Analysis: what the QA space can learn from the CrowdStrike saga

The initial confusion has cleared since CrowdStrike’s software caused a major, global Windows system outage over the summer, on July 19, hen millions of PCs and servers went down and impacted hundreds of businesses around the world, mostly in the U.S.

Since then, enough time has passed that a retrospective could be valuable, not just for CrowdStrike, but for the greater software testing community.

So what happened exactly, and what should financial services firms and other companies do to avoid these major QA crises?

“Antivirus and security software typically needs to run in a privileged mode, something other applications cannot touch,” explained Matt Heusser, managing director at Excelon Development, where he recruits, trains and conducts software testing and development.

Matt Heusser
Matt Heusser

“The If security software runs with the same security as everything else, then malevolent software could act like a biological virus, finding the security software and overwriting or destroying it,” he elaborated in a recent analysis.

In a 2009 decision, the EU claimed that Microsoft’s antivirus software, Windows Defender, had an unfair, monopoly-like advantage in that it could run in the kernel, the most privileged and incorruptible memory and disk space.

The EU forced Microsoft to allow third-party developers, such as CrowdStrike, to run in the kernel.
Heusser pointed out that CrowdStrike’s initial incident report claimed that it was not new code, but a “content update”, similar to the signature of a virus, that caused a crash in the software.

“Because this was running in the kernel, or OS, this exception could not be trapped,” Heusser said.

“Because the software ran on bootup to check for viruses, a simple reboot could not fix the issue. Apple and Linux computers do not run antivirus software in this privileged mode and did not manifest the issue,” he added.

‘Dysfunctional security update’

The “dysfunctional security update”, as Heusser put it, affected only about 1% of the total number of Windows computers.

Yet, as of 2023, CrowdStrike has 15% of the software security marketplace.

“That means that CrowdStrike’s high-end security software is deployed on computers that are considered mission-critical,” he said.

An estimate by Parametrix of the direct losses to all companies is $5.4 billion. This does not count damage to reputation or lost revenue in the future.

A timeline of the CrowdStrike outage events

“While the incident report itself is more than a little ambiguous, using high-level terms that could have multiple meanings, a few things are clear,” Heusser continued.

The company has the source code, some of which runs in the Windows kernel, as well as content updates.

The content updates include both Rapid Response Content, which needs to be deployed as soon as possible due to serious security vulnerabilities, along with more routine changes, which go through a more rigorous process, he added.

“It’s easy to jump to exactly where the defect was, CrowdStrike itself claimed it was a bug in its test system,” Heusser noted.

Specifically, the report stated: “Due to a bug in the Content Validator, one of two Template Instances passed validation despite containing problematic content data.”

Heusser thinks “this is a bit of an odd statement, putting the bug in the test system instead of framing it as a bug in the data that the tests missed.”


“There was a mismatch between the expected number of inputs, which caused the software to access memory that was out of bounds.”

– Matt Heusser

According to the later root cause analysis, released on August 6, there was a missing test for a scenario where the signature was literal text, instead of a pattern match using wildcards.

“There was also a mismatch between the expected number of inputs, which caused the software to access memory that was out of bounds,” he wrote.

DevOps and automation, at least in theory, resolve several risks that reduce the reliance on humans, who can be inconsistent and prone to error.

“Yet, as it turns out, it is humans who write the computer programs doing the automating, those programs are prone to error, too,” Heusser stressed.

“First, the test data itself was malformed,” he continued. “Then, the test software was a missing test. Then, the architecture failed to trap the failure with something like a try/catch block or some other form of structured exception handing.”

“Running code inside an exception handler causes it not to crash, but to jump elsewhere in the code — the handler — in the event of a divide-by-zero or other type of error that causes a crash if unhandled,” Heusser explained.

Companies could not choose to do a staged deploy even if they wanted to, he pointed out.

One comment “buried in the post-incident report,” as Heusser put it, is the suggestions for improvement, including “local developer testing,” as well as “content update and rollback testing.”

“The implication of these comments is that developer testing was not occurring — that, had a programmer tested the change on a machine, real or virtual, they would have seen the error and been able to prevent it,” he said.

“If that comment does not lead to more questions and better answers about the CrowdStrike incident, hopefully, it might at least lead to questions many organisations.”

‘A bit like insurance’

Heusser placed the entire incident in a wider context by saying that “it can be helpful to think of testing and quality efforts as a bit like insurance. Spend a little money now to reduce the impact if something goes wrong later.”

He added: “Like health insurance, the hope is to never need the benefits, yet the risk of going uninsured is not worth it.”

That said, here are seven lessons to consider, at a time in history when, perhaps, the idea of risk and its impact might be something management is willing to consider and take action to mitigate, Heusser noted.

He said it is vital to look at the worst possible outcomes, and then work backward to measure them and see how to assess risk.

“Soap opera tests and soap opera analysis are two ways to do this in a couple hours in a cross-functional workshop.”

In addition, reliability focuses on continuous uptime, high availability and the absence of defects.

“Resilience sees failure as inevitable and focuses on fast issue discovery, service restoration, backup, cutover and redundancy to minimize the impact of failure.”

Moreover, human oversight remains vital, Heusser said.

“In an age where people see tester as a bad word or overhead, organizations keep introducing defects that an actual person doing testing would catch,” he added.

“If the organization is not ready to have a tester role, developers can check each other’s work or implement cross-team testing, where teams test each other’s work with a beginner’s mind.”


NEXT MONTH IN SINGAPORE

REGISTRATION IS NOW OPEN FOR THE QA FINANCIAL FORUM SINGAPORE 2024

Test automation, data and software risk management in the era of AI

The QA Financial Forum launches in Singapore on November 6th, 2024, at the Tanglin club.

An invited audience of DevOps, testing and quality engineering leaders from financial firms will hear presentations from expert speakers.

Delegate places are free for employees of banks, insurance companies, capital market firms and trading venues.


QA FINANCIAL FORUM LONDON: RECAP

Last month, on September 11, QA Financial held the London conference of the QA Financial Forum, a global series of conference and networking meetings for software risk managers.

The agenda was designed to meet the needs of software testers working for banks and other financial firms working in regulated, complex markets.

Please check our special post-conference flipbook by clicking here.


READ MORE


Become a QA Financial subscriber – for FREE

News and interviews * Receive our weekly newsletter * Get priority invitations to our Forum events

REGISTER HERE TODAY