For testing engineers in banks and insurers, the stakes around resilience have never been higher. Digital platforms are expected to operate without interruption, yet outages and system failures continue to grab headlines across industries.
In financial services, where every second of downtime can translate into regulatory scrutiny, lost trades, or customer mistrust, QA teams are under pressure to expand their remit beyond functional correctness into full-scale resilience testing.
The shift to cloud-native architectures and distributed applications has created both opportunities and risks. These systems enable innovation and speed, but also increase the number of moving parts that can fail in unexpected ways.
High-profile incidents over the past two years, ranging from prolonged downtime at global tech platforms to failures at cybersecurity providers with direct impact on banks and airlines, underscore the urgency of building fault tolerance into every layer of software.
Against this backdrop, chaos engineering is gaining traction as a discipline that can complement and strengthen traditional QA practices. It is not about introducing risk recklessly, but about creating controlled experiments to expose weaknesses before they trigger real-world crises.
“Chaos engineering is a valuable framework for tamping down on those unexpected problems lying in wait across every back-end system,” stated Rohan Gupta, VP for Cloud, Security, & DevOps at California-based AI software testing and engineering firm R Systems.
From Spotify’s hours-long outage to the CrowdStrike incident in 2024 that grounded airlines and disrupted critical industries like healthcare and financial services, the consequences of system failure have become stark.
Meta also suffered widespread disruption in early 2024, “causing millions of users to lose their ability to post content and refresh their feeds, a nightmare for consumers, influencers, and businesses,” Gupta noted.
Complexity creates new vulnerabilities
As financial services institutions expand their digital platforms, they also increase the complexity of the systems that underpin them.
“This complexity, in turn, can create new vulnerabilities that can halt experiences and operations, ultimately impacting their bottom line,” Gupta warned.
He pointed to research showing that “Global 2000 companies, on average, lose $200 million annually due to unexpected failures in their digital environments.”
For QA and testing teams, this raises a fundamental challenge: traditional testing and monitoring are not enough.
“Such testing and monitoring procedures often overlook vital distributed systems supporting a bank’s platform or app, or fail to identify potential blind spots inside them,” Gupta explained.
To address these limitations, he argued that enterprises need to make chaos engineering part of routine practice.
“Enterprises can leverage the power of chaos engineering, the testing of an ecosystem by running custom, confined experiments to identify single points of failure and build resiliency so businesses have confidence in their systems,” he said.
“Testing and monitoring procedures often overlook vital distributed systems supporting a bank’s platform or app.”
– Rohan Gupta
Gupta drew an analogy with elite sports preparation: “Enterprises should treat chaos engineering as a routine practice, just like sports teams before every game. These groups would never participate in matches without understanding their opponent or ensuring they are in the best possible position to win. They train under pressure, run through potential scenarios, and test their plays to identify the weaknesses of their opponents. This same mindset applies to enterprise engineering teams preparing for potential chaos in their environments.”
Gupta highlighted how controlled chaos experiments could have mitigated real-world failures.
“In the case of CrowdStrike, the company’s engineering team could have run controlled chaos experiments on endpoint update mechanisms. This would have aimed to test rollback strategies and ensure corrupted updates would not propagate across critical systems, minimising the risk of grounding airlines or disrupting essential industries.”
He added: “For Meta, the company would have benefited from simulating high-traffic surges coupled with API rate-limit failures to detect bottlenecks that caused content refresh issues. This would have allowed teams to fine-tune auto-scaling and caching layers proactively without causing disruptions for countless users.”
The role of AI in chaos testing
AI is emerging as a force multiplier for QA teams adopting chaos testing. “Today, enterprises are integrating AI into their chaos engineering pipelines to accelerate root cause analysis (RCA) during and after these experiments,” Gupta explained.
“For example, when teams purposely spike memory or bring down servers as part of a test, AI can automatically generate remediation steps by analysing the root cause and providing guidance directly to engineering teams. This reduces manual effort and minimises the time it takes for engineering teams to resolve key issues not only in the experiments, but also in live scenarios.”
AI can also help scale testing as systems evolve. “Engineering teams could eventually be supported by AI agents that can identify potential points of failure by analysing architectural changes, referencing known vulnerabilities from industry examples, and developing new custom experiments and remediation approaches accordingly,” Gupta said.
Ultimately, Gupta emphasised that chaos engineering is not just about technology, but mindset.
“Chaos engineering encourages a mindset of preparedness and resilience within engineering teams. Instead of simply reacting to failures, engineering teams learn to anticipate and handle them proactively. This cultural shift improves incident response and decreases unexpected issues in production.”
The message for QA leaders in financial services is clear: “Chaos engineering helps enterprises prepare for disruptive events in a safe and controlled way. This enables them to build resilience across their entire organisation and operations, from their apps to their teams.”
He added: “AI is increasingly supporting these efforts by making it easier for teams to identify and anticipate potential issues. However, a strong engineering culture rooted in proactivity, not reactivity, remains critical.”
Gupta concluded by saying that: “Just as sports teams need proper preparation, practice, and learning from mistakes before a big game, enterprises must also apply discipline and foresight to succeed in today’s complex digital landscape.”
Finally, he was keen to stress that “in the end, this will help enterprises deliver great customer experiences and grow their businesses. If they fail to prepare for chaos, they risk facing serious consequences that could damage their reputations and long-term success.”
QA FINANCIAL PODCASTS

Listen to Sudeepta Guchhait on Nasdaq’s new Mimic AI testing platform
QA Financial sits down with Sudeepta Guchhait, Senior Director of Product Framework & Quality Engineering at Nasdaq
——–
Listen to Wesley Scheffel and Robin Rain on Schroders’ DevOps strategy
We catch up with Wesley Scheffel, Head of Cloud Platform and Product Engineering at Schroders, and Robin Rain, Head of Cloud Platform Architecture
——–
Listen to Citi’s Jason Morris on Lightspeed and the future of continuous delivery
Jason Morris, Head of Developer Pipelines for Securities Markets and Banking at Citi, talks about Lightspeed
NEXT MONTH

Why not become a QA Financial subscriber?
It’s entirely FREE
* Receive our weekly newsletter every Wednesday * Get priority invitations to our Forum events *
LAW, REGULATION & COMPLIANCE
Looking for more news on regulations and compliance requirements driving developments in software quality engineering at financial firms? Visit our dedicated Regulation & Compliance page here.
READ MORE
- Trust, not speed: Why AI governance is now a testing battleground for banks
- NatWest’s AI trade finance overhaul opens new chapter for QA teams
- Banking UAT moves beyond sign-off as QA takes centre stage in system rollouts
- Citi ramps up AI-driven testing in race to modernise legacy systems
- Lloyds, HSBC and NatWest get OpenAI access amid mounting concerns
WATCH NOW

