BofE escalates AI fears as regulators push banks toward continuous testing

Bank of England Governor Andrew Bailey

The Bank of England’s growing alarm over Anthropic’s unreleased Mythos AI model is rapidly turning frontier AI testing into a core resilience issue for banks, with regulators increasingly demanding that firms prove how AI systems behave under stress, attack conditions and real-world operational scenarios.

Anthropic is now set to brief the Financial Stability Board (FSB) on cyber vulnerabilities identified by its Mythos model following a direct request from Bank of England Governor Andrew Bailey.

The Financial Stability Board is an international body that monitors risks to the global financial system and coordinates financial regulation among G20 countries and major regulators.

The development significantly escalates the regulatory focus on AI-driven cyber and operational risk across financial services, particularly for QA, quality engineering and software testing teams already under pressure to validate increasingly complex AI-enabled systems.

Several news outlets reported that Bailey, who chairs the FSB, requested Anthropic to discuss the capabilities of its “Mythos Preview” cybersecurity model with “leading finance ministries and central banks from the FSB.”

The FSB itself confirmed the engagement, saying it “welcomes engagement with Anthropic and other firms on emerging and frontier risks to global financial stability.”

The concern centres on Mythos’ ability to detect vulnerabilities across “web browsers, infrastructure and software,” capabilities regulators fear could dramatically accelerate cyber exploitation against banks reliant on legacy technology stacks.

Speaking at Columbia University last month, Bailey warned: “You wake up to find that Anthropic may have found a way to crack the whole cyber risk world open.”

“The issue is: to what extent is this new version of the product going to be able to identify vulnerabilities in other systems which can be exploited for cyberattack purposes,” he added.

That warning increasingly aligns with a wider regulatory shift already underway across the UK, Australia and other major jurisdictions: moving AI oversight away from static governance frameworks toward continuous validation, resilience testing and live operational assurance.

The Bank of England in the City of London

Earlier this year, the Bank of England confirmed it was already running simulations to assess how AI agents could behave under stressed market conditions, including “herding” behaviour where multiple systems amplify volatility simultaneously.

Ed Birchall, VP enterprise AI at Nuix, described the move as “a big signal, not just for regulators, but for every financial institution.”

“We’re moving from ‘AI experimentation’ to ‘AI as market infrastructure’,” Birchall argued.

That transition is becoming increasingly important for QA and software testing teams as regulators begin treating AI systems not as isolated innovation projects but as embedded financial infrastructure requiring evidence-based assurance.

Regulators want proof, not policies

The Mythos escalation comes amid mounting concern from regulators that governance and assurance programmes are failing to keep pace with the speed of AI deployment.

Australia’s prudential regulator APRA warned this week that “governance, risk management, assurance and operational resilience practices are not keeping pace with the scale, speed, and complexity of AI adoption.”

“The systems and processes required to safely govern AI use aren’t keeping up,” said APRA Member Therese McCarthy Hockey.

The regulator explicitly linked those concerns to frontier AI cyber models such as Mythos, warning they “could enhance the discovery of vulnerabilities by bad actors” and “are expected to further increase the probability, speed and scale of cyber attacks.”

For QA and quality engineering teams, APRA’s findings were particularly striking.

The regulator warned that “the volume and speed of AI assisted software development is placing strain on the effectiveness of change and release management controls,” while identifying “gaps in the scope and coverage of security testing programmes for both AI implementation and responding to the AI augmented threat environment.”


“You wake up to find that Anthropic may have found a way to crack the whole cyber risk world open.”

– Andrew Bailey

APRA also criticised firms still relying on outdated assurance methodologies.

“We also observed reliance on point in time and sample based assurance methods, despite these methods being ill suited to probabilistic models that learn, adapt and degrade over time,” McCarthy Hockey stressed.

In one of its strongest observations for testing leaders, the regulator said: “Few entities had continuous validation or monitoring in place to detect issues such as model drift, bias, failure modes, or control breakdowns in a timely manner.”

The watchdog added that “assurance activities often lagged AI deployment,” especially where “agentic behaviour, automated decision making or AI assisted code generation were involved.”

UK efforts

That concern mirrors the direction of travel emerging from UK regulators.

The Financial Conduct Authority’s AI Live Testing initiative is already pushing firms into “real-world conditions” testing environments designed to validate AI systems outside traditional proof-of-concept settings.

Ed Towers

As Ed Towers, head of advanced analytics at the FCA, explained earlier this year: “We’re providing a structured but flexible space where firms can test AI-driven services in real-world conditions, all with our regulatory support and oversight.”

He added that the initiative aims to move firms beyond “POC paralysis”.

Crucially, the FCA also broadened the definition of what constitutes an AI system requiring validation.

“We broadly define the AI system as: the actual AI model, information on the deployment context and core risks … governance and human in the loop considerations, evaluation techniques as well as the input and output controls,” Towers explained.

That framing significantly expands QA scope beyond model testing into governance validation, operational resilience, recovery testing and lifecycle monitoring.

Resilience and AI testing converge

The Mythos debate is also accelerating convergence between cyber testing, operational resilience and AI assurance.

The UK’s AI Security Institute recently warned that frontier AI cyber capability is advancing rapidly after testing Mythos.

“Frontier AI’s autonomous cyber and software capability is advancing quickly: the length of cyber tasks that frontier models can complete autonomously has doubled on the order of months, not years,” the institute said.

It added that Mythos had successfully completed a previously unsolved cybersecurity challenge known as “cooling tower” in three out of 10 attempts, describing it as a “notable capability jump.”

The IMF has similarly warned that “cyber risk does not respect borders.”


“As AI capabilities spread across countries, inconsistent oversight could weaken a globally interconnected system.”

–IMF

The Bank of England’s concerns increasingly reflect fears around systemic AI behaviour rather than isolated model failures.

That includes scenarios where multiple AI systems behave similarly under stress, where AI-generated code introduces vulnerabilities faster than institutions can detect them, or where cyber models identify weaknesses across interconnected banking infrastructure simultaneously.

For QA teams, that fundamentally changes the testing mandate.

Testing now increasingly includes adversarial validation, recovery testing, AI drift monitoring, AI-assisted code assurance, third-party dependency testing and proving systems can remain within operational tolerances under degraded conditions.

APRA warned institutions must implement “robust security testing across AI-generated code, software components and libraries” while strengthening “continuous and proportionate” lifecycle monitoring.

The regulator also highlighted mounting concerns around supplier concentration and opaque AI supply chains.

“AI functionality is often embedded within broader software platforms or developer tooling, reducing transparency over where and how models are trained, updated or constrained,” APRA said.

It warned that “upstream dependencies such as foundation models, training data sources and fourth party service providers are opaque,” limiting firms’ ability to “independently assess model performance, bias, resilience and security.”

The Bank of England’s broader operational resilience agenda is increasingly moving in the same direction.

Under frameworks such as STAR-FS and cyber recovery guidance, regulators have repeatedly stressed that resilience must be “continuously tested, measured and evidenced.”

As the Bank recently stated: “Cyber-attacks remain a major threat to the financial sector. Resilience can no longer be assumed, it must be proven.”

For QA and software testing teams inside banks and financial institutions, the message from regulators is becoming increasingly clear: AI governance alone is no longer enough.

Testing, validation and operational evidence are becoming the mechanisms through which firms prove AI systems can be trusted.


Why not become a QA Financial subscriber?

It’s entirely FREE

* Receive our weekly newsletter every Wednesday * Get priority invitations to our Forum events *

REGISTER HERE TODAY


REGULATION & COMPLIANCE

Looking for more news on regulations and compliance requirements driving developments in software quality engineering at financial firms? Visit our dedicated Regulation & Compliance page here.


READ MORE


WATCH NOW


QA FINANCIAL PODCASTS

CLICK HERE TO LISTEN TO OUR EXCLUSIVE CONVERSATIONS