Banks warned over ‘ambiguity risk’ as agentic AI moves into core workflows

Banks are rapidly moving beyond generative AI chatbots and copilots into a far more autonomous phase of AI deployment, one where software agents can increasingly make decisions, execute workflows and interact with core banking systems with limited human intervention.

Across financial services, institutions are now experimenting with agentic AI systems capable of handling fraud investigations, compliance monitoring, customer interactions, trade accounting and operational workflows.

The push is being driven by growing pressure to cut operational costs, accelerate productivity and scale AI-enabled automation across highly complex banking environments.

But as autonomous AI systems move deeper into production infrastructure, regulators, software testing teams and operational resilience leaders are increasingly questioning whether many institutions fully understand the long-term risks emerging alongside the headline efficiency gains.

A new research paper examining the rise of Agentic AI in banking warns that autonomous AI systems may deliver strong gains in highly structured operational workflows while simultaneously introducing new forms of systemic and operational risk in more ambiguous environments.

The paper, The Rise of Agentic AI in Banking: Operational Performance and Systemic Risk in the US and UK (2022–2026), by India-based researcher Ahsan Perwez, argues that banks deploying increasingly autonomous AI agents into core business functions may need far more granular QA, software testing and governance frameworks than many current industry programmes provide.

The study focuses on the transition from “Assisted Intelligence” to “Agentic Intelligence”, describing Agentic AI systems as capable of “proactive goal-setting, multi-step reasoning, and autonomous execution across banking legacy systems without constant human prompting.”

According to the paper, by early 2026, major financial institutions including Goldman Sachs and JPMorgan Chase had already integrated autonomous AI systems into trade accounting, KYC and sales enablement workflows.

For QA and software testing teams inside banks, the paper raises growing concerns around what the author calls “task-level conditions” that determine whether autonomous AI systems generate operational value or create instability.

‘Reliable gains’ in rule-based workflows

The research identifies substantial reported efficiency gains from agentic deployments in banking operations, compliance and sales enablement.

The paper cites findings including “net cost reductions of up to 20%” from agentic financial closing workflows, alongside projections from Lloyds Banking Group of “£100 million in annual value from automating fraud investigations and complaints processing.”

Elsewhere, the study references a reported “30% increase in closed deals” following deployment of JPMorgan Chase’s “Coach AI” platform to financial advisers, while noting that agentic systems at Goldman Sachs aimed at reclaiming “thousands of billable hours” in operational workflows.


“Autonomous agents fail systematically, often producing outcomes worse than the human processes they replaced.”

– Ahsan Perwez

Perwez also reported “99.9% accuracy in real-time AI compliance monitoring for KYC and AML checks” from Phacet Labs, while Bank of Singapore reduced “source of wealth” memo drafting from 10 days to one hour using agentic assistants.

He argued these success stories largely cluster around highly structured, deterministic and rules-based tasks.

“Agentic AI systems in banking deliver reliable, scalable value only in tasks that fall below a threshold of contextual ambiguity,” the industry insider stated.

“Above this threshold, where human judgment, relational context, or novel edge cases are required, autonomous agents fail systematically, often producing outcomes worse than the human processes they replaced,” Perwez added.

For software testing and quality engineering teams, the implications are significant. The study repeatedly stresses that banks should avoid treating AI performance as a single aggregate metric and instead build testing, validation and governance programmes around task-level ambiguity classification.

‘The AI proof gap’

One of the Perwez’s central arguments is what Perwez describes as an “AI Proof Gap”, a disconnect between rapid investment in autonomous AI capabilities and the accountability, validation and oversight structures required to safely operate them.

“The Proof Gap appears widest in high-ambiguity task domains, where neither agents nor regulators have established clear success criteria,” Perwez wrote.

That argument lands directly in the middle of ongoing debates around AI testing, digital resilience and operational governance across financial services.

The study argued that many current AI performance assessments remain distorted because gains often emerge rapidly after deployment, while risks compound over much longer periods.

“A second observation: agentic AI gains and risks are not only different in type, they are different in timing,” Perwez stated.

“Gains are measurable within 6–12 months of deployment. Risks materialise 18–36 months post-deployment, after agents have embedded themselves into workflows and failure modes have had time to compound.”

Perwez warned this creates a structural testing and governance problem for banks rolling out autonomous systems into production environments without sufficiently long validation windows.

“This temporal asymmetry suggests that cost-benefit analyses conducted within the first year of deployment may systematically overstate net value,” he explained.


“Agentic AI in banking is not a question of adoption versus caution. It is a question of deployment precision.”

– Ahsan Perwez

Alongside operational gains, the research catalogues several high-profile failures and risk events tied to autonomous or semi-autonomous AI systems.

These include the “Jane Street SEBI enforcement” case involving a reported “$565 million escrow obligation following unauthorised autonomous trading patterns,” as well as the “Lobstar Wilde” bot incident, where a customer-facing agent reportedly generated a “$250,000 direct loss” after autonomously misapplying interest-rate waivers.

Perwez argued these failures consistently emerge in “high-ambiguity” environments requiring contextual reasoning, judgment and nuanced escalation handling.

For QA and testing leaders, the findings reinforce growing industry concerns around edge-case testing, human-on-the-loop oversight, model validation and resilience testing for autonomous AI agents operating across core banking systems.

Perwez said that “banks that deploy agents uniformly, without ambiguity screening, risk reporting aggregate gains while quietly accumulating tail risk in high-ambiguity task portfolios.”

Regulatory pressure

Perwez also pointed toward increasing regulatory scrutiny around autonomous AI systems in banking.

It references emerging frameworks including the FCA’s 2026/27 Work Programme and the US Treasury’s FS AI RMF, but argues that many current approaches still focus primarily on transparency rather than task-level governance and testing requirements.

Perwez ultimately concludes that the future success of agentic AI in banking may depend less on whether institutions adopt the technology and more on how precisely they deploy, test and govern it.

“Agentic AI in banking is not a question of adoption versus caution,” he concluded. “It is a question of deployment precision.”

He further argued that banks “need task-level ambiguity classification systems before deploying agentic systems,” rather than treating AI deployment as a binary choice between innovation and risk management.


NEXT WEEK


WHY not become a QA Financial subscriber?

It’s entirely FREE

* Receive our weekly newsletter every Wednesday * Get priority invitations to our Forum events *

REGISTER HERE TODAY


READ MORE


WATCH NOW


QA FINANCIAL PODCASTS

CLICK HERE TO LISTEN TO OUR EXCLUSIVE CONVERSATIONS