Artificial intelligence is rapidly transforming software testing and development, with AI coding assistants now a familiar presence in the daily workflows of developers. From auto-completion to generating substantial code blocks, their impact is undeniable, QA insiders largely agree.
Among the most compelling use cases is the automatic generation of tests: unit, integration, and end-to-end testing. The prospect of AI churning out tests, boosting coverage metrics, and freeing developers from the often-tedious task of test creation sounds like a direct route to faster feedback and conquering the backlog of untested code.
But is this powerful new capability a reliable asset, or merely a deceptive and quite risky shortcut?
According to Spain-based Jonathan Vila Lopez, developer advocate at Sonar and a member of industry group Barcelona JUG, the excitement surrounding AI in test generation must be balanced with caution.
“Uncritically accepting AI-generated tests can lead to a false sense of security,” he warned. “Developers might believe their codebase is robust due to high test counts, while the tests themselves could be superficial or even erroneous.”
Sonar, the company Vila Lopez represents, equips developers and organisations to deliver quality, secure code fit for development and production, whether AI-generated or written by developers.
Trusted by over dozens of banks, large and small, globally to clean more than half a trillion lines of code, it is fair to say Vila Lopez’s company plays an integral role in delivering software at scale.
To grasp why AI-generated tests demand scrutiny, it is crucial to understand how these code-generating AIs learn. Most are large language models (LLMs) trained on vast datasets, billions of lines of code from public repositories like GitHub, platforms like Stack Overflow, open source projects and maybe your own company’s code.
Through this massive ingestion, the AI learns patterns: common coding structures, typical API usage, popular libraries, and prevalent coding styles. It becomes adept at predicting the next sequence of code, enabling it to write code that often appears correct on the surface.
‘Garbage in, garbage out’
But the inherent pitfall lies in the nature of this training data. It is an indiscriminate collection of all types of code, Vila Lopez stressed, including code that is riddled with bugs or contains security vulnerabilities.
The AI does not inherently distinguish “good” code from “bad” code, he explained: it simply reproduces the patterns it has observed most frequently. If buggy patterns are common in its training set, it will replicate them.
This is the classic “garbage in, garbage out” dilemma. Consequently, when developers task AI with writing tests, several critical issues can emerge, Vila Lopez noted.
AI-powered tests can be inaccurate, often validating existing code, flaws and all, rather than the intended behaviour. This leads to two primary categories of problems.
One is the generation of tests that are syntactically correct but semantically wrong. AI can generate code that compiles and uses testing annotations (like @Test), seemingly saving considerable manual effort. However, correctness is far from guaranteed.
Vila Lopez highlighted several recurring flaws. These include incomplete tests, tests with no assertions, or weak assertions such as ‘assertNotNull(result)’ which does not confirm the result’s correctness.
He noted that “AI tends to test the simplest, most straightforward case,” often neglecting null inputs, error conditions and edge cases unless explicitly prompted.
Furthermore, AI may generate irrelevant or “hallucinated” tests, scenarios that are nonsensical for the application or focus on trivial details instead of significant behaviours.
Another issue is subtle logic bugs, which include incorrect setup, such as improper mocking or initializing tests in an invalid state, or flawed assertions, such as wrong comparisons or off-by-one errors.
Tests may also be flaky, especially when involving concurrency or asynchronous operations, leading to results that pass or fail inconsistently. And even when technically accurate, AI lacks contextual understanding, Vila Lopez said.
Good tests often require domain-specific knowledge, and AI typically lacks deep context about a particular application unless extensively briefed, potentially testing a method correctly in isolation but missing its broader systemic implications.
“Garbage prompts yield garbage tests.”
– Jonathan Vila Lopez
A more insidious problem arises when an AI test validates the wrong behaviour because the code itself is buggy, he continued.
This highlights the crucial distinction between verification and validation. Verification asks, ‘Are we building the product right?” and AI can perform reasonably well here. But validation asks, ‘Are we building the right product?’ This is where AI often falls short, Vila Lopez pointed out.
For instance, if a calculateTax method contains a bug that results in a negative tax for certain inputs, an AI analysing this code might generate a test asserting that calculateTax(badInput) should indeed return that negative number, thereby verifying the bug, he explained.
Given the propensity for AI-generated tests to be flawed, integrating static analysis tools becomes essential. These tools automatically scan code, including tests, against extensive rule sets, identifying potential bugs, security vulnerabilities and code quality issues.
When AI is rapidly introducing new code, this automated oversight acts as a critical quality check. Some tools even promote AI assurance, sometimes with stricter scrutiny applied to AI-generated code, Vila Lopez noted.
Human review
In addition to static analysis, he urged developers to adopt structured practices when using AI test generation. Mandatory human review is non-negotiable.
“Scrutinize AI-generated tests for logical soundness, assertion quality, requirement alignment and edge case coverage,” Vila Lopez shared. Delegating wisely is also key, using AI for boilerplate tasks like generating test method skeletons, basic setup/teardown, simple mocking or creating input variations for existing, trusted tests.
“Garbage prompts yield garbage tests,” he warned. Providing the AI with relevant documentation, requirement snippets and examples of high-quality tests improves results significantly.
Developers should be explicit in prompts, clearly instructing the AI on what to test, what to mock and what to assert. AI output should be treated as a first draft, then reviewed, corrected and enhanced.
“Understand the nuances of your chosen AI tool,” Vila Lopez summarised. “Learn its common error patterns and effective prompting strategies.”
He also recommended that teams start small and controlled, experimenting with AI test generation on noncritical projects first to gauge its true productivity benefits and learning curve.
“AI test generation is undeniably a powerful emerging capability,” he concluded: “but it’s not a ‘fire and forget’ solution.”
The key, according to Vila Lopez, is to view AI as an intelligent assistant, and not merely as an infallible expert. Allow it to handle rudimentary drafting and repetitive tasks, but always subject its output to rigorous human review and automated quality checks via static analysis.
“By combining AI’s speed with developer diligence and robust tooling,” he concluded, adding that “teams can harness the benefits of AI-driven test generation without sacrificing the integrity and quality of their software.”
NEW EVENT IN NOV

Why not become a QA Financial subscriber?
It’s entirely FREE
* Receive our weekly newsletter every Wednesday * Get priority invitations to our Forum events *

REGULATION & COMPLIANCE
Looking for more news on regulations and compliance requirements driving developments in software quality engineering at financial firms? Visit our dedicated Regulation & Compliance page here.
READ MORE
- Trust, not speed: Why AI governance is now a testing battleground for banks
- NatWest’s AI trade finance overhaul opens new chapter for QA teams
- Banking UAT moves beyond sign-off as QA takes centre stage in system rollouts
- Citi ramps up AI-driven testing in race to modernise legacy systems
- Lloyds, HSBC and NatWest get OpenAI access amid mounting concerns
WATCH NOW

