QA teams and engineers are warned that large language models (LLMs), which are increasingly integrated into the complex software systems at many banks, financial services companies and other firms, are far from ideal.
In fact, the AI-driven LLMs are more often than not unable to spot or identify bugs and flaws in the system, even major ones.
A new study conducted by OpenAI researchers highlights the ongoing challenges of using large language models (LLMs) for real-world software engineering tasks, showing – that while these AI systems can assist in coding, they are still far from capable of fully replacing human engineers.
Despite claims from OpenAI CEO Sam Altman that LLMs could eventually replace ‘low-level’ engineers, the research emphasised that AI models remain limited in their ability to handle complex issues and understand the root causes of bugs.
The study involved testing three advanced AI models OpenAI’s GPT-4o and GPT-01, and Anthropic’s Claude 3.5 Sonnet – on a variety of software engineering tasks sourced from the freelancing platform Upwork.
These tasks, totalling $1 million in potential payouts, were designed to assess the LLMs’ abilities in two categories: individual contributor tasks, which include solving bugs or implementing features, and management tasks, where the models roleplay as a manager who evaluates and selects the best technical proposals to resolve problems.
The researchers built a custom benchmark called SWE-Lancer to simulate real-life freelance tasks. This benchmark was designed to test the LLMs against actual issues faced by human engineers on freelance platforms.
Worrying results
The test results revealed significant limitations in the performance of AI models, particularly when it came to complex tasks that require a deep understanding of both the code and the underlying problems.
While the models showed some ability to solve bugs and automate certain tasks, they were unable to fully grasp why the bugs existed in the first place and often introduced new errors.
To generate the SWE-Lancer dataset, the researchers collaborated with 100 professional software engineers who helped identify potential freelance tasks on Upwork.
They extracted close to 1,500 tasks, which included bug fixes, feature implementations, and more intricate problems such as reviewing freelancer proposals.
These tasks were grouped into two categories: individual contributor tasks, which paid out about $585,225, and managerial tasks, which were focused on evaluating solutions and proposals to fix software issues.
The research team created prompts based on the titles and descriptions of each task and compiled a snapshot of the relevant codebase.
In some cases, additional management tasks were generated to reflect scenarios in which the AI would need to evaluate and choose the best proposal for resolving a bug.
To ensure the models were truly addressing the software issues and not simply retrieving answers from pre-existing sources, the SWE-Lancer dataset was designed with no internet access or access to platforms like GitHub, which could have given the models an unfair advantage.
The tasks were then processed through a container, where the AI models were given the problem descriptions, codebases, and other relevant data.
This container was isolated to ensure that the models couldn’t scrape any outside resources or code snippets.
Afterwards, the team developed detailed end-to-end tests to assess the AI models’ performance, which involved simulating real-world user flows, such as logging into applications, performing financial transactions, and ensuring that the AI’s solutions worked as expected.
No credible alternative
The test results showed that the AI models, despite their impressive capabilities, did not perform well enough to replace human engineers, especially for the most challenging tasks.
In total, none of the three models came close to earning the full $1 million potential value of the tasks. Claude 3.5 Sonnet, the highest-performing model, managed to resolve only 26.2% of the individual contributor tasks and earned just over $200,000 of the potential task payout.
While this was the best result, the research team pointed out that even Claude 3.5 Sonnet’s solutions were mostly flawed, underlining the need for higher reliability before such models could be trusted for real-world deployment.
When it came to individual contributor tasks like bug fixes and feature implementations, the AI models showed some promise, but the vast majority of their solutions were incomplete or incorrect.
The models were able to localize problems by quickly pinpointing the relevant files and functions that likely caused the issue, often faster than human engineers.
However, the study revealed that these models struggled to understand how an issue might span multiple components or files. As a result, their solutions were either too simplistic or failed to address the underlying cause of the problem.
Interestingly, the AI models performed better when faced with managerial tasks. These tasks required the models to evaluate multiple proposals from human freelancers and choose the best solution to address a given problem.
In these cases, the AI demonstrated better reasoning and the ability to assess technical understanding, reflecting the strengths of LLMs in situations that require analysis rather than deep technical problem-solving.
The researchers noted that while LLMs can assist in simplifying tasks that involve routine debugging, the inability of these models to understand the root cause of problems makes them unreliable for tasks requiring comprehensive and accurate solutions.
This distinction between “low-level” coding and more complex engineering work is crucial, as it demonstrates that AI tools, while powerful, are still far from being able to take over all aspects of software development.
Long way to go
The study provides a clear picture of the current state of AI in the field of software development. While these language models show promise in assisting human engineers with basic tasks, they are still a long way from becoming autonomous engineers who can handle complex freelance tasks on platforms like Upwork.
The limitations identified by the study suggest that AI tools need further refinement before they can be used in real-world scenarios where precision and understanding are paramount.
However, the researchers also noted that the technology is improving rapidly, and it’s possible that AI models could soon evolve to handle more complex problems with greater accuracy.
For now, the results suggest that human software engineers will remain an essential part of the development process, especially for tasks that require a deep understanding of code and the intricate details of software systems.
While the models’ performance in management tasks offers some optimism for the future of AI in software development, the gap between AI’s current capabilities and the needs of real-world engineering remains significant.
Why not become a QA Financial subscriber?
It’s entirely FREE
* Receive our weekly newsletter every Wednesday * Get priority invitations to our Forum events *
THIS MONTH

QA Financial is delighted to announce that Tal Barmeir will join us as a speaker at the QA Financial Forum Toronto 2025 Places are limited – register today.


WATCH NOW

READ MORE
- Taking place TODAY: the QA Financial Forum Chicago 2025
- HDFC Bank turns to Katalon and QualityKiosk for QA upgrade
- Perforce’s Clinton Sprauve on AI testing for charts and graphics
- Functional and regression testing crucial for banks, says insider
- Shastic teams up with MeridianLink to target banks with AI automation