Banks turn to synthetic data as QA bottlenecks meet new regulatory walls

Kalyan Veeramachaneni, a principal research scientist in the Laboratory for Information and Decision Systems at MIT, and co-founder of DataCebo

With regulators from Brussels to Washington tightening scrutiny over how test data is handled, financial institutions are running out of room to use production-based datasets.

In response, QA teams are rapidly deploying synthetic data, artificial datasets that mirror the statistical structure of real information, to maintain both agility and control.

According to Kalyan Veeramachaneni of MIT’s Laboratory for Information and Decision Systems, the shift marks a turning point: synthetic data is no longer experimental but a compliance-driven necessity shaping the next era of testing and AI model validation.

“While concrete numbers are hard to pin down, some estimates suggest that more than 60% of data used for AI applications in 2024 was synthetic, and this figure is expected to grow across industries,” said Veeramachaneni, who is also the co-founder of DataCebo.

For QA teams at banks, where data access is tightly controlled under frameworks such as the EU’s General Data Protection Regulation (GDPR) and the Digital Operational Resilience Act (DORA), synthetic data addresses a long-standing bottleneck.

“Because synthetic data don’t contain real-world information, they hold the promise of safeguarding privacy while reducing the cost and increasing the speed at which new AI models are developed,” Veeramachaneni said.

But using synthetic data requires careful evaluation, planning, and checks and balances to prevent loss of performance when AI models are deployed,” he warned.

Regulatory obstacles

Banks and insurers typically test applications with anonymised production data, but this practice is increasingly restricted under GDPR and other privacy rules.

Regulators now expect strict guarantees that personal or transactional information is not exposed outside of production environments. For firms also subject to model-risk management frameworks in the U.S. and Europe, transparency over data lineage is becoming a board-level issue.

In this context, synthetic data provides a controlled alternative.

“One fundamental application which has grown tremendously over the past decade is using synthetic data to test software applications,” Veeramachaneni said.

“There is data-driven logic behind many software applications, so you need data to test that software and its functionality. In the past, people have resorted to manually generating data, but now we can use generative models to create as much data as we need.”


“One of the biggest problems in QA has been getting access to sensitive real data for testing software in non-production environments.”

– Kalyan Veeramachaneni

For QA teams testing payment rails or high-volume trading platforms, the benefits are obvious. “Because synthetic data aren’t drawn from real situations, they are also privacy-preserving.”

He noted that “one of the biggest problems in software testing has been getting access to sensitive real data for testing software in non-production environments, due to privacy concerns.”

“Another immediate benefit is in performance testing. You can create a billion transactions from a generative model and test how fast your system can process them,” Veeramachaneni explained.

Synthetic data is also finding applications in fraud detection, anti-money-laundering systems, and risk-modellin, areas where regulators demand both accuracy and accountability.

“Sometimes, we want an AI model to help us predict an event that is less frequent. A bank may want to use an AI model to predict fraudulent transactions, but there may be too few real examples to train a model that can identify fraud accurately. Synthetic data provide data augmentation, additional data examples that are similar to the real data. These can significantly improve the accuracy of AI models,” Veeramachaneni said.

The technology also allows firms to train models faster and at lower cost. “Sometimes users don’t have time or the financial resources to collect all the data,” he stressed.

Veeramachaneni added that “if you end up with limited data and then try to train a model, it won’t perform well. You can augment by adding synthetic data to train those models better.”

Risks of bias and compliance

Still, regulators will expect firms to prove that synthetic data does not compromise model reliability or perpetuate hidden bias.

“One of the biggest questions people often have in their mind is, if the data are synthetically created, why should I trust them?” Veeramachaneni said. “Determining whether you can trust the data often comes down to evaluating the overall system where you are using them.”

Kalyan Veeramachaneni

Bias remains a critical challenge in regulated industries. “Since it is created from a small amount of real data, the same bias that exists in the real data can carry over into the synthetic data,” Veeramachaneni warned.

“Just like with real data, you would need to purposefully make sure the bias is removed through different sampling techniques, which can create balanced datasets. It takes some careful planning, but you can calibrate the data generation to prevent the proliferation of bias.”

To support governance, MIT researchers have built evaluation tools that banks could adopt as part of their compliance reporting.

“To help with the evaluation process, our group created the Synthetic Data Metrics Library,” Veeramachaneni said, explaining that “we worried that people would use synthetic data in their environment and it would give different conclusions in the real world. We created a metrics and evaluation library to ensure checks and balances.”

As QA teams move towards greater automation and resilience, synthetic data is expected to reshape how financial firms design both software testing and model validation workflows.

“I expect that the old systems of working with data, whether to build software applications, answer analytical questions, or train models, will dramatically change as we get more sophisticated at building these generative models,” Veeramachaneni concluded.

“A lot of things we have never been able to do before will now be possible.”


QA FINANCIAL PODCASTS

Listen to Sudeepta Guchhait on Nasdaq’s new Mimic AI testing platform
QA Financial sits down with Sudeepta Guchhait, Senior Director of Product Framework & Quality Engineering at Nasdaq

——–

Listen to Wesley Scheffel and Robin Rain on Schroders’ DevOps strategy
We catch up with Wesley Scheffel, Head of Cloud Platform and Product Engineering at Schroders, and Robin Rain, Head of Cloud Platform Architecture

——–

Listen to Citi’s Jason Morris on Lightspeed and the future of continuous delivery
Jason Morris, Head of Developer Pipelines for Securities Markets and Banking at Citi, talks about Lightspeed


THIS MONTH


Why not become a QA Financial subscriber?

It’s entirely FREE

* Receive our weekly newsletter every Wednesday * Get priority invitations to our Forum events *

REGISTER HERE TODAY


REGULATION & COMPLIANCE

Looking for more news on regulations and compliance requirements driving developments in software quality engineering at financial firms? Visit our dedicated Regulation & Compliance page here.


READ MORE


WATCH NOW