QA Financial Forum London | 11 September 2024 | BOOK TICKETS
Close this search box.

Deep Dive: will synthetic data change QA testing forever?


Synthetic enterprise data, or data generated from user behaviour on large companies’ software applications, powered by the relatively young generative artificial intelligence technology, may have the ability to change QA testing forever.

Moreover, generative AI features are able to create realistic synthetic data around a host of models and scenarios, especially in scenarios where real-world data is limited or sensitive. In other words, GenAI-powered synthetic enterprise data is simply ideal for QA testing.

At least, that is the view of research scientist and AI investor Kalyan Veeramachaneni, an MIT alumni and founder of the US-based startup DataCebo.

Together with co-founder Neha Patki, Veeramachaneni developed a generative software system called the Synthetic Data Vault that is able to help firms create synthetic data which are used for software application testing and machine learning models.

Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning and language models.

Data generated by a computer simulation can be seen as synthetic data. Therefore, the open-source tool is ideal for software testing, Veeramachaneni argued.

“You need data to test these software applications,” he stated, pointing out this is a step up from developers’ traditional approach, when they manually write scripts to create synthetic data.

“In the next few years, synthetic data from generative models will transform testing.”

Kalyan Veeramachaneni

“With generative models, you can learn from a sample of data collected and then sample a large volume of synthetic data, which has the same properties as real data, or create specific scenarios and edge cases, and use the data to test your application,” Veeramachaneni set out in a recent MIT blog.

Singling out a large financial institution as a good example, “if a bank wanted to test a program designed to reject transfers from accounts with no money in them, it would have to simulate many accounts simultaneously transacting,” he said.

“Doing that with data created manually would take a lot of time. With generative models, customers can create any edge case they want to test.”

He continued by stressing that “it is common for industries to have data that is sensitive in some capacity. Often when you’re in a domain with sensitive data you are dealing with regulations, and even if there aren’t legal regulations, it’s in companies’ best interest to be diligent about who gets access to what at which time.”

Therefore, “synthetic data is always better from a privacy perspective,” Veeramachaneni argued.


In the last three years, DataCebo’s SDV has built an impressive clientbase that are after the startup platform’s generative software system.

Veeramachaneni disclosed that SDV has been used more than one million times, with over 10,000 data scientists using the open-source library for generating synthetic tabular data.

Banks, insurers and other firms “use synthetic data instead of sensitive information in programs while still preserving the statistical relationships between datapoints,” he pointed out.

“Companies can also use synthetic data to run new software through simulations to see how it performs before releasing it to the public,” Veeramachaneni added.

Complex tasks

This approach does make testing processes a tad more complex, however.

“Enterprise data of this kind is complex, and there is no universal availability of it, unlike language data,” Veeramachaneni acknowledged.

“When folks use our publicly available software and report back if works on a certain pattern, we learn a lot of these unique patterns, and it allows us to improve our algorithms,” he explained.

“From one perspective, we are building a corpus of these complex patterns, which for language and images is readily available.”

Veeramachaneni pointed out that his startup recently released features to improve the data’s usefulness, including tools to assess the “realism” of the generated data, called the SDMetrics library, as well as a way to compare models’ performances, called SDGym.

“It’s about ensuring organisations trust this new data,” Veeramachaneni said.

“Enterprise data of this kind is complex, and there is no universal availability of it, unlike language data”.

Kalyan Veeramachaneni

“[Our tools offer] programmable synthetic data, which means we allow enterprises to insert their specific insight and intuition to build more transparent models.”

As banks, insurance firms and other financial services institutions are rolling out AI applications at unprecedented speed, Veeramachaneni said his company is ultimately helping them to do so in a more transparent and responsible way.

“In the next few years, synthetic data from generative models will transform testing,” according to Veeramachaneni.

“We believe 90 percent of enterprise operations can be done with synthetic data,” he concluded.

Stay up to date and receive our news, features and interviews for free

Our e-newsletter lands in your inbox every Friday. Sign up HERE in one simple step.


DO NOT miss coverage of our recent conferences in Chicago and Toronto