QA insiders generally agree synthetic data has the ability to significantly enhance testing, with such data sets increasingly gaining momentum.
Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning and language models.
Data generated from user behaviour on large companies’ software applications, accompanied by the relatively young generative artificial intelligence technology, can create a host of models and scenarios that are ideal for QA testing, especially in scenarios where real-world data is limited or sensitive.
But do developers and engineers fully grasp the value of synthetic data yet, and understand its potential?
In that respect, the QA space still has a long way to go, thinks Haritha Khandabattu, a senior director at Garner, where she primarily focuses on AI, GenAI and software engineering.
The Amsterdam-based analyst argues it is vital that testing teams start to understand the potential of synthetic data.

Since synthetic data is a class of data artificially generated through advanced methods like machine learning that can be used when real-world data is unavailable, “it offers a multitude of compelling advantages, such as its flexibility and control, which allows engineers to model a wide range of scenarios that might not be possible with production data,” Khandabattu explained.
She stressed that market awareness of synthetic data for software testing “has been very low and its potential has not yet been realized by software engineering leaders.”
In fact, Gartner has found that 34% of software engineering leaders identified improving software quality as one of their top three performance objectives.
However, “many software engineering leaders are inadequately equipped to achieve these objectives because their teams rely on antiquated development and testing strategies,” Khandabattu wrote in a recent SDT post.
“QA leaders should evaluate the feasibility of synthetic data to boost software quality and accelerate delivery,” she argued.
While market awareness of synthetic data is generally low, it is rising, however, Khandabattu continued.
“Compared to large language models, synthetic data generation is a relatively mature market.”
She wrote that synthetically generated data for software testing offers a range of benefits such as security and compliance.
“Synthetic data can mitigate the risk of exposing sensitive or confidential information to comply with data privacy regulations.”
Moreover, reliability as synthetic data allows for control over specific data characteristics, such as age, income or location, to specify customer demographics. Software engineers can generate data that matches their product’s testing needs, and update the data as use cases change.
“Once generated, datasets can be retrained for reliable and consistent testing scenarios,” Khandabattu noted.
“QA leaders should evaluate the feasibility of synthetic data to boost software quality and accelerate delivery.”
– Haritha Khandabattu
Next up is customization as synthetic data generation techniques and platforms provide customization capabilities to include diverse data patterns and edge cases.
“Since the data is artificially generated, test data can be made available even if a feature has no production data, resulting in the ability to test new features and inherently enhancing the test coverage,” Khandabattu wrote in her blog post.
Finally, there is a rising demand for data.
“Quality engineers can create any volume of data they need without limitations or delays associated with real-world data acquisition,” she pointed out, adding that “this is particularly valuable for testing features with limited real-world data or for large-scale performance testing.”
She went on to argue that software engineering leaders can enhance development cycle efficiency by strategically transitioning to synthetic data for testing.
“This enables teams to conduct secure, efficient and comprehensive tests, resulting in high-quality software.”
On thing Khandabattu did note was the costs firms may encounter during the implementation and rollout of synthetic data.
“It is vital to determine ROI that outlines the strategic significance, expected returns and methods for mitigating risks to generate the requisite support and secure budget for synthetic data investment,” she said.
To accurately determine ROI, software engineering leaders should include non-financial benefits such as improved compliance, data security, and innovation, Khandabattu concluded.
Impact
Khandabattu is not alone in pushing for synthetic data to be given a more prominent spot within the QA ecosystem.
Research scientist and AI investor Kalyan Veeramachaneni, an MIT alumni and founder of the US-based startup DataCebo, also thinks testers and developers should focus much more on synthetic data models.
He argues such data sets may have the ability to change QA testing forever, since generative AI features are able to create realistic synthetic data around a host of models and scenarios, especially in scenarios where real-world data is limited or sensitive.
In other words, GenAI-powered synthetic enterprise data is simply ideal for QA testing.
Together with co-founder Neha Patki, Veeramachaneni developed a generative software system called the Synthetic Data Vault that is able to help firms create synthetic data which are used for software application testing and machine learning models.
Data generated by a computer simulation can be seen as synthetic data. Therefore, the open-source tool is ideal for software testing, Veeramachaneni argued.

“You need data to test these software applications,” he stated, pointing out this is a step up from developers’ traditional approach, when they manually write scripts to create synthetic data.
“With generative models, you can learn from a sample of data collected and then sample a large volume of synthetic data, which has the same properties as real data, or create specific scenarios and edge cases, and use the data to test your application,” Veeramachaneni set out in a recent MIT blog.
Singling out a large financial institution as a good example, “if a bank wanted to test a program designed to reject transfers from accounts with no money in them, it would have to simulate many accounts simultaneously transacting,” he said.
“Doing that with data created manually would take a lot of time. With generative models, customers can create any edge case they want to test.”
He continued by stressing that “it is common for industries to have data that is sensitive in some capacity. Often when you’re in a domain with sensitive data you are dealing with regulations, and even if there aren’t legal regulations, it’s in companies’ best interest to be diligent about who gets access to what at which time.”
Therefore, “synthetic data is always better from a privacy perspective,” Veeramachaneni argued.
“If a bank wants to test a program designed to reject transfers from accounts with no money in them, it would have to simulate many accounts simultaneously transacting.”
– Kalyan Veeramachanen
In the last three years, DataCebo’s SDV has built an impressive clientbase that are after the startup platform’s generative software system.
Veeramachaneni disclosed that SDV has been used more than one million times, with over 10,000 data scientists using the open-source library for generating synthetic tabular data.
Banks, insurers and other firms “use synthetic data instead of sensitive information in programs while still preserving the statistical relationships between datapoints,” he pointed out.
“Companies can also use synthetic data to run new software through simulations to see how it performs before releasing it to the public,” Veeramachaneni added.
Complex tasks
This approach does make testing processes a tad more complex, however.
“Enterprise data of this kind is complex, and there is no universal availability of it, unlike language data,” Veeramachaneni acknowledged.
“When folks use our publicly available software and report back if works on a certain pattern, we learn a lot of these unique patterns, and it allows us to improve our algorithms,” he explained.
“From one perspective, we are building a corpus of these complex patterns, which for language and images is readily available.”
Veeramachaneni pointed out that his startup recently released features to improve the data’s usefulness, including tools to assess the “realism” of the generated data, called the SDMetrics library, as well as a way to compare models’ performances, called SDGym.
“It’s about ensuring organisations trust this new data,” Veeramachaneni said.
He is convinced that “in the next few years, synthetic data from generative models will transform testing. I believe 90 percent of enterprise operations can be done with synthetic data.”
Stay up to date and receive our news, features and interviews for free
Our e-newsletter lands in your inbox every Friday. Sign up HERE in one simple step.
DO NOT MISS
- QA Financial Forum Chicago 2025: what to expect this afternoon
- QA Financial Forum Chicago 20205: a sitdown with GreatAmerica’s QA lead
- Taking place TODAY: the QA Financial Forum Chicago 2025
- HDFC Bank turns to Katalon and QualityKiosk for QA upgrade
- Perforce’s Clinton Sprauve on AI testing for charts and graphics