Exclusive: ‘AI productivity means more risk,’ warns Tricentis ML head

David Colwell, VP, AI & Machine Learning at Tricentis
David Colwell, VP, AI & Machine Learning at Tricentis

Global testing and QA firm Tricentis launched last month a new AI-powered platform that it claims can measure software speed and quality more efficiently and accurately.

The new platform, called Copilot, is supported by GenAI applications and helps quality engineering teams and developers integrate AI responsibly by testing complex applications.

The launch came only weeks after Vienna-based Tricentis brought a new SaaS-based test automation capability to the market, designed to help banks, financial services firms and other large entities that use SAP solutions to manage end-to-end transformation initiatives and need to store or process large amounts of personal or sensitive data.

Time for QA Financial to catch up with one of the firm’s key players, David Colwell, VP of AI and ML, to zoom in on the company’s generative AI investments in testing and overall DevOps efforts.

QA Financial: Congrats on your recent launch of Tricentis Copilot, which leverages GenAI to enhance the testing lifecycle. Why does this tool stand out from your main rivals?

David Colwell: The key differentiator for the Tricentis Copilot solutions is that they are embedded in our testing products, removing the need to switch between tools and offering full testing context – unlike external generative AI, or GenAI, tools. There are technological gaps in current generative AI and large language models, LLMs. Our technology is closing these gaps by adding what we like to think of as specialised brains, or micro-optimisations for specific fields in testing, into the mix.

QA Financial: So how do you deploy LLMs?

David Colwell: We use various methods to train LLM models such as Retrieval Augmented Analysis, or RAG, to enable the AI to understand specific user interfaces, test data or application types such as mobile, CRM, or ERP, and so on. This means that with Tricentis Copilot, users can complete complex tasks in testing that ChatGPT alone cannot solve. The solution can automatically suggest test cases and fixes, and optimise workflows to boost productivity and improve time-to-market. With Copilot, the first product to launch, code is easy to add to custom steps at the touch of a button, while fixes generated can be directly inserted into the code, reducing debug time and effort.


“AI tools are not infallible; they hallucinate, display overconfidence, and lack critical questioning.”.

– David Colwell

Our ‘explain’ function reads through existing code and provides a summary of what it does, making it easier for team members to leverage existing functionality. Best practices can be scaled by using the copilot to create a library of reusable custom steps across different tests or projects. Finally, non-technical users can ensure test robustness by harnessing natural language input on what a test needs to do to generate code without needing deep JavaScript expertise.

QA Financial: So how does AI fit into this approach, exactly?

David Colwell: Yes, the other main element that sets Copilot apart is our focus on providing responsible and high-quality AI solutions. We have worked hard to implement a holistic, responsible and high-quality AI approach that ensures testers and developers are productive. We strive to meet privacy, security, and compliance standards when building AI into our products, providing a user-centric and accessible design built to support the AI needs of all testers and developers. Our approach isn’t simply to build a wrapper around a large language model, but to create quality AI by combining machine learning and generative AI to power business productivity for our users. We also work to implement testing AI to help customers assess the readiness and potential risks of implementing AI in their own products.

QA Financial: Speaking of implementation, what are the main challenges of developing and integrating AI-powered testing applications in that respect?

David Colwell: There are two sides to every coin, and the flipside of AI-powered productivity is increased risk. AI tools are not infallible; they hallucinate, display overconfidence, and lack critical questioning. AI lawsuits have increased 26-fold since 2021. GitHub’s AI assistant may raise developer productivity by 55%, but it also produces code with defects 40% of the time. As LLMs evolve, ensuring responsible and trustworthy AI will become critical for data privacy, security and transparency.

Testing GenAI systems to protect against common concerns such as hallucination, bias and discrimination, misclassification, vulnerable systems, incuriosity and more, is therefore crucial. Complex AI models can, however, be rather like a ‘black box – opaque and not easily decipherable. This is because AI systems are nondeterministic. A single input can yield a variety of outputs, making them unpredictable and rendering traditional testing methods for GenAI inadequate. Unlike rule-based systems, where the logic and rules are explicitly defined, there is a lack of transparency on how AI models “learn”, making it hard to explain their reasoning.

QA Financial: David, you mentioned testing GenAI systems, so let’s zoom in on that. How do you test the functionality and performance of your own AI-powered tools and capabilities?

David Colwell: AI is prone to error, so testing it is crucial. When testing AI-powered tools it’s important to first determine whether they are based on a traditional or deterministic AI system, or a generative AI system. Testing a traditional AI system tends to be more straightforward because their outputs can be predetermined and validated. It is not capable of leaking sensitive data and information, and can be punished or rewarded if it recognises an object correctly or incorrectly. This means that testing can be much more controlled.


“GitHub’s AI assistant may raise developer productivity by 55%, but it also produces code with defects 40% of the time.”

– David Colwell

That said, it’s still crucial to assess potential biases within the data such as representation, recency, or selection bias. Exploratory testing methods can also be applied to uncover edge cases or try to simulate real world usage to help identify why the system might not perform as expected. Testing generative AI systems presents a greater challenge due to their inherently non-deterministic and variable outputs. Red teaming methodologies must be applied in which exploitable vulnerabilities are identified in order to gain access.

An AI red team is responsible for identifying specific risks that an implementation of generative AI may cause and explore the fallout. These risks, known as harms, are vectors by which the AI can act in a way that may cause harm to the organisation or its customers, or act outside of the programme’s intent. Harms include things such as hijacking an AI system or its components to use them for purposes unintended by the creators or owners, sensitive data extraction, or prompt overflowing to disrupt a system’s primary function.

It is then important to test the general quality of the AI system by asking whether it does the job it was intended for. For instance, if a bot cannot answer basic questions, it cannot be deployed because it’s not doing its job and it can therefore impact a business’ bottom line.

In addition to the specific harm-identification techniques of red teaming, there are also broader categories of testing that play a critical role in evaluating generative AI systems for performance, security, and reliability. Load/performance testing for instance evaluates how well a generative AI system performs under varying degrees of demand. It measures the system’s responsiveness, stability, and scalability when handling a large number of requests simultaneously. Performance testing is essential for ensuring that the AI system can maintain its efficiency and accuracy under real-world conditions, where user demands can be unpredictable and fluctuate significantly.

QA Financial: Yes, since user demands can be unpredictable and constantly evolve, there is increasing scrutiny of AI-based applications, how are you building anticipated compliance requirements into your apps?

David Colwell: We are focused on ensuring responsible and trustworthy use of AI in our products. We are guided by the AI principles from the Organization for Economic Cooperation and Development (OECD), which include transparency, privacy adherence, human control, fair application and accountability. We also adhere to the Code of Conduct for Azure OpenAI Service, our trusted provider for AI services. Microsoft provides enterprise level data privacy and data security compliance and maintains a multi-jurisdictional AI compliance programme for its AI services.

QA Financial: As we discussed, AI and GenAI are very much buzzwords at the moment in the industry. Undoubtedly the biggest trend in QA in years. How will they change the face and nature of testing in the years to come, you think?

David Colwell: Software testing is a huge area of opportunity for GenAI. Providing access to GenAI tools for testing is like providing each team member with a junior assistant that works for them. They can discover what needs to be tested, create those tests, then self-diagnose errors and optimise the code, as well as provide intelligent insights and analytics to debug applications. AI-based chatbots can also offer invaluable support to help answer questions, locate documentation, and provide links to additional resources. All of this means that the time spent on developing high quality software can be greatly reduced, allowing time and space for productivity and innovation.


“There is no structural or technological limitation to the implementation of AI, so any constraints are likely to be societal.”

– David Colwell

IDC estimates that enterprises will leverage GenAI and automation technologies to drive $1 trillion in productivity gains by 2026, and their survey data also shows that testing and automated software quality are a key area for anticipated GenAI benefits in the coming year. Examples of areas of focus include test prioritisation, identifying root cause of failure, automated test case creation and self-healing of test cases.

While GenAI shows great promise, quality engineering and development teams need well-designed, responsibly implemented solutions that fit within their testing workflows to realize these productivity benefits. We believe that Tricentis Copilot steps neatly into this gap.

QA Financial: Finally, is there anything else that you’d like to share with our readers?

David Colwell: There is no structural or technological limitation to the implementation of AI, so any constraints are likely to be societal. We won’t be able to adopt autonomous AI so quickly, for example. If you consider self-driving cars, despite the fact that they are technically less likely to cause an accident than a human, adoption will be slow due to concerns over where responsibility lies in the event of a crash. Overall approval for these highly autonomous systems will also be a key issue. We expect that AI will be able to complete tasks with very little human supervision within the next three to five years, but whether or not that is allowed is another question!



Stay up to date and receive our news, features and interviews for free

Our e-newsletter lands in your inbox every Friday. Sign up HERE in one simple step.


READ MORE