When it comes to testing the quality and reliability of artificial intelligence applications, and the larger language models that power them, the QA space still has a long way to go.
In fact, standard third-party tests are practically non-existent and there are no uniform testing models for AI yet, despite banks, insurers and other financial services firms rushing to roll out and implement AI-powered software in their digital infrastructure.
As a result, many AI solution providers have gone onto developing their own standards and testing practices, which are far from transparent or uniform, and often lack impartiality.
For this reason, the chief executive of Cohere, Aiden Gomez, warned recently that quality testing in AI language models is “a broken system.”
And AI industry giant Ai unicorn Anthropic acknowledged the other day that testing the quality of AI models is still “very limited”.
Men on a mission
The current fragmented AI testing landscape has prompted a number of young developers in California to set up a new startup to develop a standard test for artificial intelligence applications and the larger language models they use.
The founder said the firm, called Vals.ai, aims to create a global test for AI apps, with a specific focus on the financial services industry, corporate finance and legal services such as contract law and tax law.
CTO and founder Langston Nashold, based in Palo Alto, California, launched up the company with Red Havaei and Rayan Krishnan after the three completed Stanford’s masters program in artificial intelligence.
“Am super excited, this marks the official launch of our company, vals.ai,” Nashold shared.
Nashold and his team are working to develop a comprehensive, third-party test to review large language models.
“Model benchmarks today are currently self-reported: there’s a concerning amount of dataset cherry-picking when results are shared. Moreover, the models are often inadvertently trained on evaluation sets, compromising the integrity of results,” Nashold stressed.
He continued: “To combat this, our first initiative is a public benchmark. We’ve rigorously tested 15 LLMs on four domains, ensuring that two of the datasets remain completely private to prevent any data leakage.”
“One major issue we’re addressing is the lack of transparency and objectivity in current benchmarking practices.”
– Langston Nashold
So far, Vals.ai managed to identify potential errors and flaws in a range of AI models.
For example, most recently, it spotted that some models struggle with complex tax-related matters.
Vals.ai said that, in this instance, GPT-4 from OpenAi turned out to be the most accurate platform, at close to 55%, while Google’s Gemini Pro achieved a score of just under 32%.
Nashold disclosed the business has secured an unknown sum of pre-seed money from Pear Venture Capital, and a Sequoia scout investor has also allocated funds, but no further details were provided.
However, he did stress investor interest clearly illustrated there is a healthy appetite for objective testing, especially as banks, insurers, law firms and other financial services firms are rolling out AI applications at unprecedented speed.
GenAI testing
While Vals.ai takes a global approach and targets various sectors, including financial services, California-based Akto also spotted the need for AI application testing within the rapidly-evolving banking and insurance space.
Chief technology officer Ankush Jain told QA Financial that Akto’s new platform, GenAI Security Testing, is “the world’s first proactive generative artificial intelligence security testing platform” as he stressed the system has been designed to specifically target the security challenges many banks, insurers investors and other big finance players face.
The software developer said the new solution should better protect the security of large language models by testing for flaws and vulnerabilities in LLMs and their security layers, thereby picking up on attempts, or detected attempts, to link malicious codes for remote access, which often lead to outright hacks.
The system also focuses on cross-site scripting and other potential hacks that may allow the hackers to obtain access or information.
The primary aim is to constantly test the LLMs the financial institution uses in order to confirm whether the models are vulnerable to creating or producing incorrect or irrelevant outputs.
Jain pointed out that the solution includes a range of features, including over 60 test cases that cover a range of elements that may confirm system vulnerabilities in the GenAI infrastructure.
Examples include overdependency on specific data sets and prompt injection of untrustworthy data.
“Our generative AI security experts have developed these test cases to ensure protection for financial firms looking to deploy generative AI models,” he explained.
“The tests try to exploit LLM vulnerabilities through different encoding methods, separators and markers,” he continued.
“This specially detects weak security practices where developers encode the input or put special markers around the input.”
“New threats have emerged due to overreliance on AI outputs without proper verification.”
– Ankush Jain
Jain agreed with the Vals.ai team that there is a clear demand for a more uniform and standardised approach to AI testing in the wider financial services space as banks, insurers and other firms are rolling out AI and LLMs as never before, “driven by a desire for more efficient, automated workflows.”
However, “new threats have emerged, such as unauthorised prompt injections, denial-of-service attacks and data inaccuracies due to overreliance on AI outputs without proper verification,” he said.
“As hackers continue to find more creative ways to exploit LLMs, the need has arisen for security teams to discover a new, automated way to secure LLMs at scale,” Jain noted.
Akto is a venture-capital back startup that was launched by current CEO Milin Desai in 2022. Among its investors are Accel Partners, Notion Labs’ founder Akshay Kothari and Tenable’s founder Renaud Deraison.
Global efforts
The lack of global, uniform AI standard testing has not gone entirely unnoticed.
In both the U.S., UK and EU legislators are gradually starting to call for a set of standards or are even actively looking into quality assurance issues with regards to AI.
In the U.S. and Britain, for example, collaborations are underway to jointly test AI models and develop common standards.
Groups in both countries have said they are working to use the same tools, infrastructure and approach when it comes to AI testing.
Meanwhile, the European Parliament approved last month what is believed the world’s most comprehensive regulatory framework for the use and rollout of artificial intelligence.
However, the Act did fail to spell out any stipulations or rules with regards to testing AI applications or monitoring the implementation of AI tech in the financial services space.
The Artificial Intelligence Act bans certain AI applications, with new obligations for high-risk AI systems, which include banking and insurance and certain systems in law enforcement.
“Such systems must maintain use logs, be transparent and accurate, and ensure human oversight,” said MEP Dragos Tudorache, who worked on the new EU legislation.
Therefore, Tudorache, who is also the EP’s Civil Liberties Committee co-rapporteur, did acknowledge “much work lies ahead that goes beyond the AI Act itself.”
He told QA Financial that: “AI will push us to rethink. The AI Act is a starting point for a new model of governance built around technology. We must now focus on putting this law into practice with testing being a major element.”
Stay up to date and receive our news, features and interviews for free
Our e-newsletter lands in your inbox every Friday. Sign up HERE in one simple step.
LAST WEEK
DO NOT miss coverage of our conferences in Chicago and Toronto
READ MORE