Large language models (LLMs) power the Generative AI capabilities in artificial intelligence (AI), excelling at tasks like generating text, translating languages, and answering questions, to name a few.
However, a critical issue lurks beneath the surface which is their tendency to create content that may be factually incorrect or even nonsensical. This phenomenon is often referred to as “hallucination”.
Cognitive scientists refer to this phenomenon as “confabulation” to more accurately describe this specific challenge within AI.
“All GenAI LLMs, without exception, are susceptible to hallucinations, albeit to varying degrees,” explained Prabhakar Srinivasan, co-head of AI at Synechron.
Hallucination levels differ, however. Srinivasan singled out a recent study by Vectara, whose ‘hallucination leaderboard for LLMs’ found that GPT4-Turbo currently leads with the lowest reported hallucination rate at 2.5%.
In contrast, Anthropic’s Claude 3 Opus exhibits a significantly higher rate of 7.4%.
“This variability underscores the challenge of mitigating hallucinations and the critical need for ongoing advancements in AI technology to enhance accuracy and reliability,” Srinivasan explained.
Impact and consequences
The consequences of hallucination in LLMs can be fairly significant.
Consider the case of an AI chatbot for customer support, Srinivasan said. “Hallucination can render the chatbot hard to interact with, causing frustration for the customer during the interactions, drifting into profanity, poor humour, and even self-deprecating poetry.”
Hallucinations in LLMs may generate misleading or offensive content, compromising system reliability, jeopardising user trust, and impacting brand image, he stressed. For banks, financial services firms and in particular insurance firms this may mean significant reputational risks,
“Understanding the nuances of hallucinations in large language models is crucial for advancing their reliability,” Srinivasan stated, as he said that, broadly, the industry categorises these errors into two primary types.
Firstly, factuality hallucinations. These occur when models generate outputs that either contradict verified facts or fabricate details. They further break down into factual inconsistency: For instance, incorrectly stating the year of the moon landing as 1975. Factual fabrication: for example, falsely claiming the origin of the tooth fairy as New York.
Faithfulness hallucinations: This category involves responses that do not adhere to the user’s instructions or the provided context, leading to logical discrepancies.
And, finally, there is instruction inconsistency, such as misinterpreting a request to translate “What is the currency of SA?” into French, but instead responding with the currency of the USA.
Then there is also context inconsistency: Misrepresenting the Himalayas as spanning South African nations instead of South Asian ones, as ell as logical inconsistency: Incorrectly identifying the second prime number as 2, despite the definition of a prime number.
Why does this happen?
Hallucination in large language models can be traced back to three main sources, Srinivasan elaborates.
Firstly, data-driven issues. “The foundation of any LLM is the data it’s trained on. When this data is flawed, whether due to mismatches between training data and task requirements or simply poor data quality, the model may lose its factual grounding,” he said.
Additionally, training on a vast corpus that includes biases or inconsistent information can embed these inaccuracies into the model, leading it to replicate these errors and biases in its outputs, Srinivasan noted.
Then there are training-driven issues.
“The training methodologies themselves can also introduce hallucinations,” he said, explaining this may be limitations in model architecture, such as flaws in transformer structures used for learning, which can restrict a model’s ability to generate accurate predictions.
Furthermore, training objectives that don’t align with real-world use, such as reliance on autoregressive predictions, can exacerbate the risk of hallucinations by lowering the mathematical probability of the model remaining within the bounds of correct responses.
And don’t forget inference-driven issues, Srinivasan continued. “During the inference stage, the actual generation of text, certain inherent limitations become apparent. Techniques intended to enhance diversity and creativity in responses can instead introduce randomness, leading to nonsensical or misleading outputs.”
Additionally, the decoding mechanisms may focus too narrowly on recent inputs or suffer from constraints like the SoftMax bottleneck, which can distort the model’s output away from accurate or relevant answers.
“Traditional metrics based on word overlap struggle to discern between plausible content and hallucinations, underlining the need for advanced detection techniques,” Srinivasan said.
“Detection benchmarks are crucial for testing how effectively LLMs can detect and mitigate their own hallucinations.”
– Prabhakar Srinivasan
According to the industry veteran, these methods can be grouped into detecting factuality and faithfulness hallucinations.
“Researchers are continually refining methods to assess the reliability of large language models (LLMs), specifically focusing on their ability to produce truthful and accurate responses,” he pointed out, stressing that these assessments fall into two main categories.
Firstly, evaluation benchmarks. “These tools are designed to gauge the propensity of LLMs to hallucinate by measuring their factual accuracy and their ability to adhere to the given context.”
For instance: TruthfulQA presents LLMs with intricately devised questions that aim to reveal their factual inaccuracies, Srinivasan set out, while REALTIMEQA evaluates LLMs using questions based on current events, testing their ability to provide timely and accurate information.
Then there are detection benchmarks, he added. “These benchmarks are crucial for testing how effectively LLMs can detect and mitigate their own hallucinations.”
As notable examples, Srinivasan singled out selfCheckGPT-Wikibio, which scrutinises sentence-level hallucinations in Wikipedia articles generated by GPT-3, focusing on the accuracy of generated content.
Also, HaluEval employs a combination of human evaluation and automated techniques to assess LLMs across various prompts, “helping to identify discrepancies in their outputs,” he said.
“FELM specifically tests ChatGPT’s proficiency in discerning factual information across diverse domains, highlighting its capability to detect and correct its own errors.
“Researchers, of course, continue to explore ways to combat hallucinations in LLMs, but it’s not that simple. In the meantime, users can manage LLM outputs more effectively with simple, practical strategies,” Srinivasan shared.
Clear and specific
Craft clear and specific prompts that guide the model towards generating the desired information accurately.
This minimizes misunderstandings and reduces the chances of irrelevant or incorrect responses, according to Srinivasan.
“Regularly verify the information provided by the model against trusted sources, especially when using outputs for critical applications,” he suggested.
“After all, AI should be used to complement and enhance your human creativity, leveraging its capabilities to expand possibilities and to innovate effectively,” Srinivasan concluded.
UPCOMING QA FINANCIAL EVENTS

READ MORE
- Kobiton launches new capabilities for mobile testing
- ECB names new digital chief as QA climbs the priority ladder
- Maveric eyes IPO amid major US growth plans
- Exclusive: ‘AI productivity means more risk,’ warns Tricentis ML head
- QualityKiosk wins major testing deal with Dubai’s largest bank
Become a QA Financial subscriber – for FREE
News and interviews * Receive our weekly newsletter * Get priority invitations to our Forum events