LLM hallucination index

Published Date:

January 22, 2025

Last Updated ON:

April 10, 2026

Galileo released an LLM Hallucination Index, which makes for very interesting reading. The charts shared considers a Q&A use-case, with and without RAG, and also Long-Form Text Generation.

Introduction

Hallucination has become a catch-all phrase for when the model generates responses which are incorrect or fabricated. Being able to measure hallucination is a first step in managing it.
As seen in recent times, it is very interesting to see what an equaliser RAG is, and how the disparity in model performance is much lower when RAG is introduced, as apposed to the absence of RAG.
What I like about this approach is the focus on different tasks LLMs may be used for, ranging from chat, to summarisation and more.
This is a practical benchmark useful for Enterprise Generative AI teams, which need to cater to the variability in task types. For instance, a model that works well for chat, might not be great at text summarisation.
The study also refers to the power of context, and that hallucination benchmarks need to take into consideration context. Retrieval augmented generation (RAG) has been popularised as an avenue to provide a contextual reference for LLMs at inference time.
Granted there is nuance with regard to the quality of the context, but measuring variability in LLM performance across RAG vs non-RAG tasks is critical.

QnA with RAG

Retrieval-Augmented Generation (RAG) refers to a hybrid approach that combines elements of both retrieval-based information and generative knowledge from the LLM. In the context of large language models (LLMs) like GPT-3, retrieval-augmented generation typically involves using a two-step process:

Retrieval: The model first retrieves relevant information from a pre-existing knowledge base or external database. This retrieval step helps the model gather specific and contextually relevant information related to the given task or query.
Generation: After retrieving relevant information, the model uses this information to generate a coherent and contextually appropriate response. This generation step allows the model to produce novel and contextually rich language based on the retrieved content.

Short context RAG (SCR)

The Short Context RAG seeks to identify the most efficient model for understanding contexts up to 5k tokens. Its primary goal is to detect any loss of information or reasoning capability within these contexts. Similar to referencing select pages in a book, this method is especially suitable for tasks that demand domain-specific knowledge.

Medium context RAG (MCR)

The Medium Context RAG aims to determine the most effective model for comprehending long contexts spanning from 5k to 25k tokens. It focuses on identifying any loss of information and reasoning ability within these extensive contexts. Additionally, we experiment with a prompting technique known as Chain-of-Note to improve performance as it has worked for short context. This task is akin to doing RAG on a few book chapters.

Long context RAG (LCR)

This heatmap shows the model's ability to recall information in different parts of the context. The x-axis represents the length of the context during the experiment, and the y-axis represents the location of the information. Green indicates successful recall, while red indicates failure.