LLM Research: LLM Hallucination Index by Galielo

03 Aug, 2024

Source¹
Supplementary reading to the research paper

Hallucination Index Results

Screenshot 2024-08-03 at 8

Adding additional context has emerged as a new way to improve RAG performance and reduce reliability on vector databases.

Context length affects the design of a RAG system by influencing:

Screenshot 2024-08-03 at 8

ChainPoll with GPT-4o

Polling the model multiple times using a technique known as the "Chain of Thought," which helps to:
- Analyze the model's propensity to generate hallucinations—incorrect or fabricated information.
- Ensure adherence to context.
This method helps in fine-tuning and benchmarking the model to ensure it performs accurately across various domains.

Needle Chunk

The "needle" represents the crucial or most relevant piece of information that the model is expected to find and correctly utilize when generating a response.
The positioning and content of the needle chunk are controlled variables.

Chain-of-X

Chain-of-Note: The core idea of CON is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer.
The Chain-of-Thought approach mirrors human problem-solving methods, where complex issues are broken down into smaller components:
- By doing so, LLMs can tackle each segment of a problem with focused attention, reducing the likelihood of overlooking critical details or making erroneous assumptions.
- This sequential breakdown makes the reasoning process more transparent, allowing for easier identification and correction of any logical missteps.
Others:
- Chain-of-Explanation
- Chain-of-Knowledge: The model utilizes interconnected pieces of information (knowledge) sequentially to form a coherent understanding or solve a problem. It implies a deeper integration of understanding, where each piece of knowledge informs the next, building a more comprehensive response.
- Chain-of-Verification: Generates an initial response, formulates verification questions, and revises the response based on these questions, reducing factual errors and hallucinations in the response.

SelfCheck-BertScore

BERTScore is a metric that leverages pre-trained BERT (Bidirectional Encoder Representations from Transformers) models to compute a similarity score between the reference text and the generated text.
While traditional methods rely on exact n-gram matches, BERTScore considers semantic similarity by comparing the contextual embeddings of words in the generated text to those in the reference text.
SelfCheck-BertScore is a variation of BERTScore: It evaluates not only the similarity between the reference and generated text but also checks for internal consistency within the generated text itself.

Other scores for comparison