LLM Research: LLM Hallucination Index by Galielo
Source1
Supplementary reading to the research paper
Hallucination Index Results
Additional context
Adding additional context has emerged as a new way to improve RAG performance and reduce reliability on vector databases.
What is the impact of context length
Context length affects the design of a RAG system by influencing:
- retrieval strategies,
- computational resource needs, and
- the balance between precision and breadth.
Comparison of context length features
Hallucination and context evaluation methodologies
ChainPoll with GPT-4o
- Polling the model multiple times using a technique known as the "Chain of Thought," which helps to:
- Analyze the model's propensity to generate hallucinations—incorrect or fabricated information.
- Ensure adherence to context.
- This method helps in fine-tuning and benchmarking the model to ensure it performs accurately across various domains.
Needle Chunk
- The "needle" represents the crucial or most relevant piece of information that the model is expected to find and correctly utilize when generating a response.
- The positioning and content of the needle chunk are controlled variables.
Chain-of-X
- Chain-of-Note: The core idea of CON is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer.
- The Chain-of-Thought approach mirrors human problem-solving methods, where complex issues are broken down into smaller components:
- By doing so, LLMs can tackle each segment of a problem with focused attention, reducing the likelihood of overlooking critical details or making erroneous assumptions.
- This sequential breakdown makes the reasoning process more transparent, allowing for easier identification and correction of any logical missteps.
- Others:
- Chain-of-Explanation
- Chain-of-Knowledge: The model utilizes interconnected pieces of information (knowledge) sequentially to form a coherent understanding or solve a problem. It implies a deeper integration of understanding, where each piece of knowledge informs the next, building a more comprehensive response.
- Chain-of-Verification: Generates an initial response, formulates verification questions, and revises the response based on these questions, reducing factual errors and hallucinations in the response.
SelfCheck-BertScore
- BERTScore is a metric that leverages pre-trained BERT (Bidirectional Encoder Representations from Transformers) models to compute a similarity score between the reference text and the generated text.
- While traditional methods rely on exact n-gram matches, BERTScore considers semantic similarity by comparing the contextual embeddings of words in the generated text to those in the reference text.
- SelfCheck-BertScore is a variation of BERTScore: It evaluates not only the similarity between the reference and generated text but also checks for internal consistency within the generated text itself.
Other scores for comparison
- G-Eval
- Max pseudo-entropy
- GPTScore
- Random Guessing