LLM Evaluation

September 1, 2024

In this blog, I dive into the essential evaluation metrics and benchmarks for large language models (LLMs). From multichoice questions and generative tasks to translation challenges, I cover the key performance indicators used to assess LLMs.

Additionally, you’ll explore various benchmark suites, including those focused on coding capabilities, language understanding, reasoning, and more. Whether you’re interested in conversational AI or content moderation, this guide offers a comprehensive look into the tools used to measure LLM effectiveness.

Evaluation Metrics

Multichoice Questions

Loglikelihood : Loglikelihood measures the probability of a set of choices being correct based on the highest log-probabilities assigned by the model. Loglikelihood Norm additionally normalizes these log-probabilities by the length of each choice to account for varying choice lengths. The final score is the mean of individual scores across multiple examples.

# Formula
best_choice = argmax(choices_logprob)
Score = 1 if best_choice is in gold_ixs, otherwise 0

# Example:
Choices: ["Apple", "Banana", "Citrus"]
Log Probabilities: [-2.0, -1.5, -3.0]
Correct Choice Index: 1 (Banana)
best_choice = argmax([-2.0, -1.5, -3.0]) = 1
Score = 1

Recall : Recall_at_1 measures the fraction of instances where the choice with the best log-probability was correct. Recall_at_2 measures the fraction of instances where either the best or the second-best log-probability was correct. The final score is the mean of individual scores across multiple examples.

# Recall@1
Score = 1 if argmax(choices_logprob) is in gold_ixs, otherwise 0

# Recall@2
top_2_choices = np.argsort(choices_logprob)[-2:]
Score = 1 if any of the top 2 choices are in gold_ixs, otherwise 0

Target Perplexity : Perplexity measures how well a probability distribution or model predicts a sample. Lower perplexity indicates better predictive performance. It evaluates the uncertainty in the model’s predictions.

\( \text{Perplexity} = \exp\left( -\frac{1}{N} \sum \log P(x_i) \right) \)

where N is the number of words or bytes in the reference text, and \(P(x_i)\) represents the probability of the i-th word or byte.

# Example
"""
Sample 1:
Log Probabilities: [-2.0, -2.5, -1.5, -1.0, -2.2]
Reference Text: "This is a test sentence."

Sample 2:
Log Probabilities: [-3.0, -2.5, -2.0]
Reference Text: "Another test."

Sample 3:
Log Probabilities: [-1.0, -1.5, -2.0, -2.5]
Reference Text: "Yet another example."
"""

Combined Log Probabilities = [-2.0, -2.5, -1.5, -1.0, -2.2, -3.0, -2.5, -2.0, -1.0, -1.5, -2.0, -2.5]
Total Number of Words (N)= 5 + 3 + 3 = 11
Perplexity = exp(-1/11 * sum([-2.0, -2.5, -1.5, -1.0, -2.2, -3.0, -2.5, -2.0, -1.0, -1.5, -2.0, -2.5]))
           = exp(-1/11 * (-23.2))
           = exp(2.1091)
            8.24
  • Lower Perplexity: Indicates that the model is more certain about its predictions, suggesting better performance.
  • Higher Perplexity: Indicates that the model is less certain about its predictions, suggesting poorer performance.
  • Perplexity measures the model’s confidence in its predictions, which is valuable for understanding its performance in generating probable sequences, even if it does not directly compare predicted outputs to expected outputs.
  • A model can have low perplexity but still make incorrect predictions with high confidence. For ex. model assigns a high probability to incorrect answer and low probability to correct answer.

Matthews Correlation Coefficient (MCC) : Matthews Correlation Coefficient (MCC) measures the quality of binary classifications. It considers true and false positives and negatives, and is generally regarded as a balanced measure even if the classes are of very different sizes. A high MCC indicates good predictive performance, while a low MCC indicates poor predictive performance.

\( \text{MCC} = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \)

  • The Matthews Correlation Coefficient (MCC) has a range of -1 to 1 where -1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier.
  • The Matthews Correlation Coefficient (MCC) has a range of -1 to 1 where -1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier.
  • F1-score is preferred over the cases where the convention is that rarer or more ‘interesting’ samples are usually labeled as positive, such as patients who have a rare disease (they are tested positive). For these problems, F1 score fulfills its purpose of being a good metric by placing more emphasis on the positive class.

Multi-class F1 Score : Multi-class F1 Score computes the F1 score for each possible choice class and averages it. It can be calculated using different averaging methods: macro, micro, or weighted.
A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

Generative Tasks

These metrics provide a comprehensive evaluation of generated summaries across different dimensions of accuracy, coverage, and similarity.

  • ROUGE-1: Measures 1-gram overlap.

  • ROUGE-2: Measures 2-gram overlap.

  • ROUGE-L: Measures LCS overlap.

  • ROUGE-Lsum: Measures LCS for the entire summary.

  • ROUGE_t5: Corpus level ROUGE for all metrics.

  • Faithfulness: Based on SummaC method.

  • BERTScore: Measures similarity using BERT embeddings.

  • Extractiveness: Proportion of the summary from the source.

    • Coverage: Extent to which the summary covers the source content.
    • Density: Average length of extracted fragments.
    • Compression: Compression ratio relative to the source.

ROUGE Score : Recall-Oriented Understudy for Gisting Evaluation
Precision: (No. of matching n-grams)/(No of n-grams in the Candidate)
Recall: (No. of matching n-grams)/(No of n-grams in the reference)
ROUGE = (2 * (precision * recall)) / (precision + recall)

# ROUGE-1
Reference: "The cat sat on the mat."
Candidate: "The cat is on the mat."

Common 1-grams: "The", "cat", "on", "the", "mat" (5/6)
Precision: 5/6 = 0.833
Recall: 5/6 = 0.833
F1 = 2 * (0.833 * 0.833) / (0.833 + 0.833) = 0.833

#ROUGE-2
Common 2-grams: "The cat", "on the", "the mat" (3/5)
Precision: 3/5 = 0.6
Recall: 3/5 = 0.6
F1 = 2 * (0.6 * 0.6) / (0.6 + 0.6) = 0.6

#ROUGE-L
Longest Common Subsequence (LCS) = "The cat on the mat" (length 5)
Precision: 5/6 = 0.833
Recall: 5/6 = 0.833
F1 = 2 * (0.833 * 0.833) / (0.833 + 0.833) = 0.833

#ROUGE-Lsum: Similar to ROUGE-L but considers the entire summary as a sequence.

Faithfulness : Measures how accurately the generated summary reflects the content of the source document. SummaC (Summarization Consistency) uses zero-shot learning with transformer models to assess if sentences in the generated summary are entailed by, contradict, or are neutral with respect to the source text.

\( Faithfulness Score = \frac{Number\ of\ Entailed\ Sentences}{Total\ Number\ of\ Sentences\ in\ the\ Summary} \)

Entailed Sentence: If the NLI model determines that the summary sentence is supported by the source text, it is considered “entailed.

BERTScore : Measures the similarity between the generated summary and the reference summary using BERT embeddings.

Extractiveness : Measures how much of the generated summary consists of exact fragments from the source document. It includes the metrics Summarization Coverage, Summarization Density, and Summarization Compression.

#Example
Source: "The cat sat on the mat. It looked very happy."
Summary: "The cat sat on the mat."

#Coverage : Measures how well the summary covers the content of the source document.
Coverage = (Number of matched words in the covered content) / (Total number of words in the source)
Coverage = 5/10 = 0.5

#Density : Measures the average length of the extracted fragments in the summary.
Density = (Number of words in extracted fragments) / (Total number of words in the summary)
Density = 5/5 = 1.0

#Compression
Compression = (Total number of words in the source) / (Total number of words in the summary)
Compression = 10/5 = 2.0

Translation Tasks

BLEU : Bilingual Evaluation Understudy (BLEU) score is a metric used to evaluate the quality of machine-generated text, such as translations or summaries, by comparing it against one or more reference texts. It is based on n-gram precision and includes a penalty for shorter sentences to ensure the generated text is both precise and complete.

\( \text{BLEU} = \text{BP} * \exp\left(\sum w_n \cdot \log(p_n)\right) \)

here,
\(w_n\) = weights (often uniform, such as \(w_n\) = 1/N)
\(p_n\) = n-gram precisions for n-grams of length n
BP = Brevity Penalty : a penalty if the candidate sentence is shorter than the reference sentences to discourage under-generation.
The brevity penalty in BLEU is used to prevent very short translations from receiving a high score simply because they include words that are highly frequent in the reference translations. The penalty ensures that translations are not only precise but also adequately long to cover the content of the reference.
BP = 1 if c > r, else BP = \(exp(1 - \frac{r}{c})\)
c is the total length of the candidate corpus and r is the total length of the reference corpus.

# Reference Sentences:
Ref 1: "The cat sat on the mat."
Ref 2: "The quick brown fox jumps over the lazy dog."
Ref 3: "A stitch in time saves nine."

# Candidate Sentences:
Cand 1: "The cat is on the mat."
Cand 2: "The quick brown fox jumped over the lazy dog."
Cand 3: "A stitch in time saves ten."

Let's p_1, p_2 . . be n-gram precisions

BLEU_1 = BP * exp((1/1) * (log(p_1)) = BP*p_1
.
.
BLEU_4 = BP * exp((1/4) * (log(p_1) + log(p_2) + log(p_3) + log(p_4)))

CHRF : Character n-gram F-score (CHRF) measures the character-level n-gram matches between the reference and candidate texts. It combines precision and recall using the F-score.

TER : Translation Edit Rate (TER) measures the number of edits needed to change the candidate text into the reference text. Edits include insertions, deletions, and substitutions.

\( TER = \frac{Number\ of\ Edits}{Number\ of\ Words\ in\ Reference} \)

BLEURT : BERT-based Learned Evaluation Metric (BLEURT) is a learned metric based on BERT that evaluates the quality of text generation tasks by comparing candidate texts to reference texts.

LLM Benchmarks

Coding Capabilities

CodeXGLUE

  • Description: Evaluates LLMs’ ability to understand and work with code across various tasks like code completion and translation.
  • Purpose: To assess code intelligence, including understanding, fixing, and explaining code.
  • Relevance: Essential for applications in software development, code analysis, and technical documentation.

HumanEval

  • Description: Contains programming challenges for evaluating LLMs’ ability to write functional code based on instructions.
  • Purpose: To test the generation of correct and efficient code from given requirements.
  • Relevance: Important for automated code generation tools, programming assistants, and coding education platforms.

Mostly Basic Python Programming (MBPP)

  • Description: Includes 1,000 Python programming problems suitable for entry-level programmers.
  • Purpose: To evaluate proficiency in solving basic programming tasks and understanding of Python.
  • Relevance: Useful for beginner-level coding education, automated code generation, and entry-level programming testing.

Knowledge and Language Understanding

Massive Multitask Language Understanding (MMLU)

  • Description: Measures general knowledge across 57 different subjects, ranging from STEM to social sciences.
  • Purpose: To assess the LLM’s understanding and reasoning in a wide range of subject areas.
  • Relevance: Ideal for multifaceted AI systems that require extensive world knowledge and problem-solving ability.

AI2 Reasoning Challenge (ARC)

  • Description: Tests LLMs on grade-school science questions, requiring both deep general knowledge and reasoning abilities.
  • Purpose: To evaluate the ability to answer complex science questions that require logical reasoning.
  • Relevance: Useful for educational AI applications, automated tutoring systems, and general knowledge assessments.

General Language Understanding Evaluation (GLUE)

  • Description: A collection of various language tasks from multiple datasets, designed to measure overall language understanding.
  • Purpose: To provide a comprehensive assessment of language understanding abilities in different contexts.
  • Relevance: Crucial for applications requiring advanced language processing, such as chatbots and content analysis.

Natural Questions

  • Description: A collection of real-world questions people have Googled, paired with relevant Wikipedia pages to extract answers.
  • Purpose: To test the ability to find accurate short and long answers from web-based sources.
  • Relevance: Essential for search engines, information retrieval systems, and AI-driven question-answering tools.

Language Modelling Broadened to Account for Discourse Aspects (LAMBADA)

  • Description: A collection of passages testing the ability of language models to understand and predict text based on long-range context.
  • Purpose: To assess the models’ comprehension of narratives and their predictive abilities in text generation.
  • Relevance: Important for AI applications in narrative analysis, content creation, and long-form text understanding.

HellaSwag

  • Description: Tests natural language inference by requiring LLMs to complete passages in a way that requires understanding intricate details.
  • Purpose: To evaluate the model’s ability to generate contextually appropriate text continuations.
  • Relevance: Useful in content creation, dialogue systems, and applications requiring advanced text generation capabilities.

Multi-Genre Natural Language Inference (MultiNLI)

  • Description: A benchmark consisting of 433K sentence pairs across various genres of English data, testing natural language inference.
  • Purpose: To assess the ability of LLMs to assign correct labels to hypothesis statements based on premises.
  • Relevance: Vital for systems requiring advanced text comprehension and inference, such as automated reasoning and text analytics tools.

TriviaQA

  • Description: A reading comprehension test with questions from sources like Wikipedia, demanding contextual analysis.
  • Purpose: To assess the ability to sift through context and find accurate answers in complex texts.
  • Relevance: Suitable for AI systems in knowledge extraction, research, and detailed content analysis.

WinoGrande

  • Description: A large set of problems based on the Winograd Schema Challenge, testing context understanding in sentences.
  • Purpose: To evaluate the ability of LLMs to grasp nuanced context and subtle variations in text.
  • Relevance: Crucial for models dealing with narrative analysis, content personalization, and advanced text interpretation.

SciQ

  • Description: Consists of multiple-choice questions mainly in natural sciences like physics, chemistry, and biology.
  • Purpose: To test the ability to answer science-based questions, often with additional supporting text.
  • Relevance: Useful for educational tools, especially in science education and knowledge testing platforms.

SuperGLUE

  • Description: An advanced version of the GLUE benchmark, comprising more challenging and diverse language tasks.
  • Purpose: To evaluate deeper aspects of language understanding and reasoning.
  • Relevance: Important for sophisticated AI systems requiring advanced language processing capabilities.

Reasoning Capabilities

GSM8K

  • Description: A set of 8.5K grade-school math problems that require basic to intermediate math operations.
  • Purpose: To test LLMs’ ability to work through multistep math problems.
  • Relevance: Useful for assessing AI’s capability in solving basic mathematical problems, valuable in educational contexts.

Discrete Reasoning Over Paragraphs (DROP)

  • Description: An adversarially-created reading comprehension benchmark requiring models to navigate through references and execute operations like addition or sorting.
  • Purpose: To evaluate the ability of models to understand complex texts and perform discrete operations.
  • Relevance: Useful in advanced educational tools and text analysis systems requiring logical reasoning.

Counterfactual Reasoning Assessment (CRASS)

  • Description: Evaluates counterfactual reasoning abilities of LLMs, focusing on “what if” scenarios.
  • Purpose: To assess models’ ability to understand and reason about alternate scenarios based on given data.
  • Relevance: Important for AI applications in strategic planning, decision-making, and scenario analysis.

Large-scale ReAding Comprehension Dataset From Examinations (RACE)

  • Description: A set of reading comprehension questions derived from English exams given to Chinese students.
  • Purpose: To test LLMs’ understanding of complex reading material and their ability to answer examination-level questions.
  • Relevance: Useful in language learning applications and educational systems for exam preparation.

Big-Bench Hard (BBH)

  • Description: A subset of BIG-Bench focusing on the most challenging tasks requiring multi-step reasoning.
  • Purpose: To challenge LLMs with complex tasks demanding advanced reasoning skills.
  • Relevance: Important for evaluating the upper limits of AI capabilities in complex reasoning and problem-solving.

AGIEval

  • Description: A collection of standardized tests, including GRE, GMAT, SAT, LSAT, and civil service exams.
  • Purpose: To evaluate LLMs’ reasoning abilities and problem-solving skills across various academic and professional scenarios.
  • Relevance: Useful for assessing AI capabilities in standardized testing and professional qualification contexts.

BoolQ

  • Description: A collection of over 15,000 real yes/no questions from Google searches, paired with Wikipedia passages.
  • Purpose: To test the ability of LLMs to infer correct answers from contextual information that may not be explicit.
  • Relevance: Crucial for question-answering systems and knowledge-based AI applications where accurate inference is key.

Multi Turn Open Ended Conversations

MT-bench

  • Description: Tailored for evaluating the proficiency of chat assistants in sustaining multi-turn conversations.
  • Purpose: To test the ability of models to engage in coherent and contextually relevant dialogues over multiple turns.
  • Relevance: Essential for developing sophisticated conversational agents and chatbots.

Question Answering in Context (QuAC)

  • Description: Features 14,000 dialogues with 100,000 question-answer pairs, simulating student-teacher interactions.
  • Purpose: To challenge LLMs with context-dependent, sometimes unanswerable questions within dialogues.
  • Relevance: Useful for conversational AI, educational software, and context-aware information systems.

Grounding and Abstractive Summarisation

Ambient Clinical Intelligence Benchmark (ACI-BENCH)

  • Description: Contains full doctor-patient conversations and associated clinical notes from various medical domains.
  • Purpose: To challenge models to accurately generate clinical notes based on conversational data.
  • Relevance: Vital for AI applications in healthcare, especially in automated documentation and medical analysis.

Machine Reading Comprehension Dataset (MS-MARCO)

  • Description: A large-scale collection of natural language questions and answers derived from real web queries.
  • Purpose: To test the ability of models to accurately understand and respond to real-world queries.
  • Relevance: Crucial for search engines, question-answering systems, and other consumer-facing AI applications.

Query-based Multi-domain Meeting Summarisation (QMSum)

  • Description: A benchmark for summarizing relevant spans of meetings in response to specific queries.
  • Purpose: To evaluate the ability of models to extract and summarize important information from meeting content.
  • Relevance: Useful for business intelligence tools, meeting analysis applications, and automated summarization systems.

Physical Interaction: Question Answering (PIQA)

  • Description: Tests knowledge and understanding of the physical world through hypothetical scenarios and solutions.
  • Purpose: To measure the model’s capability in handling physical interaction scenarios.
  • Relevance: Important for AI applications in robotics, physical simulations, and practical problem-solving systems.

Content Moderation and Narrative Control

ToxiGen

  • Description: A dataset of toxic and benign statements about minority groups, focusing on implicit hate speech.
  • Purpose: To test a model’s ability to both identify and avoid generating toxic content.
  • Relevance: Crucial for content moderation systems, community management, and AI ethics research.

Helpfulness, Honesty, Harmlessness (HHH)

  • Description: Evaluates language models’ alignment with ethical standards such as helpfulness, honesty, and harmlessness.
  • Purpose: To assess the ethical responses of models in interaction scenarios.
  • Relevance: Vital for ensuring AI systems promote positive interactions and adhere to ethical standards.

TruthfulQA

  • Description: A benchmark for evaluating the truthfulness of LLMs in generating answers to questions prone to false beliefs and biases.
  • Purpose: To test the ability of models to provide accurate and unbiased information.
  • Relevance: Important for AI systems where delivering accurate and unbiased information is critical, such as in educational or advisory roles.

Responsible AI (RAI)

  • Description: A framework for evaluating the safety of chat-optimized models in conversational settings.
  • Purpose: To assess potential harmful content, IP leakage, and security breaches in AI-driven conversations.
  • Relevance: Crucial for developing safe and secure conversational AI applications, particularly in sensitive domains.