question-answering-on-triviaqa

Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	EM	F1	ModelName	ReleaseDate
Model Card and Evaluations for Claude Models		87.5		Claude 2 (few-shot, k=5)	2023-07-11
[]()		87		GPT-4-0613
Model Card and Evaluations for Claude Models		86.7		Claude 1.3 (few-shot, k=5)	2023-07-11
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs		86.5		RankRAG-llama3-70b (Zero-Shot, KILT)	2024-07-02
PaLM 2 Technical Report	✓ Link	86.1		PaLM 2-L (one-shot)	2023-05-17
ChatQA: Surpassing GPT-4 on Conversational QA and RAG		85.6		ChatQA-1.5-llama3-70b (Zero-Shot, KILT)	2024-01-18
Llama 2: Open Foundation and Fine-Tuned Chat Models	✓ Link	85		LLaMA 2 70B (one-shot)	2023-07-18
GPT-4 Technical Report	✓ Link	84.8		GPT-4-0613 (Zero-shot)	2023-03-15
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs		82.9		RankRAG-llama3-8b (Zero-Shot, KILT)	2024-07-02
PaLM 2 Technical Report	✓ Link	81.7		PaLM 2-M (one-shot)	2023-05-17
PaLM: Scaling Language Modeling with Pathways	✓ Link	81.4		PaLM-540B (Few-Shot)	2022-04-05
PaLM: Scaling Language Modeling with Pathways	✓ Link	81.4		PaLM-540B (One-Shot)	2022-04-05
ChatQA: Surpassing GPT-4 on Conversational QA and RAG		81.0		ChatQA-1.5-llama3-8B (Zero-Shot, KILT)	2024-01-18
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling	✓ Link	79.29		GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)	2024-06-18
Model Card and Evaluations for Claude Models		78.9		Claude Instant 1.1 (few-shot, k=5)	2023-07-11
REPLUG: Retrieval-Augmented Black-Box Language Models	✓ Link	77.3		code-davinci-002 175B + REPLUG LSR (Few-Shot)	2023-01-30
PaLM: Scaling Language Modeling with Pathways	✓ Link	76.9		PaLM-540B (Zero-Shot)	2022-04-05
REPLUG: Retrieval-Augmented Black-Box Language Models	✓ Link	76.8		code-davinci-002 175B + REPLUG (Few-Shot)	2023-01-30
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts		75.8		GLaM 62B/64E (One-shot)	2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts		75.8		GLaM 62B/64E (Few-shot)	2021-12-13
RA-DIT: Retrieval-Augmented Dual Instruction Tuning		75.4		RA-DIT (Zero-Shot)	2023-10-02
PaLM 2 Technical Report	✓ Link	75.2		PaLM 2-S (one-shot)	2023-05-17
LLaMA: Open and Efficient Foundation Language Models	✓ Link	73.0		LLaMA 65B (few-shot, k=64)	2023-02-27
FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering		72.6		FiE+PAQ	2022-11-18
LLaMA: Open and Efficient Foundation Language Models	✓ Link	72.6		LLaMA 65B (few-shot, k=5)	2023-02-27
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs		72.6		RankRAG-llama3-70b (Zero-Shot, DPR)	2024-07-02
Distilling Knowledge from Reader to Retriever for Question Answering	✓ Link	72.1		FiD+Distil	2020-12-08
LLaMA: Open and Efficient Foundation Language Models	✓ Link	71.6		LLaMA 65B (one-shot)	2023-02-27
End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering	✓ Link	71.4		EMDR2	2021-06-09
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts		71.3		GLaM 62B/64E (Zero-shot)	2021-12-13
Language Models are Few-Shot Learners	✓ Link	71.2		GPT-3 175B (Few-Shot)	2020-05-28
Mistral 7B	✓ Link	69.9		Mistral 7B (5-shot)	2023-10-10
ChatQA: Surpassing GPT-4 on Conversational QA and RAG		69.0		ChatQA-1.5-llama3-70b (Zero-Shot, DPR)	2024-01-18
LLaMA: Open and Efficient Foundation Language Models	✓ Link	68.2		LLaMA 65B (zero-shot)	2023-02-27
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering	✓ Link	67.6		Fusion-in-Decoder (large)	2020-07-02
MemoReader: Large-Scale Reading Comprehension through Neural Memory Controller		67.21	73.26	MemoReader	2018-10-01
Simple and Effective Multi-Paragraph Reading Comprehension	✓ Link	66.37	71.32	S-Norm	2017-10-29
Mention Memory: incorporating textual knowledge into Transformers through entity mention attention	✓ Link	65.8		TOME-2	2021-10-12
SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments		58.2		Shakti-LLM (2.5B)	2024-10-15
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	✓ Link	57.1		Branch-Train-MiX 4x7B (sampling top-2 experts)	2024-03-12
Dense Passage Retrieval for Open-Domain Question Answering	✓ Link	56.8		DPR	2020-04-10
Finetuned Language Models Are Zero-Shot Learners	✓ Link	56.7		FLAN 137B (zero-shot)	2021-09-03
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks	✓ Link	56.1		RAG	2020-05-22
Dynamic Integration of Background Knowledge in Neural NLU Systems		50.56	56.73	Reading Twice for NLU	2017-06-08
Reinforced Mnemonic Reader for Machine Reading Comprehension	✓ Link	46.94	52.85	Mnemonic Reader	2017-05-08
Latent Retrieval for Weakly Supervised Open Domain Question Answering	✓ Link	45		ORQA	2019-06-01
MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension		43.16	46.90	MEMEN	2017-07-28
SpanBERT: Improving Pre-training by Representing and Predicting Spans	✓ Link		83.6	SpanBERT	2019-07-24
Big Bird: Transformers for Longer Sequences	✓ Link		80.9	BigBird-etc	2020-07-28
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation	✓ Link		80.1	DPA-RAG	2024-06-26
LinkBERT: Pretraining Language Models with Document Links	✓ Link		78.2	LinkBERT (large)	2022-03-29
DyREx: Dynamic Query Representation for Extractive Question Answering	✓ Link		77.37	DyREX	2022-10-26
Search-o1: Agentic Search-Enhanced Large Reasoning Models	✓ Link		74.1	Search-o1	2025-01-09
UnitedQA: A Hybrid Approach for Open Domain Question Answering			70.3	UnitedQA (Hybrid reader)	2021-01-01
ReasonBERT: Pre-trained to Reason with Distant Supervision	✓ Link		45.5	ReasonBERTR	2021-09-10
ReasonBERT: Pre-trained to Reason with Distant Supervision	✓ Link		37.2	ReasonBERTB	2021-09-10

OpenCodePapers

question-answering-on-triviaqa