question-answering-on-truthfulqa

Question Answering

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	MC1	MC2	% true	% info	% true (GPT-judge)	BLEURT	ROUGE	BLEU	EM	Accuracy	ModelName	ReleaseDate
GPT-4 Technical Report	✓ Link	0.59										GPT-4 (RLHF)	2023-03-15
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space	✓ Link	0.56	0.75									Mistral-7B-Instruct-v0.2 + TruthX	2024-02-27
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space	✓ Link	0.54	0.74									LLaMa-2-7B-Chat + TruthX	2024-02-27
Representation Engineering: A Top-Down Approach to AI Transparency	✓ Link	0.54										LLaMA-2-Chat-13B + Representation Control (Contrast Vector)	2023-10-02
Representation Engineering: A Top-Down Approach to AI Transparency	✓ Link	0.48										LLaMA-2-Chat-7B + Representation Control (Contrast Vector)	2023-10-02
[]()		0.389		88.6	83.5							Vicuna 7B + Inference Time Intervention (ITI)
[]()		0.319		66.6	97.7							Alpaca 7B + Inference Time Intervention (ITI)
Scaling Language Models: Methods, Analysis & Insights from Training Gopher	✓ Link	0.295										Gopher 280B (zero-shot, Our Prompt + Choices)	2021-12-08
[]()		0.288		45.1	93.8							LLaMA 7B + Inference Time Intervention (ITI)
Galactica: A Large Language Model for Science	✓ Link	0.26										GAL 120B	2022-11-16
Scaling Language Models: Methods, Analysis & Insights from Training Gopher	✓ Link	0.25										Gopher 7.1 (zero-shot, QA prompts)	2021-12-08
Galactica: A Large Language Model for Science	✓ Link	0.24										GAL 30B	2022-11-16
Scaling Language Models: Methods, Analysis & Insights from Training Gopher	✓ Link	0.23										Gopher 7.1B (zero-shot, Our Prompt + Choices)	2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher	✓ Link	0.23										Gopher 1.4 (zero-shot, QA prompts)	2021-12-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods	✓ Link	0.22	0.39	29.50	89.84	29.87	-0.25	-9.41	-4.91			GPT-2 1.5B	2021-09-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher	✓ Link	0.217										Gopher 1.4B (zero-shot, Our Prompt + Choices)	2021-12-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods	✓ Link	0.21	0.33	20.44	97.55	20.56	-0.56	-17.75	-17.38			GPT-3 175B	2021-09-08
Galactica: A Large Language Model for Science	✓ Link	0.21										OPT 175B	2022-11-16
TruthfulQA: Measuring How Models Mimic Human Falsehoods	✓ Link	0.20	0.36	26.68	89.96	27.17	-0.31	-11.35	-7.58			GPT-J 6B	2021-09-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods	✓ Link	0.19	0.35	53.86	64.50	53.24	0.08	1.76	-0.16			UnifiedQA 3B	2021-09-08
Galactica: A Large Language Model for Science	✓ Link	0.19										GAL 125M	2022-11-16
Galactica: A Large Language Model for Science	✓ Link	0.19										GAL 1.3B	2022-11-16
Galactica: A Large Language Model for Science	✓ Link	0.19										GAL 6.7B	2022-11-16
Scaling Language Models: Methods, Analysis & Insights from Training Gopher	✓ Link	0. 27										Gopher 280B (zero-shot, QA prompts)	2021-12-08
LLaMA: Open and Efficient Foundation Language Models	✓ Link			57	53							LLaMA 65B	2023-02-27
LLaMA: Open and Efficient Foundation Language Models	✓ Link			52	48							LLaMA 33B	2023-02-27
LLaMA: Open and Efficient Foundation Language Models	✓ Link			47	41							LLaMA 13B	2023-02-27
LLaMA: Open and Efficient Foundation Language Models	✓ Link			33	29							LLaMA 7B	2023-02-27
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models	✓ Link									67.3		CoA	2024-03-26
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	✓ Link									66.6		ToT	2023-05-17
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models	✓ Link									63.3		CoA w/o actions	2024-03-26
Automatic Chain of Thought Prompting in Large Language Models	✓ Link									42.2		Auto-CoT	2022-10-07
SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments											68.4	Shakti-LLM (2.5B)	2024-10-15

OpenCodePapers

question-answering-on-truthfulqa