OpenCodePapers

question-answering-on-truthfulqa

Question Answering
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeMC1MC2% true% info% true (GPT-judge)BLEURTROUGEBLEUEMAccuracyModelNameReleaseDate
GPT-4 Technical Report✓ Link0.59GPT-4 (RLHF)2023-03-15
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space✓ Link0.560.75Mistral-7B-Instruct-v0.2 + TruthX2024-02-27
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space✓ Link0.540.74LLaMa-2-7B-Chat + TruthX2024-02-27
Representation Engineering: A Top-Down Approach to AI Transparency✓ Link0.54LLaMA-2-Chat-13B + Representation Control (Contrast Vector)2023-10-02
Representation Engineering: A Top-Down Approach to AI Transparency✓ Link0.48LLaMA-2-Chat-7B + Representation Control (Contrast Vector)2023-10-02
[]()0.38988.683.5Vicuna 7B + Inference Time Intervention (ITI)
[]()0.31966.697.7Alpaca 7B + Inference Time Intervention (ITI)
Scaling Language Models: Methods, Analysis & Insights from Training Gopher✓ Link0.295Gopher 280B (zero-shot, Our Prompt + Choices)2021-12-08
[]()0.28845.193.8LLaMA 7B + Inference Time Intervention (ITI)
Galactica: A Large Language Model for Science✓ Link0.26GAL 120B2022-11-16
Scaling Language Models: Methods, Analysis & Insights from Training Gopher✓ Link0.25Gopher 7.1 (zero-shot, QA prompts)2021-12-08
Galactica: A Large Language Model for Science✓ Link0.24GAL 30B2022-11-16
Scaling Language Models: Methods, Analysis & Insights from Training Gopher✓ Link0.23Gopher 7.1B (zero-shot, Our Prompt + Choices)2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher✓ Link0.23Gopher 1.4 (zero-shot, QA prompts)2021-12-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods✓ Link0.220.3929.5089.8429.87-0.25-9.41-4.91GPT-2 1.5B2021-09-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher✓ Link0.217Gopher 1.4B (zero-shot, Our Prompt + Choices)2021-12-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods✓ Link0.210.3320.4497.5520.56-0.56-17.75-17.38GPT-3 175B2021-09-08
Galactica: A Large Language Model for Science✓ Link0.21OPT 175B2022-11-16
TruthfulQA: Measuring How Models Mimic Human Falsehoods✓ Link0.200.3626.6889.9627.17-0.31-11.35-7.58GPT-J 6B2021-09-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods✓ Link0.190.3553.8664.5053.240.081.76-0.16UnifiedQA 3B2021-09-08
Galactica: A Large Language Model for Science✓ Link0.19GAL 125M2022-11-16
Galactica: A Large Language Model for Science✓ Link0.19GAL 1.3B2022-11-16
Galactica: A Large Language Model for Science✓ Link0.19GAL 6.7B2022-11-16
Scaling Language Models: Methods, Analysis & Insights from Training Gopher✓ Link0. 27Gopher 280B (zero-shot, QA prompts)2021-12-08
LLaMA: Open and Efficient Foundation Language Models✓ Link5753LLaMA 65B2023-02-27
LLaMA: Open and Efficient Foundation Language Models✓ Link5248LLaMA 33B2023-02-27
LLaMA: Open and Efficient Foundation Language Models✓ Link4741LLaMA 13B2023-02-27
LLaMA: Open and Efficient Foundation Language Models✓ Link3329LLaMA 7B2023-02-27
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models✓ Link67.3CoA2024-03-26
Tree of Thoughts: Deliberate Problem Solving with Large Language Models✓ Link66.6ToT2023-05-17
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models✓ Link63.3CoA w/o actions2024-03-26
Automatic Chain of Thought Prompting in Large Language Models✓ Link42.2Auto-CoT2022-10-07
SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments68.4Shakti-LLM (2.5B)2024-10-15