temporal-relation-extraction-on-vinoground

Temporal Relation Extraction

Results over time

Click legend items to toggle metrics. Hover points for model names.

Leaderboard

Paper	Code	Text Score	Video Score	Group Score	ModelName	ReleaseDate
[]()		59.2	51	35	GPT-4o (CoT)
[]()		54	38.2	24.6	GPT-4o
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	50.4	32.6	17.4	Qwen2-VL-72B	2024-09-18
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	48.4	35.2	21.8	LLaVA-OneVision-Qwen2-72B	2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer	✓ Link	41.6	29.4	14.6	LLaVA-OneVision-Qwen2-7B	2024-08-06
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	✓ Link	40.2	32.4	15.2	Qwen2-VL-7B	2024-09-18
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	37	27.6	12.4	Gemini-1.5-Pro (CoT)	2024-03-08
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	✓ Link	36.2	21.8	8.4	VideoLLaMA2-72B	2024-06-11
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	✓ Link	35.8	22.6	10.2	Gemini-1.5-Pro	2024-03-08
[]()		32.8	28.8	10.6	Claude 3.5 Sonnet
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	✓ Link	32.6	29.2	11.2	MiniCPM-2.6	2024-08-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	✓ Link	30.8	28.4	9	InternLM-XC-2.5 (CoT)	2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	✓ Link	28.8	27.8	9.6	InternLM-XC-2.5	2024-07-03
[]()		25.8	22.2	5.2	LLaVA-NeXT-Video-34B (CoT)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	✓ Link	24.8	25.8	6.6	Video-LLaVA-7B	2023-11-16
[]()		24	22.4	6.2	Phi-3.5-Vision
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	✓ Link	23.8	25.6	6.8	MA-LMM-Vicuna-7B	2024-04-08
[]()		23	21.2	3.8	LLaVA-NeXT-Video-34B
[]()		21.8	26.2	6.8	LLaVA-NeXT-Video-7B (CoT)
[]()		21.8	25.6	6.2	LLaVA-NeXT-Video-7B
VTimeLLM: Empower LLM to Grasp Video Moments	✓ Link	19.4	27	5.2	VTimeLLM	2023-11-30
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding	✓ Link	17	2.8	1.2	VideoCLIP	2021-09-28
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	✓ Link	10.6	5	1.2	LanguageBind	2023-10-03
ImageBind: One Embedding Space To Bind Them All	✓ Link	9.4	3.4	0.6	ImageBind	2023-05-09

OpenCodePapers

temporal-relation-extraction-on-vinoground