OpenCodePapers

temporal-relation-extraction-on-vinoground

Temporal Relation Extraction
Dataset Link
Results over time
Click legend items to toggle metrics. Hover points for model names.
Leaderboard
PaperCodeText ScoreVideo ScoreGroup ScoreModelNameReleaseDate
[]()59.25135GPT-4o (CoT)
[]()5438.224.6GPT-4o
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link50.432.617.4Qwen2-VL-72B2024-09-18
LLaVA-OneVision: Easy Visual Task Transfer✓ Link48.435.221.8LLaVA-OneVision-Qwen2-72B2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer✓ Link41.629.414.6LLaVA-OneVision-Qwen2-7B2024-08-06
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution✓ Link40.232.415.2Qwen2-VL-7B2024-09-18
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link3727.612.4Gemini-1.5-Pro (CoT)2024-03-08
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs✓ Link36.221.88.4VideoLLaMA2-72B2024-06-11
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context✓ Link35.822.610.2Gemini-1.5-Pro2024-03-08
[]()32.828.810.6Claude 3.5 Sonnet
MiniCPM-V: A GPT-4V Level MLLM on Your Phone✓ Link32.629.211.2MiniCPM-2.62024-08-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output✓ Link30.828.49InternLM-XC-2.5 (CoT)2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output✓ Link28.827.89.6InternLM-XC-2.52024-07-03
[]()25.822.25.2LLaVA-NeXT-Video-34B (CoT)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection✓ Link24.825.86.6Video-LLaVA-7B2023-11-16
[]()2422.46.2Phi-3.5-Vision
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding✓ Link23.825.66.8MA-LMM-Vicuna-7B2024-04-08
[]()2321.23.8LLaVA-NeXT-Video-34B
[]()21.826.26.8LLaVA-NeXT-Video-7B (CoT)
[]()21.825.66.2LLaVA-NeXT-Video-7B
VTimeLLM: Empower LLM to Grasp Video Moments✓ Link19.4275.2VTimeLLM2023-11-30
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding✓ Link172.81.2VideoCLIP2021-09-28
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment✓ Link10.651.2LanguageBind2023-10-03
ImageBind: One Embedding Space To Bind Them All✓ Link9.43.40.6ImageBind2023-05-09