Paper | Code | Accuracy (% ) | ModelName | ReleaseDate |
---|---|---|---|---|
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | ✓ Link | 66.7 | Gemini 1.5 Pro | 2024-03-08 |
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension | ✓ Link | 65.4 | Video-RAG (based on LLaVA-Video) | 2024-11-20 |
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | 64.0 | GPT-4o | 2024-06-14 | |
Video Instruction Tuning With Synthetic Data | 61.9 | LLaVA-Video | 2024-10-03 |