Summary: We introduce CG-Bench, a groundbreaking benchmark for clue-grounded question answering in long videos, addressing the limitations of existing benchmarks that focus primarily on short videos and rely on multiple-choice questions (MCQs). These limitations allow models to answer by elimination rather than genuine understanding. CG-Bench enhances evaluation credibility by requiring models to retrieve relevant clues for questions. It includes 1,219 manually curated videos across 14 primary, 171 secondary, and 638 tertiary categories, making it the largest benchmark for long video analysis. With 12,129 QA pairs in perception, reasoning, and hallucination question types, CG-Bench introduces innovative clue-based evaluation methods: clue-grounded white box and black box evaluations, ensuring answers are based on correct video understanding. Evaluations of various MLLMs reveal significant performance gaps in long video comprehension, especially between open-source and commercial models. We aim for CG-Bench to drive the development of more reliable and capable MLLMs for long video understanding. All annotations and video data will be publicly released.
Rank | Models | LLM | #F | MCQ | Cred. Eval. | Open-Ended | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
#param | clue | long | clue-acc. | long-acc. | mIoU | rec.@IoU | acc.@IoU | CRR | acc. | ||
1 | Videochat2 | 7B | 16 | 16 | 35.5 | 19.1 | 1.07 | 1.91 | 0.91 | 53.8 | 18.4 |
2 | VideoLLaMA | 7B | 32 | 32 | 36.5 | 18.0 | 1.14 | 1.93 | 0.74 | 49.3 | 16.0 |
3 | Video-LLaVA | 7B | 8 | 8 | 33.4 | 16.8 | 1.01 | 1.89 | 0.64 | 50.3 | 12.0 |
4 | ST-LLM | 7B | 32 | 64 | 39.1 | 24.7 | 2.13 | 2.83 | 1.27 | 63.2 | 20.0 |
5 | Chat-UniVi-v1.5 | 13B | 32 | 64 | 41.2 | 26.7 | 1.91 | 2.45 | 1.29 | 64.7 | 21.8 |
6 | ShareGPT4Video | 16B | 16 | 16 | 40.9 | 27.1 | 2.07 | 2.39 | 1.02 | 66.4 | 21.5 |
7 | Qwen-VL-Chat | 7B | 4 | 4 | 37.7 | 20.7 | 0.93 | 1.06 | 0.52 | 54.9 | 20.1 |
8 | ViLA | 8B | 14 | 14 | 41.4 | 28.1 | 1.98 | 2.51 | 1.43 | 67.8 | 23.8 |
9 | InternVL-Chat-v1.5 | 20B | 10 | 10 | 42.0 | 28.5 | 1.85 | 2.48 | 1.09 | 67.9 | 22.9 |
10 | MiniCPM-v2.6 | 8B | 32 | 32 | 44.4 | 29.9 | 2.27 | 2.41 | 0.94 | 67.3 | 26.3 |
11 | Kangaroo | 8B | 32 | 64 | 46.4 | 31.2 | 2.32 | 2.97 | 1.54 | 67.2 | 25.9 |
12 | LLaVA-OneVision | 7B | 16 | 16 | 43.7 | 30.9 | 1.56 | 1.72 | 1.19 | 70.7 | 25.0 |
13 | Video-CCAM | 14B | 32 | 96 | 42.9 | 29.1 | 2.76 | 3.57 | 1.85 | 67.8 | 24.8 |
14 | LongVA | 7B | 32 | 128 | 42.6 | 28.7 | 2.91 | 3.15 | 1.32 | 67.4 | 24.2 |
15 | VITA | 8x7B | 32 | 32 | 47.4 | 33.0 | 2.99 | 3.16 | 2.15 | 69.6 | 28.0 |
16 | Qwen2-VL | 72B | 32 | 128 | 57.8 | 45.3 | 3.64 | 5.11 | 3.17 | 78.4 | 33.7 |
17 | InternVL2.5 | 72B | 32 | 32 | 59.5 | 44.2 | 3.90 | 5.05 | 2.46 | 74.3 | 34.2 |
18 | GPT-4o-08-06 | - | 32 | 128 | 58.6 | 44.9 | 5.73 | 8.12 | 4.33 | 76.6 | 39.2 |
19 | GPT-4mini-08-06 | - | 32 | 128 | 48.4 | 32.6 | 3.68 | 5.07 | 2.13 | 67.4 | 24.9 |
20 | Gemini-1.5-Pro | - | 32 | 128 | 50.9 | 37.8 | 3.85 | 5.61 | 2.64 | 74.3 | 28.7 |
21 | Gemini-1.5-Flash | - | 32 | 128 | 48.4 | 33.5 | 3.43 | 5.10 | 2.49 | 69.2 | 24.6 |
22 | Claude3.5-Sonnet | - | 32 | 50 | 56.5 | 40.3 | 4.17 | 5.99 | 2.73 | 71.3 | 35.6 |
This leaderboard tracks the performance of different models on the mini-set of CG-Bench dataset. It contains 1118 videos and 3000 questions for fast evaluation. The evaluation metrics include:
Click on column headers to sort by different metrics.
If you want to submit your model, please fill in this sheet.
@misc{chen2024cgbench, title={CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding}, author={Guo Chen and Yicheng Liu and Yifei Huang and Yuping He and Baoqi Pei and Jilan Xu and Yali Wang and Tong Lu and Limin Wang}, year={2024}, eprint={2412.12075}, archivePrefix={arXiv}, primaryClass={cs.CV} } |