CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
CG-Bench: Clue-grounded Question Answering Benchmark
for Long Video Understanding

Guo Chen*
Yicheng Liu
Yifei Huang
Yuping He
Baoqi Pei
Jilan Xu
Yali Wang
Tong Lu
Limin Wang
Nanjing University
[Paper]
[Dataset]
[GitHub]
[Leaderboard]

Summary: We introduce CG-Bench, a groundbreaking benchmark for clue-grounded question answering in long videos, addressing the limitations of existing benchmarks that focus primarily on short videos and rely on multiple-choice questions (MCQs). These limitations allow models to answer by elimination rather than genuine understanding. CG-Bench enhances evaluation credibility by requiring models to retrieve relevant clues for questions. It includes 1,219 manually curated videos across 14 primary, 171 secondary, and 638 tertiary categories, making it the largest benchmark for long video analysis. With 12,129 QA pairs in perception, reasoning, and hallucination question types, CG-Bench introduces innovative clue-based evaluation methods: clue-grounded white box and black box evaluations, ensuring answers are based on correct video understanding. Evaluations of various MLLMs reveal significant performance gaps in long video comprehension, especially between open-source and commercial models. We aim for CG-Bench to drive the development of more reliable and capable MLLMs for long video understanding. All annotations and video data will be publicly released.



Leaderboard

Rank Models LLM #F MCQ Cred. Eval. Open-Ended
#param clue long clue-acc. long-acc. mIoU rec.@IoU acc.@IoU CRR acc.
1 Videochat2 7B 16 16 35.5 19.1 1.07 1.91 0.91 53.8 18.4
2 VideoLLaMA 7B 32 32 36.5 18.0 1.14 1.93 0.74 49.3 16.0
3 Video-LLaVA 7B 8 8 33.4 16.8 1.01 1.89 0.64 50.3 12.0
4 ST-LLM 7B 32 64 39.1 24.7 2.13 2.83 1.27 63.2 20.0
5 Chat-UniVi-v1.5 13B 32 64 41.2 26.7 1.91 2.45 1.29 64.7 21.8
6 ShareGPT4Video 16B 16 16 40.9 27.1 2.07 2.39 1.02 66.4 21.5
7 Qwen-VL-Chat 7B 4 4 37.7 20.7 0.93 1.06 0.52 54.9 20.1
8 ViLA 8B 14 14 41.4 28.1 1.98 2.51 1.43 67.8 23.8
9 InternVL-Chat-v1.5 20B 10 10 42.0 28.5 1.85 2.48 1.09 67.9 22.9
10 MiniCPM-v2.6 8B 32 32 44.4 29.9 2.27 2.41 0.94 67.3 26.3
11 Kangaroo 8B 32 64 46.4 31.2 2.32 2.97 1.54 67.2 25.9
12 LLaVA-OneVision 7B 16 16 43.7 30.9 1.56 1.72 1.19 70.7 25.0
13 Video-CCAM 14B 32 96 42.9 29.1 2.76 3.57 1.85 67.8 24.8
14 LongVA 7B 32 128 42.6 28.7 2.91 3.15 1.32 67.4 24.2
15 VITA 8x7B 32 32 47.4 33.0 2.99 3.16 2.15 69.6 28.0
16 Qwen2-VL 72B 32 128 57.8 45.3 3.64 5.11 3.17 78.4 33.7
17 InternVL2.5 72B 32 32 59.5 44.2 3.90 5.05 2.46 74.3 34.2
18 GPT-4o-08-06 - 32 128 58.6 44.9 5.73 8.12 4.33 76.6 39.2
19 GPT-4mini-08-06 - 32 128 48.4 32.6 3.68 5.07 2.13 67.4 24.9
20 Gemini-1.5-Pro - 32 128 50.9 37.8 3.85 5.61 2.64 74.3 28.7
21 Gemini-1.5-Flash - 32 128 48.4 33.5 3.43 5.10 2.49 69.2 24.6
22 Claude3.5-Sonnet - 32 50 56.5 40.3 4.17 5.99 2.73 71.3 35.6

This leaderboard tracks the performance of different models on the mini-set of CG-Bench dataset. It contains 1118 videos and 3000 questions for fast evaluation. The evaluation metrics include:

  • MCQ: Multiple Choice Questions accuracy on clue-based and long-video settings:
    • clue-acc: Accuracy on multiple choice questions when only given the relevant video clue segments
    • long-acc: Accuracy on multiple choice questions when processing the entire long video
  • Cred. Eval: Credibility evaluation metrics including:
    • mIoU: Mean Intersection over Union - measures the overlap between predicted and ground truth clue intervals
    • recall@IoU: Percentage of predictions with IoU above threshold - evaluates model's ability to locate relevant clues
    • accuracy@IoU: Answer accuracy when IoU exceeds threshold - assesses answer quality with correct clue identification
    • CRR: Clue Retrieval Rate - measures model's overall effectiveness in finding relevant video segments
  • Open-Ended: Performance on open-ended question answering
    • acc: Accuracy on open-ended questions

Click on column headers to sort by different metrics.

If you want to submit your model, please fill in this sheet.

Benchmark Statistics

Video Meta: Our dataset comprises a total of 1219 videos with multiple multimodal information, including vision, audio, and subtitles. The duration of the videos varies between 10 and 80 minutes. Notably, videos that last between 20 and 30 minutes are the most prevalent. This selection process is manual, based on content relevance, which mirrors real-world duration distributions and highlights a long-tail effect for longer videos. As illustrated in Figure 2, each video is classified using a three-tiered tagging system that succinctly encapsulates its content and assigns it to fundamental categories. The primary classification is augmented by a secondary layer of 171 tags and a tertiary layer consisting of 638 tags. This multi-level tagging mechanism guarantees a broad diversity of data content. For a more detailed classification of tags, please consult the supplementary materials.
Question Meta: We annotate it with high-quality question-answer-clue (QAC) triplets. To ensure question diversity, we first establish a taxonomy with three main types: Perception, Reasoning, and Hallucination. As shown in Figure 3, Perception and Reasoning questions are further divided into 10 and 14 subcategories, respectively, while Hallucination questions combine elements of both perception and reasoning. Annotators are instructed to include negative options to create a multiple-choice QA format, facilitating straightforward and cost-effective assessments. To minimize expression loss, annotators use their native language during the annotation process. Each video requires between 6 to 15 QAC triplets, depending on its duration.




Benchmark Comparison

CG-Bench is characterized by its diverse features, allowing it to be compared with three distinct types of benchmarks, as depicted in the three sections of Table 2: Question Clue Grounding, Short-Video QA, and Long-Video QA benchmarks.
Question Grounding: For the question clue grounding benchmarks, NextGQA, Ego4D-NLQ, MultiHop-EgoQA, E.T. Bench, and RexTime are primarily centered around action and egocentric domains. Their videos are sampled from academic datasets. In comparison, the question clue grounding part of CG-Bench, CG-Bench-QG, stands out with the highest number of videos and the longest average length, the diversity of which fosters a broad spectrum of question-grounding queries.
Short-Video Question Answering: Furthermore, we transform QAC triplets to our novel Short-Video QA benchmark, termed CG-Bench-Clue. When contrasted with prior short video benchmarks such as TempCompass, MVBench and MMBench-Video, our CG-Bench-Clue emerges as the largest, held-out, open-domain and multimodal Short-Video QA benchmark.
Long-Video Question Answering: As for the Long-Video QA benchmark, CG-Bench excels in the number of videos, length, quantity of questions, and annotation quality. Owing to our clue interval annotations, CG-Bench further facilitates reliable evaluations for long videos and open-ended evaluations with clue assistance, a feature that sets it apart from existing long video benchmarks like Video-MME and MLVU.




Experiments Results



Citation

@misc{chen2024cgbench,
      title={CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding}, 
      author={Guo Chen and Yicheng Liu and Yifei Huang and Yuping He and Baoqi Pei and Jilan Xu and Yali Wang and Tong Lu and Limin Wang},
      year={2024},
      eprint={2412.12075},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Website source