CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
CG-Bench: Clue-grounded Question Answering Benchmark
for Long Video Understanding

Yicheng Liu

Yuping He

Baoqi Pei

Jilan Xu

Nanjing University

Summary: We introduce CG-Bench, a groundbreaking benchmark for clue-grounded question answering in long videos, addressing the limitations of existing benchmarks that focus primarily on short videos and rely on multiple-choice questions (MCQs). These limitations allow models to answer by elimination rather than genuine understanding. CG-Bench enhances evaluation credibility by requiring models to retrieve relevant clues for questions. It includes 1,219 manually curated videos across 14 primary, 171 secondary, and 638 tertiary categories, making it the largest benchmark for long video analysis. With 12,129 QA pairs in perception, reasoning, and hallucination question types, CG-Bench introduces innovative clue-based evaluation methods: clue-grounded white box and black box evaluations, ensuring answers are based on correct video understanding. Evaluations of various MLLMs reveal significant performance gaps in long video comprehension, especially between open-source and commercial models. We aim for CG-Bench to drive the development of more reliable and capable MLLMs for long video understanding. All annotations and video data will be publicly released.

Leaderboard

Rank	Models	LLM	#F		MCQ		Cred. Eval.				Open-Ended
Rank	Models	#param	clue	long	clue-acc.	long-acc.	mIoU	rec.@IoU	acc.@IoU	CRR	acc.
1	Videochat2	7B	16	16	35.5	19.1	1.07	1.91	0.91	53.8	18.4
2	VideoLLaMA	7B	32	32	36.5	18.0	1.14	1.93	0.74	49.3	16.0
3	Video-LLaVA	7B	8	8	33.4	16.8	1.01	1.89	0.64	50.3	12.0
4	ST-LLM	7B	32	64	39.1	24.7	2.13	2.83	1.27	63.2	20.0
5	Chat-UniVi-v1.5	13B	32	64	41.2	26.7	1.91	2.45	1.29	64.7	21.8
6	ShareGPT4Video	16B	16	16	40.9	27.1	2.07	2.39	1.02	66.4	21.5
7	Qwen-VL-Chat	7B	4	4	37.7	20.7	0.93	1.06	0.52	54.9	20.1
8	ViLA	8B	14	14	41.4	28.1	1.98	2.51	1.43	67.8	23.8
9	InternVL-Chat-v1.5	20B	10	10	42.0	28.5	1.85	2.48	1.09	67.9	22.9
10	MiniCPM-v2.6	8B	32	32	44.4	29.9	2.27	2.41	0.94	67.3	26.3
11	Kangaroo	8B	32	64	46.4	31.2	2.32	2.97	1.54	67.2	25.9
12	LLaVA-OneVision	7B	16	16	43.7	30.9	1.56	1.72	1.19	70.7	25.0
13	Video-CCAM	14B	32	96	42.9	29.1	2.76	3.57	1.85	67.8	24.8
14	LongVA	7B	32	128	42.6	28.7	2.91	3.15	1.32	67.4	24.2
15	VITA	8x7B	32	32	47.4	33.0	2.99	3.16	2.15	69.6	28.0
16	Qwen2-VL	72B	32	128	57.8	45.3	3.64	5.11	3.17	78.4	33.7
17	InternVL2.5	72B	32	32	59.5	44.2	3.90	5.05	2.46	74.3	34.2
18	GPT-4o-08-06	-	32	128	58.6	44.9	5.73	8.12	4.33	76.6	39.2
19	GPT-4mini-08-06	-	32	128	48.4	32.6	3.68	5.07	2.13	67.4	24.9
20	Gemini-1.5-Pro	-	32	128	50.9	37.8	3.85	5.61	2.64	74.3	28.7
21	Gemini-1.5-Flash	-	32	128	48.4	33.5	3.43	5.10	2.49	69.2	24.6
22	Claude3.5-Sonnet	-	32	50	56.5	40.3	4.17	5.99	2.73	71.3	35.6

This leaderboard tracks the performance of different models on the mini-set of CG-Bench dataset. It contains 1118 videos and 3000 questions for fast evaluation. The evaluation metrics include:

MCQ: Multiple Choice Questions accuracy on clue-based and long-video settings:
- clue-acc: Accuracy on multiple choice questions when only given the relevant video clue segments
- long-acc: Accuracy on multiple choice questions when processing the entire long video
Cred. Eval: Credibility evaluation metrics including:
- mIoU: Mean Intersection over Union - measures the overlap between predicted and ground truth clue intervals
- recall@IoU: Percentage of predictions with IoU above threshold - evaluates model's ability to locate relevant clues
- accuracy@IoU: Answer accuracy when IoU exceeds threshold - assesses answer quality with correct clue identification
- CRR: Clue Retrieval Rate - measures model's overall effectiveness in finding relevant video segments
Open-Ended: Performance on open-ended question answering
- acc: Accuracy on open-ended questions

Click on column headers to sort by different metrics.

If you want to submit your model, please fill in this sheet.

Benchmark Statistics

Video Meta: Our dataset comprises a total of 1219 videos with multiple multimodal information, including vision, audio, and subtitles. The duration of the videos varies between 10 and 80 minutes. Notably, videos that last between 20 and 30 minutes are the most prevalent. This selection process is manual, based on content relevance, which mirrors real-world duration distributions and highlights a long-tail effect for longer videos. As illustrated in Figure 2, each video is classified using a three-tiered tagging system that succinctly encapsulates its content and assigns it to fundamental categories. The primary classification is augmented by a secondary layer of 171 tags and a tertiary layer consisting of 638 tags. This multi-level tagging mechanism guarantees a broad diversity of data content. For a more detailed classification of tags, please consult the supplementary materials.
Question Meta: We annotate it with high-quality question-answer-clue (QAC) triplets. To ensure question diversity, we first establish a taxonomy with three main types: Perception, Reasoning, and Hallucination. As shown in Figure 3, Perception and Reasoning questions are further divided into 10 and 14 subcategories, respectively, while Hallucination questions combine elements of both perception and reasoning. Annotators are instructed to include negative options to create a multiple-choice QA format, facilitating straightforward and cost-effective assessments. To minimize expression loss, annotators use their native language during the annotation process. Each video requires between 6 to 15 QAC triplets, depending on its duration.

Benchmark Comparison

CG-Bench is characterized by its diverse features, allowing it to be compared with three distinct types of benchmarks, as depicted in the three sections of Table 2: Question Clue Grounding, Short-Video QA, and Long-Video QA benchmarks.
Question Grounding: For the question clue grounding benchmarks, NextGQA, Ego4D-NLQ, MultiHop-EgoQA, E.T. Bench, and RexTime are primarily centered around action and egocentric domains. Their videos are sampled from academic datasets. In comparison, the question clue grounding part of CG-Bench, CG-Bench-QG, stands out with the highest number of videos and the longest average length, the diversity of which fosters a broad spectrum of question-grounding queries.
Short-Video Question Answering: Furthermore, we transform QAC triplets to our novel Short-Video QA benchmark, termed CG-Bench-Clue. When contrasted with prior short video benchmarks such as TempCompass, MVBench and MMBench-Video, our CG-Bench-Clue emerges as the largest, held-out, open-domain and multimodal Short-Video QA benchmark.
Long-Video Question Answering: As for the Long-Video QA benchmark, CG-Bench excels in the number of videos, length, quantity of questions, and annotation quality. Owing to our clue interval annotations, CG-Bench further facilitates reliable evaluations for long videos and open-ended evaluations with clue assistance, a feature that sets it apart from existing long video benchmarks like Video-MME and MLVU.

Experiments Results

Citation

@misc{chen2024cgbench,
      title={CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding}, 
      author={Guo Chen and Yicheng Liu and Yifei Huang and Yuping He and Baoqi Pei and Jilan Xu and Yali Wang and Tong Lu and Limin Wang},
      year={2024},
      eprint={2412.12075},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Website source