EPFL · CS-503 Visual Intelligence · Spring 2026

Auxiliary Evidence Routing for Egocentric Video QA

1,035Questions evaluated
6Evidence tools compared
32.3%Native baseline
+3.9 ppRule-based router gain
51.9%Perfect oracle

Abstract

We study whether targeted auxiliary evidence can improve egocentric video QA under practical frame-budget and model-size constraints. Using HD-EPIC, a highly-detailed egocentric benchmark with 41 hours of video and 26K multiple-choice questions across 7 major categories, we compare a resource-bounded native-video baseline against six evidence tools: uniform frame sampling, CLIP retrieval (ViT-B/32), a Motion+CLIP cascade, OCR Crop, object tracking (GroundingDINO-tiny), and Uniform+CLIP.

We run a controlled ablation under a fixed budget of k=8 auxiliary frames, comparing augmentation (native video + auxiliary frames) against replacement (auxiliary frames only). On the 719-question fair intersection used for the mode comparison, the best augment tool (OCR Crop, 33.7%) barely edges the best replace tool (CLIP, 32.5%), both near the no-tool baseline of 32.3%. Replace mode, however, preserves a much larger per-question routing opportunity on this matched subset (+17.4 pp oracle gap vs. +6.1 pp for augment), making it cost-efficient at ~9× fewer frames.

Across 8 routing methods evaluated in replace mode, only a hand-crafted rule-based keyword router beats the best fixed tool (+3.9 pp), recovering the entire between-category gap. Learned classifiers fall below their matched single-tool baselines across the full, helpful, and cached-frame subsets. A large within-category gap of +16 pp to the perfect per-question oracle remains.

I. Introduction

Long-form first-person video question answering is inherently an active-vision problem: not every moment in a video is equally informative, and the evidence needed to answer a question may be brief, local, or easy to miss. In practice, however, compute and memory constraints often prevent us from feeding every frame to a very large video model. We therefore study a resource-bounded setting with a frozen Qwen3-VL-2B model, sampling the native video at 1 FPS with a cap of 64 frames. For a 6.4-minute clip (384 s), this means one frame every ≈ 6 seconds, leaving large temporal gaps where brief events can be missed entirely. Important cues such as scale readings, ingredient labels, or brief object interactions can be easily missed.

Prior work addresses this by selecting or replacing parts of the visual input with targeted evidence. However, these approaches discard the global temporal context of the full video, which a modern video-native VLM already processes well. We instead study auxiliary evidence selection in two modes: augmentation (native video + auxiliary frames) and replacement (auxiliary frames only). We first compare both modes under the same budget, then use replacement — which preserves a 3× larger per-question routing opportunity — as the setting for our routing experiments.

This framing lets us ask three precise questions: (1) Can targeted auxiliary evidence improve egocentric video QA on top of the native video input of a modern VLM? (2) Is augmentation more effective than replacement under the same evidence budget? (3) If gains are category-specific, are they strong enough to justify question-type-conditioned tool routing?

We evaluate on HD-EPIC, a highly-detailed egocentric benchmark with 41 hours of kitchen video and 26K multiple-choice VQA questions across subcategories within all 7 major categories including Recipes, Ingredients, Object Motion, Nutrition, and 3D Perception. We focus on clips ≥ 6.4 minutes, where a limited native-frame budget leaves enough temporal sparsity that targeted frame selection may recover evidence missed by uniform video sampling.

Prior work on long-video understanding uses evidence selection exclusively in a replacement setting, where selected frames or segments become the model's only visual input. Our work differs in two key ways: we study both augmentation (native video + auxiliary frames) and replacement as complementary conditions — finding that replacement preserves more routing opportunity — and our final goal is a question-type-conditioned router introduced only if the ablation reveals systematic tool-specific gains.

Query-guided frame selection

Frame selection GroundVQA (Di & Xie, CVPR 2024) performs temporal grounding before QA on long egocentric videos, localising the relevant segment before passing it to the model — a replacement approach that discards global context. Q-Frame (Zhang et al., ICCV 2025) uses CLIP text-image similarity to dynamically select the most relevant frames for a given question, with multi-resolution adaptation. M-LLM Frame Selection (Hu et al., CVPR 2025) trains a lightweight multimodal selector to score frames before passing the best ones to a frozen downstream VLM. All three operate in replacement mode and apply a single fixed selection mechanism rather than routing between strategies by question type.

Tool-based retrieval and agents

Tool-based retrieval and agents VideoAgent (Wang et al., ECCV 2024) uses a LLM as an agent that iteratively calls retrieval tools to assemble evidence, reasoning only over tool outputs without ever passing the full video to the VLM. Video-RAG (Luo et al., NeurIPS 2025) uses open-source tools (OCR, ASR, object detection) to extract text-based auxiliary information alongside video frames. Neither considers question-type-conditioned routing.

Egocentric-specific works

Egocentric MFAS (Zhang et al., ICML 2024) performs adaptive patch-level selection for egocentric VQA, showing that zooming on local details improves recognition of small objects, directly motivating our OCR+crop tool. EgoTextVQA (Zhou et al., CVPR 2025) shows that even Gemini 1.5 Pro reaches only 33% accuracy on scene-text questions in egocentric video, motivating the use of specialized OCR evidence.

III. Method

Pipeline overview

For each question-video pair from HD-EPIC, we sample a candidate frame pool at 1 FPS from the clip window. An auxiliary evidence tool selects k=8 frames from this pool (our default budget; we ablate over k ∈ {8, 16, 32} in Exp. 3). The frozen VLM (Qwen3-VL-2B) then receives either: (a) the native video + the k auxiliary frames (augment mode), or (b) only the k auxiliary frames (replace mode), along with the multiple-choice question.

Input
HD-EPIC
Video
clip ≥ 6.4 min
Extract
Candidate
Frame Pool
1 FPS sampling
Evidence Tool
▸ Uniform
▸ CLIP (ViT-B/32)
▸ Motion+CLIP
▸ OCR Crop
▸ Object Tracking
▸ Uniform+CLIP
k = 8
Mode
Augment
Native video + aux frames
Replace
Aux frames only
Frozen VLM
Qwen3-VL
2B
+ MCQ prompt
Answer
A–E
MCQ

Auxiliary evidence tools

Uniform

Selects k frames at evenly-spaced indices from the candidate pool. Provides temporal coverage with no question-specific bias. Used as the temporal baseline.

CLIP Retrieval (ViT-B/32)

Ranks candidate frames by text-image cosine similarity to the question using OpenAI CLIP (openai/clip-vit-base-patch32). Uses hierarchical score-based selection to ensure temporal diversity within the top-k.

Motion + CLIP (cascade)

Two-stage cascade: motion detection (L1 pixel diff) over-selects 3× budget at high-change moments, then CLIP refines to the most question-relevant k frames.

OCR Crop

EasyOCR ranks frames by detected-text confidence; saliency-based crop refinement then extracts high-resolution patches (≥384 px) around text regions and salient areas.

Object Tracking (GroundingDINO-tiny)

Zero-shot object detection using IDEA-Research/grounding-dino-tiny. Extracts the object name from the question, scores frames by max detection confidence, and selects the k most salient frames.

Uniform + CLIP (cascade)

Combines uniform temporal coverage (stage 1) with CLIP semantic retrieval (stage 2). Ensures both temporal spread and question relevance within the k-frame budget.

Evaluation protocol

We evaluate on HD-EPIC questions filtered to clips ≥ 6.4 minutes (min_clip_duration_s=384), excluding TIME-tagged and multi-video questions. This yields 1,035 questions across 10 categories. For augment/replace head-to-head comparisons we further restrict to the fair intersection of 719 questions where all 6 tools produce a result in both modes. Each condition is run in both augment (native video + k aux frames) and replace (k aux frames only) mode. The metric is 5-choice multiple-choice accuracy.

Routing

In Stage 2, we test 8 routing strategies in replace mode (k=8): (1) category-level LOOCV, (2–3) TF-IDF bigrams with LogReg/kNN, (4) a rule-based keyword router (regex over question text), (5) tool-agreement ensemble majority vote, (6) CLIP text embeddings (ViT-B/32) with LogReg/kNN, (7) 11 visual features per question (motion energy, motion std, brightness, contrast, colorfulness, edge density, clip length, CLIP text-image similarity mean/max/std, OCR density) with LogReg/kNN, and (8) combined text+visual features. Methods 1–6 are evaluated with leave-one-category-out cross-validation (LOOCV) on all 1,035 questions and, where applicable, with 80/20 splits on the 700 helpful-category questions. Visual methods require cached frames, so Methods 7–8 use the visual subset (623 questions for LOOCV; 427 helpful-category questions for 80/20). Each learned method is tested with both Logistic Regression and k-NN (k=5).

IV. Experiments & Results

We run four experiments on HD-EPIC, each addressing one of our three research questions. The frozen VLM is Qwen3-VL-2B-Instruct throughout. All experiments use a default budget of k=8 auxiliary frames drawn from a 1 FPS candidate pool. The evaluation set covers 1,035 questions from 10 long-video categories (clips ≥ 6.4 min). For head-to-head augment/replace comparisons we use the fair intersection of 719 questions where all 6 tools are available in both modes.

Two reference points appear throughout: random chance = 20% (5-choice MCQ); and the perfect oracle, a theoretical upper bound computed by assigning each question the tool that answers it correctly. It cannot be deployed in practice but measures how much a router could gain if it made no mistakes.

Qualitative examples

Each panel shows k=8 frames selected by two strategies for the same question.

✓ CLIP wins recipe_step: temporal step localization
Question: When did the participant perform step "Add couscous and boiling water to pot (no flame needed), leave until water has been absorbed" from recipe Fish Cakes and Vegetables?
Multiple-choice timestamps: CLIP retrieves frames semantically matching the pouring/mixing action.
CLIP vs Uniform frame grids for recipe step question
[ Image manquante ]

Observation. CLIP retrieves frames that are semantically close to the question text, such as the pot, couscous, and pouring action, concentrating evidence around the relevant step. Uniform spreads k=8 frames evenly over the full clip (often 20+ minutes), so the brief step may fall between sampled timestamps entirely. This illustrates the within-category diversity: for recipe_step questions, which require temporal localization, CLIP's text-image alignment is the right inductive bias.

✓ OCR Crop wins ingredient_ingredient_weight: scale reading
Question: How much did the participant weigh of brown onions in this video?
A: 113 g  ·  B: 98 g ✓  ·  C: 117 g  ·  D: 101 g  ·  E: 94 g
All 5 strategies for ingredient weight question
[ Image manquante ]

All 5 tools compared. CLIP, Motion+CLIP, and Object Tracking have no inductive bias toward text displays. Their selected frames look virtually identical to Uniform (wide kitchen shots, no scale visible). Only OCR Crop consistently zooms in on the digital readout.

Focused view: OCR Crop vs Uniform
OCR Crop vs Uniform focused comparison
[ Image manquante ]

Observation. OCR Crop consistently retrieves frames where the digital scale is visible and readable (96, 98 g), providing a high-resolution crop of the display. Uniform sampling returns wide kitchen scenes, meaning the scale may not even be in frame at the sampled timestamps. For a numeric-readout question, the evidence quality gap is decisive.

Exp. 1
Does the choice of mode (augment vs. replace) matter?
We take the same 6 tools and same k=8 budget and evaluate them in two modes: replace (VLM sees only the k aux frames) and augment (VLM sees native video + k aux frames). The native-only baseline qwen_native serves as the reference. Evaluated on 719 fair-intersection questions.

Augment vs. replace: headline (k=8, n=719)

Figure 1. On the 719-question fair intersection, best-tool accuracy is near-equal across modes (augment 33.7% vs. replace 32.5%), but the oracle gap is nearly 3× larger in replace mode (+17.4 pp vs. +6.1 pp). Dashed line = qwen_native baseline (32.3%).

The two modes are nearly tied on best-tool accuracy (33.7% augment vs. 32.5% replace, +1.1 pp). The oracle bars tell a different story on the same 719 questions: replace preserves far more per-question routing opportunity (49.9% oracle vs. 39.8%). Adding the native video compresses the oracle gap by ~65%, because all tools converge once the VLM already has global context.

Finding: Replace mode is preferred for routing experiments: best-tool accuracy is equivalent (+1.1 pp), but it preserves a 3× larger per-question oracle gap (+17.4 pp vs. +6.1 pp) at ~9× fewer frames, making it the more informative and cost-efficient setting.

Routing opportunity decomposition (replace, k=8, full n=1,035)

32.0%
35.9%
51.9%
Best single tool
Within-category gap (unsolved)
Best single tool (32.0%)
Rule-based router recovers +3.9 pp → 35.9%
Within-category gap +16.0 pp → oracle 51.9% (unsolved)

Figure 2. Routing opportunity decomposition on the full replace-mode routing set (k=8, n=1,035). This is larger than the 719-question fair intersection used in Figure 1. The +19.9 pp total gap splits into +3.9 pp between categories (fully recovered by the rule-based router) and +16.0 pp within categories (unsolved by any method).

This larger replace-mode routing gap is interesting because replace uses far fewer frames than augment. One possible explanation is that auxiliary frames in augment mode are appended after the native video, which may make them harder for the model to integrate or may even distract it from the original temporal context. In replace mode, the model is forced to rely on the selected evidence frames, so each tool's inductive bias remains clearer and the difference between tools is easier to exploit.

On the full replace-mode set, the total routing potential (+19.9 pp) can be decomposed into two distinct gaps. The between-category gap (+3.9 pp) exists because the best tool differs across question categories. A router that simply knows the category and assigns its historically best tool recovers this gap entirely. The within-category gap (+16.0 pp) is harder: even within a single category, different questions benefit from different tools, and no method we tested can exploit this signal. This decomposition explains why Exp. 4 focuses on category-level routing first.

Exp. 2
Which evidence tool works best, and does it vary by question category?
We compare six tools: uniform, clip, motion_then_clip, ocr_crop, object_tracking, and uniform+clip, against the native baseline on the 719-question fair intersection, breaking results down by overall accuracy and by each of the 10 HD-EPIC categories.

Per-tool breakdown (k=8, n=719 fair intersection)

Figure 3. Per-tool accuracy in augment (dark) vs. replace (grey) modes (k=8, n=719). OCR Crop is the only tool with a large augment advantage (+3.5 pp). CLIP, Motion+CLIP, and Uniform+CLIP score higher in replace mode, showing augment is not uniformly better. Dashed red line = qwen_native (32.3%).

Augment does not uniformly outperform replace. CLIP, Motion+CLIP, and Uniform+CLIP each score higher in replace mode. OCR Crop is the only tool with a large augment advantage (+3.5 pp), because text/detail crops are complementary to the native video rather than redundant with it.

Per-category analysis (augment, k=8, n=719)

Categorynqwen_nativeBest aug. toolAug. bestΔ vs nativeBest rep. toolRep. best
obj. motion count20044.5%ocr_crop49.0%+4.5ppuniform50.0%
obj. motion itin.13411.2%motion_then_clip14.2%+3.0ppocr_crop17.9%
3d perc. fixture8625.6%object_tracking26.7%+1.2ppocr_crop32.6%
recipe multi-step5032.0%uniform+clip34.0%+2.0ppclip30.0%
recipe multi-recog.5056.0%clip58.0%+2.0ppuniform+clip54.0%
ingredient order5034.0%motion_then_clip36.0%+2.0ppmotion_then_clip32.0%
ingredient exact5026.0%ocr_crop28.0%+2.0ppocr_crop16.0%
nutrition3554.3%uniform65.7%+11.4ppmotion_then_clip57.1%
fine-grained act.3125.8%uniform25.8%±0.0ppobject_tracking35.5%
recipe step3315.2%uniform12.1%−3.0ppuniform+clip24.2%

Table 2. Per-category breakdown (augment mode, k=8, n=719). For each category: native baseline, best augment tool with accuracy and Δ vs. native, and best replace tool for comparison.

In 8 of 10 categories, the per-category best augment tool beats qwen_native (1 tied, 1 below). Replace mode tends to produce higher best-tool scores in categories with high oracle headroom (e.g. fine-grained, recipe step), while augment dominates in nutrition and ingredient exact.

Finding: No single tool dominates. All six tools win at least one category across augment/replace views. OCR Crop is the strongest globally (33.7% augment), driven by large gains in text-heavy and counting tasks. The best tool is a function of the question type, which motivates Exp. 4.

Figure 4. Heatmap: augment mode accuracy per tool per category (k=8, n=719). Color scale is normalized per row (white = row minimum, red = row maximum). ★ marks the best tool per category. All six tools win at least one augment category, confirming that no single evidence strategy is universally optimal.

Reading the heatmap by row: rows where all cells have similar brightness (e.g. recipe recog.) indicate that all tools perform equally, so no routing gain is possible. Rows with high contrast (e.g. nutrition, ingredient exact) signal that the right tool matters a lot. The ★ column shifts across rows, with no fixed winner across categories.

Exp. 3
Does a larger frame budget (k=16, k=32) improve accuracy?
We re-run the ablation at k=8, k=16, and k=32 on categories with budget-ablation data (ingredient exact, obj. motion count, recipe multi-step; fine-grained has k=8/k=32 only). Results below focus on obj. motion count (n=200) in augment mode, where the resolution collapse is most dramatic; the other categories show no comparable collapse.

Budget sensitivity: augment mode (obj. motion count, n=200, shown as representative)

Figure 5. Accuracy vs. frame budget in augment mode (obj. motion count, n=200). CLIP, Motion+CLIP, Object Tracking, and Uniform+CLIP collapse at k=32 due to resolution compression when 32 aux frames compete with the native video for context budget. OCR Crop is the most robust (−4.5 pp from k=8 to k=32); Uniform declines moderately (−11 pp).

At k=32 in augment mode, CLIP-based and tracking tools collapse catastrophically (−25 to −35 pp). When 32 aux frames are added alongside the native video, Qwen3-VL must fit all frames into a fixed context window. Each frame is allocated fewer tokens, reducing its effective resolution. Tools that rely on fine-grained image features (CLIP similarity, GroundingDINO detections) degrade severely at low resolution. This collapse does not appear in replace mode, where no native video competes for the context budget.

Finding: More frames is not always better. In augment mode, k=32 causes a catastrophic resolution collapse for CLIP/tracking tools (−35 pp for Motion+CLIP, −35 pp for Object Tracking). OCR Crop is uniquely robust because it selects high-resolution crops independently of image embedding quality. k=8 is the safest default budget: larger budgets can help in some categories, but they can also hurt badly when auxiliary frames compete with the native video for context.
Exp. 4
Can a router predict the best tool per question?
Since Exp. 2 shows the best tool varies by category, we test 8 routing strategies: a rule-based keyword router, text classifiers (TF-IDF, CLIP embeddings), an ensemble, visual feature classifiers, and a combined text+visual approach. All methods use replace mode at k=8, but the matched baseline depends on the evaluation subset: Methods 1–6 use all-category LOOCV (n=1,035) or helpful-category 80/20 splits (n=700), while visual methods use the cached-frame subset (n=623 LOOCV; n=427 helpful 80/20).

Routing methods: overview

Input signal
Method
Features → classifier
Training
Category
stats
M1 Category LOOCV
per-cat best tool
none
M5 Ensemble majority vote
majority answer A–E
none
Question
text
M4 ★ Rule-based router
"exact qty" 10 regex rules
none ✓
M2/3 TF-IDF
500-d sparse LogReg / kNN
LOOCV
M6 CLIP text embeddings
512-d dense LogReg / kNN
LOOCV
Video
frames †
M7 Visual features
11 descriptors LogReg / kNN
LOOCV
M8 Text + Visual
512+11 = 523-d LogReg / kNN
LOOCV

Figure 6. The 8 routing methods grouped by input signal. Each row shows the feature representation (bar charts = feature vectors, sparse = TF-IDF, dense = CLIP, coloured bars = 11 visual descriptors) and the classifier used. M4 (red, ★) marks the only method that beats its matched baseline; M1, M4, and M5 require no learned classifier. Tool colour dots in M1 indicate per-category tool assignment. † Visual methods require cached candidate frames (~60% of questions).

Routing results (replace mode, k=8)

Figure 7. LOOCV routing results in replace mode. Methods without † use all categories (n=1,035; best single tool = 32.0%; oracle = 51.9%). Visual methods require cached frames and use the visual subset (n=623; matched best single tool = 33.7%; oracle = 52.5%). Only the rule-based router beats its matched best-single-tool baseline.

Matched-baseline checks beyond LOOCV

Helpful categories 80/20 (n=700)

Matched baseline: motion_then_clip at 38.3%.

Rule-based routing reaches 42.7% (+4.4 pp).

TF-IDF and CLIP-text classifiers cluster at 35.7%, so learned text routing remains below the fixed-tool baseline even in the easier within-distribution split.

Visual + helpful 80/20 (n=427)

Matched baseline: motion_then_clip at 41.9%.

Visual and text+visual routers reach at most 39.6% (−2.4 pp).

The bonus text-only kNN run comes closest at 41.7% ± 4.1, but still does not exceed its matched baseline.

Results across 8 routing methods: (1) The rule-based router reaches 35.9% LOOCV and 42.7% on helpful-category 80/20. (2) TF-IDF bigrams and CLIP text classifiers both cluster at ~31–32% LOOCV. (3) On the cached-frame subset, visual feature classifiers peak at 31.0% LOOCV against a matched baseline of 33.7%. (4) Combined text+visual features perform no better than visual alone. (5) The text-only kNN run reaches 41.7% on the visual+helpful 80/20 subset, the strongest learned result but 0.3 pp below its matched baseline.

Finding: Routing is hard. Only the hand-crafted rule-based router beats the best fixed tool (+3.9 pp LOOCV; +4.4 pp on helpful-category 80/20). The learned classifiers and non-rule baselines all fall below their matched single-tool baselines. The gap decomposition from Exp. 1 explains why: the between-category gap (+3.9 pp) is recoverable with domain knowledge; the within-category gap (+16.0 pp) requires per-question signals that neither text nor cheap visual features can provide.

V. Conclusion and Limitations

Our main answer is nuanced. Targeted auxiliary evidence can help egocentric video QA, but not as a universal add-on to the native video input. Under our resource-bounded setup, the best augment tool improves only slightly over the native-video baseline (33.7% vs. 32.3%), so simply appending selected frames is not enough to produce a large overall gain. The stronger signal is conditional: different question types benefit from different evidence tools, especially for cues such as scale readings, ingredient labels, object motion, and temporal localization.

This leads to two conclusions about routing. First, augment and replace modes are nearly tied on best-tool accuracy on the 719-question fair intersection (+1.1 pp), but replace mode preserves a much larger routing opportunity on that matched subset (+17.4 pp oracle gap vs. +6.1 pp), making it the better setting for studying tool choice. Second, routing is useful only when it captures real question-type structure: on the full 1,035-question replace-mode routing set, a hand-crafted rule-based router is the only method that beats the best fixed tool (+3.9 pp LOOCV; +4.4 pp on helpful-category 80/20), while learned text, visual, and combined classifiers all fall below their matched single-tool baselines. The remaining +16 pp within-category oracle gap shows that the main unsolved challenge is per-question routing, which likely requires stronger visual or model-confidence signals beyond cheap first-order descriptors.

Limitations & Future Work

Sample size

Small per-category samples (as few as n=31) mean results lack statistical confidence for some categories.

Future work: Evaluating on the full HD-EPIC set (all clip lengths) would increase per-category n and yield more reliable category-level comparisons.

Within-category routing

The +16 pp within-category oracle gap is unsolved. Cheap visual features are insufficient; routing at the question level requires stronger signals.

Future work: Model confidence scores, VLM attention maps, or fine-grained embeddings (e.g. CLIP ViT-L/14) could provide the per-question discriminability that first-order descriptors lack.

Single domain

All experiments are conducted on long HD-EPIC kitchen videos. A short/medium-video check showed a smaller routing gap (+6.6 pp vs. +19.9 pp), so our conclusions are strongest for long-video QA.

Future work: Repeating the study on Ego4D or EPIC-Kitchens would test whether the between-category gap and tool rankings hold across different activity domains and clip lengths.

Cheap tools

Evidence tools use lightweight models (CLIP ViT-B/32, GroundingDINO-tiny, EasyOCR). Stronger backbones could change the relative tool ranking and routing conclusions.

Future work: Replacing CLIP ViT-B/32 with ViT-L/14 or using a larger OCR model (e.g. PaddleOCR) would test whether tool quality is the binding constraint.