Tinted Frames: Question Framing Blinds Vision-Language Models

Published in arXiv preprint, 2026

Vision-language models (VLMs) often underutilize visual inputs, relying instead on language priors. This work shows that such blindness is not static but conditional: models selectively attend to visual information depending on how a question is framed. Even when different formulations require identical visual reasoning, constrained formats such as yes/no or multiple-choice reduce attention to relevant image regions, increase focus on uninformative tokens, and degrade accuracy. :contentReference[oaicite:0]{index=0}

We quantify this effect through cross-framing inconsistency, demonstrating that models frequently fail to preserve correct answers when questions are reformulated. Using attention rollout, we show that framing alters both the magnitude and spatial allocation of visual attention, particularly in layers responsible for cross-modal interaction. These shifts causally impact performance: restoring attention patterns via intervention improves accuracy.

To address this, we propose a lightweight prompt-tuning method that learns a small set of tokens to realign attention in constrained settings toward the robust patterns observed in open-ended generation. This approach improves visual grounding, reduces inconsistency across framings, and yields consistent gains across multiple models and benchmarks without modifying model weights.

Recommended citation: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta. (2026). "Tinted Frames: Question Framing Blinds Vision-Language Models." arXiv preprint.
Download Paper

Share on

X (formerly Twitter) Facebook LinkedIn