DatBench: Discriminative, Faithful, and Efficient Vision-Language Model Evaluations
*NOTE: If you prefer a PDF, the ArXiv version of this post can be found here.
Executive Summary
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, do not represent downstream use-cases, and saturate early as models improve; (ii) 'blindly-solvable' questions which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone.
Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize their fidelity and discriminability. We find that transformations such as converting MCQs to generative tasks reveal sharp capability drops of up to 35%. In addition, filtering blindly-solvable and mislabeled samples enhances the discriminative power of these evaluations, while simultaneously reducing their computational cost. We release DatBench, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench-Full, a discriminative subset that achieves 13× average speedup (up to 50×) while closely matching the discriminative power of the original datasets. Our work provides a path towards evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
Hugging Face Datasets:
- DatBench: https://huggingface.co/datasets/DatologyAI/DatBench
- DatBench-Full: https://huggingface.co/datasets/DatologyAI/DatBench-Full
Code: https://github.com/datologyai/DatBench
1. Introduction
Empirical evaluation is the primary mechanism through which progress in foundation models is recognized, compared, and acted upon. As machine learning has shifted from narrow, task-specific systems to general-purpose vision-language models (VLMs) with [20], benchmarks now play an outsized role: they define what counts as progress and directly shape how substantial computational and human resources are allocated. Evaluations are no longer a passive reporting tool but an active driver of research direction.
However, modern evaluation pipelines are increasingly misaligned with the behaviors they aim to measure. As model inputs span multiple modalities and outputs become increasingly generative and stochastic, benchmarks must better disentangle genuine capabilities from superficial heuristics and inherent variance [57]. While the evaluation of language-only models has received sustained methodological attention [54, 55], VLM evaluation remains comparatively under-examined.
Recent evidence suggests that this gap has become a serious liability. Existing VLM benchmarks suffer from pervasive data quality failures, including mislabeled or ambiguous examples, questions solvable without visual input, and heavy reliance on multiple-choice formats that are not representative of downstream use-cases and are vulnerable to spurious correlations [9, 21, 48, 14, 12]. These artifacts inflate reported accuracy, introduce a substantial noise floor, and reduce the signal-to-noise ratio of the evaluations. In such a regime, small improvements, often on the order of a few percent, are more plausibly explained by overfitting to benchmark idiosyncrasies than by real capability gains, rendering the research community vulnerable to hill-climbing on noise [58, 20].
At the same time, evaluation has become a major computational bottleneck. Running comprehensive VLM evaluation suites now consumes a nontrivial fraction of total development compute [60]. For example, during the development of OLMo3, nearly 20% of the total compute budget for the post-training phase was reportedly dedicated to evaluation alone [69]. This burden is amplified for VLMs by the dense visual token sequences required to represent high-resolution images and the extended reasoning traces at inference time, which can collectively exceed tens of thousands of tokens per example [38]. Detailed analyses indicate that much of this cost is spent evaluating samples that are either trivial, noisy, or weakly discriminative [3, 24].
In this work, we argue that the design of effective evaluation should be treated as a data curation problem. Rather than repeatedly constructing new benchmarks from scratch, we propose to systematically transform and filter evaluation data to maximize faithfulness, discriminative power, and efficiency. This perspective mirrors recent successes in training data curation, in which careful data transformation and selection has produced large gains in model quality and compute efficiency [19, 17, 52, 18, 62, 51, 73, 72, 79]. We show that the same principles apply, with similar impact, to evaluation.
Guided by this view, we define three desiderata for modern VLM evaluation datasets, (i) faithfulness: examples should genuinely require visual input and reflect intended downstream use cases; (ii) discriminability: examples should reliably separate stronger models from weaker ones; and (iii) efficiency: evaluation should maximize signal per unit of compute. These criteria expose four systematic failure modes in existing benchmarks and motivate targeted interventions (Section 3).
First, multiple-choice formats are both unfaithful and weakly discriminative in generative settings. Converting MCQs to open-ended generation reveals large hidden capability gaps. On AI2D, for example, average accuracy drops from 77.56% to 40.53%, with the strongest MCQ model losing nearly 35 points. When generative conversion is infeasible, circular evaluation [21] collapses chance baselines and exposes similar inflation effects.
Second, many VLM benchmarks can be solved without vision. By evaluating models with the image removed, we find that over 70% of samples in VQA-v2 can be answered correctly using language priors alone. Such examples fundamentally fail to measure multimodal reasoning.
Third, low-resolution inputs and inaccurate or ambiguous annotations introduce substantial noise. Using a multi-stage filtering pipeline, we discard up to 43.9% in benchmarks such as MME-RealWorld (Autonomous Driving). In these instances, evaluation is confounded by factual labeling errors and indeterminable ground truths---where poor image quality renders the target objects unrecognizable even to a human observer---effectively precluding reliable performance assessment.
Fourth, existing evaluation suites are inefficient. By explicitly selecting items with high discriminative power across a diverse set of 1B–10B scale models, we achieve speedups of up to 50× (13x on average) while closely matching the discriminative power of full benchmarks using a small fraction of the data (Figure 1).
Applying these interventions, we introduce DatBench (Section 4), a curated suite of VLM evaluations designed to be faithful, discriminative, and compute-efficient. To construct it, we partition the large pool of existing datasets into nine fundamental VLM capabilities and release two resulting artifacts:
- DatBench, a high-efficiency subset for rapid iteration that provides a 13× speedup on average across all capabilities while increasing signal per sample.
- DatBench-Full, the full collection of high-quality samples remaining after excluding blind-solvable or objectively low-quality data.
Beyond efficiency, our work provides empirical insights across 27 state-of-the-art VLMs, revealing structural limitations that are invisible under conventional evaluation (Section 5). We show that inference-time scaling can actively degrade perceptual performance through an overthinking penalty, that current VLMs exhibit a sharp tension between high-level reasoning and low-level perception, and that language priors systematically mask true multimodal capability across popular benchmarks. Together, these resources and findings improve evaluation quality while dramatically reducing its cost, offering a path toward evaluation practices that keep pace with the rapid advancement of vision-language models.
2. Related Work
Faithful Evaluation. Recent research has identified significant issues with the validity of VLM benchmarks, prompting various mitigation strategies. To address inflated performance caused by high-risk baselines in multiple-choice evaluations, several studies propose reformulating tasks into generative answer-matching settings [1]or employing circular evaluation techniques [21]. More broadly, prior work shows that ambiguous and hard-to-solve comparative prompts can systematically induce spurious preferences in models, meaning that the evaluation prompts themselves can become a hidden source of bias when they implicitly force a choice without sufficient grounding or context [56]. This further motivates interventions like circular evaluations and other option-robust MCQ protocols. Other efforts focus on statistical refinement of evaluation metrics. For instance, [3] apply Item Response Theory (IRT) motivated weighting to account for item difficulty and discrimination beyond simple average accuracy.
Beyond these issues, multiple-choice formats are also misaligned with real-world VLM usage, where models are typically deployed in open-ended, generative settings rather than selecting from a small, predefined set of options. As a result, strong MCQ performance may overstate practical capability by rewarding option elimination or prompt-specific biases, as MCQ-based evaluations systematically misrepresent model abilities by constraining outputs and failing to probe the generative behaviors that dominate real-world LLM and VLM deployment [6].
Additional analyses suggest that many VLMs can perform well on certain benchmarks without meaningfully leveraging visual input, calling into question whether such evaluations truly measure visual understanding or multimodal reasoning [28, 29, 31, 27, 30]. In contrast to approaches that seek to recover signal through post hoc statistical modeling, our method improves evaluation reliability at the source by enhancing data quality via systematic transformation and filtering of benchmark examples, building on both prior work and newly introduced techniques.
Efficient & Discriminative Evaluation. Efforts to improve the efficiency of model evaluation largely draw from (1) psychometric modeling, and (2) exploiting semantic structure in evaluation data. IRT–based methods [3, 24] model latent capability variables in order to estimate item difficulty and discrimination. In practice, however, these approaches typically require large, dense response matrices (many models evaluated on many items) to fit parameters stably. Without this scale, estimates can become highly sensitive to hyperparameter choices.
An alternative line of work leverages semantic structure. For example, [23] employ embedding-based clustering to select representative subsets, while Scales++ [22] relies on qualitative, rubric-driven segmentation of tasks. These approaches face notable limitations. Clustering outcomes are tightly coupled to the choice of embedding model, a significant concern given the lack of unified multimodal embeddings, while rubric-based methods are inherently labor-intensive and subjective.
More broadly, approaches that optimize solely for preserving model rankings suffer from an inherent limitation. As we show in Section 3.4, rank correlation saturates quickly and can often be achieved even by random subsets whose individual samples do not reliably discriminate between weak and strong models. Consequently, prioritizing rank stability risks overfitting to a fixed set of evaluated models without guaranteeing the quality of the underlying examples. Prior work [71] has also proposed aggregating heterogeneous evaluations via Plackett-Luce models, emphasizing ordinal rankings for their robustness to metric calibration issues. While this addresses the challenge of combining diverse measurements, it operates downstream of data quality, aggregating rankings over noisy or blind-solvable samples still propagates those artifacts into the final ordering.
In contrast to these approaches, we shift the focus from preserving global rankings to the targeted curation of individual samples. First, we systematically transform and filter evaluation data to resolve quality issues such as low resolution and labeling errors. Second, we employ a discriminative subset selection strategy that, unlike rank-preservation methods, identifies high-signal samples without requiring the large-scale model response matrices necessary for stable IRT parameter fitting.
3. The Making of DatBench
3.1. MCQ Evaluations: High Noise, Low Fidelity
In this section, we present the methodology for DatBench, a framework designed to transform noisy, large-scale VLM evaluation suites into high-quality, discriminative benchmarks. Our approach systematically addresses four critical failures in current evaluation regimes: (1) signal dilution in Multiple Choice Questions (MCQs), (2) examples solvable without visual context, (3) incorrect, ambiguous, or low-resolution samples, and (4) prohibitively high computational costs. Collectively, the first three interventions enhance the faithfulness and discrimination of the evaluation data, while the fourth ensures the resulting efficient and discriminative.
Datasets & Capabilities. We define our goal as establishing a faithful, discriminative, and efficient evaluation for nine distinct VLM capabilities (see Fig. 2): (1) Chart Understanding: extracting quantitative data and performing trend analysis on bar charts, pie charts, line graphs, and infographics; (2) Document Understanding: parsing structured layouts and extracting key information from digital or scanned documents, with a focus on text-heavy visual processing; (3) Scene OCR: recognizing and interpreting textual information found in natural environments, such as storefront names, street signs, and product labels; (4) Math & Logic: solving multimodal mathematical problems, including geometry, physics mechanics diagrams, and complex logical puzzles; (5) Spatial Reasoning: assessing the relative positions of objects and demonstrating a directional and physical understanding of 3D space; (6) Grounding: identifying and localizing specific regions or objects referred to in text through bounding boxes or segmentation-style tasks; (7) Counting: accurately enumerating specific objects across varied environments and overlapping visual contexts; (8) Diagrams & Tables: interpreting grade-school diagrams and structured tables to extract data points and infer underlying relationships; and (9) General: performing high-level Visual Question Answering (VQA) based on holistic image descriptions and real-world scene comprehension. To achieve this, we source a diverse pool of evaluation sets for each capability and apply our methodology to address problems (1)–(4), transforming them into refined, high-quality benchmarks. Table 1 details the specific dataset composition and selection rationale used to ensure broad coverage of image distributions across each capability.
Models. We leverage a diverse suite of 27 state-of-the-art models to evaluate and refine our benchmarks. The model families and their corresponding parameter sizes used in this study include: (1) Qwen3-VL (2B, 4B, and 8B Instruct variants, as well as 2B, 4B, and 8B Thinking models); (2) Qwen2.5-VL (3B and 7B Instruct variants); (3) Qwen2.5-Omni (3B and 7B multimodal versions); (4) InternVL3.5 (2B, 4B, and 8B Instruct variants); (5) InternVL3 (2B and 9B Instruct variants); (6) InternVL2.5 (2B, 4B, and 8B variants); (7) InternVL2 (2B, 4B, and 8B variants); and (8) Thinking & Specialist Models, comprising GLM-4.1V-9B (Base and Thinking), R-4B, SmolVLM2-2.2B, Phi-3.5- vision, and Gemma-3-4B-it. Using these models as a broad empirical base allows us to ensure our data-centric improvements generalize beyond any single model family.
For all experiments detailed in this study, model generation was standardized with a maximum output length of 4,096 tokens and suggested sampling configs per the corresponding model card or code repository.
Problem: Chance Baselines and The Evaluation-Deployment Gap. Standard MCQ formats systematically overestimate model capability through two primary mechanisms: random guessing and task misalignment. First, multiple-choice questions introduce a non-trivial chance baseline (1/N for N options), allowing models to achieve inflated scores that add significant noise to performance metrics. This inflation is compounded when evaluating across multiple stochastic samples or models; the probability of an item appearing ”solved” by at least one of M uniform random guesses grows rapidly as 1−(1−1/N)M. Second, there is a fundamental mismatch between evaluation and deployment: while most VLMs are used in generative contexts, MCQs merely test the ability to pick a candidate from a pre-defined list. This ”closed-set” evaluation fails to capture the generative reasoning required for real-world tasks and allows models to rely on superficial shortcuts or linguistic priors within the options themselves [1]. As shown in Figure 3a, this creates a ”perceived capability” bubble in which models appear proficient in MCQ formats while failing to produce the same answers in a fully generative regime.
Solution: MCQ-to-Generative Transformation and Circular MCQ Evaluation. To bridge this gap, we adopt a two-pronged strategy to ensure measured performance reflects genuine visual reasoning. Wherever viable, we transform MCQs into open-ended generative tasks by removing candidate options and requiring the model to formulate a direct response. To score these free-form outputs without the brittleness of exact-string matching, we employ an LLM-as-judge (specifically Qwen3-30B [63], a cost-effective and capable judge) to perform semantic answer matching as in [1].
We illustrate the impact of this transformation in Figure 3a, which compares standard MCQ accuracy against our generative transformation across 27 models on the AI2D dataset. We observe a distinct non-linear relationship: while high-performing models (80%+ MCQ accuracy) show tighter convergence between generative and discriminative performance, lower-tier models exhibit a sharp drop-off in the generative setting. This confirms that for weaker models, traditional MCQ benchmarks often mask a fundamental lack of generative skill through random guessing and closed-set shortcuts.
For tasks where options are structurally necessary, specifically inherently discriminative questions like "Which of the following..." where generative conversion would alter the question's core intent, we implement Circular Evaluation [21]. By rotating option permutations across N passes and crediting a point only if the model identifies the correct answer across all rotations, we effectively collapse the chance baseline. As shown in Figure 3b across 27 models, circular evaluation yields a steeper-than-unity slope relative to vanilla MCQ. This slope captures the persistence of the ”false floor” inherent in standard formats: while vanilla MCQs grant models a significant head start (often 20-30% accuracy) through random guessing and position bias, circular evaluation reveals that genuine reasoning capability remains near zero for these same models. The steepness of the curve illustrates that vanilla MCQ continues to significantly inflate perceived performance while true accuracy remains low (< 50%); it is only as models achieve high-level robustness that the two metrics begin to align. By stripping away this artificial inflation, we ensure the benchmark can accurately signal the transition from zero to genuine competence, a critical signal that is otherwise obscured by the noisy MCQ baseline.
Correcting this inflation is crucial: a benchmark is most valuable when it can accurately track the emergence of a new capability. By allowing MCQ formats to provide a ”false floor” of performance, we lose the ability to signal when a model truly transitions from zero to non-zero capability. Ultimately, these stricter criteria ensure that DatBench provides a more faithful representation of genuine model competence by stripping away the artificial inflation inherent in traditional formats.
3.2. The Mirage of Visual Understanding
Problem: Language Priors are often all you need. A significant challenge in VLM evaluation is “blind solvability”, a phenomenon in which models correctly answer questions without visual input by exploiting the language priors encoded in their LM backbones. This phenomenon fundamentally decouples benchmark performance from actual multimodal reasoning, inadvertently rewarding models with stronger language priors rather than superior visual understanding. This creates a “mirage” of progress, due to which improvements in the vision encoder or cross-modal connector are masked by the overwhelming influence of the text-based backbone. Consequently, models with more capable LMs are often deemed to be stronger VLMs simply because they are better at guessing answers from context.
Solution: Filtering Blind-Solvable Questions. To ensure DatBench measures genuine vision-language integration, we systematically identify and remove samples that models can solve “blind.” We conduct a comprehensive evaluation where all 27 models in our suite are queried using only the text prompts from each dataset, without the corresponding image inputs. For each dataset, we visualize this in a histogram (Figure 4) where the x-axis represents the number of models answering correctly and the y-axis represents the fraction of the dataset solved at that model-frequency.
As shown in Table 2, blind-solvable questions typically fall into three categories: (1) World Knowledge, where the answer is physically or culturally standard (e.g., a mosquito’s four-stage life cycle); (2) Visual Stereo-typicality, where models exploit the skewed distribution of attributes in natural images to predict answers without visual confirmation (e.g., toilets usually being white); and (3) Purely Symbolic Reasoning, where the question contains all necessary information for a LLM to solve via logic or arithmetic (e.g., counting digits in a range).
We employ a systematic thresholding strategy (τ) to define rejection criteria based on task format. For datasets with open-ended, generative answers where the probability of a model guessing the exact string is negligible, we set a strict threshold (τ = 1); any sample solved by even a single model without visual input is discarded (e.g., CharXiv-Descriptive). Conversely, for tasks with a constrained solution space—such as Multiple Choice Questions (MCQ) or specialized counting tasks—we set higher thresholds to account for the increased baseline of random guessing and language priors. This includes datasets like CountBench, where answers are concentrated at low integers, or specific questions in AI2D that feature a limited set of candidate solutions evident from the prompt (see Row 1 of Table 2).
As illustrated in Figure 4a for AI2D, the distribution shows a significant “tail” of questions solvable by nearly all models without an image. Even in more recent evaluations like CharXiv Descriptive (Figure 4b), a large fraction of samples are solvable through language priors alone despite the descriptive nature of the task. In the General capability, this issue is most acute: over 70% of examples can be answered without the image. By removing these samples, DatBench ensures the evaluation focuses on high-quality data where visual reasoning is mandatory for success.
3.3. Incorrect Ground Truth and Ambiguity
Problem: The Cost of Evaluative Noise. Despite significant curation efforts, many existing VLM benchmarks contain non-trivial proportions of examples with incorrect ground-truth labels, ambiguous questions, or insufficient image resolution to support the required reasoning. Such noise fundamentally compromises benchmark validity; when a dataset punishes a model for providing a correct answer that contradicts a flawed label, it obscures genuine capability gains and encourages “hill-climbing on noise”. Since we source DatBench from a massive aggregate pool of candidate datasets, we have a surplus of examples that allows us to prioritize rigorous data quality over raw quantity.
Solution: Two-Stage Quality Filtering with VLM-as-Judge. To identify and purge these artifacts, we employ a two-stage filtering pipeline. In the first stage, we flag examples that all evaluated models (1–10B parameters) answer incorrectly. Unanimous failure across a diverse suite of state-of-the-art models typically indicates either a data quality issue or a genuinely difficult frontier case, both of which warrant closer inspection. In the second stage, a strong VLM judge (GPT-5.2) verifies each flagged sample with access to the ground-truth answer as privileged information.
Our choice of a frontier model as a judge is motivated by prior work suggesting that models are significantly stronger verifiers than generators [33, 34, 35, 32]; we therefore expect the judge to accurately identify errors even in cases that are too challenging for contemporary models to solve autonomously. Given that we operate in a regime of abundant evaluation data across our 9 capabilities, we intentionally err on the side of caution. We adopt a stringent filtering policy, discarding any item flagged as (1) ambiguous, (2) incorrectly labeled, or (3) unsolvable due to insufficient resolution, ensuring that the resulting DatBench subset represents only the highest quality of evaluation data. The impact of this pipeline is most visible in the Spatial capability, which exhibits a 42.07% discard rate, primarily due to insufficient resolution in “in-the-wild” images. Similarly, complex expert-authored sets like ChartQA Pro (17.2% removed) and MMMU-Pro (24.3% removed) show significantly higher noise rates than standard benchmarks (c.f. Appendix D in the paper for per dataset / capability counts of filtered examples). While these high attrition rates reflect significant noise in frontier evaluations, we recognize that a judge might occasionally misinterpret specialized, valid reasoning as a data defect. To maintain evaluative headroom, we retain only the subset of these examples that the judge explicitly validates as correct and unambiguous. Our aggregate data surplus allows us to prioritize this high-fidelity subset, accepting the risk that a conservative filtering policy may sacrifice some valid frontier samples to ensure the remaining benchmark remains strictly noise-free.
3.4. High Discrimination with Limited Compute
Problem: The Computational Burden of Comprehensive Evaluation. As VLMs grow in sophistication and expand their set of capabilities, comprehensive evaluation imposes a prohibitive computational burden. This is exacerbated by the emergence of “thinking” models; for instance, Bai et al. [38] utilize inference-time compute scaling, often generating chains-of-thought exceeding 32K tokens. Consequently, evaluating a single capability like OCR (often containing > 100K examples) can require generating over 3 billion tokens, an untenable cost for iterative research.
Selecting a representative subset of examples is a natural approach to reducing evaluation costs. The intuitive heuristic for such a selection is to preserve the model ranking induced by the full dataset, typically quantified using rank correlation measures such as Spearman’s ρ or Kendall’s τ [65, 66, 67]. While rank preservation is a necessary condition for a representative subset, it is theoretically insufficient: rank correlation is agnostic to which specific samples are retained. In practice, even random subsets can preserve global model rankings by retaining items that separate coarse capability tiers (e.g., small versus large models), while failing to retain the high-discrimination examples needed to distinguish models along the Pareto frontier. More broadly, methods that optimize solely for rank preservation face a fundamental limitation, rank correlation saturates rapidly and is often achieved by subsets whose individual samples are weakly or inconsistently informative about underlying capabilities [68, 66]. In such regimes, apparent ranking stability may be driven by spurious correlations or superficial artifacts rather than genuine reasoning ability.
Instead, we turn to Item Response Theory (IRT) for inspiration, originally formalized [36]. IRT posits that items differ not just in difficulty, but in item discrimination, a parameter that determines how sharply an item distinguishes between subjects of varying ability levels [37]. However, directly applying standard IRT methodologies [3] to VLM evaluation is often infeasible due to the limited number of diverse observations available per sample in the current research landscape [22, 64]. Effectively fitting IRT models typically requires stable evaluations from hundreds of diverse state-of-the-art models; without this scale, IRT models become highly sensitive to hyperparameters and are notoriously difficult to fit stably.
Consequently, simply prioritizing rank stability risks overfitting to the evaluated model suite, without guaranteeing the quality or generalizability of the underlying examples. In effect, this produces a “coarse” measuring stick: it yields a subset that is discriminative enough to recover a specific ranking but lacks the resolution to generalize to unseen models or distinguish those with similar capabilities. Therefore, the core optimization problem is not merely to maintain ranking stability, but to maximize total discrimination. By ensuring every sampled example possesses high discriminative power, we can implicitly guarantee robust ranking while maximizing the information content per inference token.
Solution: Item-Discrimination Based Subset Selection. To avoid the instability of IRT models that are sensitive to hyperparameters and sample size, we operationalize item-discrimination using the point-biserial correlation (rpb): a robust, hyperparameter-free measure of the association between a binary item response and continuous model capability. Intuitively, rpb measures the extent to which success on a specific question acts as a proxy for global performance. An item with high rpb is one that strong models consistently answer correctly and weak models consistently miss; conversely, a low or negative rpb indicates a noisy item that fails to track with underlying capability. We define total discriminative power as the sum of discrimination of each example (item).
We select subsets by prioritizing examples with the highest rpb to maximize information density (c.f. Appendix E in the paper). As demonstrated in Figure 7a, DatBench achieves approximately 90% of the total discriminative power using only 40% of the full dataset, whereas random sampling scales linearly and provides less than half that signal at the same budget. Notably, our selection curve peaks above 1.0 before the full dataset is included; this occurs because we intentionally deprioritize ”anomalous items” at the end of our selection process. These are questions with negative rpb where weaker models outperform stronger ones—likely due to spurious text-based correlations, prompt sensitivity, or test-set leakage—which effectively introduce noise into the evaluation.
While Figure 7b shows that both random and discriminative subsets rapidly achieve high rank correlation, this similarity is deceptive. Because our model suite contains distinct performance tiers (e.g., 1B vs. 8B), the global ranking is easily recovered even by uninformative samples. Rank correlation is thus a ”low-bar” metric that saturates too quickly to reflect subset quality. By maximizing discrimination, DatBench provides a higher-fidelity instrument that remains sensitive to marginal capability gains and ensures that evaluation remains stable across unseen model architectures.
4. Introducing DatBench and DatBench-Full
By applying our four-stage pipeline: MCQ transformation (Section 3.1), blind-solvability filtering (Section 3.2), quality filtering (Section 3.3), and discriminative selection (Section 3.4), we transform noisy, redundant dataset aggregations into precise evaluation artifacts. These artifacts cover nine distinct capabilities: Chart Understanding, Document Understanding, Scene OCR, Grounding, Counting, Spatial Reasoning, Math & Logic, Diagrams & Tables, and General VQA. We release two versions of the benchmark to cater to varying computational budgets.
For the final DatBench subset, we execute steps 1 through 4. However, the discrimination-based selection in Step 4 naturally discards ”frontier” examples—items that all evaluated models fail—as they offer near-zero discrimination by construction. To prevent benchmark saturation and ensure evaluative headroom for future models, we manually allocate up to 20% of the DatBench subsets for these valid frontier cases, specifically those verified by our VLM-as-judge as correct and unambiguous. This strategic inclusion ensures that DatBench maintains a high difficulty ceiling while remaining a robust instrument for measuring progress at the frontier of vision-language modeling.
- DatBench: Our primary, high-efficiency subset tailored for rapid iterative development. Constructed via item-wise point-biserial correlation (rpb), this set maintains high ranking fidelity while minimizing inference costs. We explicitly retain a partition of verified, high-quality “frontier” examples—currently unsolvable by 1B–10B models—to ensure the benchmark remains an effective measuring stick as model capabilities scale.
- DatBench-Full: The complete aggregation of all high-quality samples remaining after our systematic filtering pipeline (Steps 1–3). While these sets include all examples validated as objectively high-quality, their scale varies significantly across capabilities based on the severity of the filtering required. For capabilities such as Counting and Spatial Reasoning, where high noise and blind-solvability rates resulted in massive attrition, DatBench-Full is comparable in size to the DatBench subset. However, for most capabilities, DatBench-Full evaluation sets are an order of magnitude larger, reaching up to 50×the size of their efficient counterparts. These are intended for extensive, fine-grained error analysis and established as a comprehensive resource for deep-dive capability assessment.
Usage Guide. We recommend DatBench in high-iteration contexts such as training loops and ablation studies, in which compute costs for evaluation can rapidly balloon but discriminative signal should be maximized. DatBench-Full should be reserved for final model reporting where computational constraints are relaxed and maximum coverage is desired. Collectively, these artifacts transition multimodal evaluation from a regime of noisy data to one of precise measurement.
Having established these artifacts, we provide a comprehensive statistical analysis of how DatBench transforms raw benchmark data into faithful and discriminative instruments for the efficient estimation of VLM capabilities
DatBench discards samples that are too easy / too hard. The most immediate impact of our filtering is the removal of samples that act as statistical noise. In the General capability, model performance is significantly shifted downward from the y= x diagonal (Figure 8), a direct result of Stage 2 filtering (c.f. Appendix F in the paper) which discarded 72.07% of samples solvable via language priors alone. Conversely, the Spatial Reasoning capability underwent rigorous quality filtering in Stage 3 (c.f. Appendix D in the paper), with 42.07% of samples removed due to ambiguity or insufficient resolution. This systematic removal of evaluative noise shifts model assessments to a more faithful performance tier, ensuring that benchmark outcomes accurately reflect genuine multimodal reasoning.
DatBench is more discriminative. Our item-selection methodology amplifies performance differences between models, increasing measurement resolution. On the original General benchmarks, models compress into a narrow 65–80% accuracy band; on DatBench, they spread across 10–65%, a nearly 4× expansion in effective score range. This “stretching” reflects our point-biserial selection criterion (Section 3.4): by retaining only items where strong models reliably succeed and weak models reliably fail, small capability differences that were previously masked now manifest as measurable gaps. The steep slopes observed in General and Document Understanding (Figure 8) confirm this effect; equivalent spacing on the original benchmarks translates to greater separation on DatBench.
DatBench preserves discrimination power with far fewer samples. Despite aggressive filtering, ranking stability is maintained. For capabilities such as Chart Understanding and Grounding, DatBench points fall almost perfectly on the y= x line (Figure 8), confirming that the subset preserves the discriminability and model rankings from the full dataset. As shown in our Stage 4 efficiency analysis (c.f. Appendix E in the paper), DatBench maintains high total discriminative power even at severely restricted budgets, whereas random sampling suffers from linear signal degradation.
Limitations and Future Directions. While our methodology offers a substantial leap forward, several avenues remain for future exploration:
- Scaling to Larger Regimes: Our current analysis focuses on models in the 1B–10B parameter range and inference traces within standard context windows. While the methodology is scale-invariant, the specific subsets of highly discriminative questions will likely shift for larger models and extended inference budgets (e.g., exceeding the 4096 tokens used in our work). Future work can apply this pipeline to larger model families and longer reasoning traces to identify the discriminative frontier for state-of-the-art systems.
- Diversity Guarantees: Our current subset selection prioritizes the highest individual discrimination scores, which implicitly relies on the inherent variety of the source data rather than an explicit diversity constraint. Consequently, this objective does not formally account for redundancy between samples; in pathological cases (e.g., duplicate but highly discriminative examples), the selection could theoretically yield a degenerate or repetitive subset. While we mitigate this through rigorous initial curation, future iterations could incorporate explicit diversity-aware objectives to ensure broader coverage of the capability space.
- Expanding Capabilities: We aim to extend our capability map beyond static images to include long-form video understanding, UI/GUI grounding, and robotics perception.
- DatBench-Live: Finally, discrimination is a moving target; questions that distinguish today’s models will eventually become trivial. We envision a dynamic, “living” benchmark where subsets are recomputed periodically as new models shift the capability distribution and new datasets emerge.
5. Diagnosing VLM Pathologies with DatBench
In this section, we leverage the high-signal artifacts produced by the DatBench pipeline to diagnose the behavioral pathologies of modern VLMs. By analyzing performance across 27 state-of- the-art models spanning the 1B–10B parameter range, we uncover fundamental trade-offs between semantic reasoning and perceptual grounding, risks and rewards of inference-time scaling, and the impact of language priors on evaluation metrics.
Takeaway 1: Capability Correlations Reveal a ”Reasoning vs. Perception” Trade-off. To identify hidden relationships between tasks, we calculated Pearson correlations (r) between all capability scores across our model suite (Figure 9a). We identify a tight Reasoning Cluster in which Chart Understanding, Math, and General VQA exhibit exceptionally high pairwise correlations, such as r = 0.90 between Chart and General tasks. This analysis confirms that General VQA benchmarks, such as MMBench and MMMU-Pro, primarily test abstract reasoning capabilities that are also fundamental to Math, evidenced by a strong correlation of r = 0.76 between these two domains. Furthermore, a distinct Spatial-Semantic Trade-off exists: Grounding correlates negatively with text-heavy tasks like Document Understanding (r =−0.29) and OCR (r =−0.19). These negative relationships, alongside the inverse correlation between Math and Spatial Reasoning (r =−0.19), suggest a latent conflict in current training paradigms between high-level semantic processing and low-level perceptual fidelity. Hierarchical clustering (Figure 9b) corroborates this dichotomy, revealing two distinct clusters: reasoning (Chart, Math, General) and perception (OCR, Spatial, Diagram). Prior work has observed a similar overthinking penalty for language models [80, 81, 82, 83].
Takeaway 2: Capability Profiles Reveal Specialist-Generalist Trade-offs. This trade-off manifests in distinct model archetypes (Figure 10). GLM-4.1V-9B acts as a perception specialist, leading in diagram understanding (66.4%) and spatial reasoning (36.8%) but struggling with math (17.4%). Balanced generalists are rare: Qwen3-VL-4B is a notable exception, maintaining strong document understanding (71.0%), OCR (77.9%), and reasoning (59.9%). Most tellingly, R-4B reaches the highest math score (43.4%) at the cost of the lowest spatial performance (11.4%), suggesting that (current) reasoning-focused training can degrade visual grounding.
Takeaway 3: The ”Overthinking” Penalty: Inference-Time Scaling Degrades Perception at High Cost. Comparing Thinking models to standard counterparts reveals that extra test-time compute is a double-edged sword (Figure 11a). To quantify this, we define the Thinking relative advantage as the percentage gain in accuracy of the thinking model over its instruct counterpart, normalized by the instruct baseline: (Accthinking−Accinstruct)/Accinstruct ×100. Scaling helps Math (≈ +36.8%) and Charts (≈ +10.8%) but causes massive regressions in OCR (≈ −53.5%) and Document Understanding (≈−47.8%). This regression is also extremely computationally wasteful: while correct thinking answers use ≈425 tokens, incorrect attempts balloon to ≈1196.9 tokens, a ≈14× increase over non-thinking models (Figure 11b). We observed that this is due to models entering unproductive thinking loops on perceptual tasks they cannot solve.
Takeaway 4: Language Priors Mask True Multimodal Performance across Capabilities. To isolate the actual visual requirement of each task, we analyze the vision delta (V∆), defined as the performance gap between standard multimodal evaluation and a blind text-only baseline (Figure 12). Our results show that reliance on language priors varies drastically by capability, often distorting perceived progress in multimodal reasoning. Capabilities such as Counting (V∆ = 60.2%) and Grounding (V∆ = 42.3%) exhibit high vision dependency, making them the most faithful indicators of true perceptual accuracy. Conversely, Math (V∆ = 13.0%) and Spatial Reasoning (V∆ = 14.9%) show significant language prior distortion, relying heavily on textual patterns that allow models to guess correctly without the image. These findings confirm that without the rigorous filtering introduced in DatBench, i.e. discarding samples that can be solved with the language prior alone, high scores in capabilities like Math may inadvertently reward stronger language models rather than superior vision-language integration.
6. Conclusion
In this work, we addressed the dual challenges of data quality and computational cost in the evaluation of Vision-Language Models (VLMs). We introduced a framework of three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. We then applied this lens to expose four critical pathologies in existing benchmarks: multiple-choice formats are both unfaithful and weakly discriminative; many VLM benchmarks can be solved without vision; incorrect and ambiguous ground truth introduces substantial noise; and existing evaluation suites are inefficient. We used these insights to distill these benchmarks into high-signal evaluation suites.
Our primary contribution, DatBench, serves as a precise, psychometrically grounded instrument for measuring multimodal capability. Motivated by Item Response Theory (IRT) and operationaliz- ing discrimination via point-biserial correlation (rpb), we demonstrated that maximizing total test discrimination yields subsets that are not only computationally lightweight but also significantly more robust and generalizable than those derived via random sampling or simple rank correlation. Our accompanying analysis of “thinking” models and language priors further validates that DatBench is capable of surfacing nuanced behavioral insights that are often obscured in aggregate metrics. We release two versions of the benchmark, the efficiency-focused DatBench for rapid iterative development (yielding 13×average speedup), and the comprehensive DatBench-Full for final reporting, to standardize comparison and accelerate progress at the pareto frontier. Our work provides a path towards evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
Get in Touch!
If you're interested in pushing the bounds of what's possible with data curation, we're looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.
If you're interested in training multimodal and/or text models faster, better, or smaller, Get in touch!
Follow us on twitter for insights (and memes) about data!
Contributions and Acknowledgements
Core Contributors
Siddharth Joshi, Haoli Yin, Rishabh Adiga, and Ricardo Monti.
for fusing modalities, wrangling the datasets, and ensuring the evaluation pipeline didn’t hallucinate.
Technical Contributors
Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, and Zhengping Wang.
the ensemble of experts who maximized our few-shot performance and ablated every hyperparameter.
Leadership
Bogdan Gaza, Ari Morcos, and Matthew Leavitt.
the ground truth oracles who steered the project and prevented collective mode collapse.
Acknowledgements
Liz Gatapia (for incredible logo design), Jacqueline Liu, Tiffanie Pham, Sylvia Hoang, Jayla Lindsey, Kylie Clement, Elise Clark
the human-in-the-loop feedback that provided essential regularization and support.
References
Ready for better data?
Let’s make models better through better data, automatically.
Book a Call