top of page
Writer's pictureDatologyAI Team

Technical Deep-Dive: Image-Text Data Curation at the Billion-Sample Scale



A series of three plots comparing DatologyAI's data curation to two baseline datasets showing that DatologyAI's curation allows models to be trained better, faster, and smaller
Train Better, Faster, and Smaller with DatologyAI's Multimodal Data Curation. We train CLIP ViT-B/32 models on the DataComp CommonPool (raw baseline), exact-deduplicated & CLIP-score-filtered CommonPool (sophisticated baseline), and our retrieval-optimized curation of CommonPool. Our curation substantially outperforms both baselines on retrieval evaluations and reaches the raw baseline performance in 2.3% of the time. A ViT-S/32 trained on our curated data has substantially higher performance than train-FLOPS-matched ViT-B/32 baselines and reduces model FLOPS by 2.6x

Note: This post is a technical deep-dive into our initial multimodal data curation results. For a brief, high-level summary of this work, we direct readers to our short companion piece.

 

tl;dr

We introduce DatologyAI’s state-of-the-art data curation pipeline. Our pipeline is scalable, productionized, and integrates a suite of approaches including model-based filtering, embedding-based filtering, synthetic data, and more. We applied our curation pipeline to image-text pairs from DataComp to obtain substantial improvements in multimodal (CLIP) model quality, training speed, and inference efficiency for pool sizes of up to 1B image-text pairs and training budgets of up to 5.1B image-text pairs.

We developed two curation recipes: one optimized for retrieval tasks, and one optimized for classification tasks. We then compared models—primarily standard CLIP-ViT-B/32—trained on our curated data to models we trained on data that were minimally curated (raw baseline), data curated using industry-standard techniques (specifically, exact image deduplication followed by CLIP score filtering to keeping the top 30% of data; what we call the sophisticated baseline), and to publicly-reported CLIP model results. By using our pipeline to curate training data, we are able to:

  • Train Better Models: When training for the largest fixed compute budget (5120M samples), our model quality improvements range from a lower bound of 2.9 percentage points (pp) on classification tasks compared to the sophisticated baseline (sophisticated baseline: 56.7%; DatologyAI: 59.5%), to an upper bound of 11.3pp on retrieval tasks compared to the raw baseline (raw baseline: 64.0%; DatologyAI: 75.3%). Our most comparable curated datasets also outperform the top submissions on the DataComp Large filtering track leaderboard.

  • Train Models Faster: Compared to baselines trained for 5.1B samples, we save 97.7% on compute (43.3x training speedup) to reach the same accuracy on retrieval tasks as the raw baseline, and 92.4% on compute (13.2x training speedup) to reach the same accuracy on classification tasks as the raw baseline. Compared to the sophisticated baseline, we achieve compute savings of 96.5% (28.6x speedup) for retrieval tasks and 65.3% (2.88x speedup) for classification tasks. Our curation also yields models that are competitive with public models trained using more than 6x the compute and well-curated datasets that are over 10x larger.

  • Train Better, Smaller Models: Improve inference efficiency by training better, smaller models. We train ViT-S/32 models on our curated data, which leads to models that reduce the cost per query by up to 2.6x and are up to 8.8 percentage points better than ViT-B/32 models trained on a FLOPs-equivalent amount of sophisticated baseline data.

We also share a brief survey of data curation algorithms, details about our tech stack, and other engineering and research insights we gained in the course of building our scalable, state-of-the-art data curation pipeline.

This work demonstrates viability and highlights the challenges of integrating a diverse set of cutting-edge data curation algorithms into a scalable, productionized pipeline.

We’re extremely proud of what we’ve done, but this is just the start. We’re expecting significant improvements to these figures over the next few months. Doing so requires pushing the bounds of what’s possible across research and engineering areas like embedding models, synthetic data, data infrastructure, and platform engineering—and we’re very excited to do it!

Are you a deep learning fanatic, data engineer, or otherwise data-obsessed person who wants to push the bounds of what’s possible with data curation? Check out our jobs page!

Is your company interested in training multimodal models faster, better, or smaller? Sign up for our customer waitlist!

Follow us on twitter for insights (and memes) about data!

These gains aren’t limited to multimodal. Stay tuned for our follow-up release on text curation for language models!

 

Table of Contents

 

1. Introduction

1.1 Why is Data Curation Important?

Contemporary deep learning model capabilities require training large models on enormous quantities of training data (Dubey et al., 2024; Gemini Team et al., 2023; NVIDIA et al., 2024; Mistral AI Team, 2024). Each additional data point requires additional compute to train on that data point, and each additional parameter requires additional compute to train and deploy. Reducing the costs required to train and deploy models could thus facilitate broader access to the benefits of machine learning.

It has been shown that a large portion of training compute is wasted training on data that are already learned (Mindermann et al., 2022) or irrelevant (Sorscher et al., 2022) for the model’s downstream application. Some data can even be misleading and can actively harm model quality (Maini et al., 2023). This means that selecting the right data can substantially improve training and/or inference efficiency by reducing the cost of training a model to a given quality, improving the quality of models trained for a given compute cost, and enabling the training of smaller models to higher quality. All we have to do is select the right data to train on!

1.2 Why is Data Curation Hard?

Unfortunately, data curation for large-scale deep learning is hard—it’s a frontier research problem. It’s a comparatively new field, experiments are costly to run at scale (Feldman and Zhang, 2020; Choe et al., 2024), and small-scale results often aren’t predictive of large-scale outcomes (Sorscher et al., 2022; Goyal et al., 2024). Furthermore, even if we understand how a single curation algorithm works, it’s unclear how different algorithms work together. This is critical, because there’s likely no single curation algorithm to rule them all. As with training efficiency algorithms more generally (Leavitt et al., 2022; Bartoldson et al., 2023; Kaddour et al., 2023), different data curation algorithms impart their benefits through different mechanisms (Hu et al., 2023; Wang et al., 2024), so understanding how to effectively combine them is necessary for achieving the best outcomes.

Data curation is also a frontier engineering problem—there’s no established playbook for implementing a curation pipeline that can scale up to the billions of images or trillions of tokens that constitute modern foundation model training datasets. While numerous open-source projects offer a viable starting point for curation (e.g. DataTrove, NeMo-Curator, fastdup), they are often rigid and demand deep expertise, significant time, and resources to customize effectively, and don’t contain the cutting-edge research innovations necessary for training the best models. Integrating cutting-edge research into large-scale dataset curation tools presents unique challenges across domains like data infrastructure, platform engineering, reliability, and security. Off-the-shelf solutions typically lack the flexibility and adaptability needed to meet specific requirements and push the boundaries of model performance, making tailored solutions essential.

These research and engineering challenges are why cutting-edge data curation has primarily been restricted to the big players who can afford to hire large in-house data teams (we counted 69 authors across the different data-related teams on OpenAI’s GPT-4 Technical Report).

1.3 Making Data Curation More Accessible

At DatologyAI, we believe that intelligent data curation is too beneficial to remain locked away in closed research labs. We believe there is tremendous societal value to enabling the training of foundation models outside a small set of large, extremely well-resourced companies. We believe that data curation is one of the most promising ways to reduce the cost of training and deploying models. We believe everyone should be empowered to train their own models on their own data. And this is why we’re excited to announce the first set of results from our state-of-the art data curation pipeline.

1.4 What We Did and How We Did It

We built a scalable data curation pipeline that comprises a suite of curation algorithms and used it to develop retrieval- and classification-optimized curation recipes for image-text multimodal data. We then applied our curation recipes to pools of DataComp data up to sizes of 1024M samples, and trained standard CLIP models for up to 8192M samples on curated data and two baseline datasets: DataComp curated as per their CommonPool filtering (”raw baseline”) and DataComp to which we also applied exact image deduplication and CLIP score filtering (”sophisticated baseline”).

Our curation can reduce the compute needed to reach baseline performance by up to 97.7% (a 43.3x training speedup), improve model quality by up to 11.3 percentage points compared to compute-matched baselines, and reduce inference costs by up to 2.6x by training smaller models to higher quality. Our curation also yields models that are competitive with public models trained using more than 6x the compute and well-curated datasets that are over 10x larger, and outperforms the top submissions on the DataComp Large filtering track leaderboard.

1.5 A Bird’s Eye View of this Post

This post starts with a brief overview of the algorithms that constitute our curation pipeline. We then jump into the results, discussing the quality improvements, training speedups, and inference benefits of training on our retrieval- and classification-optimized datasets compared to the other baselines we trained ourselves. Next, we broaden our scope and compare against external public baselines. Finally, we share our methodology, describing the data we use and how we curate it, how we evaluate, and provide a technical overview of our curation pipeline and some of the challenges involved in scaling it up.

Read on to uncover the methods, metrics, and real-world results that make our data curation pipeline a game-changer — empowering you to build faster, more efficient models without the intense resource demands of large AI labs.

2. Curation Algorithms

Graphical depictions of the different curation algorithms used in DatologyAI's data curation pipeline
Figure 1: Data Curation Algorithm Families: Our pipeline integrates a number of families of curation algorithms.

Our algorithmic curation pipeline comprises a suite of algorithms, which we group into “families” based on their known or hypothesized mechanism(s) of action, effects, and computational requirements (see Figure 1). The algorithm families we use are as follows:

  • Exact deduplication: We remove exact image duplicates across the entire data pool using SHA-512 hashes.

  • Model-based filtering: Similar to Fang et al., 2023, we trained filtering models on high-quality reference datasets known to be in-domain with respect to our evaluation tasks. We then used these filtering models to score pretraining data samples and filtered out low-scoring examples. We found that the composition of the data that the filtering model was trained on strongly affected quality of the model trained on the filtered data.

  • Embedding-based curation: We embed samples and then leverage the geometric relations between samples in embedding space to find high- and low-value data points, such as semantic duplicates, similar to Abbas et al., 2023, Tirumala et al., 2023, and Abbas et al., 2024. We improve on existing approaches by reducing the need for manual hyperparameter selection and enhancing the scalability and efficiency of the algorithms.

  • Target distribution matching: Like Gadre et al., 2023 and Wang et al., 2024, we leveraged auxiliary data (for example from other high quality datasets and/or relevant tasks) to retrieve data points from a larger corpus and upsample them during training. We found that the efficacy and efficiency of this approach is sensitive to a number of design elements, including the choice of target dataset(s), similarity and ranking algorithm, accounting for the density of the target distribution, and the proportion of retrieved data used in the final data mix.

  • Synthetic data: We used pretrained VLMs to recaption data samples, similarly to Schuhmann et al., 2022, Lai et al., 2023, Li et al., 2023, and Nguyen et al., 2024. We want to be explicit that we did not generate new images, only new captions. We found a number of factors that substantially impacted the quality of model trained on recaptioned data, including: validating the quality of the generated captions; the choice of recaptioning model; the recaptioning model prompt; and the proportion of recaptioned samples in the dataset.

Please see the Appendix for a more detailed discussion of each algorithm family.

3. Results

Data curation with DatologyAI’s methods significantly enhances training efficiency by reducing the resources required to achieve a given level of quality or improving model quality for a given computational cost. Since models in our benchmarks are not trained to convergence, the relationship between training speed and accuracy can be adjusted to prioritize one over the other. We report results with a focus on maximizing either speed or quality independently, demonstrating that our retrieval- and classification-optimized datasets provide substantial savings in computational resources while delivering high-performance outcomes in training and inference.

We first present results on the quality improvements, training speedups, and inference benefits of training on our retrieval- and classification-optimized datasets compared to the other baselines we trained ourselves. We then broaden our scope to compare against external public baselines.

To facilitate a deeper understanding of our findings, we present our results using interactive plots. We encourage you to explore these visualizations, as they offer valuable insights and multiple perspectives on our data. By interacting with these plots, you can uncover nuances and trends that may not be immediately apparent in a static presentation. We also include a plot in the Appendix that contains the full set of results for every task for every model we trained for readers who wish to go really deep. Perhaps you will find an interesting result that we missed—if so, please let us know!

3.1 Data Curation for Better Model Quality

Our retrieval- and classification-optimized curation recipes unlock substantial improvements in model quality that generalize across a broad range of training budgets and dataset scales. We trained CLIP-ViT-B/32 models on three retrieval-optimized datasets, each curated from DataComp pools of varying sizes: 256M, 512M, 1024M. For each dataset, we varied the number of training samples seen by the models, ranging from 128M to 5.12B (see Data). We report results in the text for the largest training scale (1024M pool size, 5.12B train budget), but all pool sizes and train budgets can be viewed in our interactive plots.

3.1.1 Model Quality Improvements from Retrieval-Optimized Curation

Figure 2: Retrieval-Optimized Curation Substantially Improves Retrieval Performance. We plot mean accuracy on retrieval tasks (Flickr, MSCOCO, and SugarCrepe) on the y-axis as a function of training samples on the x-axis for models trained on our retrieval-optimized dataset. Each larger point represents the final accuracy for a model trained for a given number of training samples (i.e. each larger point is a different model), and we linearly interpolate between larger points. You can control the pool size (i.e. the number of samples available for curation; see Data) using the dropdown menu at the top of the plot. Note: we also include a more detailed plot in the Appendix that contains the full set of results for every task for every model we trained.

At the 1024M pool size, 5.12B train budget, our retrieval-optimized DataComp yields an improvement of 11.3 percentage points (pp) average on retrieval tasks over the raw baseline data (raw baseline: 64.0%; DatologyAI: 75.3%; see How We Evaluate and Figure 2). This corresponds to a 17.6% relative performance improvement.

Compared to the sophisticated baseline (65.9% mean retrieval performance) at the 1024M pool size, 5.12B train budget, our retrieval-optimized DataComp (75.3%) yields an improvement of 9.2pp, which corresponds to a 14.3% relative performance improvement.

3.1.2 Model Quality Improvements from Classification-Optimized Curation

Figure 3: Classification-Optimized Curation Substantially Improves Classification Performance. We plot mean accuracy on 25 classification tasks (see How We Evaluate) on the y-axis as a function of training samples on the x-axis for models trained on our classification-optimized dataset. Each larger point represents the final accuracy for a model trained for a given number of training samples (i.e. each larger point is a different model), and we linearly interpolate between larger points. You can control the pool size (i.e. the number of samples available for curation; see Data) using the dropdown menu at the top of the plot.

Moving to the classification setting, at the 1024M pool size, 5.12B train budget our classification-optimized curation yields an improvement of 9.1pp average across classification tasks (see How We Evaluate) over the raw baseline (raw baseline: 50.4%; DatologyAI: 59.5%; see Figure 3). This corresponds to an 18.1% relative performance improvement.

Compared to the sophisticated data baseline (56.6%), our classification-optimized curation yields an improvement of 2.9pp average across classification tasks. This corresponds to a 5.1% relative performance improvement.

Our retrieval- and classification-optimized curation recipes achieve substantial, consistent gains across a range of pool sizes, training budgets, and baselines, highlighting the strength of our curation pipeline and of task-specific dataset curation.

3.1.3 How Much Does One Additional Point of Accuracy Cost?

Given the desire to improve model quality, it’s important to consider how different training interventions affect the marginal cost of model quality improvements—the cost for each additional unit of accuracy. We would like to highlight that attempting to use the raw baseline data to match the accuracy improvements obtained when using curated data—especially DatologyAI’s curated data—are very expensive, and in some cases impossible to obtain.

3.1.3.1 The Diminishing Returns of Training on Minimally-Curated Data

Taking the quality of our model trained on 256M samples of the retrieval-optimized 512M pool size data as a reference (69.9% mean retrieval accuracy), we attempted to match it by training for longer on the raw baseline data—8192M samples, or 32x as longer—but were unable to achieve it. Performance was 63.1%, 6.8pp worse than the retrieval-optimized model (see Table 1).

The Quality Ceiling of Uncurated Data. Training for 256M samples on our retrieval-optimized data substantially outperforms (+6.8pp) training for 8192M samples (32x as long) on the raw baseline data.
Table 1: The Quality Ceiling of Uncurated Data. Training for 256M samples on our retrieval-optimized data substantially outperforms (+6.8pp) training for 8192M samples (32x as long) on the raw baseline data.

3.1.3.2 Calculating the Marginal Cost of Accuracy Improvements

The diminishing returns from extended model training on poorly curated data result in a high marginal cost for model quality improvements. To quantify this, we calculate the marginal accuracy point cost by establishing a reference accuracy. This reference point represents the model’s performance at the first checkpoint (after processing 16M samples) when trained on either our retrieval-optimized 256M→58M dataset, achieving 52% accuracy on retrieval tasks, or our classification-optimized 256M→47M dataset, achieving 19% accuracy on classification tasks. We then define 1x as the compute needed for the baseline model to reach this benchmark accuracy.

Figure 4 shows the relative cost of each marginal accuracy point when training models with and without DatologyAI curation. Using retrieval performance as an example, a 2.7pp accuracy improvement on retrieval tasks from 52% to 54.7% requires approximately 5x the compute of reaching 52% retrieval performance when using uncurated raw data. In contrast, investing the same 5x compute on our retrieval-optimized data yields an improvement of 16.4% in model quality, an improvement that’s unachievable when training on the baseline datasets. These findings underscore the substantial efficiency gains achieved by using DatologyAI's curated data compared to raw datasets.

Curation not only significantly reduces the cost per unit of accuracy but also unlocks performance levels unattainable with raw data alone—highlighting the pivotal role of targeted dataset curation in advancing model quality.


Figure 4: Data Curation Reduces the Marginal Cost of Accuracy Improvements. We show the marginal accuracy improvement (y-axis) as a function of compute cost multiple (x-axis) relative to the accuracy of the raw baseline. We compute the marginal accuracy point cost by defining a reference accuracy, which is the performance of the first checkpoint (16M samples) when training on our retrieval-optimized 256M→58M dataset (52% on retrieval tasks) or classification-optimized 256M→47M dataset (19% on classification tasks). We then define 1x as the amount of compute required to for the raw baseline model to achieve reference accuracy. Accuracy gains are much cheaper when using curated data compared to the baselines. On retrieval tasks, for example, using 10x as much compute yields less than a 4.5pp improvement when training on raw baseline data, while it yields a 18.4pp improvement when training on DatologyAI curated data.

3.2 Data Curation Accelerates Model Training

Our results thus far have focused on demonstrating how our curation can improve model quality. We now shift our focus to the compute savings enabled by our curation. We measure compute savings by first training a baseline model for some fixed number of samples, then comparing the number of DatologyAI-curated samples required to reach the baseline model’s accuracy.


Figure 5: Data Curation Accelerates Model Training We plot accuracy (y-axis) as a function of training samples (x-axis). Accuracy values are the pareto frontier across different pool sizes, i.e. we take the highest-quality model for a training budget. You can control the curation type and evals (retrieval or classification) using the dropdown menu at the top of the plot.

3.2.1 Faster Training for Retrieval

At the largest training duration (baselines trained for 5120M samples), we save 97.7% on compute (43.3x training speedup) to reach the same accuracy on retrieval tasks—64.0% accuracy—when training on retrieval-optimized data compared to the raw baseline, and we save 96.5% on compute (28.6x speedup) compared to the sophisticated baseline (65.9% accuracy; see Figure 5).

3.2.2 Faster Training for Classification

For classification tasks, we save 92.4% on compute (13.2x training speedup) when training on classification-optimized data to reach the same accuracy as the raw baseline—50.4% accuracy—and save 77.5% (4.5x speedup) compared to the sophisticated baseline (56.6% accuracy; see Figure 5).

These reductions in computational cost underscore the value of curated data for rapid iteration in large-scale model development.

3.3 Train Smaller, Better Models

Most published scaling laws focus solely on estimating changes in model quality as parameter count and training data increase, without considering the cost of inference. However, Sardana et al., 2024 presents inference-aware scaling laws that reveal the optimal model size can be significantly smaller than what is predicted by conventional, inference-agnostic approaches. This is particularly relevant to deployed industrial settings, where inference cost can dominate the economics of the model lifecycle. Thankfully, the improved training efficiency imparted by data curation can also be used to improve the quality of smaller models, making them viable replacements for larger models in deployment and yielding substantial inference savings.

We first note that our FLOPs calculations in the following results use the measurements for a single forward pass on a 224x224 image for a CLIP-ViT-S/32 (5.7GFLOPs) and CLIP-ViT-B/32 (14.8GFLOPs) from OpenCLIP. We multiply this by 3 when discussing training results to account for the backward pass.

3.3.1 Train Small, Retrieve Big

We trained a CLIP-ViT-S/32 on 2048M samples of DatologyAI’s Retrieval-Optimized 512M205M dataset (71.8% mean retrieval performance) and obtained a 13.3pp improvement (22.7% relative improvement) over a CLIP-ViT-B/32 trained on the raw baseline data for an equivalent amount of FLOPs (58.5% mean retrieval performance), and an 8.8pp improvement (14.9%) over a ViT-B/32 trained on the sophisticated baseline data for the same amount of FLOPS (63.0% mean retrieval performance; Figure 6). This translates to an inference cost reduction of 2.6x for a substantially better model.


Figure 6: Data Curation Enables Training Better, Smaller Models. We plot model accuracy (y-axis) as a function of theoretical training FLOPs (x-axis) for different ViT-B/32 models trained using baseline data and ViT-S/32 trained using DatologyAI’s curation. Each larger point represents the final accuracy for a model trained for a given number of training samples (i.e. each larger point is a different model), and we linearly interpolate between larger points. We calculate FLOPs using OpenCLIPs measurements (ViT-S/32 for a single 224x224 image: 5.7GFLOPs; ViT-B/32 for a single 224x224 image: 14.8GFLOPs). You can control the curation type and evals (retrieval or classification) using the dropdown at the top of the plot.

3.3.2 Train Small, Classify Big

And training a CLIP-ViT-S/32 on DatologyAI’s classification-optimized DataComp for 512M samples (51.6% mean classification accuracy) yields a 9.2pp improvement (21.7% relative improvement) over the raw baseline trained for the same amount of FLOPS (42.2% mean classification accuracy), and a 1.7pp improvement (3.4% relative improvement) over a model trained for the same amount of FLOPs on the sophisticated baseline data (49.9% mean retrieval accuracy)—which again comes along an inference cost reduction of up to 2.6x (Figure 6).

These results demonstrate the promise of high quality data curation for shrinking models, which allows satisfying more demand at the same cost, lower latency, and unlocking new use cases such as on-device computation or real time operations

3.4 A Cornucopia of Comparisons: Expanding to External Models

Up to this point, we have compared our curation exclusively to baselines we trained ourselves. We now expand our focus to external models. Making apples-to-apples comparisons to external models is complicated by a number of factors, including differences in training hyperparameters, training algorithms, evaluation datasets, and training dataset link rot (Lakic et al., 2023), but is nonetheless critical for more broadly contextualizing our results. Overall we find that our curation compares quite favorably to external CLIP models trained using a variety of data curation approaches and datasets.

3.5 Fruits of Curation: High-Quality Results without High Costs in ViT-B/32

Our primary testbed is the vanilla CLIP-ViT-B/32, which we chose because its training speed allows for rapid iteration. Our ViT-B/32 models trained on DatologyAI-curated data are competitive with models trained on substantially larger datasets and more compute.

3.5.1 Better Retrieval Quality with Decimated Datasets and Truncated Training Budgets

To demonstrate the effectiveness of our curation process, we start by comparing the retrieval performance of models trained on our Retrieval Optimized-optimized dataset against those trained on much larger datasets.

Comparison of DatologyAI Retrieval-Optimized ViT-B/32 models to external ViT-B/32 models. External results are all obtained from the OpenCLIP repo. Ret Avg = mean Flickr and MSCOCO recall@1 (see How We Evaluate).

Table 2: Comparison of DatologyAI Retrieval-Optimized ViT-B/32 models to external ViT-B/32 models. External results are all obtained from the OpenCLIP repo. Ret Avg = mean Flickr and MSCOCO recall@1 (see How We Evaluate). We note that there are ViT-B/32 models in the OpenCLIP repo that outperform the models we trained on DatologyAI-curated data (primarily those trained with 34B-sample training budgets and/or larger images sizes), but the core claim we aim to emphasize here is that DatologyAI-curated data are competitive with models trained with substantially larger batch sizes, more compute, and on much larger datasets.


Training for 5.1B samples on our Retrieval-Optimized 1024M→205M dataset yields a mean retrieval recall@1 of 61.1% (see Table 2). This exceeds the performance of a QuickGELU ViT-B/32 trained on the well-curated MetaCLIP-2.5B for 12.8B samples (59.8% mean retrieval recall@1), and an xlm-roberta-base-ViT-B/32 trained for 13B samples on LAION-5B (59.6 mean retrieval recall@1). Not only are both of these external models trained on datasets 10-20x the size for 2.5x the training budget, but they also use improved model architectures and substantially larger batch sizes.

Larger batch sizes are associated with improved CLIP training due to more in-batch negatives (Chen et al., 2020; Radford et al., 2021; Pham et al., 2021). The external baselines we compare to are all trained with batch sizes 4x-11x larger than what we use here (8k). This shows that targeted curation increases the information density within each batch (e.g. Evans et al., 2024), allowing effective training even with a batch size as modest as 8k, yielding results that are often superior in comparison.

Similarly, training a ViT-B/32 for ~2B samples on our Retrieval-Optimized 512M→102M dataset yields a mean retrieval recall@1 of 57.6%, while training the same model architecture for 12.8B samples on LAION-400M yields a mean retrieval recall@1 of 57.1%. We would like to emphasize that our curation obtains superior performance compared to training for over 6x longer on a dataset over ~4x larger.

3.5.2 Less Data, Less Compute, Better Classification

In addition to optimizing for retrieval performance, we also evaluated our models on classification tasks to assess the versatility of our curation approach. Training on our Classification-Optimized datasets led to notable performance on classification metrics. For instance, using the Classification-Optimized 1024M→179M dataset, a ViT-B/32 model trained for 5.12B samples achieves a classification average accuracy of 59.5% and an ImageNet-1K accuracy of 63.2% (see Table 3). This performance is competitive with models trained on substantially larger datasets and with substantially more compute.

Comparison of DatologyAI Classification-Optimized ViT-B/32 models to external ViT-B/32 models. External results are all obtained from the OpenCLIP repo. Cls Avg = mean across 25 classification datasets (see How We Evaluate).
Table 3: Comparison of DatologyAI Classification-Optimized ViT-B/32 models to external ViT-B/32 models. External results are all obtained from the OpenCLIP repo. Cls Avg = mean across 25 classification datasets (see How We Evaluate). We note that there are ViT-B/32 models in the OpenCLIP repo that outperform the models we trained on DatologyAI-curated data (primarily those trained with 34B-sample training budgets and/or larger images sizes), but the core claim we aim to emphasize here is that DatologyAI-curated data are competitive with models trained with substantially larger batch sizes, more compute, and on much larger datasets.

For instance:

  1. The LAION-400M ViT-B/32 model, trained for 12.8B samples with a batch size of 32k, attained a classification average of 57.2 and an ImageNet-1K accuracy of 60.2%.

  2. Another example is the LAION-2B Roberta-ViT-B-32 model, trained for 12B samples with a batch size of 32k, attained a classification average of 59.4 and an ImageNet-1K accuracy of 61.7%.

These results demonstrate that our curation recipes can reduce the amount of compute and/or training data required to achieve a given level of model quality by an order of magnitude.

3.6 Branching Out: Apples-to-Apples with DataComp Large

We’re quite excited about how our ViT-B/32 results stack up against external baselines, so in order to make additional comparisons we look to the DataComp leaderboard. Specifically, we compare to two entries in the Large Filtering track, in which participants filter a pool size of 1.28B samples and train a CLIP-ViT-B/16 on the filtered data for 1.28B samples: Data Filtering Networks (DFNs; Fang et al., 2023), which is the top-ranked submission on the leaderboard, and the DataComp team’s own best submission, Image-based ∩ CLIP score (L/14 30%), which we refer to as “Image ∩ CLIP” for brevity (Gadre et al., 2023). In order to make our comparisons as apples-to-apples as possible, we trained CLIP-ViT/16 models for 1.28B samples on our Classification-Optimized 1024M→179M and Retrieval-Optimized 1024M→205M datasets.

One key difference between our 1024M pool and the DataComp Large (1.28B) pool is that the DataComp Large pool is 25% larger than our 1024M pool. This could affect comparisons between results obtained using the two datasets. Expanding the initial pool size can enhance the quality of the curated dataset by providing a broader range of high-quality samples for selection. This is evidenced by Figure 7, which demonstrates a clear trend: as the pool size increases, overall classification accuracy improves monotonically.


Figure 7: Larger Pool Sizes Yield Higher Quality Datasets. We plot overall classification accuracy (y-axis) as a function of pool size (x-axis) for ViT-B/32 models trained on 1024B samples raw baseline data, sophisticated baseline data, and DatologyAI Classification-Optimized data.

3.6.1 Normalizing to CLIP Score Baselines for a More Accurate Comparison

In order to facilitate more direct comparison between our 1024M pool and the DataComp Large pool, we took the following approach: First, we applied DataComp’s CLIP score filtering to our DataComp 1024M pool to generate our own CLIP score baseline. We then subtracted our CLIP score baseline from our curation results to generate CLIP score-normalized results. We then applied the equivalent process to the DataComp leaderboard results: we subtracted the DataComp leaderboard CLIP score performance from the DFN and Image ∩ CLIP performance to obtain CLIP score-normalized results for each of those two methods.

After normalizing the DataComp Large filtering track results and our curation results using the CLIP score-normalization method, we find that our curation outperforms both the DFN and Image ∩ CLIP approaches.

3.6.2 Class Warfare: Strong General and Classification Performance on DataComp

Our DatologyAI Classification-Optimized 1024M→179M dataset improves performance over our CLIP score baseline by 4.6pp on average across the 38 DataComp evaluation datasets, while DFN and Image ∩ CLIP improve over the DataComp leaderboard CLIP score baseline by 3.1pp and 0.8pp, respectively. Even without CLIP score baseline normalization, our classification-optimized dataset yields better overall performance than DFN: 56.68% average performance across all 38 DataComp tasks, compared to 56.0% for DFN. We note that DFN performs better on ImageNet-related tasks (likely due to fine-tuning the filtering model on the ImageNet train set). However, our curated dataset outperforms DFN on VTAB (58.7% vs 55.5%), a diverse subset of classification tasks (see Table 4). This indicates that our approach doesn't over-index on ImageNet, but rather maintains strong performance across a broader range of classification tasks.

Comparison of DatologyAI Classification-Optimized Curation to DataComp Large Filtering Track Submissions. In the final two columns we report normalized evaluation results, where performance is adjusted by subtracting the CLIP score baseline of the respective dataset: DataComp CLIP Baseline for models trained on the DataComp Large dataset and DatologyAI CLIP Baseline for our models. Metrics include VTAB (mean accuracy across 12 classification datasets as specified in the Appendix and Fang et al., 2023), and 38 Evals Avg (mean accuracy across all 38 DataComp evaluation datasets). All models use the CLIP ViT-B/16 architecture and are trained on 1.28 billion samples. Results for DataComp Large models are sourced from the DataComp Large Filtering track leaderboard. *The DFN dataset size is close to 380M according to Fang et al., 2023, but the exact size is unknown.
Table 4: Comparison of DatologyAI Classification-Optimized Curation to DataComp Large Filtering Track Submissions. In the final two columns we report normalized evaluation results, where performance is adjusted by subtracting the CLIP score baseline of the respective dataset: DataComp CLIP Baseline for models trained on the DataComp Large dataset and DatologyAI CLIP Baseline for our models. Metrics include VTAB (mean accuracy across 12 classification datasets as specified in the Appendix and Fang et al., 2023), and 38 Evals Avg (mean accuracy across all 38 DataComp evaluation datasets). All models use the CLIP ViT-B/16 architecture and are trained on 1.28 billion samples. Results for DataComp Large models are sourced from the DataComp Large Filtering track leaderboard. *The DFN dataset size is close to 380M according to Fang et al., 2023, but the exact size is unknown.

3.6.3 Golden Retrieval: Dogging the DataComp Competition

Our DatologyAI Retrieval-Optimized 1024M→205M dataset improves retrieval performance over our CLIP score baseline by 13.36pp on average, while DFN and Image ∩ CLIP improve over the DataComp leaderboard CLIP score baseline by 6.8pp and 3.2pp, respectively. Our retrieval-optimized dataset has average retrieval performance (averaged across Flickr, MSCOCO, and WinoGavil) of 59.9%, compared to DFN’s 53.4% (see Table 5). These results indicate that our curation outperforms the best current approaches on the DataComp Large filtering leaderboard, despite the DataComp Large-derived datasets having a leg up due to a 25% larger pool size.

Our task-specific curation methods thus set new standards in both classification and retrieval performance, showcasing how targeted curation can outperform existing approaches — even when working with a smaller initial data pool.

Comparison of DatologyAI Retrieval-Optimized Curation to DataComp Large Filtering track submissions
Table 5: Comparison of DatologyAI Retrieval-Optimized Curation to DataComp Large Filtering track submissions. In the final column we report normalized evaluation results, where performance is adjusted by subtracting the CLIP score baseline of the respective dataset: DataComp CLIP Baseline for models trained on the DataComp Large dataset and DatologyAI CLIP Baseline for our models. Ret Avg is the mean performance across Flickr, MSCOCO, and WinoGavil datasets. All models use the CLIP ViT-B/16 architecture and are trained on 1.28 billion samples. Results for DataComp Large models are sourced from the DataComp Large Filtering track leaderboard. *The DFN dataset size is close to 380M according to Fang et al., 2023, but the exact size is unknown.

3.7 Cross-Pollination: Enriching Natural Images with Synthetic Captions

While our results are competitive with much larger external baselines, thus far we haven’t compared to other approaches that utilize synthetic data. We now turn to external results that train comparable model architectures (ViT-B/32 and ViT-B/16) using synthetic data. While our classification-optimized curation data yields the best VTAB performance of the datasets we compare to (see Table 6), we focus our discussion primarily on retrieval capabilities, where synthetic captions are known to provide greater benefit (Xu et al., 2024).

Synthetic Data Comparison, Classification. We compare ViT-B/32 and ViT-B/16 models
Table 6: Synthetic Data Comparison, Classification. We compare ViT-B/32 and ViT-B/16 models. Due to limited evals shown in other baselines, we restrict evals to include ImageNet (IN), IN Shifts, and VTAB (mean accuracy across 12 classification datasets as specified in the Appendix and Fang et al., 2023). We compare to the best-performing MetaCLIPv2 model, which employs a recaptioning model trained on a dataset with three rounds of iterative human annotation (r=3) and uses a 15% probability (p=0.15) of selecting a synthetic caption during pretraining. We also compare with LaCLIP (Fan et al., 2023) and with Nguyen et al., 2023, which keeps the top 30% of raw data from the DataComp-Large pool as ranked using a ViT-H/16 CLIP score and uses BLIP2 (Li et al., 2023) to recaption the the remaining 70% of images.

3.7.1 Strong Synthetic Recaptioning Improves ViT-B/32 Retrieval Quality

We start with ViT-B/32 models, and benchmark our curation against two notable approaches. First, LaCLIP (Fan et al., 2023), which leverages LLaMA's in-context learning capabilities to generate diverse captions conditioned on the original alt-text/caption for training the CLIP model. Second, to the recently released MetaCLIPv2 (Xu et al., 2024). In this approach, the authors develop a specialized recaptioning model based on high-quality, iterative human annotations. This model then synthesizes detailed captions conditioned on both the image and the original caption. Our model, trained on the DatologyAI Retrieval-Optimized 1024M→205M dataset, achieves a Retrieval Avg score of 57.8%, outperforming MetaCLIPv2's 55.6% and LaCLIP's 53.7%—despite the latter two models training for 2.5x more samples and using 4x the batch size (see Table 7).

Synthetic Data Comparison, Retrieval. We compare ViT-B/32 and ViT-B/16 models.
Table 7: Synthetic Data Comparison, Retrieval. We compare ViT-B/32 and ViT-B/16 models. Due to limited evals shown in other baselines, we restrict evals to Retrieval Avg (mean performance across Flickr, MSCOCO, and WinoGavil datasets). We compare to the best-performing MetaCLIPv2 model, which employs a recaptioning model trained on a dataset with three rounds of iterative human annotation (r=3) and uses a 15% probability (p=0.15) of selecting a synthetic caption during pretraining. We also compare with LaCLIP (Fan et al., 2023) and with Nguyen et al., 2023, which keeps the top 30% of raw data from the DataComp-Large pool as ranked using a ViT-H/16 CLIP score and uses BLIP2 (Li et al., 2023) to recaption the the remaining 70% of images.

3.7.2 Sweet 16: Driving ViT-B/16 Retrieval Performance with Synthetic Captions

We now compare our ViT-B/16 results to two external approaches. The first is LaCLIP—described above—and the second is from the BYOD (Bring Your Own Dataset) track of DataComp. Specifically, Nguyen et al., 2023’s submission, which keeps the top 30% of raw data from the DataComp-Large pool as ranked using ViT-H/16 CLIP scores and uses BLIP2 (Li et al., 2023) to recaption the the remaining 70% of images. Notably, our ViT-B/16 model, trained on the DatologyAI Retrieval-Optimized 1024M→205M dataset, achieves an average retrieval score of 59.9% as compared to Nguyen’s 58.8%. We also achieve better results than LaCLIP’s 57.3%, despite LaCLIP training on 10x more samples.

3.7.3 Take-aways from Synthetic Recaptioning

Our empirical results demonstrate that our automated curation pipeline outperforms comparable models like MetaCLIPv2 that rely heavily on manual human annotation—a costly and slow process. Specifically, our approach, leveraging the DatologyAI Retrieval-Optimized 1B dataset, achieves better retrieval scores without human intervention. This highlights how synthetic data, when effectively curated through an automated and scalable pipeline, can drive substantial improvements in retrieval performance for vision-language models.

4. Methods

4.1 Data

All of the experiments presented in this work use the DataComp benchmark dataset (Gadre et al., 2023), which we chose because it’s large, diverse, and well-established. We compare against two versions of DataComp: one that uses the “raw” DataComp data (raw baseline; referred to in the DataComp paper as “CommonPool”), and one with a level of curation applied to it that we would expect from more sophisticated practitioners (sophisticated baseline). Specifically, our sophisticated baseline uses exact (image hash-based) deduplication followed by CLIP score filtering (keeping the top 30% of data using a CLIP ViT-L/14; see Appendix: Algorithms or the DataComp paper for details). We also note that “raw” DataComp already has some amount of curation applied to it, including url-text deduplication, safety filtering, eval set deduplication, and face blurring.

In order to demonstrate the effectiveness of our curation pipeline across a breadth of dataset and compute scales, we applied our curation pipeline to different subsets of the DataComp CommonPool, which we refer to as pool sizes, ranging from 256M to 1.024B samples. We trained for budgets ranging from 128M to 5.12B total samples, depending on the pool size (Table 8).

Train Budgets and Pool Sizes: We train CLIP models using three distinct dataset pools, each with varying sizes and corresponding training budgets. This table outlines the specific training budgets allocated for each pool size. Each pool corresponds to a specific number of samples from the DataComp CommonPool.
Table 8: Train Budgets and Pool Sizes: We train CLIP models using three distinct dataset pools, each with varying sizes and corresponding training budgets. This table outlines the specific training budgets allocated for each pool size. Each pool corresponds to a specific number of samples from the DataComp CommonPool.

4.1.1 Retrieval- and Classification-Optimized Curation Recipes

Most academic data curation research works aim to develop general-purpose curation algorithms that maximize average performance across a range of diverse benchmark evaluation tasks. This is a sensible approach in an academic context where end tasks are not known a priori, generality is prized, and aggressive reviewers may need placation. But it is somewhat at odds with many industrial and applied machine learning settings, in which models are trained for specific tasks that are known ahead of time (Urlana et al., 2024; Cohere Team, 2024). Because our curation product is intended for use in the latter setting, we chose not to develop a single, general-purpose curation recipe, and instead developed independent curation recipes for classification and retrieval (see How We Evaluate).

We note that the tooling and algorithmic building blocks for the classification and retrieval recipes are identical, but it is the precise configuration of the algorithmic building blocks that changes for the two recipes. Most notably, we keep our curation recipe fixed and agnostic to pool size and training duration to show that our pipeline is robust to scaling, despite recent work showing that tuning curation to the pool size and training duration is necessary to maximize its efficacy (Goyal et al., 2024).

The sizes of the datasets after curation are shown in Table 9. To clarify, we curate a dataset down from a certain pool size and then train on the curated dataset for the training budget shown in Table: 8. For example, for a pool size of 256M and training budget of 512m (2x) total train samples, our retrieval-optimized curation ends with a final dataset size of 58M samples which we then train on for 512m total train samples (~8.8 epochs).

Dataset Sizes After Curation. We started with the DataComp CommonPool as the raw baseline for each pool size, then curated it down using standard techniques (exact image deduplication followed by CLIP score filtering to keeping the top 30% of data) to obtain the sophisticated baseline, or using our retrieval- or classification-optimized curation recipe. We show here the size of each dataset after curation. Models were trained for the sample budgets shown in the Train Budgets and Pool Sizes table.
Table 9: Dataset Sizes After Curation. We started with the DataComp CommonPool as the raw baseline for each pool size, then curated it down using standard techniques (exact image deduplication followed by CLIP score filtering to keeping the top 30% of data) to obtain the sophisticated baseline, or using our retrieval- or classification-optimized curation recipe. We show here the size of each dataset after curation. Models were trained for the sample budgets shown in Table 8.

4.2 How We Evaluate

Our evaluation datasets (evals) include many of those from DataComp (Gadre et al., 2023) with the addition of SugarCrepe (Hsieh et al., 2024), a set of evals that measure compositionality. All evals are conducted zero-shot as defined in CLIP benchmark code by LAION group (see also Radford et al., 2021). We also evaluate on the 38 DataComp evaluation datasets (Gadre et al., 2023) in some comparisons with external models. More details and a complete list of evaluation datasets are provided in the Appendix. We split the evals into two separate groups of classification and retrieval+compositionality based on the different skills we optimize our curation for.

4.2.1 Defining Classification and Retrieval Tasks

Classification tasks involve predicting the correct category label for an input image, measuring the model’s ability to recognize and categorize visual features. Different types of classification tasks include object, scene, and fine-grained classification. Retrieval tasks, on the other hand, focus on finding the most relevant text or image for a given query, requiring high alignment between image and caption to assess how well a model can match pairs of related information.

4.2.2 Defining and Excluding Noisy Evals

We observed that some evaluation datasets were particularly noisy: evaluations lacking monotonic improvement over training and showing performance close to chance for all models and training datasets. This is the same definition used by FineWeb to identify and exclude noisy evaluations. We employed this criterion because datasets that don't improve over training aren't learnable at these model and data scales, and therefore would only introduce noise into our evaluation.

It's crucial to note that this exclusion was based on observing the behavior of models trained on raw data, not the curated data. These criteria identified seven evals: Rendered SST2, CLEVR Counts, CLEVR Distance, Camelyon17, PatchCamelyon, KITTI Vehicle Distance, and FMoW. We exclude these evals when reporting average classification results unless otherwise noted.

4.2.3 Grouping and Reporting Evals

In total, we evaluate on 25 classification evals and 3 retrieval evals. The classification evals can be further divided into 21 object classification evals and 4 scene classification evals. There are 2 retrieval evals (MSCOCO and Flickr) along with the compositionality eval SugarCrepe (which itself has 7 subevals). We group SugarCrepe with retrieval tasks because it is formulated as an image-to-text retrieval task.

When reporting the quality of models trained on classification-optimized data, we report average accuracy across all classification tasks unless otherwise noted. And when reporting the quality of models trained on retrieval-optimized data, we report average performance across the two retrieval datasets and one compositionality dataset, unless otherwise noted.

4.3 Models, Training, and Infrastructure

While numerous advancements in modeling (Li et al., 2023; Li et al., 2023; Mu et al., 2022) and other components of training have emerged since the introduction of CLIP models, the focus of this work is on the effects of data curation, so we did not attempt to optimize model quality via any means such as architecture, optimizer, objective, batch size, etc. We chose to use standard models and training procedures.

We trained standard CLIP ViT-B/32 models (Radford et al., 2021) for the majority of our experiments. We trained CLIP ViT-S/32 for experiments in which we demonstrate how data curation can be used to train smaller models that are competitive with larger models trained on less-curated data. We also trained CLIP ViT-B/16 models for a handful of experiments to broaden the set of public results against which we can compare our work to.

We use the default ViT-B/32, ViT-S/32, and ViT-B/16 architectures from Release 2.24.0 of the OpenCLIP repository, all trained with identical hyperparameters: Adam optimizer (Kingma et al., 2014), with a maximum learning rate of 5.0e-4, a cosine learning rate decay scheduler, a batch size of 8096, and the InfoNCE contrastive loss (van den Oord et al., 2018).

All experiments were replicated across three random seeds except for those with training budget {1280M; 5120M; 8192M} samples, which were only trained with a single seed.

We train using PyTorch 2.1.2 with CUDA 12.1. All our model training and evaluation was conducted on an AWS SageMaker Hyperpod cluster of H100 SXM nodes (p5.48xlarge instance).

4.4 Building a Robust, Scalable Data Curation Pipeline

While many excellent open-source projects offer a solid foundation for basic curation on small, single-machine text datasets, scaling those efforts to apply cutting-edge research to multimodal datasets presents a unique set of challenges. First, the sheer size of the datasets means that we have to run our curation algorithms using a cluster of machines working together, with all of the challenges around network I/O, fault tolerance, and data consistency that emerge when we are working with distributed systems. Optimal curation performance requires us to use all of computer science, from how we lay data out on disk and object storage, to how we perform batch inference on GPUs, to how we organize and track assets and curation decisions as data flows through our systems. These different components can interact in subtle and often unexpected ways, which is precisely why we feel that a tight integration between research and engineering for data curation systems is so necessary to ensure that state-of-the-art techniques translate into practical, scalable, and efficient solutions.

Our data curation platform evolved from the requirements of our initial customers and our internal research team, focusing on scalability, portability, performance and security. At its core, the platform is designed for Bring Your Own Cloud (BYOC) deployments, enabling seamless integration with our customers' existing infrastructure while maintaining cloud agnosticity.

As we developed this stack we focused on the following high-level requirements:

  • Scalable: our pipeline needs to curate datasets at the scale for training modern foundation models—billions of images / trillions of tokens.

  • Performant: the cost and speed of curation need to yield a substantial net savings on model training and deployment.

  • Portable: our data curation platform must be cloud-agnostic and not depend on specific cloud resources available in just one cloud provider

  • Secure: our product runs on the customer’s side so handling their data needs to be done in a secure fashion with all the proper controls in place

4.4.1 Core Infrastructure

We chose Kubernetes as our primary container orchestration platform, providing a consistent foundation across different cloud environments. This decision has proven crucial for maintaining operational consistency and enabling portable deployments across diverse customer environments.

4.4.2 Data Processing Engine

Our primary data processing framework is Spark on Kubernetes using the Spark Operator, which offers finer-grained control over performance tuning. Specifically, it has enabled us to:

  • Optimize resource allocation more precisely

  • Fine-tune job configurations for maximum throughput

  • Maintain greater control over our processing infrastructure

We store our data primarily in Parquet format, optimizing for both storage efficiency and query performance. Looking ahead, we're evaluating Ray for inference workloads, which promises to better serve our inference needs.

4.4.3 Workflow Orchestration with Flyte

A key differentiator in our stack is Flyte, which we leverage for complex workflow orchestration. Flyte has transformed how we manage our data curation pipelines by providing:

  • Dynamic pipeline composition, allowing us to reorder and recombine curation strategies

  • Strong type checking capabilities, ensuring reliable data flow between pipeline stages

  • Robust error handling and recovery mechanisms

The type system has proven particularly valuable, helping us catch potential issues early in the development cycle and ensuring smooth integration between different pipeline components.

5. Wrapping Up

In this work, we introduced DatologyAI's state-of-the-art data curation pipeline and demonstrated its effectiveness through extensive experimentation with multimodal (CLIP) models. Our data curation pipeline can dramatically improve the efficiency of foundation model training across multiple dimensions:

  1. Better Models: When training ViT-B/32 for the largest fixed compute budget (5120M samples), our model quality improvements reach up to 11.3pp, depending on the task and baseline. Our most comparable curated datasets also outperform the top submissions on the DataComp Large filtering track leaderboard.

  2. Faster Training: Our curation enables substantial reductions in training compute requirements: Compared to baselines trained for 5.1B samples, we save up to 97.7% on compute (43.3x training speedup) to reach baseline accuracy. Our curation also yields models that are competitive with public models trained using more than 6x the compute and well-curated datasets that are over 10x larger.

  3. Smaller, Better Models: Our curation enables training of smaller models that are both better and more efficient: We train ViT-S/32 models on our curated data, which leads to models that reduce the cost per query by up to 2.6x and are up to 13 percentage points better compared to ViT-B/32 models trained on the same amount of raw baseline data.

Beyond these core results, we also showed that our curation pipeline, which includes synthetic recaptioning, outperforms other contemporary synthetic recaptioning approaches.

What’s Next

These results establish that data curation can be a powerful tool for improving the efficiency of foundation model training and deployment. However, this work represents just the beginning of what we can do with intelligent data curation—our results will only get better.

We’ve also been hard at work developing data curation for text, and we’re very excited about the results. Stay tuned for our follow-up releases on text curation for language models!

Get in Touch!

If you’re interested in pushing the bounds of what’s possible with data curation, we’re looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.

We’re starting to work with early customers. If you’re an enterprise AI company interested in training multimodal and/or text models faster, better, or smaller, sign up for our customer waitlist!

Follow us on twitter for insights (and memes) about data!

Contributors

Core Contributors

Amro Abbas Josh Wills Haoli Yin

Contributors

Paul Burstein Ning Cao Aldo Carranza Alvin Deng

Priya Goyal Pratyush Maini Josh McGrath Fan Pan

Jack Urbanek

Interns

Vineeth Kada Muhammed Razzak Vishwa Shah Vishruth Veerendranath

Leadership and Advising

Matthew Leavitt Bogdan Gaza Ari Morcos

 

For attribution in academic contexts, please cite this work as

Abbas et al., "DatologyAI Technical Deep-Dive: Image-Text Data Curation at the Billion-Sample Scale", 2024.

BibTeX citation

@techreport{abbas_datologyai_2024,
	title = {{DatologyAI} {Technical} {Deep}-{Dive}: {Image}-{Text} {Data} {Curation} at the {Billion}-{Sample} {Scale}},
	url = {https://www.datologyai.com/post/productionized-multimodal-data-curation-at-the-billion-sample-scale},
	institution = {DatologyAI},
	author = {Abbas, Amro and Wills, Josh and Yin, Haoli and Burstein, Paul and Cao, Ning and Carranza, Aldo and Deng, Alvin and Goyal, Priya and Maini, Pratyush and McGrath, Joshua and Pan, Fan and Urbanek, Jack and Kada, Vineeth and Razzak, Muhammed and Shah, Vishwa and Veerendranath, Vishruth and Gaza, Bogdan and Morcos, Ari and Leavitt, Matthew},
	month = nov,
	year = {2024},
}
 

A. Appendix

A.1 The Big Plot

Figure A.1: Comprehensive results plot (aka The Big Plot). Here we enable viewing the results for all evaluations, model sizes, pool sizes, and curation types, each of which you can control using the dropdown menus at the top of the plot. Each larger point represents the final accuracy for a model trained for a given number of training samples (i.e. each larger point is a different model), and we linearly interpolate between larger points. We calculate FLOPs using OpenCLIPs measurements. Evaluations marked with a * are evaluations that we defined as noisy (see Methods: Defining and Excluding Noisy Evals).

A.2 Meet the Family: A Deeper Dive on our Curation Algorithms

A.2.1 Exact Deduplication

Exact deduplication is a common data curation practice, though there are a number of ways to define and implement it. Some works focus only on image URLs and their accompanying alternative (alt) text (which can be viewed as captions). DataComp (Gadre et al., 2023), for example, removes duplicate samples that have identical URLs and alt text, and LAION-400M uses a bloom filter to detect duplicate URL and alt text pairs. But different URLs can reference the same image, so some works deduplicate using image content. Zhu et al., 2024 use pHash to remove exact and near visual duplicates, and Awadalla et al., 2024 deduplicate images using SHA-256 hashes. Both works only remove images that occur more than ten times within a single Common Crawl snapshot. We chose SHA-512 hashes to remove exact image duplicates because SHA-512 is faster to compute than SHA-256 on 64-bit systems.

Because the fraction of exact duplicates increases as a function of dataset size (see Figure A.2), exact deduplication’s impact grows with dataset size and must be accounted for when attempting to extrapolate the effects of curation in small datasets to large datasets.


FIgure A.2: Fraction of Exact Duplicates Grows with Dataset Size: The fraction of exact image duplicates (y-axis) as a function of dataset size (x-axis) for different subsamples of the DataComp CommonPool.

A.2.2 Model-Based Filtering

Heuristic filters, which are developed and selected through intuitive reasoning, can yield powerful insights when applied effectively, but recent results have shown that humans are often worse than random at guessing which examples a model will learn well from (Li et al., 2024). Moreover, hand-crafted filtering rules are a brittle foundation upon which to build a tool that works across diverse domains. Privacy concerns may also prevent accessing customer data directly, making it even more difficult to implement filtering methods that rely on manual inspection and trial-and-error. Using an existing model to filter data for training subsequent models is a more reliable approach, and is standard practice for identifying high-quality data points, for both multimodal (Gadre et al., 2023; Schuhmann et al., 2022; Zhu et al., 2024; Awadalla et al., 2024; Fang et al., 2023; Xu et al., 2023; Maini et al., 2023; Mahmoud et al., 2023) and text (Wenzek et al., 2019; Marion et al., 2025; Li et al., 2024) datasets. Models used for filtering are typically trained on high-quality datasets or datasets aligned with specific downstream tasks, such that the filtering model retains data more relevant to the target distribution.

Model-based filtering is typically implemented in one of a few ways. Classifier-based approaches (Li et al., 2024; Soldaini et al., 2024; Li et al., 2024  NSFW filters from the llama 3.2 multimodal section) utilize pre-trained or fine-tuned classification models to directly assess each data sample. These models output a score or probability indicating whether a sample meets certain criteria, effectively filtering data based on discrete predictions. Alternatively, contrastive-based approaches (Fang et al., 2023; Schuhmann et al., 2022; Lai et al., 2023) are trained with a contrastive loss function, which encourages the model to map similar data points closer together and dissimilar ones further apart in the embedding space. This technique is particularly common in image-caption alignment filtering for multimodal data (Schuhmann et al., 2022; Lin et al., 2025), where the goal is to measure the semantic similarity between different modalities. However, model-based score generation for filtering can be computationally expensive at scale. Recent methods have proposed using a larger model to generate training data and distill that into a smaller, more efficient filtering model (Penedo et al., 2024; Dubey et al., 2024), or employing active learning to selectively route samples to different model sizes and improve efficiency (Zhang et al., 2024).

A.2.3 Embedding-based Curation

Embedding-based data curation has emerged as a widely used method for improving the quality of training data (Sorscher et al., 2022; Abbas et al., 2023; Tirumala et al., 2023; Abbas et al., 2024; Vo et al., 2024). These methods transform the data into an embedding space using a pre-trained encoder model and apply different curation algorithms on the embeddings.

For example, SemDeDup (Abbas et al., 2023) exploits the high-level semantic information that’s captured by the embedding model to identify and removing semantic duplicates. A series of works (Sorscher et al., 2022; Tirumala et al., 2023; Abbas et al., 2024) use various clustering-based metrics computed in the embedding space to determine sample difficulty or relevance. Vo et al., 2024 employ a hierarchical clustering algorithm to balance concept distributions in pretraining data, resulting in more efficient training for both text and image datasets.

Embedding-based curation provides two key advantages. One advantage of is that it can leverage the relations between samples, instead of curating samples in isolation like many algorithms. The second advantage is that it reduces the need to implement modality-specific curation algorithms. Once the data are embedded, curation becomes modality-agnostic. While it still requires an embedding model, the existence of many high quality, open-sourced, pretrained embedding models makes it easy to extend embedding-based curation methods to novel domains and modalities.

A.2.4 Target Distribution-Matching

The distributional similarity between pretraining data and target data is recognized as a predictor of performance on the target task (Kirchenbauer et al., 2024). Therefore, when auxiliary data is available for a known target distribution (for example a specific task or evaluation), it can be leveraged to retrieve data points from a larger corpus that are close to the target distribution (Dai et al., 2019; Yalniz et al., 2019; Aharoni et al., 2020; Gururangan et al., 2020; Oquab et al., 2023; Yu et al., 2023; Gadre et al., 2023; Wang et al., 2024). These methods typically operate by using a model (neural or otherwise) to obtain embeddings for the samples in an uncurated dataset and the samples in a target dataset(s), computing a similarity score between the embeddings of the uncurated and target data, and selecting and ranking samples from the uncurated dataset that have high similarity to the target data in a way that attempts to match the target distribution. These retrieved samples are then upsampled during training or used to construct an entirely new dataset.

This approach is generally quite effective, although it requires having a high-quality target dataset and/or a priori knowledge of the test distribution. The latter scenario is the norm in industrial settings, where models are trained intentionally for known tasks and applications. We found that the efficacy and efficiency of this algorithm is sensitive to a number of design elements, including the choice of target datasets, the similarity search & ranking algorithm, accounting for the density of the target distribution, and the proportion of retrieved data used in the final data mix.

A.2.5 Synthetic Data

Common-crawl in its raw form is a great source of images, but the associated captions for the images can often be noisy, low quality, or not capture important details relevant to the image. One axis of improving the quality of image-caption pairs in the overall corpus is to filter out any such unwanted pairs where the caption quality is insufficient. But this discards millions of high quality images simply because they have low-quality or meaningless captions like “DSCxxyyzz.jpg”. How, then, can we benefit from these images without being sabotaged by their poor captions?

In the diffusion model space, training models on synthetic captions (such as in Dalle-3) demonstrated the efficacy of using a pretrained model to improve the training of a subsequent model. Researchers found that training diffusion models on densely captioned images significantly boosts performance of such models.

To shift the focus toward CLIP models, we can explore how recent research has approach the challenge of improving image-caption alignment. Nguyen et al., 2024 and Lai et al., 2023 show the strength of recaptioning low quality image-caption pairs with an image captioning model (such as BLIP2 Li et al., 2023), and then training subsequent CLIP models on a hybrid dataset of real and synthetic captions. Such hybrid data not only allows us to use poorly-captioned images that would otherwise be filtered out, but it also enables the addition of dense details and semantic relationships that natural captions often lack. Most recently, it has also been found that translating multilingual captions into English and subsequently training CLIP models on them can improve vision-language representations (Nguyen et al., 2024).

A.3 Evaluations

A.3.1 0-shot Evaluation Methodology

All evaluations are conducted in a zero-shot manner as seen in CLIP benchmark code by LAION group. For classification, text templates with the class label (e.g., “A photo of a {class_label}”) are tokenized and encoded by the model’s text encoder, creating embeddings that are averaged and normalized into representative class vectors. Each image is then encoded by the model’s image encoder, and classification scores (logits) are calculated by measuring the similarity between the image embedding and each class embedding. Metrics such as top-k accuracy and average precision are used to assess performance, allowing the model to classify images without directly training on target classes by leveraging CLIP’s joint understanding of visual and textual data.

For retrieval, images and their corresponding captions are encoded using the model's image and text encoders to generate embeddings, which are then normalized and stored. For text-to-image retrieval, scores are calculated by computing the similarity (via dot product) between each text embedding and all image embeddings (and vice-versa for image-to-text retrieval). Positive pairs — correct matches between images and texts — are identified based on their original associations. Metrics such as recall@k are computed by checking whether at least one correct match appears among the top-k retrieved items for each query.

Similarly, SugarCrepe (the compositionality evaluation) is formulated as an image-to-text retrieval task. Unlike standard retrieval tasks where the incorrect candidates differ significantly from the correct text, compositionality benchmarks intentionally design hard negative texts that differ minimally from the positive text (Hsieh et al., 2023). This approach tests whether the model truly understands the fine-grained atomic concepts that compose a scene.

A.3.2 Banging the Gavel on WinoGAVIL

We choose not to include WinoGAVIL (Bitton et al., 2022) as part of the retrieval evals—even though DataComp does so—because we believe that the skill of commonsense reasoning/vision-language associations belongs in its own group and involves reasoning across multiple images, which cannot be considered classical retrieval. For example, a sample that is that given an image/caption of a werewolf, an associated image/caption of the moon would need to be retrieved. This, however, is considered a negative result in the traditional retrieval setting and would need an explicitly different curation and training paradigm to perform well on.

A.3.3 Calculating Recall@k

For all our retrieval results, we report the average of the image-to-text recall@k and the text-to-image recall@k, where k=1.

Image-to-text recall@k is computed as the number of images in the dataset for which the correct text (caption) is retrieved among the top k captions, divided by the dataset size. For each query, the top k retrieved captions in the dataset are defined as having highest cosine similarity to the query image. Similarly, we compute the text-to-image recall@k following the same steps.

We note that for the MSCOCO and Flickr datasets, each image typically has multiple (usually 5) correct captions. This means for image-to-text recall@k, the model can make a correct prediction by retrieving any of the 5 captions among the top k captions. This fact results in the image-to-text recall@k being higher than text-to-image recall@k for almost all the CLIP models.

A.3.4 Complete List of Evaluation Datasets


Comments


bottom of page