🌞 Summer of Data Seminar: Diving Deep into Data Research

Welcome to the Summer of Data Seminar series, hosted by Datology AI. Each week, we invite friends, collaborators, and curious minds doing interesting research in data and pretraining to share what they’re excited about—with a small group of us at the office (or virtually).

Whether it’s how to filter data at scale, reimagine pretraining from scratch, or just what makes a dataset “good”—this seminar is a space to think out loud, ask questions, and hang out with others who are data-obsessed.

🧠 About The Seminar

The Summer of Data Seminar is a casual internal series where we bring in folks doing great research around:

Dataset design, pretraining, or scaling laws
Synthetic data and data-centric alignment
Data contamination, memorization, unlearning
Anything else weird and interesting about data

Each session includes a short talk (30–40 mins) followed by open-ended discussion, questions, and some good old-fashioned geeking out. We record these sessions to share with the broader community on YouTube, while keeping the live discussions cozy with just our team.

The joy of research is in sharing it. And asking the hard questions together.

📅 Event Schedule

Stay tuned for upcoming talks. Here's who we've hosted so far:

Presenter	Date	Affiliation	Topic	Resources
Charlie Snell	May 5, 2025	UC Berkeley	Scaling Test-Time Compute & Predicting Emergent Capabilities by Finetuning	Paper 1 Paper 2 YouTube
Maximilian Böther	May 7, 2025	ETH Zurich	Mixtera: A Data Plane for Foundation Model Training	Paper
Shizhe Diao	May 19, 2025	NVIDIA Research	CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training	Paper
Alexander Gurung	June 2, 2025	University of Edinburgh	Learning to Reason for Long-Form Story Generation	Paper
Jacob Springer	June 9, 2025	CMU	Echo embeddings & Overtrained Language Models Are Harder to Fine-Tune	Paper 1 Paper 2
Suhas Kotha	June 16, 2025	Stanford	Standard fine-tuning inefficiently uses rare data	—
Xindi Wu	June 23, 2025	Princeton	COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning	Paper
Jaehun Jung	June 30, 2025	UW & Nvidia	Prismatic Synthesis & G-Vendi Score: How Data Diversification makes R1-32B a Better Teacher than R1-671B	Paper
Etash Guha	July 7, 2025	Stanford & UW	OpenThoughts3	Paper
William Held	July 21, 2025	Georgia Tech	Optimizing Pretraining Data Mixtures with LLM-Estimated Utility	Paper
Vishaal Udandarao	July 25, 2025	U Tubingen	ACID: Active Data Curation Effectively Distills Large-Scale Multimodal Models	Paper
Rosie Zhao	July 28, 2025	Harvard University	Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining	Paper