Datology AI

🌞 Summer of Data Seminar: Diving Deep into Data Research

We're hosting a weekly data seminar series at Datology AI featuring fun and thoughtful researchers pushing the boundaries of pretraining and data curation. Are you data-obsessed yet?

Company Updates

Written by

DDatologyAI
Published on

Share on

Share on TwitterShare on FacebookShare on LinkedIn

Welcome to the Summer of Data Seminar series, hosted by Datology AI. Each week, we invite friends, collaborators, and curious minds doing interesting research in data and pretraining to share what they’re excited about—with a small group of us at the office (or virtually).

Whether it’s how to filter data at scale, reimagine pretraining from scratch, or just what makes a dataset “good”—this seminar is a space to think out loud, ask questions, and hang out with others who are data-obsessed.


đź§  About The Seminar

The Summer of Data Seminar is a casual internal series where we bring in folks doing great research around:

  • Dataset design, pretraining, or scaling laws
  • Synthetic data and data-centric alignment
  • Data contamination, memorization, unlearning
  • Anything else weird and interesting about data

Each session includes a short talk (30–40 mins) followed by open-ended discussion, questions, and some good old-fashioned geeking out. We record these sessions to share with the broader community on YouTube, while keeping the live discussions cozy with just our team.

The joy of research is in sharing it. And asking the hard questions together.


đź“… Event Schedule

Stay tuned for upcoming talks. Here's who we've hosted so far:

PresenterDateAffiliationTopicResources
Charlie SnellCharlie Snell
May 5, 2025UC BerkeleyScaling Test-Time Compute & Predicting Emergent Capabilities by Finetuning
Maximilian BötherMaximilian Böther
May 7, 2025ETH ZurichMixtera: A Data Plane for Foundation Model Training
Shizhe DiaoShizhe Diao
May 19, 2025NVIDIA ResearchCLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Alexander GurungAlexander Gurung
June 2, 2025University of EdinburghLearning to Reason for Long-Form Story Generation
Jacob SpringerJacob Springer
June 9, 2025CMUEcho embeddings & Overtrained Language Models Are Harder to Fine-Tune
Suhas KothaSuhas Kotha
June 16, 2025StanfordStandard fine-tuning inefficiently uses rare data —
Xindi WuXindi Wu
June 23, 2025PrincetonCOMPACT: COMPositional Atomic-to-complex Visual Capability Tuning
Jaehun JungJaehun Jung
June 30, 2025UW & NvidiaPrismatic Synthesis & G-Vendi Score: How Data Diversification makes R1-32B a Better Teacher than R1-671B
Etash GuhaEtash Guha
July 7, 2025Stanford & UWOpenThoughts3
William HeldWilliam Held
July 21, 2025Georgia TechOptimizing Pretraining Data Mixtures with LLM-Estimated Utility
Vishaal UdandaraoVishaal Udandarao
July 25, 2025U TubingenACID: Active Data Curation Effectively Distills Large-Scale Multimodal Models
Rosie ZhaoRosie Zhao
July 28, 2025Harvard UniversityEcho Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
(Links will be updated closer to the event dates)

🎤 Want to Present?

If you’re working on something fun and would enjoy chatting about it with a bunch of thoughtful nerds, we’d love to have you join us.

To suggest a talk, send us a message (email or DM @datologyai) with:

  • Your name + topic
  • A rough title
  • Any weeks that are particularly good or bad for your schedule

We’ll take it from there.


We’re excited for a summer of curiosity, great conversations, and rabbit holes we didn’t expect to fall into.

Stay data-obsessed. 🤓🚀



Ready for better data?

Let’s make models better through better data, automatically.

Book a Call