
Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we've consumed the "low-hanging fruit" of internet data and graphics innovations, and what this means for the future of AI development.
The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning's "Hello World"; ImageNet (2008), Fei-Fei Li's image dataset that launched deep learning through AlexNet's 2012 breakthrough; and Common Crawl (2007), Gil Elbaz's web crawling project that fueled 60% of GPT-3's training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.
Links and Resources:
- Common Crawl Foundation Event - October 22nd event at Stanford!
- Cloud-Native Geospatial Forum Conference 2026 - 6-9 October 2026 at Snowbird in Utah!
- Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten - Drew's blog post that inspired this conversation
- Unicorns, Show Ponies, and Gazelles - Jed's vision for sustainable data organizations
- ARC AGI Benchmark - François Chollet's reasoning benchmark
- Thinking Machines Lab - Mira Murati's reproducibility research lab
- Terminal Bench - Stanford's coding agent evaluation benchmark
- Data Science at the Singularity - David Donoho's masterful paper examining the power of frictionless reproducibility
- Rethinking Dataset Discovery with DataScout - New paper examining dataset discovery
- MNIST Dataset - The foundational machine learning dataset on Hugging Face
Key Takeaways
1. Great data products create ecosystems - They don't just provide data, they enable entire communities and industries to flourish
2. Benchmarks are data products with intent - They encode values and shape the direction of AI development
3. We've consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted
4. The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models
5. Data markets need new models - Traditional approaches to data sharing may not work in the AI era