Why LLM Progress is Getting Harder

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/b2/84/fb/b284fb10-4005-5de0-200f-485611e8df5b/mza_16650230443270850324.png/600x600bb.jpg

Great Data Products

Source Cooperative

2 episodes

1 week ago

A podcast about the ergonomics and craft of data. Brought to you by Source Cooperative.

Technology

Science

RSS

All content for Great Data Products is the property of Source Cooperative and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

A podcast about the ergonomics and craft of data. Brought to you by Source Cooperative.

Technology

Science

https://hosting-media.rs-prod.riverside.fm/media/podcasts/02a41c0e-5563-4b0a-9fc9-60b01317d14f/logos/b82bd04f-283a-4abb-aaad-79ae517c9bb9.png

Why LLM Progress is Getting Harder

Great Data Products

1 hour 51 minutes 38 seconds

1 month ago

Why LLM Progress is Getting Harder

Jed Sundwall and Drew Breunig explore why LLM progress is getting harder by examining the foundational data products that powered AI breakthroughs. They discuss how we've consumed the "low-hanging fruit" of internet data and graphics innovations, and what this means for the future of AI development.

The conversation traces three datasets that shaped AI: MNIST (1994), the handwritten digits dataset that became machine learning's "Hello World"; ImageNet (2008), Fei-Fei Li's image dataset that launched deep learning through AlexNet's 2012 breakthrough; and Common Crawl (2007), Gil Elbaz's web crawling project that fueled 60% of GPT-3's training data. Drew argues that great data products create ecosystems around themselves, using the Enron email dataset as an example of how a single data release can generate thousands of research papers and enable countless startups. The episode concludes with a discussion of benchmarks as modern data products and the challenge of creating sustainable data infrastructure for the next generation of AI systems.

Links and Resources:

- Common Crawl Foundation Event - October 22nd event at Stanford!

- Cloud-Native Geospatial Forum Conference 2026 - 6-9 October 2026 at Snowbird in Utah!

- Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten - Drew's blog post that inspired this conversation

- Unicorns, Show Ponies, and Gazelles - Jed's vision for sustainable data organizations

- ARC AGI Benchmark - François Chollet's reasoning benchmark

- Thinking Machines Lab - Mira Murati's reproducibility research lab

- Terminal Bench - Stanford's coding agent evaluation benchmark

- Data Science at the Singularity - David Donoho's masterful paper examining the power of frictionless reproducibility

- Rethinking Dataset Discovery with DataScout - New paper examining dataset discovery

- MNIST Dataset - The foundational machine learning dataset on Hugging Face

Key Takeaways

1. Great data products create ecosystems - They don't just provide data, they enable entire communities and industries to flourish

2. Benchmarks are data products with intent - They encode values and shape the direction of AI development

3. We've consumed the easy wins - The internet and graphics innovations that powered early AI breakthroughs are largely exhausted

4. The future is specialized - Progress will come from domain-specific datasets, benchmarks, and applications rather than general models

5. Data markets need new models - Traditional approaches to data sharing may not work in the AI era