Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
Sports
History
Music
About Us
Contact Us
Copyright
© 2024 PodJoint
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/28/0d/20/280d209a-5b07-9e20-7d7a-c67d4f4e957a/mza_17291545061204122681.jpg/600x600bb.jpg
Talking Machines by SU PARK
Su Park
9 episodes
6 days ago
Join Su Park as she invites various guests to unpack the hottest Artificial Intelligence papers off the press. Each episode dives into the newest discoveries in AI and the sci-fi-slowly-becoming-our-reality era we’re living in.
Show more...
Education
RSS
All content for Talking Machines by SU PARK is the property of Su Park and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Join Su Park as she invites various guests to unpack the hottest Artificial Intelligence papers off the press. Each episode dives into the newest discoveries in AI and the sci-fi-slowly-becoming-our-reality era we’re living in.
Show more...
Education
https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/43344290/43344290-1743044471027-52ce8605d5ebd.jpg
How to Pick the Best Pretraining Data
Talking Machines by SU PARK
17 minutes 30 seconds
8 months ago
How to Pick the Best Pretraining Data

In this episode of "Talking Machines by Su Park," the hosts explore the critical topic of selecting pretraining datasets for Large Language Models, a decision that significantly impacts model performance and cost-efficiency. The discussion centers on a recent paper from the Allen Institute for AI, which introduces a novel approach to optimizing dataset selection without extensive computational resources, thereby addressing a key challenge in AI research.


The episode highlights two major insights from the paper. First, the proposed suite of models, known as DATADECIDE, allows researchers to effectively predict which datasets will yield the best results for larger models based on smaller-scale experiments. This method has been shown to achieve approximately 80% accuracy in predicting performance outcomes, thus reducing the need for costly trial-and-error approaches. Additionally, the research reveals which benchmarks correlate with high performance, offering valuable guidance for future dataset selection in AI training.


"DataDecide: How to Predict Best Pretraining Data with Small Experiments" by Allen Institute for AI: https://arxiv.org/abs/2504.11393

Talking Machines by SU PARK
Join Su Park as she invites various guests to unpack the hottest Artificial Intelligence papers off the press. Each episode dives into the newest discoveries in AI and the sci-fi-slowly-becoming-our-reality era we’re living in.