Learning to reason in LLMs by expectation maximization

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/f2/56/51/f256516c-7ca0-a1e0-095d-98b42a505a34/mza_2950839120930297173.jpg/600x600bb.jpg

Best AI papers explained

Enoch H. Kang

609 episodes

22 hours ago

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.

Technology

RSS

All content for Best AI papers explained is the property of Enoch H. Kang and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_episode/43252366/43252366-1766901428832-13ce5fbfb9b3.jpg

Learning to reason in LLMs by expectation maximization

Best AI papers explained

13 minutes 53 seconds

1 week ago

Learning to reason in LLMs by expectation maximization

This research formalizes the process of reasoning in large language models as a latent variable model, utilizing the expectation-maximization (EM) algorithm to improve performance. The authors demonstrate that training a model to generate intermediate rationales before answering is mathematically equivalent to reward-weighted fine-tuning using binary correctness as a signal. A central focus of the study is the sampling distribution used to create these rationales, comparing methods like rejection sampling and the self-taught reasoner (STaR). The paper introduces prompt posterior sampling (PPS), a technique that conditions the model on the correct answer during training to generate more effective reasoning traces. Experiments across multiple benchmarks show that PPS consistently outperforms existing methods by producing more concise and accurate rationales. Ultimately, the work highlights that high-quality rationale generation is just as critical to model improvement as the underlying optimization algorithms.