RLVR Lets Models Fail Their Way to the Top

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/46/97/87/46978735-eae5-ffe8-5f54-fa4b8ef75c68/mza_8633369489999732461.jpg/600x600bb.jpg

YAAP (Yet Another AI Podcast)

AI21

11 episodes

2 weeks ago

All content for YAAP (Yet Another AI Podcast) is the property of AI21 and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

RLVR Lets Models Fail Their Way to the Top

YAAP (Yet Another AI Podcast)

49 minutes

3 months ago

RLVR Lets Models Fail Their Way to the Top

Think you know fine-tuning? If your answer is RLHF, you don’t. In this episode, Itay, who leads the Alignment group at AI21, gives a no-fluff crash course on RLVR (Reinforcement Learning with Verifiable Rewards), the method powering today’s smartest coding and reasoning models. He explains why RLVR beats RLHF at its own game, how “hard to solve, easy to verify” tasks unlock exploration without chaos, and the emergent behaviors you only get when models are allowed to screw up. If you want to actually understand RLVR (and use it), start here. Key topics: How RLVR outsmarts RLHF in real-world training The “verified rewards” trick that kills reward hacking Emergent skills you don’t get with hand-holding: self-verification, backtracking, multi-path reasoning Why coding models took a giant leap forward Practical steps to train (and actually benefit from) RLVR models