Takes on "Alignment Faking in Large Language Models"

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/1b/25/ad/1b25ad75-d76c-3264-3bcd-8372ffeccf04/mza_2077981149951118548.jpg/600x600bb.jpg

Joe Carlsmith Audio

Joe Carlsmith

69 episodes

1 month ago

AIs with alien motivations can still follow instructions safely on the inputs that matter. Text version here: https://joecarlsmith.com/2025/11/12/how-human-like-do-safe-ai-motivations-need-to-be/

Philosophy

Society & Culture

RSS

All content for Joe Carlsmith Audio is the property of Joe Carlsmith and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

AIs with alien motivations can still follow instructions safely on the inputs that matter. Text version here: https://joecarlsmith.com/2025/11/12/how-human-like-do-safe-ai-motivations-need-to-be/

Philosophy

Society & Culture

https://storage.buzzsprout.com/1r37hxzfmti5sntlnl78k70yif5q?.jpg

Takes on "Alignment Faking in Large Language Models"

Joe Carlsmith Audio

1 hour 27 minutes

1 year ago

Takes on "Alignment Faking in Large Language Models"

What can we learn from recent empirical demonstrations of scheming in frontier models? Text version here: https://joecarlsmith.com/2024/12/18/takes-on-alignment-faking-in-large-language-models/