Two: Ajeya Cotra on accidentally teaching AI models to deceive us

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/e9/ef/82/e9ef82c2-76e7-4f94-a7ba-954d9fee9092/mza_3680030813785491470.jpg/600x600bb.jpg

The 80,000 Hours Podcast on Artificial Intelligence (September 2023)

80,000 Hours

14 episodes

3 months ago

A compilation of ten key episodes on artificial intelligence and related topics from 80,000 Hours. Together they'll help you learn about how AI looks from a broadly longtermist, existential risk, or effective altruism flavoured point of view.

Science

Society & Culture

RSS

All content for The 80,000 Hours Podcast on Artificial Intelligence (September 2023) is the property of 80,000 Hours and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Science

Society & Culture

https://img.transistor.fm/bD50GDHloa1j6wv4hFOG_X0pcNmEtbN_-bTAMV1dEf0/rs:fill:3000:3000:1/q:60/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS9lcGlz/b2RlLzEzMzk1NTMv/MTY4NDI0Njk4OC1h/cnR3b3JrLmpwZw.jpg

Two: Ajeya Cotra on accidentally teaching AI models to deceive us

The 80,000 Hours Podcast on Artificial Intelligence (September 2023)

2 hours 49 minutes

2 years ago

Two: Ajeya Cotra on accidentally teaching AI models to deceive us

Originally released in May 2023.

Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.

Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.

Links to learn more, summary and full transcript.

As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.

Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!

Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:

Saints — models that care about doing what we really want
Sycophants — models that just want us to say they've done a good job, even if they get that praise by taking actions they know we wouldn't want them to
Schemers — models that don't care about us or our interests at all, who are just pleasing us so long as that serves their own agenda

And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.

In today's interview, Ajeya and Rob discuss the above, as well as:

How to predict the motivations a neural network will develop through training
Whether AIs being trained will functionally understand that they're AIs being trained, the same way we think we understand that we're humans living on planet Earth
Stories of AI misalignment that Ajeya doesn't buy into
Analogies for AI, from octopuses to aliens to can openers
Why it's smarter to have separate planning AIs and doing AIs
The benefits of only following through on AI-generated plans that make sense to human beings
What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
How one might demo actually scary AI failure mechanisms

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris

Audio mastering: Ryan Kessler and Ben Cordell

Transcriptions: Katy Moore