How Hackers Keep AI Safe: Inside the World of AI Red Teaming

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/01/1c/4f/011c4f19-1f8b-29e3-6acf-78be44b020ba/mza_15450455317821352510.jpg/600x600bb.jpg

Ethical Bytes | Ethics, Philosophy, AI, Technology

Carter Considine

35 episodes

2 weeks ago

Ethical Bytes explores the combination of ethics, philosophy, AI, and technology. More info: ethical.fm

Society & Culture

RSS

All content for Ethical Bytes | Ethics, Philosophy, AI, Technology is the property of Carter Considine and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Ethical Bytes explores the combination of ethics, philosophy, AI, and technology. More info: ethical.fm

Society & Culture

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/42178869/42178869-1730013614624-c83a0b4b66f1e.jpg

How Hackers Keep AI Safe: Inside the World of AI Red Teaming

Ethical Bytes | Ethics, Philosophy, AI, Technology

27 minutes 6 seconds

2 months ago

How Hackers Keep AI Safe: Inside the World of AI Red Teaming

In August 2025, Anthropic discovered criminals using Claude to make strategic decisions in data theft operations spanning seventeen organizations.

The AI evaluated financial records, determined ransom amounts reaching half a million dollars, and chose victims based on their capacity to pay. Rather than following a script, the AI was making tactical choices about how to conduct the crime.

Unlike conventional software with predictable failure modes, large language models respond to conversational manipulation. An eleven-year-old at a Las Vegas hacking conference successfully compromised seven AI systems, which shows that technical expertise isn't required.

That accessibility transforms AI security into a challenge unlike anything cybersecurity has faced before. This makes red teaming essential. Organizations hire people to probe their systems for weaknesses before criminals find them.

These models process everything as undifferentiated text streams. You could say it’s an architectural issue. System instructions and user input flow together without clear boundaries.

Security researcher Simon Willison, who named this "prompt injection," confesses he sees no reliable solution. Many experts believe the problem may be inherent to how these systems work.

Real-world testing exposes severe vulnerabilities. Third-party auditors found that more than half their attempts to coax weapons information from Google's systems succeeded in certain setups. Researchers pulled megabytes of training data from ChatGPT for around two hundred dollars. A 2025 study showed GPT-4 could be jailbroken 87.2 percent of the time.

Today's protections focus on reducing rather than eliminating risk.

Tools like Lakera Guard detect attacks in real-time, while guidance from NIST, OWASP, and MITRE provides strategic frameworks. Meanwhile, underground markets price AI exploits between fifty and five hundred dollars, and criminal operations build malicious tools despite safeguards.

When all’s said and done, red teaming offers our strongest defense against threats that may prove impossible to completely resolve.

Key Topics:

Criminal Use of AI (00:00)
The Origins: Breaking Things in the Cold War (02:57)
When a Bug is a Core Functionality (05:40)
Testing at Scale (10:30)
When Attacks Succeed (12:55)
What Works (17:06)
The Democratization of Hacking (19:09)
What Two Years of Red Teaming Tells Us (21:01)
The Arms Race Ahead (23:58)

More info, transcripts, and references can be found at ethical.fm