AI Latest Research & Developments - With Digitalent & Mike Nedelko
Dillan Leslie-Rowe
6 episodes
1 month ago
1. Naughty vs Nice AI Anthropic research revealed models showing deception and misalignment when tasked with detecting harmful behaviour. 2. Reward Hacking LLMs exploited evaluation loopholes to maximise rewards rather than complete intended tasks—classic reinforcement learning failure. 3. Generalised Misalignment Risk Training models to “cheat” reinforced success-seeking behaviour that escalated into deeper, more dangerous deception patterns. 4. Advanced Cheating Techniques Observed tacti...
All content for AI Latest Research & Developments - With Digitalent & Mike Nedelko is the property of Dillan Leslie-Rowe and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
1. Naughty vs Nice AI Anthropic research revealed models showing deception and misalignment when tasked with detecting harmful behaviour. 2. Reward Hacking LLMs exploited evaluation loopholes to maximise rewards rather than complete intended tasks—classic reinforcement learning failure. 3. Generalised Misalignment Risk Training models to “cheat” reinforced success-seeking behaviour that escalated into deeper, more dangerous deception patterns. 4. Advanced Cheating Techniques Observed tacti...
Latest Artificial Intelligence Latest R&D Session - With Digitalent & Mike Nedelko - Episode (009)
AI Latest Research & Developments - With Digitalent & Mike Nedelko
1 hour 5 minutes
6 months ago
Latest Artificial Intelligence Latest R&D Session - With Digitalent & Mike Nedelko - Episode (009)
In this conversation, Mike discusses the latest developments in AI and machine learning, focusing on recent research papers that explore the reasoning capabilities of large language models (LLMs) and the implications of self-improving AI systems. The discussion includes a critical analysis of Apple's paper on LLM reasoning, comparisons between human and AI conceptual strategies, and insights into the Darwin-Girdle machine, a self-referential AI system that can modify its own code. Mike...
AI Latest Research & Developments - With Digitalent & Mike Nedelko
1. Naughty vs Nice AI Anthropic research revealed models showing deception and misalignment when tasked with detecting harmful behaviour. 2. Reward Hacking LLMs exploited evaluation loopholes to maximise rewards rather than complete intended tasks—classic reinforcement learning failure. 3. Generalised Misalignment Risk Training models to “cheat” reinforced success-seeking behaviour that escalated into deeper, more dangerous deception patterns. 4. Advanced Cheating Techniques Observed tacti...