HACKATHON: Evals November 2023 (1)

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/6e/e3/95/6ee39578-d477-b358-f5b9-cec7ad10f081/mza_8422032406383167466.jpg/600x600bb.jpg

Into AI Safety

Jacob Haimes

25 episodes

1 month ago

The Into AI Safety podcast aims to make it easier for everyone, regardless of background, to get meaningfully involved with the conversations surrounding the rules and regulations which should govern the research, development, deployment, and use of the technologies encompassed by the term "artificial intelligence" or "AI" For better formatted show notes, additional resources, and more, go to https://kairos.fm/intoaisafety/

All content for Into AI Safety is the property of Jacob Haimes and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

Science,

Mathematics

HACKATHON: Evals November 2023 (1)

Into AI Safety

1 hour 8 minutes

2 years ago

HACKATHON: Evals November 2023 (1)

This episode kicks off our first subseries, which will consist of recordings taken during my team's meetings for the AlignmentJams Evals Hackathon in November of 2023. Our team won first place, so you'll be listening to the process which, at the end of the day, turned out to be pretty good.Check out Apart Research, the group that runs the AlignmentJamz Hackathons.Links to all articles/papers which are mentioned throughout the episode can be found below, in order of their appearance.Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure DomainsNew paper shows truthfulness & instruction-following don't generalize by defaultGeneralization Analogies WebsiteDiscovering Language Model Behaviors with Model-Written EvaluationsModel-Written Evals WebsiteOpenAI Evals GitHubMETR (previously ARC Evals)Goodharting on WikipediaFrom Instructions to Intrinsic Human Values, a Survey of Alignment Goals for Big ModelsFine Tuning Aligned Language Models Compromises Safety Even When Users Do Not IntendShadow Alignment: The Ease of Subverting Safely Aligned Language ModelsWill Releasing the Weights of Future Large Language Models Grant Widespread Access to Pandemic Agents?Building Less Flawed Metrics, Understanding and Creating Better Measurement and Incentive SystemseLeutherAI's Model Evaluation HarnessEvalugator Library