SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform Engineering

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/aa/f8/44/aaf8447e-a7a4-c70b-162f-5018dde26f8e/mza_11090008097488332563.png/600x600bb.jpg

Platform Engineering Playbook Podcast

vibesre

30 episodes

1 day ago

All content for Platform Engineering Playbook Podcast is the property of vibesre and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform Engineering

Platform Engineering Playbook Podcast

15 minutes

4 days ago

SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform Engineering

Only 26% of organizations actively use SLOs after a decade of Google's SRE principles being gospel. We explore why adoption is so low despite 49% saying they're more relevant than ever, which principles remain timeless (error budgets, embracing risk, blameless postmortems), and how to adapt SRE for 2025's complexity of AI/ML systems, Platform Engineering collaboration, and multi-cloud chaos. Includes practical playbooks for starting from zero, fixing ignored SLOs, and ML-specific adaptations. The key insight: it's not that SRE principles are wrong—implementation is harder than anticipated, but the philosophy remains timeless when properly adapted. In this episode:- Only 26% of organizations use SLOs despite 85% adopting OpenTelemetry—process transformation is harder than tooling, with unrealistic targets (99.99% = 52min/year downtime) undermining entire systems- Error budget fundamentals remain timeless: 99.999% SLO with 0.0002% problem = 20% quarterly budget spent, transforming reliability from political arguments into data-driven release decisions- Platform Engineering ($115K average) and SRE ($127K average) are complementary not competitive—Platform teams build systems, SREs ensure reliability, both can use error budget thinking for alignment- AI/ML systems need adapted SRE principles: data freshness SLOs, model drift detection, training pipeline reliability, and different error budget math (one LLM training failure = tens of thousands in compute loss)- Starting from zero: pick 3-5 critical services, one SLO per service initially (99.9% = 43 minutes/month downtime is reasonable), automate with OpenTelemetry from day one, get cross-functional buy-in, target 12-month timeline Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their reliability engineering skills. Episode URL: https://platformengineeringplaybook.com/podcasts/00025-sre-reliability-principles