Only 26% of organizations actively use SLOs after a decade of Google's SRE principles being gospel. We explore why adoption is so low despite 49% saying they're more relevant than ever, which principles remain timeless (error budgets, embracing risk, blameless postmortems), and how to adapt SRE for 2025's complexity of AI/ML systems, Platform Engineering collaboration, and multi-cloud chaos. Includes practical playbooks for starting from zero, fixing ignored SLOs, and ML-specific adaptations. The key insight: it's not that SRE principles are wrong—implementation is harder than anticipated, but the philosophy remains timeless when properly adapted.
In this episode:- Only 26% of organizations use SLOs despite 85% adopting OpenTelemetry—process transformation is harder than tooling, with unrealistic targets (99.99% = 52min/year downtime) undermining entire systems- Error budget fundamentals remain timeless: 99.999% SLO with 0.0002% problem = 20% quarterly budget spent, transforming reliability from political arguments into data-driven release decisions- Platform Engineering ($115K average) and SRE ($127K average) are complementary not competitive—Platform teams build systems, SREs ensure reliability, both can use error budget thinking for alignment- AI/ML systems need adapted SRE principles: data freshness SLOs, model drift detection, training pipeline reliability, and different error budget math (one LLM training failure = tens of thousands in compute loss)- Starting from zero: pick 3-5 critical services, one SLO per service initially (99.9% = 43 minutes/month downtime is reasonable), automate with OpenTelemetry from day one, get cross-functional buy-in, target 12-month timeline
Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their reliability engineering skills.
Episode URL: https://platformengineeringplaybook.com/podcasts/00025-sre-reliability-principles
Show more...