Platform Engineering Playbook Podcast

EXPLORE

Society & Culture

© 2024 PodJoint

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/aa/f8/44/aaf8447e-a7a4-c70b-162f-5018dde26f8e/mza_11090008097488332563.png/600x600bb.jpg

Platform Engineering Playbook Podcast

vibesre

28 episodes

1 day ago

Show more...

All content for Platform Engineering Playbook Podcast is the property of vibesre and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Episodes (20/28)

Platform Engineering Playbook Podcast

OpenTelemetry eBPF Instrumentation: Zero-Code Observability Under 2% Overhead

What if you could achieve complete observability coverage—every HTTP request, database query, and gRPC call—without touching application code? Jordan and Alex investigate eBPF instrumentation for OpenTelemetry, revealing how kernel-level hooks deliver under 2% CPU overhead versus traditional APM agents' 10-50%. Discover the May 2025 inflection point, the TLS encryption challenge, and a practical framework for combining eBPF with SDK instrumentation. In this episode:- eBPF instrumentation achieves under 2% CPU overhead by observing kernel operations already happening—versus 10-50% for traditional APM agents- Grafana donated Beyla to OpenTelemetry in May 2025, making eBPF instrumentation part of the core ecosystem- eBPF captures protocol-level data (HTTP, gRPC, SQL) but cannot access application context like user IDs or feature flags—use SDKs for business-critical paths Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills. Episode URL: https://platformengineeringplaybook.com/podcasts/00028-opentelemetry-ebpf-instrumentation

1 day ago

14 minutes

Platform Engineering Playbook Podcast

The Open Source Observability Showdown: When "Free" Costs $12K/Month

Prometheus is free, Grafana is free, Loki is free—yet Datadog posted $2.3B in revenue and Shopify runs a 15-person team just to manage their observability stack. We decode which open source tools (Prometheus, Loki, Tempo, VictoriaMetrics) actually deliver on their promises, which hide massive operational complexity, and when the "free" option costs more than paying a vendor. Learn the decision framework that matches observability architecture to your team's operational maturity. In this episode:- Single-cluster Prometheus costs ~5 hrs/month ($750-1500 equivalent), but multi-cluster federation jumps to 40-80 hrs/month ($6K-12K)—know your tier before committing- Loki delivers 5-10x cheaper storage than OpenSearch but 3-5x slower queries for complex searches—works brilliantly for structured logs with good labels, struggles with full-text search- VictoriaMetrics reports 40-60% storage reduction vs Prometheus with better high-cardinality handling—consider it before jumping to commercial platforms Perfect for senior platform engineers, sres, devops engineers with 5+ years experience making build vs buy decisions for observability infrastructure looking to level up their platform engineering skills. Episode URL: https://platformengineeringplaybook.com/podcasts/00027-observability-tools-showdown

2 days ago

19 minutes

Platform Engineering Playbook Podcast

The Kubernetes Complexity Backlash: When Simpler Infrastructure Wins

Kubernetes commands 92% market share, yet 88% report year-over-year cost increases and 25% plan to shrink deployments. We unpack the 3-5x cost underestimation problem, the cargo cult adoption pattern, and when alternatives like Docker Swarm, Nomad, ECS, or PaaS platforms deliver better ROI. From the 200-node rule to 37signals' $10M+ five-year savings leaving AWS, this is your data-driven framework for right-sizing infrastructure decisions in 2025. 🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00026-kubernetes-complexity-backlash 📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub! Summary:• 88% of Kubernetes adopters report year-over-year TCO increases (Spectro Cloud 2025), with teams underestimating total costs by 3-5x when missing human capital ($450K-$2.25M for 3-15 FTE platform team), training (6-month ramp-up), and tool sprawl• The 200-node rule: Kubernetes makes sense above 200 nodes with complex orchestration needs; below that, Docker Swarm (10-minute setup), HashiCorp Nomad (10K+ node scale), AWS ECS, Cloud Run (production in 15 minutes), or PaaS platforms ($400/month vs $150K/year K8s team) often win• 209 CNCF projects create analysis paralysis, with 75% inhibited by complexity and fintech startup wasting 120 engineer-hours evaluating service mesh they didn't need for their 30 services• Real 5-year TCO comparison: Kubernetes at 50-100 nodes costs $4.5M-$5.25M (platform team + compute + tools + training) versus PaaS at $775K-$825K (5-6x cheaper), but Kubernetes wins at 500+ nodes where PaaS per-resource costs become prohibitive• 37signals' cloud repatriation saved $10M+ over five years by leaving AWS (EKS/EC2/S3) for on-prem infrastructure ($3.2M → $1.3M annually), proving cloud and Kubernetes aren't universally optimal—they're tools with specific use cases that require matching tool to actual scale, not aspirational scale

3 days ago

15 minutes

Platform Engineering Playbook Podcast

SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform Engineering

Only 26% of organizations actively use SLOs after a decade of Google's SRE principles being gospel. We explore why adoption is so low despite 49% saying they're more relevant than ever, which principles remain timeless (error budgets, embracing risk, blameless postmortems), and how to adapt SRE for 2025's complexity of AI/ML systems, Platform Engineering collaboration, and multi-cloud chaos. Includes practical playbooks for starting from zero, fixing ignored SLOs, and ML-specific adaptations. The key insight: it's not that SRE principles are wrong—implementation is harder than anticipated, but the philosophy remains timeless when properly adapted. In this episode:- Only 26% of organizations use SLOs despite 85% adopting OpenTelemetry—process transformation is harder than tooling, with unrealistic targets (99.99% = 52min/year downtime) undermining entire systems- Error budget fundamentals remain timeless: 99.999% SLO with 0.0002% problem = 20% quarterly budget spent, transforming reliability from political arguments into data-driven release decisions- Platform Engineering ($115K average) and SRE ($127K average) are complementary not competitive—Platform teams build systems, SREs ensure reliability, both can use error budget thinking for alignment- AI/ML systems need adapted SRE principles: data freshness SLOs, model drift detection, training pipeline reliability, and different error budget math (one LLM training failure = tens of thousands in compute loss)- Starting from zero: pick 3-5 critical services, one SLO per service initially (99.9% = 43 minutes/month downtime is reasonable), automate with OpenTelemetry from day one, get cross-functional buy-in, target 12-month timeline Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their reliability engineering skills. Episode URL: https://platformengineeringplaybook.com/podcasts/00025-sre-reliability-principles

4 days ago

15 minutes

Platform Engineering Playbook Podcast

Internal Developer Portal Showdown 2025: Backstage vs Port vs Cortex vs OpsLevel

Your team spent 6 months implementing Backstage. Adoption? 8%. The CFO asks: "Why didn't we buy a solution?" Here's the 2025 comparison with real pricing, real timelines, and the counterintuitive truth: commercial platforms are 8-16x cheaper than "free" Backstage for most teams. OpsLevel $39/user/month delivers in 30-45 days. Port $78/month offers flexibility without coding. Cortex $65-69/month enforces standards. We break down the decision framework by team size—under 200? OpsLevel. 200-500? Port or OpsLevel. 500+? Backstage viable with dedicated platform team. The key insight: it's not open-source free vs commercial expensive—it's transparent licensing vs hidden $150K/20-developer engineering costs. In this episode:- Backstage costs $150K per 20 developers in hidden engineering time—$1.5M annually for 200-person teams versus $93K-$187K for commercial platforms (8-16x cheaper)- OpsLevel ($39/user/month) delivers fastest implementation at 30-45 days with 60% efficiency gains and automated catalog maintenance—ideal for teams under 200 engineers- Port ($78/user/month) offers flexible "Blueprints" data model for customization without coding, 3-6 month implementation—best for 200-500 engineer teams needing flexibility Perfect for senior platform engineers, sres, devops engineers with 5+ years experience looking to level up their platform engineering skills. Episode URL: https://platformengineeringplaybook.com/podcasts/00024-internal-developer-portals-showdown

5 days ago

24 minutes

Platform Engineering Playbook Podcast

DNS for Platform Engineering: The Silent Killer

• CoreDNS plugin-based architecture: middleware → backend chain, Kubernetes plugin watches API server and generates responses on-the-fly for cluster.local, forward plugin handles external queries • ndots:5 trap creates 5x DNS query amplification—api.stripe.com tries 4 search domains before absolute query; fix by lowering to ndots:1, using FQDNs with trailing dot, implementing app-level caching • AWS October 19-20, 2024 outage: two DNS Enactors racing in DynamoDB DNS automation, cleanup deleted all IPs for regional endpoint, 15+ hours of cascading failures (DynamoDB → dependent services → Slack/Atlassian/Snapchat) • Five-layer defensive playbook: (1) optimize—fix ndots, tune CoreDNS cache to 10K records/30s, latency ＜100ms warning; (2) failover—GSLB with health checks, TTL 60-300s for backends; (3) security—DNSSEC + DoH with internal resolvers; (4) monitoring—track p95 latency, error rates by type, top requesters; (5) testing—DNS failure game days, kill CoreDNS pods, inject latency, model failover scenarios • TTL balancing trade-off: low TTL (60-300s) enables fast failover but increases query load; high TTL (3600-86400s) improves performance but delays failover; no perfect answer, depends on SLO

6 days ago

19 minutes

Platform Engineering Playbook Podcast

eBPF in Kubernetes: Kernel-Level Superpowers Without the Risk

• eBPF enables safe kernel-level visibility with ＜5% overhead—no restarts, no kernel modules—through verifier-checked programs attached to thousands of kernel hooks (syscalls, network events, scheduler) • Cilium processes 10M packets/sec vs iptables 1-2M packets/sec by replacing linear rule evaluation with eBPF hash table lookups and XDP programs at network driver level • Pixie auto-instruments HTTP, gRPC, DNS, and database protocols by hooking syscalls in kernel space—sees application traffic without code changes or language agents • Falco detects runtime threats (spawned shells, file access anomalies) through kernel-level syscall monitoring that catches attacks traditional application tools miss • Start with low-risk tools (Parca for profiling, Falco for security alerts), verify Linux 5.0+ kernel version, avoid CNI replacement until you have specific network performance needs

1 week ago

31 minutes

Platform Engineering Playbook Podcast

Time Series Language Models

• Time-Series Language Models (TSLMs) bring foundation model capabilities to infrastructure metrics, offering zero-shot anomaly detection and natural language root cause analysis • Three major players emerged in 2024-2025: Stanford’s OpenTSLM (medical focus), Datadog’s Toto (2.36 trillion observability data points), and Nixtla’s TimeGPT (commercial forecasting API) • Despite impressive benchmarks, even vendors won’t deploy TSLMs to production yet due to accuracy gaps, massive resource requirements (40-110GB VRAM), and immature tooling ecosystems • The technology works but lacks battle-testing, characterized failure modes, and production integrations—expect vendor solutions in 2026-2027, mainstream adoption by 2027+ • Action plan: Prepare, don’t implement—build skills now (time-series fundamentals, LLM concepts, infrastructure expertise), experiment in non-critical environments, and position yourself to lead when production-ready solutions arrive

1 week ago

19 minutes

Platform Engineering Playbook Podcast

Title: Kubernetes IaC & GitOps - The Workflow Paradox

• The tool wars (ArgoCD vs Flux) are over—successful teams often run both for different use cases • Only 38% of GitOps adopters have fully automated releases despite 77% adoption • Workflow design is the real differentiator, not technology choice • Platform teams should manage infrastructure with Flux, app teams deploy with ArgoCD • Clear separation of concerns prevents tool complexity when running multiple GitOps tools

1 week ago

19 minutes

Platform Engineering Playbook Podcast

The FinOps AI Paradox: Why Smart Tools Don't Cut Costs (And What Actually Does)

• AI-powered FinOps tools work perfectly—95% accuracy, 3-minute anomaly detection, 70-85% actionable recommendations—yet 68% of identified savings never get implemented due to organizational bottlenecks, not technology limitations • Real example: AWS Cost Optimization Hub generated a perfect $200K/year savings recommendation in 12 minutes, but implementation took 7 weeks requiring CFO approval, engineering sign-off, and product team confirmation • AI cannot automate business context decisions (Black Friday traffic clusters need headroom despite low average utilization), stakeholder negotiation (VP Product doesn’t care about infrastructure costs), or application architecture changes (Lambda→Fargate migration requires 6 weeks of engineering time) • The 6% who succeed have three things: executive sponsorship (cost optimization in engineering OKRs), cross-functional accountability (teams own their cloud costs), and automated enforcement (Active Assist auto-apply, Azure Policy governance) • Decision framework: Adopt AI tools for multi-cloud billing chaos (Google FinOps Hub + FOCUS), weekly cost spike surprises (free AWS/GCP anomaly detection), or FinOps team spending ＞50% time on reporting—but NOT if you lack tagging/ownership/process or your problem is architectural (80% of cost in 2 services) • 90-day playbook: Days 1-30 audit and show leadership the $10M/year waste, Days 31-60 run free tool POC with one team, Days 61-90 roll out to 20% of org with measured impact • Monday morning actions: Enable AWS Cost Anomaly Detection (15 min, free), calculate waste (15 hrs/week × $150/hr × team size), evaluate one AI recommendation with business context, build leadership business case with 90-day plan

1 week ago

12 minutes

Platform Engineering Playbook Podcast

The DevOps Toolchain Crisis: Why Adding Tools Makes Teams Slower

• 75% of IT professionals lose 6-15 hours per week to tool sprawl ($50K per developer annually) • Each tool switch costs 23 minutes of focus time—teams navigate 7.4 tools just to ship code • AI tools are compounding the problem: organizations adding 8-10 AI tools on top of existing DevOps stack • 66% spend MORE time fixing AI-generated code than writing it themselves • 53% of organizations solved this with Internal Developer Portals (IDPs) like Backstage • IDPs make sense at 50+ engineers with 5+ tools—below 20 engineers, overhead exceeds benefit • Port (commercial) offers fast time-to-value; Backstage (open-source) provides deep customization • 90-day playbook: audit tool usage (days 1-30), run POC (31-60), roll out to 20% of team (61-90) • Monday action: measure tools touched per day, calculate annual waste, pilot Backstage over weekend

1 week ago

11 minutes

Platform Engineering Playbook Podcast

Kubernetes Production Mastery Lesson 3: Health Checks & Probes

• Configuring liveness, readiness, and startup probes with production thresholds • Diagnosing CrashLoopBackOff and NotReady pod states systematically • Designing health endpoints that validate actual application health • Understanding the critical differences between probe types • Avoiding the five most common health check mistakes

1 week ago

18 minutes

Platform Engineering Playbook Podcast

Kubernetes Production Mastery Lesson 3: Security Foundations - RBAC & Secrets

• RBAC has 4 components: Subjects, Resources, Verbs, and Scope - understand how they connect • Always prefer namespace-scoped Roles over ClusterRoles - contain blast radius • Create dedicated ServiceAccounts per application, never bind to default • Base64 is encoding, not encryption - real secrets need Sealed Secrets or External Secrets Operator • The 5 critical misconfigurations: cluster-admin for workloads, wildcards, default ServiceAccount permissions, create on RBAC resources, auto-mounted tokens

1 week ago

42 minutes

Platform Engineering Playbook Podcast

The Cloud Repatriation Debate: When AWS Costs 10-100x More Than It Should

• AWS C5 instances cost 18x more than comparable Hetzner bare metal servers ($2,500/mo vs €190/mo for 80 cores) • 86% of CIOs are planning cloud repatriation in 2025, up from 43% five years ago - but only 8-9% do full exits • Hidden costs beyond compute: egress fees ($0.09/GB), NAT gateways ($45/mo), inter-AZ traffic add 20-40% to bills • Cloud makes sense for: unpredictable spikes, ＜$10K/month spend, 20+ global regions, heavy managed service use • Break-even point: 50-100 sustained servers or 12-24 months of stable workload patterns • Hybrid approach wins: base workloads on dedicated servers, burst capacity and specialized services on cloud

2 weeks ago

13 minutes

Platform Engineering Playbook Podcast

Kubernetes in 2025: The Maturity Paradox

📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub! Summary: • Service mesh ambient mode eliminates sidecars (Istio 1.24, Red Hat OpenShift 3) • AI/ML integration: Kubeflow mainstream, 85% shadow AI problem requires governance • When to use K8s: 200+ nodes, complex orchestration, multi-cloud, 5-15 FTE platform team • When to skip: ＜200 nodes, monoliths, limited expertise, need production in ＜3 months • Skills rising: platform engineering (60% forming teams), AI/ML workloads, security/compliance • Skills declining: manual kubectl ops, vendor-specific expertise (multi-cloud abstractions winning) • Alternatives gaining ground: Docker Swarm revival, Nomad, ECS/Cloud Run, PaaS ($400/mo vs $150K/yr team) • 88% cite rising costs, 42% say it’s their #1 pain point - most underestimate by 3-5x • Salary: $144K-$202K for K8s skills alone, premium for platform engineering expertise

2 weeks ago

16 minutes

Platform Engineering Playbook Podcast

Backstage in Production: The 10% Adoption Problem

• Real costs: $1.05M year 1, $900K+ ongoing—requires 7-15 FTEs, not the 1-2 most teams budget • Why adoption stalls at 10%: data quality death spiral, technical complexity barrier, plugin maze overwhelm • Backstage success criteria: 500+ engineers minimum, strong React/TypeScript skills, 3-5 year commitment, existing service catalog discipline • Alternatives comparison: Port.io (4x faster, $50K-$200K, 40-60% adoption), Cortex ($40K-$150K, 30-50% adoption), custom portals (full control) • Success pattern: 500-person company, 5 FTE team, 18 months, managed Backstage (Roadie) → 50% adoption • Failure pattern: 150-person company, 2 FTE team, 12 months, self-hosted → 8% adoption, abandoned for Port • Framework vs product distinction: Backstage requires building the product; Port/Cortex work out of the box • Decision framework: If ＜500 engineers, limited frontend skills, need ROI ＜12 months, or can’t dedicate 7+ FTE → choose alternatives

2 weeks ago

16 minutes

Platform Engineering Playbook Podcast

Platform Engineering ROI Calculator: Prove Value to Executives

• Why 60-70% of platform teams get disbanded within 18 months (hint: it’s not technical failure) • The exact ROI formula: (Total Value - Total Cost) / Total Cost × 100 • Real numbers across 5 company sizes: 233% ROI at startups, 380% at enterprises • How to translate DORA metrics to business outcomes (deployments → revenue, MTTR → SLA penalties) • CFO, CTO, and VP Eng stakeholder templates that speak their language • When NOT to build a platform team (under 100 engineers? Read this first) • Monday morning action plan: baseline metrics → quarterly ROI presentations → survival

2 weeks ago

14 minutes

Platform Engineering Playbook Podcast

Why 70% of Platform Engineering Teams Fail (And the 5 Metrics That Predict Success)

Summary: • Only 33% of platform teams have product managers, yet 52% say PMs are crucial—this 19-point gap predicts failure better than any technology choice • Spotify’s Backstage achieved 99% voluntary adoption with a PM; external adopters average 10% adoption without one • The 2024 DORA Report found platform teams decreased throughput by 8% and stability by 14%—platforms make things worse before better • The 5 predictive metrics: (1) PM exists, (2) Baseline established, (3) NPS over 20, (4) Voluntary adoption over 50%, (5) Time to value under 30 days • Decision framework: Under 10 engineers don’t build platforms; at 100+ start with 3 people (1 PM, 2 engineers)

3 weeks ago

11 minutes

Platform Engineering Playbook Podcast

Lesson 02: Resource Management - Kubernetes Production Mastery

• Requests vs limits: scheduler uses requests for placement, kubelet enforces limits at runtime—understand this distinction to prevent node overcommitment • Three QoS classes (Guaranteed, Burstable, BestEffort) determine eviction priority when nodes face resource pressure • Five-step debugging workflow: check pod status, read describe output, analyze events, inspect logs, verify resource metrics • Right-sizing methodology: start with realistic estimates, monitor P50/P95/P99 metrics, add 20% headroom, adjust based on production data • Common mistakes: no limits (unlimited burst), equal requests/limits (wastes resources), guessing values (leads to OOMKilled), ignoring JVM memory overhead

3 weeks ago

19 minutes

Platform Engineering Playbook Podcast

Kubernetes Production Mastery - Lesson 01: Production Mindset

• Production mindset: Think in failure modes, not just success cases • 5 failure patterns: OOMKilled, cascading failures, config drift, silent degradation, manual toil • 6-item production readiness checklist: Resource limits, health checks, security context, observability, graceful shutdown, rollback plan • Real-world example: How skipping the checklist caused a 3-hour outage • Actionable next steps: Audit your workloads, enforce the checklist, run chaos experiments

3 weeks ago

15 minutes