DNS for Platform Engineering: The Silent Killer

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/aa/f8/44/aaf8447e-a7a4-c70b-162f-5018dde26f8e/mza_11090008097488332563.png/600x600bb.jpg

Platform Engineering Playbook Podcast

vibesre

28 episodes

1 day ago

All content for Platform Engineering Playbook Podcast is the property of vibesre and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

DNS for Platform Engineering: The Silent Killer

Platform Engineering Playbook Podcast

19 minutes

6 days ago

DNS for Platform Engineering: The Silent Killer

• CoreDNS plugin-based architecture: middleware → backend chain, Kubernetes plugin watches API server and generates responses on-the-fly for cluster.local, forward plugin handles external queries • ndots:5 trap creates 5x DNS query amplification—api.stripe.com tries 4 search domains before absolute query; fix by lowering to ndots:1, using FQDNs with trailing dot, implementing app-level caching • AWS October 19-20, 2024 outage: two DNS Enactors racing in DynamoDB DNS automation, cleanup deleted all IPs for regional endpoint, 15+ hours of cascading failures (DynamoDB → dependent services → Slack/Atlassian/Snapchat) • Five-layer defensive playbook: (1) optimize—fix ndots, tune CoreDNS cache to 10K records/30s, latency ＜100ms warning; (2) failover—GSLB with health checks, TTL 60-300s for backends; (3) security—DNSSEC + DoH with internal resolvers; (4) monitoring—track p95 latency, error rates by type, top requesters; (5) testing—DNS failure game days, kill CoreDNS pods, inject latency, model failover scenarios • TTL balancing trade-off: low TTL (60-300s) enables fast failover but increases query load; high TTL (3600-86400s) improves performance but delays failover; no perfect answer, depends on SLO