{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"},{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"description":"Recent content in Operations and Infrastructure Management on Daily DevOps \u0026 .NET","favicon":"https://daily-devops.net/images/logo_hu_6465d873dfa490cf.png","feed_url":"https://daily-devops.net/tags/operations/feed.json","home_page_url":"https://daily-devops.net/tags/operations/","icon":"https://daily-devops.net/images/logo_hu_5926de77762241ba.png","items":[{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eNobody warns you about the point where Kubernetes stops behaving like Kubernetes. At 100 nodes the platform feels manageable: logs are searchable, deployments finish quickly, and most incidents resolve with a kubectl command and some patience. Cross 500 nodes and small architectural assumptions start cracking. Cross 1,000 nodes and those cracks become structural.\u003c/p\u003e\n\u003cp\u003eThe problems described here are not hypothetical. etcd database sizes that stretched backup windows into hours. Observability stacks consuming more cluster resources than the workloads they were supposed to monitor. Network overlays running fine at 200 nodes that started dropping packets at 800. If you\u0026rsquo;re planning to push past 500 nodes, or already running infrastructure at that scale and things feel increasingly fragile, read on.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-scale-cliff-why-1000-nodes-changes-everything\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#the-scale-cliff-why-1000-nodes-changes-everything\" title=\"The Scale Cliff: Why 1,000 Nodes Changes Everything\"\u003eThe Scale Cliff: Why 1,000 Nodes Changes Everything\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAt 100 nodes, Kubernetes feels manageable. Monitoring works. Logs are searchable. Network patterns make sense. Deployments complete in minutes. Then you cross 500 nodes and small cracks appear. By 1,000 nodes, those cracks become structural failures.\u003c/p\u003e\n\u003cp\u003eThe problem: Kubernetes components designed for graceful degradation hit hard limits at scale. etcd performance degrades non-linearly with keyspace size. Network overlay solutions that worked fine at 200 nodes saturate at 800. Observability stacks consuming 3% of cluster resources at 100 nodes consume 25% at 1,000. Cost-per-node stays flat but operational overhead per node increases exponentially.\u003c/p\u003e\n\u003cp\u003eThese aren\u0026rsquo;t bugs. They\u0026rsquo;re architectural realities. Understanding where the cliffs are lets you plan around them instead of discovering them in production outages.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"etcd-the-hidden-scaling-bottleneck\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#etcd-the-hidden-scaling-bottleneck\" title=\"etcd: The Hidden Scaling Bottleneck\"\u003eetcd: The Hidden Scaling Bottleneck\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eetcd is the single most critical component in your cluster and the first to hit scaling limits. It stores all cluster state: every pod, service, config map, secret, and custom resource. At 1,000 nodes with 200 pods per node, you\u0026rsquo;re managing 200,000+ objects. etcd wasn\u0026rsquo;t designed for that scale without careful tuning.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"performance-degradation-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#performance-degradation-patterns\" title=\"Performance Degradation Patterns\"\u003ePerformance Degradation Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eetcd performance degrades based on keyspace size, transaction rate, and storage backend latency. At small scale, these factors don\u0026rsquo;t matter. At mega-cluster scale, they dominate operational behavior.\u003c/p\u003e\n\u003cp\u003eSymptoms you\u0026rsquo;ll see:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAPI server latency spikes during deployments\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ekubectl\u003c/code\u003e commands timing out intermittently\u003c/li\u003e\n\u003cli\u003eController reconciliation loops falling behind\u003c/li\u003e\n\u003cli\u003eScheduler making suboptimal placement decisions due to stale state\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe root cause is usually one of three things: etcd database size exceeding memory capacity, insufficient IOPS on the storage backend, or transaction rate overwhelming the commit pipeline.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"backup-size-and-recovery-time\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#backup-size-and-recovery-time\" title=\"Backup Size and Recovery Time\"\u003eBackup Size and Recovery Time\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eetcd backup size scales with keyspace. A 100-node cluster might produce 500MB backups. A 1,000-node cluster produces 8GB+ backups. That size creates operational problems:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBackup windows extend from minutes to hours\u003c/li\u003e\n\u003cli\u003eNetwork transfer costs increase linearly\u003c/li\u003e\n\u003cli\u003eRecovery time objectives (RTO) slip from \u0026ldquo;15 minutes\u0026rdquo; to \u0026ldquo;2+ hours\u0026rdquo;\u003c/li\u003e\n\u003cli\u003eStorage costs for retention policies multiply unexpectedly\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWorse: most backup solutions for etcd aren\u0026rsquo;t tested at mega-cluster scale. The tooling that works reliably at 100 nodes silently fails or creates corrupted snapshots at 1,000 nodes.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-mitigation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#practical-mitigation\" title=\"Practical Mitigation\"\u003ePractical Mitigation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAKS manages etcd for you\u003c/a\u003e, but you still need to monitor and validate its health. Here\u0026rsquo;s a Terraform configuration that sets up Azure Monitor alerts for etcd-related API server latency:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_metric_alert\u0026#34; \u0026#34;etcd_latency\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-etcd-high-latency\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scopes\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  description\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Alert when API server latency exceeds 200ms (etcd saturation signal)\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  severity\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e2\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  frequency\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT1M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  window_size\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT5M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003ecriteria\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_namespace\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Microsoft.ContainerService/managedClusters\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;apiserver_request_duration_seconds\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aggregation\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Average\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    operator\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GreaterThan\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    threshold\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e0\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"m\"\u003e2\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003edimension\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      name\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;verb\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      operator\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Include\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      values\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;GET\u0026#34;, \u0026#34;LIST\u0026#34;, \u0026#34;PATCH\u0026#34;, \u0026#34;UPDATE\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eaction\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    action_group_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_monitor_action_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eplatform\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_metric_alert\u0026#34; \u0026#34;etcd_database_size\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-etcd-database-size-warning\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scopes\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  description\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Alert when etcd database approaches size limits\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  severity\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  frequency\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT5M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  window_size\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT15M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003ecriteria\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_namespace\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Microsoft.ContainerService/managedClusters\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;etcd_db_total_size_in_bytes\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aggregation\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Average\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    operator\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GreaterThan\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    threshold\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e6442450944\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # 6GB (warning threshold)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eaction\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    action_group_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_monitor_action_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eplatform\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThese alerts won\u0026rsquo;t prevent etcd saturation, but they\u0026rsquo;ll give you advance warning before cascading failures occur. At scale, that early warning is the difference between a controlled maintenance window and an all-hands incident.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"network-performance-when-overlay-solutions-hit-limits\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#network-performance-when-overlay-solutions-hit-limits\" title=\"Network Performance: When Overlay Solutions Hit Limits\"\u003eNetwork Performance: When Overlay Solutions Hit Limits\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNetwork overlay performance is invisible at small scale and catastrophic at large scale. Container Network Interface (CNI) plugins that handle 50,000 pods without issue can saturate CPU and drop packets at 200,000 pods. There is no single right answer for which CNI to use. For a full breakdown of the tradeoffs between kubenet, Azure CNI, and Azure CNI Overlay, see \u003ca href=\"/posts/aks-networking-clash/\"\u003eAKS Networking Clash: kubenet vs. CNI vs. CNI Overlay\u003c/a\u003e.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"pod-density-and-node-saturation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#pod-density-and-node-saturation\" title=\"Pod Density and Node Saturation\"\u003ePod Density and Node Saturation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure CNI Overlay\u003c/a\u003e supports up to 250 pods per node. That\u0026rsquo;s a theoretical maximum. Practical limits depend on network I/O patterns, pod churn rate, and service mesh overhead.\u003c/p\u003e\n\u003cp\u003eSignals that you\u0026rsquo;re approaching saturation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eNodes showing high system CPU (kernel networking overhead)\u003c/li\u003e\n\u003cli\u003eIntermittent packet loss between pods on the same node\u003c/li\u003e\n\u003cli\u003eService discovery latency increasing over time\u003c/li\u003e\n\u003cli\u003eDNS resolution failures under load\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe underlying issue: network namespace creation, iptables rule updates, and conntrack table management all scale poorly. At 200 pods per node, these operations consume negligible resources. At 250 pods per node, they dominate system CPU.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cross-node-latency-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cross-node-latency-patterns\" title=\"Cross-Node Latency Patterns\"\u003eCross-Node Latency Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eOverlay networks add encapsulation overhead. Azure CNI Overlay typically adds 100-200 microseconds per hop. At small scale, that\u0026rsquo;s noise. At mega-cluster scale, it compounds across multi-tier applications.\u003c/p\u003e\n\u003cp\u003eExample: a request traversing frontend → API gateway → backend service → database proxy touches 4 pods. If those pods span nodes, you\u0026rsquo;ve added 400-800 microseconds of latency from network overhead alone. Multiply that by 10,000 requests per second and the impact becomes measurable in user-facing metrics.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"mitigation-strategy\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#mitigation-strategy\" title=\"Mitigation Strategy\"\u003eMitigation Strategy\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003ePin latency-sensitive workloads to the same node using pod affinity\u003c/li\u003e\n\u003cli\u003eUse host networking for data-plane components (with appropriate security controls)\u003c/li\u003e\n\u003cli\u003eMonitor conntrack table utilization: \u003ccode\u003esysctl net.netfilter.nf_conntrack_count\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eSet conservative pod density limits (180-200 pods/node instead of 250)\u003c/li\u003e\n\u003cli\u003eImplement service mesh with extended Berkeley Packet Filter (eBPF) dataplane (\u003ca href=\"https://cilium.io/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eCilium\u003c/a\u003e) to reduce iptables overhead\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThese aren\u0026rsquo;t performance optimizations. They\u0026rsquo;re operational requirements at scale.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"observability-overhead-when-monitoring-becomes-the-problem\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#observability-overhead-when-monitoring-becomes-the-problem\" title=\"Observability Overhead: When Monitoring Becomes the Problem\"\u003eObservability Overhead: When Monitoring Becomes the Problem\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eObservability at scale creates a paradox: the systems you need to diagnose problems become the source of resource exhaustion.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"logging-cost-explosion\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#logging-cost-explosion\" title=\"Logging Cost Explosion\"\u003eLogging Cost Explosion\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eA single pod generating 100KB/day of logs costs nothing. 200,000 pods generating the same logs produce 20GB/day. Over a month, that\u0026rsquo;s 600GB. With 3x replication and 90-day retention, you\u0026rsquo;re storing 162TB of log data.\u003c/p\u003e\n\u003cp\u003eStorage costs for that volume run into thousands of dollars monthly. Query performance degrades. Log ingestion pipelines fall behind. The tooling designed to help you debug problems becomes unusable during incidents.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"metric-cardinality-problems\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#metric-cardinality-problems\" title=\"Metric Cardinality Problems\"\u003eMetric Cardinality Problems\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePrometheus-based monitoring hits cardinality limits around 10 million active time series. A 1,000-node cluster with moderate instrumentation easily exceeds that threshold:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e200,000 pods × 20 metrics per pod = 4M series\u003c/li\u003e\n\u003cli\u003e1,000 nodes × 100 metrics per node = 100K series\u003c/li\u003e\n\u003cli\u003e50 services × 10K instances × 5 metrics = 2.5M series\u003c/li\u003e\n\u003cli\u003eCustom application metrics add another 3M+ series\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWhen you exceed cardinality limits, Prometheus becomes unstable. Queries time out. Dashboards fail to render. Alerting rules stop evaluating. You lose observability exactly when you need it most.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-approaches\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#practical-approaches\" title=\"Practical Approaches\"\u003ePractical Approaches\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eImplement aggressive log sampling: 1% sampling still gives 2GB/day of logs\u003c/li\u003e\n\u003cli\u003eUse structured logging with consistent field names to enable efficient compression\u003c/li\u003e\n\u003cli\u003eArchive cold logs to blob storage (pennies per GB vs. dollars per GB in hot storage)\u003c/li\u003e\n\u003cli\u003eDeploy federated Prometheus with careful metric filtering at scrape time\u003c/li\u003e\n\u003cli\u003eUse recording rules to pre-aggregate high-cardinality metrics\u003c/li\u003e\n\u003cli\u003eConsider managed observability services (Azure Monitor, Datadog) that handle scale for you\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe honest assessment: if your observability stack consumes more than 10% of cluster resources, it\u0026rsquo;s time to rethink your approach. At mega-cluster scale, that threshold is easy to exceed.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cost-spirals-small-decisions-with-exponential-consequences\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cost-spirals-small-decisions-with-exponential-consequences\" title=\"Cost Spirals: Small Decisions with Exponential Consequences\"\u003eCost Spirals: Small Decisions with Exponential Consequences\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eCost optimization at 100 nodes is optional. At 1,000 nodes, it\u0026rsquo;s mandatory. Small inefficiencies compound brutally.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"resource-overprovisioning\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#resource-overprovisioning\" title=\"Resource Overprovisioning\"\u003eResource Overprovisioning\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eTeams typically request 2x actual resource needs for safety margin. At 100 nodes, that\u0026rsquo;s wasteful but affordable. At 1,000 nodes with 250 pods per node, you\u0026rsquo;re paying for 125,000 unutilized CPU cores.\u003c/p\u003e\n\u003cp\u003eWith Azure D8s_v5 nodes at ~$0.40/hour, a 1,000-node cluster costs ~$288,000/year in compute alone. 50% overprovisioning adds $144,000 annually. That\u0026rsquo;s real budget impact.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"storage-cost-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#storage-cost-patterns\" title=\"Storage Cost Patterns\"\u003eStorage Cost Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eEvery pod gets ephemeral storage. Most clusters also provision persistent volumes. At scale, storage costs exceed compute costs.\u003c/p\u003e\n\u003cp\u003eExample: 200,000 pods with 10GB ephemeral storage each = 2PB of ephemeral storage. Persistent volume claims add another 500TB+. Azure Premium SSD costs $0.135/GB/month. You\u0026rsquo;re paying $300K+ monthly for storage alone.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"network-egress-surprises\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#network-egress-surprises\" title=\"Network Egress Surprises\"\u003eNetwork Egress Surprises\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCross-region and internet egress costs scale linearly with traffic volume. A 1,000-node cluster handling 10TB/day of egress traffic incurs $1,500/day in bandwidth costs ($45,000/month).\u003c/p\u003e\n\u003cp\u003eTeams typically discover these costs 60 days into a scale-up when the first full billing cycle completes. By then, architectural changes are expensive and disruptive.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-control-strategy\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cost-control-strategy\" title=\"Cost Control Strategy\"\u003eCost Control Strategy\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eImplement cluster autoscaling with aggressive scale-down policies\u003c/li\u003e\n\u003cli\u003eUse spot instances for fault-tolerant workloads (70% cost reduction)\u003c/li\u003e\n\u003cli\u003eRight-size pod resource requests using \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/vertical-pod-autoscaler\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eVPA (Vertical Pod Autoscaler)\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003eEnable Azure Hybrid Benefit for Windows nodes\u003c/li\u003e\n\u003cli\u003eDeploy regional caching layers to reduce cross-region egress\u003c/li\u003e\n\u003cli\u003eMonitor and alert on cost metrics, not just resource metrics\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eTeams defer cost optimization in favor of operational simplicity, and early on that is usually the right call. At mega-cluster scale, that priority reverses. Cost efficiency becomes a constraint you cannot ignore. \u003ca href=\"/posts/cost-optimization-resource-governance-aks/\"\u003eAKS Cost Optimization: Resource Governance That Actually Works\u003c/a\u003e goes deeper on VPA configuration and autoscaling policies if you want the practical implementation details.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"debugging-at-scale-finding-needles-in-exponentially-larger-haystacks\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#debugging-at-scale-finding-needles-in-exponentially-larger-haystacks\" title=\"Debugging at Scale: Finding Needles in Exponentially Larger Haystacks\"\u003eDebugging at Scale: Finding Needles in Exponentially Larger Haystacks\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDebugging a 100-node cluster means checking logs from a few thousand pods. Debugging a 1,000-node cluster means isolating the problem from millions of log lines across 200,000+ pods.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"correlation-and-isolation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#correlation-and-isolation\" title=\"Correlation and Isolation\"\u003eCorrelation and Isolation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen a user reports an error, your troubleshooting workflow looks like this:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eIdentify the service handling the request (1 of 50+ services)\u003c/li\u003e\n\u003cli\u003eFind the pod instance that processed the request (1 of 5,000+ pod instances)\u003c/li\u003e\n\u003cli\u003eLocate the relevant log lines (1 of 10M+ log events in the time window)\u003c/li\u003e\n\u003cli\u003eCorrelate with upstream/downstream service calls\u003c/li\u003e\n\u003cli\u003eReproduce the issue in a controlled environment\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAt small scale, steps 2-3 take minutes. At mega-cluster scale, they take hours, assuming correlation IDs exist and work correctly. Without proper instrumentation, they\u0026rsquo;re impossible.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"reproduction-challenges\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#reproduction-challenges\" title=\"Reproduction Challenges\"\u003eReproduction Challenges\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eIssues that reproduce reliably at scale rarely reproduce in test environments. A race condition that triggers once per 100,000 requests never manifests in pre-production. Network congestion patterns that emerge at 1,000 nodes don\u0026rsquo;t exist at 10 nodes.\u003c/p\u003e\n\u003cp\u003eThis creates a diagnostic blind spot. You can observe the failure in production but can\u0026rsquo;t reproduce it for root cause analysis.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"large-scale-troubleshooting-checklist\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#large-scale-troubleshooting-checklist\" title=\"Large-Scale Troubleshooting Checklist\"\u003eLarge-Scale Troubleshooting Checklist\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eHere\u0026rsquo;s a diagnostic script I use for investigating performance degradation at scale:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Large-scale AKS cluster diagnostic script\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Run this when experiencing unexplained performance issues\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003e1\u003c/span\u003e\u003cspan class=\"p\"\u003e:?Cluster name required\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003e2\u003c/span\u003e\u003cspan class=\"p\"\u003e:?Resource group required\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eOUTPUT_DIR\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;./diagnostics-\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003edate +%Y%m%d-%H%M%S\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Running diagnostics for cluster: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003emkdir -p \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Get cluster credentials\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-credentials --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --overwrite-existing\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Node health check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking node health...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get nodes -o wide \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/nodes.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl top nodes \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/node-resources.txt\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Metrics server unavailable\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/node-resources.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# API server latency check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking API server latency...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e i in \u003cspan class=\"o\"\u003e{\u003c/span\u003e1..5\u003cspan class=\"o\"\u003e}\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003etime\u003c/span\u003e kubectl get nodes \u0026gt; /dev/null 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep real \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/api-latency.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# etcd health indicators\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking etcd health signals...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get --raw /metrics \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -E \u003cspan class=\"s2\"\u003e\u0026#34;apiserver_request_duration|etcd_request_duration\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/etcd-metrics.txt\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Metrics unavailable\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/etcd-metrics.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Pod distribution analysis\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Analyzing pod distribution...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | \u0026#34;\\(.spec.nodeName)\u0026#34;\u0026#39;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e sort \u003cspan class=\"p\"\u003e|\u003c/span\u003e uniq -c \u003cspan class=\"p\"\u003e|\u003c/span\u003e sort -rn \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/pod-distribution.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Network policy count (can cause iptables overhead)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking network policy count...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get networkpolicies -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/netpol-count.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Service endpoint count (affects kube-proxy performance)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking service endpoint count...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get endpoints -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq \u003cspan class=\"s1\"\u003e\u0026#39;[.items[].subsets[].addresses] | flatten | length\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/endpoint-count.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Resource pressure signals\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Identifying pods with resource pressure...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | select(.status.conditions[]? | select(.type==\u0026#34;Ready\u0026#34; and .status==\u0026#34;False\u0026#34;)) | \u0026#34;\\(.metadata.namespace)/\\(.metadata.name)\u0026#34;\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/not-ready-pods.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Recent events (truncated for performance)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Capturing recent cluster events...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get events -A --sort-by\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;.lastTimestamp\u0026#39;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e tail -1000 \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/recent-events.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Node condition checks\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking for node pressure conditions...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get nodes -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | select(.status.conditions[]? | select(.type==\u0026#34;MemoryPressure\u0026#34; or .type==\u0026#34;DiskPressure\u0026#34; or .type==\u0026#34;PIDPressure\u0026#34;) | select(.status==\u0026#34;True\u0026#34;)) | .metadata.name\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/nodes-under-pressure.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# ConfigMap and Secret count (affects etcd size)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Counting ConfigMaps and Secrets...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ConfigMaps: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get configmaps -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Secrets: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get secrets -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt;\u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Total Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pods -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt;\u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# DNS performance check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Testing DNS resolution performance...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run dns-test --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox:1.36 --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever --rm -i --command -- sh -c \u003cspan class=\"s2\"\u003e\u0026#34;time nslookup kubernetes.default\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/dns-test.txt\u0026#34;\u003c/span\u003e 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;DNS test failed\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/dns-test.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Diagnostics complete. Results in: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Quick analysis:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Nodes: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get nodes --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pods -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Not Ready Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ecat \u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e/not-ready-pods.txt \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Nodes Under Pressure: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ecat \u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e/nodes-under-pressure.txt \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Review the output files for detailed diagnostics.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script collects the signals that matter at scale: API latency, pod distribution skew, resource pressure indicators, and object count metrics. It doesn\u0026rsquo;t solve problems, but it eliminates 90% of the noise.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"patterns-that-prevent-catastrophe\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#patterns-that-prevent-catastrophe\" title=\"Patterns That Prevent Catastrophe\"\u003ePatterns That Prevent Catastrophe\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAfter running mega-clusters through multiple incident cycles, a few patterns consistently prevent the worst outcomes:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eProgressive rollouts\u003c/strong\u003e: Never deploy to 1,000 nodes simultaneously. Deploy to 1 node, then 10, then 100, then all. Automate rollback triggers. This pattern catches 95% of scale-dependent bugs before they impact production.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBlast radius isolation\u003c/strong\u003e: Segment your cluster into failure domains using node pools, namespaces, and network policies. When something fails (and it will), contain the damage. \u003ca href=\"/posts/aks-network-policies-zero-trust/\"\u003eAKS Network Policies: The Security Layer Your Cluster Is Missing\u003c/a\u003e covers practical policy configuration if you are starting from scratch.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCapacity reservation\u003c/strong\u003e: Reserve 15-20% headroom for burst traffic and incident response. Running at 90%+ utilization saves money until you need to scale during an outage and can\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eImmutable infrastructure\u003c/strong\u003e: Treat nodes as cattle, not pets. Automate node replacement on a fixed schedule (weekly or monthly). This prevents subtle configuration drift that compounds into unreproducible failures.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational runbooks\u003c/strong\u003e: Document every common failure mode. When API server latency spikes at 2 AM, you don\u0026rsquo;t want to be reading Kubernetes source code to understand etcd compaction behavior.\u003c/p\u003e\n\u003cp\u003eThese patterns aren\u0026rsquo;t revolutionary. They\u0026rsquo;re boring, defensive engineering. At mega-cluster scale, boring wins.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"honest-takeaways\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#honest-takeaways\" title=\"Honest Takeaways\"\u003eHonest Takeaways\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning AKS at 1,000+ nodes isn\u0026rsquo;t fundamentally different from running it at 100 nodes. It\u0026rsquo;s exponentially different. Problems that self-heal at small scale cascade catastrophically at large scale. Architectural decisions that feel premature at 50 nodes become load-bearing at 500 nodes.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re planning to scale past 500 nodes: budget significant engineering time for operational tooling. Plan your observability strategy before your first node boots. Understand your cost model in detail. Test failure scenarios at scale before they happen in production.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re already running at scale: you know everything in this article because you\u0026rsquo;ve lived it. The value isn\u0026rsquo;t the advice. It\u0026rsquo;s knowing you\u0026rsquo;re not alone in discovering these lessons the hard way.\u003c/p\u003e\n\u003cp\u003eScale is honest. Every shortcut taken for velocity will surface eventually, usually at the worst possible moment. Budget engineering time to address that reality before you hit 500 nodes, not after. Fixing structural problems under production pressure costs significantly more than building them correctly from the start.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-04-01T17:00:00+01:00","id":"https://daily-devops.net/posts/aks-at-scale-mega-cluster-lessons/","language":"en","summary":"Real-world lessons from operating 1000+ node AKS clusters: etcd limits, network saturation, observability overhead, and cost spirals you need to know.","tags":["kubernetes","azure","cloud","devops","operations","infrastructure"],"title":"AKS at Scale: Hard-Won Lessons from 1000+ Node Clusters","url":"https://daily-devops.net/posts/aks-at-scale-mega-cluster-lessons/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\n\n\n\n\u003ch2 id=\"the-problem-traditional-storage-models-dont-translate-to-kubernetes\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-problem-traditional-storage-models-dont-translate-to-kubernetes\" title=\"The Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\"\u003eThe Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads in Kubernetes means more than deploying a database pod. Traditional storage models (provision a disk, format it, mount it, expect it to stay) collide with Kubernetes\u0026rsquo; ephemeral, distributed architecture. Pods get rescheduled, scaled, and terminated. Your database shouldn\u0026rsquo;t lose data when that happens.\u003c/p\u003e\n\u003cp\u003eThe core challenge: \u003cstrong\u003ehow do you attach persistent storage to ephemeral compute?\u003c/strong\u003e On-premises infrastructure relies on SAN devices, NFS mounts, or local disks with predictable failure domains. You know which server hosts which disk. In AKS, you work with Azure storage primitives: Managed Disks, Azure Files, blob storage. These need seamless integration with Kubernetes lifecycle management. The abstractions differ, the failure modes differ, and operational patterns require rethinking.\u003c/p\u003e\n\u003cp\u003eComplexity multiplies with backup requirements, disaster recovery expectations, and multi-cluster data synchronization. Whether migrating legacy apps that expect local RAID controllers or building cloud-native data platforms from scratch, AKS storage architecture knowledge is foundational. Get it wrong: data loss, performance bottlenecks, escalating cloud bills.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pvcpv-architecture-how-storage-binds-to-pods-in-aks\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#pvcpv-architecture-how-storage-binds-to-pods-in-aks\" title=\"PVC/PV Architecture: How Storage Binds to Pods in AKS\"\u003ePVC/PV Architecture: How Storage Binds to Pods in AKS\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes abstracts storage through two key objects: \u003cstrong\u003ePersistentVolumes (PV)\u003c/strong\u003e and \u003cstrong\u003ePersistentVolumeClaims (PVC)\u003c/strong\u003e. A PV represents the actual storage resource (Azure Disk, Azure Files share). A PVC represents the request for that storage. The relationship mirrors compute abstractions: nodes are physical machines, pods are logical units consuming node resources. Similarly, PVs are physical storage, PVCs are logical requests consuming PV capacity.\u003c/p\u003e\n\u003cp\u003eThe binding flow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eDeveloper creates a PVC specifying size, access mode, and storage class\u003c/li\u003e\n\u003cli\u003eKubernetes finds or provisions a matching PV based on the storage class\u003c/li\u003e\n\u003cli\u003ePVC binds to the PV, making it available to pods\u003c/li\u003e\n\u003cli\u003ePods reference the PVC in their volume mounts\u003c/li\u003e\n\u003cli\u003eWhen the pod terminates, the PVC remains (data persists across pod lifecycles)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAccess modes matter:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteOnce (RWO)\u003c/strong\u003e: Single node can mount the volume (Azure Disk)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteMany (RWX)\u003c/strong\u003e: Multiple nodes can mount simultaneously (Azure Files)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadOnlyMany (ROX)\u003c/strong\u003e: Multiple nodes, read-only access\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eMost stateful apps (databases, message queues) use RWO. Azure Disks provide better IOPS and latency than Azure Files. For shared storage (parallel batch processing, shared config directories, legacy apps expecting NFS semantics), use RWX: Azure Files or third-party CSI drivers like NFS or CephFS.\u003c/p\u003e\n\u003cp\u003eCritical insight: \u003cstrong\u003ePVCs decouple storage requests from storage implementation.\u003c/strong\u003e Developers don\u0026rsquo;t need to know if they get a Premium SSD or Standard HDD. They request 100Gi of fast storage, the storage class handles provisioning. This abstraction enables platform teams to enforce policies (all production PVCs use Premium tier) without touching application manifests.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-disk-vs-azure-files-performance-cost-regional-constraints\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#azure-disk-vs-azure-files-performance-cost-regional-constraints\" title=\"Azure Disk vs. Azure Files: Performance, Cost, Regional Constraints\"\u003eAzure Disk vs. Azure Files: Performance, Cost, Regional Constraints\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eChoosing between Azure Disk and Azure Files isn\u0026rsquo;t a one-size-fits-all decision. Each has distinct performance profiles, cost implications, and operational constraints.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Disk (Managed Disks):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Lower latency, higher IOPS. Premium SSDs reach 20,000 IOPS, Ultra Disks exceed that.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Single-node attachment (RWO). Pod rescheduling to another node triggers disk detach and reattach (expect brief delay).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Databases (PostgreSQL, MongoDB), stateful apps requiring low-latency I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per provisioned disk size. A 1TB Premium SSD costs more than a 1TB Standard HDD, regardless of actual usage.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e Disks are zone-specific. With availability zones, pods must schedule in the same zone as the disk.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Files (SMB/NFS):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Higher latency than disks. Premium Files tier improves performance but still trails disk I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Multi-node (RWX). Multiple pods across nodes can mount the same share.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Shared logs, static assets, config files, legacy apps expecting NFS.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per storage consumed plus transactions. Transaction costs surprise teams on high-throughput workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e File shares are regional, not zonal. Better for cross-zone workloads, still tied to single region.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eDecision criteria:\u003c/strong\u003e Default to Azure Disk for databases and high-IOPS apps. Use Azure Files only when RWX access or legacy NFS compatibility is required. For backup targets or archival storage, consider blob storage with CSI drivers (experimental, improving).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-disk-attachment-penalty\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-disk-attachment-penalty\" title=\"The Disk Attachment Penalty\"\u003eThe Disk Attachment Penalty\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eGotcha: \u003cstrong\u003edisk attachment times.\u003c/strong\u003e Pod rescheduling requires Azure to detach the disk from the old node and attach it to the new one. This takes 30 to 90 seconds. Apps that cannot tolerate this downtime need application-level replication (PostgreSQL streaming replication) or third-party solutions like Portworx.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"storage-classes--dynamic-provisioning-automating-the-lifecycle\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#storage-classes--dynamic-provisioning-automating-the-lifecycle\" title=\"Storage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\"\u003eStorage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStatic provisioning (manually creating PVs, hoping someone claims them) creates operational overhead. \u003cstrong\u003eStorage classes\u003c/strong\u003e enable dynamic provisioning: Kubernetes automatically creates a PV when a PVC is submitted.\u003c/p\u003e\n\u003cp\u003eAKS ships with default storage classes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003edefault\u003c/code\u003e: Standard HDD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003emanaged-premium\u003c/code\u003e: Premium SSD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile\u003c/code\u003e: Azure Files share (RWX)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile-premium\u003c/code\u003e: Premium Azure Files share (RWX)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eYou can define custom storage classes to fine-tune parameters:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eStorageClass\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003efast-ssd\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eprovisioner\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003edisk.csi.azure.com\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eskuName\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePremium_LRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eManaged\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecachingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eReadOnly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Zone redundant storage (ZRS) for higher durability\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# skuName: Premium_ZRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eallowVolumeExpansion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ereclaimPolicy\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eRetain\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evolumeBindingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eWaitForFirstConsumer\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eKey parameters:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ereclaimPolicy:\u003c/strong\u003e \u003ccode\u003eDelete\u003c/code\u003e removes the disk when PVC is deleted, \u003ccode\u003eRetain\u003c/code\u003e keeps it. For production databases, \u003ccode\u003eRetain\u003c/code\u003e prevents accidental data deletion.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003evolumeBindingMode:\u003c/strong\u003e \u003ccode\u003eWaitForFirstConsumer\u003c/code\u003e delays PV creation until pod scheduling. Critical for zone-aware clusters (Kubernetes creates the disk in the same zone as the pod).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eallowVolumeExpansion:\u003c/strong\u003e Enables PVC resizing without recreation. Azure Disks support this, not all storage backends do.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eBest practice:\u003c/strong\u003e Create environment-specific storage classes (dev, staging, prod) with different \u003ccode\u003eskuName\u003c/code\u003e values. Dev clusters use Standard HDDs, prod uses Premium SSDs. Developers use identical manifests across environments, only the storage class name changes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"backup--recovery-rtorpo-implications\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#backup--recovery-rtorpo-implications\" title=\"Backup \u0026amp; Recovery: RTO/RPO Implications\"\u003eBackup \u0026amp; Recovery: RTO/RPO Implications\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes doesn\u0026rsquo;t backup data by default. Running \u003ccode\u003ekubectl delete pvc\u003c/code\u003e without a recovery plan means permanent data loss.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVelero\u003c/strong\u003e (formerly Heptio Ark) is the de facto standard for Kubernetes backup. It snapshots PVs, captures Kubernetes object state, stores backups in object storage (Azure Blob, S3, GCS).\u003c/p\u003e\n\u003cp\u003eExample Velero backup schedule (via CLI):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install Velero with Azure plugin\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero install \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --provider azure \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --bucket velero-backups \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --secret-file ./credentials-velero \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --backup-location-config \u003cspan class=\"nv\"\u003eresourceGroup\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eaks-backups-rg,storageAccount\u003cspan class=\"o\"\u003e=\u003c/span\u003eaksbackupssa\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create a daily backup schedule for production namespace\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero schedule create daily-prod-backup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --include-namespaces production \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --snapshot-volumes \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --ttl 720h\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"rto-and-rpo-considerations\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#rto-and-rpo-considerations\" title=\"RTO And RPO Considerations\"\u003eRTO And RPO Considerations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eRTO/RPO considerations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSnapshot-based backups (Azure Disk snapshots via Velero):\u003c/strong\u003e RPO equals backup frequency (hourly, daily). RTO equals time to provision new PV plus restore data (5 to 30 minutes).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNative Azure Backup for AKS:\u003c/strong\u003e Microsoft managed solution. Integrated with Azure Backup policies, slower restores and less granular than Velero.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eApplication-level backups (pg_dump, mongodump):\u003c/strong\u003e Bypasses Kubernetes entirely. Lower RTO with automated restore scripts, requires custom orchestration.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eGotcha:\u003c/strong\u003e Velero relies on Azure Disk snapshots. Disk in Zone 1, restore to cluster in Zone 2 requires cross-zone snapshot copy (not instant). Test restore procedures in non-prod clusters. A backup never restored is wishful thinking.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-aks-replication-patterns-for-cross-cluster-data-synchronization\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#multi-aks-replication-patterns-for-cross-cluster-data-synchronization\" title=\"Multi-AKS Replication: Patterns for Cross-Cluster Data Synchronization\"\u003eMulti-AKS Replication: Patterns for Cross-Cluster Data Synchronization\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads across multiple AKS clusters—whether for HA, disaster recovery, or multi-region latency requirements—adds another layer of complexity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 1: Application-Level Replication\u003c/strong\u003e\nLet the application handle replication. PostgreSQL streaming replication, MongoDB replica sets, Kafka replication understand their data models and replicate efficiently.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e No Kubernetes-specific dependencies. Works identically in VMs, on-premises, or managed services.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e You manage replication lag, split-brain scenarios, and failover logic.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 2: Storage-Level Replication\u003c/strong\u003e\nUse Azure NetApp Files or third-party solutions like Portworx for block or file-level replication.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Transparent to applications. Works with legacy apps lacking native replication.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e Expensive. NetApp Files Premium tier and Portworx licensing (scales with node count) add significant cost.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 3: Backup-Based DR\u003c/strong\u003e\nVelero backups from primary cluster, restore to secondary on failover.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Cost-effective (blob storage only).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e RPO equals last backup interval (hours, not seconds). RTO includes restore time (minutes to hours).\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"a-multi-region-postgresql-pattern\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-multi-region-postgresql-pattern\" title=\"A Multi-Region PostgreSQL Pattern\"\u003eA Multi-Region PostgreSQL Pattern\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eReal-world example:\u003c/strong\u003e Multi-region PostgreSQL deployment pattern I\u0026rsquo;ve encountered:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePrimary AKS cluster (West Europe):\u003c/strong\u003e Production traffic\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSecondary AKS cluster (North Europe):\u003c/strong\u003e Read replicas via PostgreSQL streaming replication\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eVelero backups:\u003c/strong\u003e Azure Blob in third region (East US) for regulatory compliance\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis provides sub-second RPO within Europe (streaming replication), hourly RPO globally (Velero), 5-minute RTO for regional failover (promote read replica).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational reality:\u003c/strong\u003e Multi-cluster data replication is complex. Avoid it by using managed services (Azure Database for PostgreSQL with geo-replication) if possible. Running databases in AKS requires investment in automation, monitoring, and runbooks. Your 3 AM self will appreciate this decision.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"final-thoughts\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#final-thoughts\" title=\"Final Thoughts\"\u003eFinal Thoughts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStorage in AKS represents a set of trade-offs requiring deliberate navigation. Azure Disk provides performance with zone-locking. Azure Files offers flexibility with latency penalties. Velero enables backups but demands operational discipline and testing. Multi-cluster replication delivers resilience with non-linear operational complexity.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"a-pragmatic-starting-point\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-pragmatic-starting-point\" title=\"A Pragmatic Starting Point\"\u003eA Pragmatic Starting Point\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePragmatic approach: Start with managed storage classes and Velero. Use Azure Disk for databases and high-IOPS workloads. Use Azure Files only when RWX access or legacy NFS compatibility is genuinely required. Test restore procedures quarterly, not during outages. Schedule fire drills: delete a namespace, restore from backup. Measure actual RTO/RPO instead of assuming SLA compliance.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-leave-aks-for-managed-data-services\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#when-to-leave-aks-for-managed-data-services\" title=\"When To Leave AKS For Managed Data Services\"\u003eWhen To Leave AKS For Managed Data Services\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen stateful workload requirements outgrow AKS storage primitives (sub-second cross-region replication, disk attachment latency breaking your app, spiraling storage costs), don\u0026rsquo;t force solutions. Consider Azure managed services (Azure Database for PostgreSQL, Cosmos DB) or specialized data platforms (Confluent Cloud for Kafka, MongoDB Atlas). Sometimes the best Kubernetes storage strategy is avoiding stateful workloads in Kubernetes.\u003c/p\u003e\n\u003cp\u003eKubernetes excels at stateless orchestration. For stateful workloads, it\u0026rsquo;s capable but demands understanding the plumbing, accepting trade-offs, building operational muscle around backups, monitoring, and runbooks. Treat storage as infrastructure that will fail, not infrastructure that just works. Plan accordingly.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-04T17:00:00+01:00","id":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/","language":"en","summary":"PVC/PV patterns, Azure Disk vs Files trade-offs, Velero backup strategies, and cross-cluster replication for production stateful workloads in AKS.","tags":["storage","azure","kubernetes","cloud","database","reliability","operations","platform-engineering","disaster-recovery"],"title":"Storage Architecture \u0026 Stateful Workloads in AKS","url":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eAKS cluster upgrades are routine maintenance, but executing them without dropping traffic or losing state is the operational challenge that separates theory from production reality. Every Kubernetes version upgrade involves replacing nodes, which means evicting pods, draining workloads, and hoping your assumptions about resilience hold true under pressure.\u003c/p\u003e\n\u003cp\u003eI have participated in dozens of AKS upgrades across production clusters ranging from 10 to 500+ nodes. The pattern is consistent: teams that treat upgrades as a checkbox operation eventually experience an outage. Teams that understand the underlying mechanics and configure explicit constraints rarely do.\u003c/p\u003e\n\u003cp\u003eThis article covers the real mechanics: how cordon and drain actually work, why Pod Disruption Budgets exist, and how to orchestrate multi-node-pool rollouts with automation that survives contact with production.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-uncontrolled-node-drains-cause-cascading-failures\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#the-problem-uncontrolled-node-drains-cause-cascading-failures\" title=\"The problem: uncontrolled node drains cause cascading failures\"\u003eThe problem: uncontrolled node drains cause cascading failures\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWhen you upgrade an AKS cluster, Azure replaces nodes with new VMs running the updated Kubernetes version. That replacement process triggers pod eviction. Without proper controls, evictions happen simultaneously across multiple nodes, stateful workloads lose quorum, and traffic drops because there are no healthy replicas left to serve requests.\u003c/p\u003e\n\u003cp\u003eThe default behavior is optimistic: Kubernetes assumes your workloads are designed for failure. But production workloads are rarely that resilient. Databases need time to transfer leadership, message queues need to flush buffers, and stateless apps still need at least one replica running to handle incoming connections.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e\u003c/em\u003e The \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-cluster\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eofficial AKS upgrade documentation\u003c/a\u003e covers the mechanics, but it does not emphasize how quickly things go wrong without proper constraints. I have seen a three-minute upgrade window turn into a two-hour incident because nobody configured Pod Disruption Budgets.\u003c/p\u003e\n\u003cp\u003eUncontrolled drains create several failure modes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eData loss:\u003c/strong\u003e Stateful workloads evicted before flushing state to disk or replicating to peers.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eService interruption:\u003c/strong\u003e All replicas terminated before new ones become ready.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCascading failures:\u003c/strong\u003e Dependent services timeout waiting for unavailable backends, triggering retries that amplify load.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe solution is not to avoid upgrades. The solution is to control the eviction process with explicit constraints that match your workload requirements.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cordon-and-drain-mechanics-what-actually-happens\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#cordon-and-drain-mechanics-what-actually-happens\" title=\"Cordon and drain mechanics: what actually happens\"\u003eCordon and drain mechanics: what actually happens\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThe Kubernetes eviction API follows a three-step process when draining a node:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eCordon:\u003c/strong\u003e Mark the node as unschedulable. New pods will not be placed on this node, but existing pods continue running.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEvict:\u003c/strong\u003e Send termination signals to all pods on the node, respecting grace periods and Pod Disruption Budgets (PDBs).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eWait:\u003c/strong\u003e Block until all pods have terminated or the drain timeout expires.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAKS automates this process during upgrades, but you can trigger it manually using kubectl for maintenance or troubleshooting:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cordon \u0026lt;node-name\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl drain \u0026lt;node-name\u0026gt; --ignore-daemonsets --delete-emptydir-data\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe \u003ccode\u003e--ignore-daemonsets\u003c/code\u003e flag prevents drain from failing on DaemonSet pods, which are designed to run on every node and will be recreated automatically. The \u003ccode\u003e--delete-emptydir-data\u003c/code\u003e flag allows drain to proceed even if pods use emptyDir volumes, which are ephemeral and will be lost.\u003c/p\u003e\n\u003cp\u003eFor AKS automated upgrades, you can configure the drain behavior per node pool using \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-node-pools-rolling\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003erolling upgrade settings\u003c/a\u003e:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks nodepool update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myNodePool \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --max-surge 33% \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --drain-timeout \u003cspan class=\"m\"\u003e45\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --node-soak-duration \u003cspan class=\"m\"\u003e5\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe \u003ccode\u003e--drain-timeout\u003c/code\u003e parameter (in minutes) controls how long AKS waits for pods to terminate before force-killing them. The \u003ccode\u003e--node-soak-duration\u003c/code\u003e (in minutes) adds a stabilization period after each node upgrade before proceeding to the next. Microsoft recommends \u003ccode\u003e--max-surge 33%\u003c/code\u003e for production workloads.\u003c/p\u003e\n\u003cp\u003eManual drain remains useful for pre-maintenance validation, testing PDB configurations, or debugging eviction failures before committing to a full cluster upgrade.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pod-disruption-budgets-the-safety-mechanism-you-should-always-configure\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#pod-disruption-budgets-the-safety-mechanism-you-should-always-configure\" title=\"Pod Disruption Budgets: the safety mechanism you should always configure\"\u003ePod Disruption Budgets: the safety mechanism you should always configure\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eA \u003ca href=\"https://kubernetes.io/docs/tasks/run-application/configure-pdb/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003ePod Disruption Budget\u003c/a\u003e (PDB) defines the minimum number of pods that must remain available during voluntary disruptions like node drains. PDBs do not prevent involuntary disruptions like node crashes or resource exhaustion, but they block evictions that would violate availability constraints.\u003c/p\u003e\n\u003cp\u003ePDBs are defined using either \u003ccode\u003eminAvailable\u003c/code\u003e or \u003ccode\u003emaxUnavailable\u003c/code\u003e:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eminAvailable:\u003c/strong\u003e The minimum number of pods (or percentage) that must remain running during a disruption.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003emaxUnavailable:\u003c/strong\u003e The maximum number of pods (or percentage) that can be unavailable during a disruption.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eExample PDB for a three-replica deployment that must keep at least two replicas running:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003epolicy/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePodDisruptionBudget\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyapp-pdb\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eminAvailable\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e2\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eselector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eapp\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyapp\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWith this PDB in place, drain will evict only one pod at a time, waiting for a replacement to become ready before proceeding to the next eviction. If no replacement becomes ready (for example, due to resource constraints or image pull failures), the drain blocks until the timeout expires.\u003c/p\u003e\n\u003cp\u003ePDBs are particularly critical for:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eStateful workloads:\u003c/strong\u003e Databases, message queues, and distributed systems that require quorum.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLow-replica deployments:\u003c/strong\u003e Services with two or three replicas where losing one pod reduces capacity significantly.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLong startup times:\u003c/strong\u003e Workloads that take minutes to initialize and become ready.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003ePractical PDB configuration advice:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSet \u003ccode\u003eminAvailable: 1\u003c/code\u003e for stateless services with two replicas.\u003c/li\u003e\n\u003cli\u003eSet \u003ccode\u003eminAvailable: N-1\u003c/code\u003e for N-replica stateful services that tolerate one failure (for example, three-node etcd allows \u003ccode\u003eminAvailable: 2\u003c/code\u003e).\u003c/li\u003e\n\u003cli\u003eAvoid \u003ccode\u003eminAvailable: N\u003c/code\u003e (all replicas), which blocks drain indefinitely and prevents upgrades.\u003c/li\u003e\n\u003cli\u003eUse percentages for large replica counts: \u003ccode\u003eminAvailable: 75%\u003c/code\u003e for a 10-replica deployment allows up to 2-3 pods to be evicted simultaneously.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor tip: Before any upgrade, run \u003ccode\u003ekubectl get pdb -A\u003c/code\u003e and verify that no PDB has \u003ccode\u003eALLOWED DISRUPTIONS\u003c/code\u003e showing zero. A PDB with zero allowed disruptions will block node drain indefinitely, and your upgrade will hang until the drain timeout expires or you manually intervene.\u003c/p\u003e\n\u003cp\u003ePDBs only apply to voluntary disruptions. Node failures ignore PDBs and evict all pods immediately.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"workload-categories-stateless-stateful-daemonsets\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#workload-categories-stateless-stateful-daemonsets\" title=\"Workload categories: stateless, stateful, DaemonSets\"\u003eWorkload categories: stateless, stateful, DaemonSets\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDifferent workload types require different upgrade strategies. A one-size-fits-all approach causes either unnecessary downtime (overly conservative) or unexpected failures (overly aggressive).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"stateless-workloads\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#stateless-workloads\" title=\"Stateless workloads\"\u003eStateless workloads\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eStateless services like web frontends, API gateways, and workers can tolerate rapid eviction as long as at least one replica remains available. Configure PDBs with \u003ccode\u003eminAvailable: 1\u003c/code\u003e or \u003ccode\u003emaxUnavailable: N-1\u003c/code\u003e to allow fast rollouts while maintaining service availability.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"stateful-workloads\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#stateful-workloads\" title=\"Stateful workloads\"\u003eStateful workloads\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDatabases, message queues, and distributed storage systems require careful sequencing. Evicting multiple replicas simultaneously can cause quorum loss, split-brain scenarios, or data corruption.\u003c/p\u003e\n\u003cp\u003eBest practices for stateful workloads:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSet conservative PDBs that preserve quorum (for example, \u003ccode\u003eminAvailable: 2\u003c/code\u003e for a three-node cluster).\u003c/li\u003e\n\u003cli\u003eConfigure long grace periods (60+ seconds) to allow state transfer and leadership handoff.\u003c/li\u003e\n\u003cli\u003eUse StatefulSets with proper readiness probes to ensure new replicas are fully initialized before old ones are terminated.\u003c/li\u003e\n\u003cli\u003eTest upgrade scenarios in staging with realistic data volumes and latency.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"daemonsets\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#daemonsets\" title=\"DaemonSets\"\u003eDaemonSets\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDaemonSets run exactly one pod per node (or per matching node). Examples include logging agents, monitoring exporters, and network plugins. Draining a node automatically terminates the DaemonSet pod, and the pod is recreated on the new node after upgrade.\u003c/p\u003e\n\u003cp\u003eDaemonSets do not require PDBs because they are designed to tolerate single-node failures. Use the \u003ccode\u003e--ignore-daemonsets\u003c/code\u003e flag during manual drain to skip these pods.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-node-pool-rollout-strategies-graduated-risk-management\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#multi-node-pool-rollout-strategies-graduated-risk-management\" title=\"Multi-node-pool rollout strategies: graduated risk management\"\u003eMulti-node-pool rollout strategies: graduated risk management\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS supports multiple node pools within a single cluster. Each pool can have different VM sizes, availability zones, and upgrade schedules. Multi-node-pool architectures enable graduated rollouts that reduce risk by upgrading non-critical workloads first.\u003c/p\u003e\n\u003cp\u003eRecommended upgrade sequence:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eDev/test pools first:\u003c/strong\u003e Upgrade node pools running non-production workloads to validate the new Kubernetes version and catch compatibility issues early.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateless application pools:\u003c/strong\u003e Upgrade pools running stateless services that can tolerate brief capacity reductions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateful application pools last:\u003c/strong\u003e Upgrade pools running databases and stateful services only after validating the rollout on stateless workloads.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eExample multi-pool upgrade using Azure CLI:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;myResourceGroup\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;myAKSCluster\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;1.29.2\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Configure rolling upgrade settings for production safety\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eMAX_SURGE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;33%\u0026#34;\u003c/span\u003e        \u003cspan class=\"c1\"\u003e# Microsoft recommended for production\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eDRAIN_TIMEOUT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;45\u0026#34;\u003c/span\u003e     \u003cspan class=\"c1\"\u003e# Minutes to wait for pod eviction\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNODE_SOAK\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;5\u0026#34;\u003c/span\u003e          \u003cspan class=\"c1\"\u003e# Minutes to stabilize after each node\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Upgrade control plane first (does not affect workloads)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Upgrading control plane to \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks upgrade \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --kubernetes-version \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TARGET_VERSION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --control-plane-only \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --yes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Upgrade node pools in sequence: system -\u0026gt; stateless -\u0026gt; stateful\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNODE_POOLS\u003c/span\u003e\u003cspan class=\"o\"\u003e=(\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;system\u0026#34;\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;stateless\u0026#34;\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;stateful\u0026#34;\u003c/span\u003e\u003cspan class=\"o\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e POOL in \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eNODE_POOLS\u003c/span\u003e\u003cspan class=\"p\"\u003e[@]\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Upgrading node pool: \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Verify current node count and health\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz aks nodepool show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --query count -o tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Current node count for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e: \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Configure rolling upgrade settings before upgrade\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --max-surge \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$MAX_SURGE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --drain-timeout \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$DRAIN_TIMEOUT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --node-soak-duration \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NODE_SOAK\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Upgrade node pool\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool upgrade \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --kubernetes-version \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TARGET_VERSION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --yes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Wait for upgrade to complete\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Waiting for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e upgrade to complete...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool \u003cspan class=\"nb\"\u003ewait\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --updated\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Verify upgraded node count matches original\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eUPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz aks nodepool show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --query count -o tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CURRENT_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e !\u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$UPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: Node count mismatch for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e. Expected \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e, got \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eUPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Pool \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e upgraded successfully.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;---\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;All node pools upgraded to \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script upgrades the control plane first (which is a non-disruptive operation), then upgrades each node pool sequentially, validating node count before and after each upgrade to detect unexpected node losses.\u003c/p\u003e\n\u003cp\u003eKey operational notes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eControl plane upgrades are non-disruptive:\u003c/strong\u003e The control plane upgrade updates the Kubernetes API server and controllers but does not affect running workloads. Only node pool upgrades trigger pod evictions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOne node pool at a time:\u003c/strong\u003e Upgrading multiple pools simultaneously multiplies risk. Sequential upgrades allow you to catch issues early and halt the rollout before affecting critical workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eValidate before proceeding:\u003c/strong\u003e Check pod health, replica counts, and application metrics after each pool upgrade. Use kubectl, Azure Monitor, or Prometheus to verify that workloads are stable before moving to the next pool.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"planned-maintenance-windows-scheduling-upgrades-safely\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#planned-maintenance-windows-scheduling-upgrades-safely\" title=\"Planned maintenance windows: scheduling upgrades safely\"\u003ePlanned maintenance windows: scheduling upgrades safely\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eFor clusters with \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eautomatic upgrades\u003c/a\u003e enabled, AKS supports \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/planned-maintenance\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eplanned maintenance windows\u003c/a\u003e to control when upgrades occur. This prevents upgrades from starting during peak traffic periods.\u003c/p\u003e\n\u003cp\u003eConfigure a weekly maintenance window using Azure CLI:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks maintenanceconfiguration add \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name aksManagedAutoUpgradeSchedule \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule-type Weekly \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --day-of-week Saturday \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --start-time 02:00 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --duration \u003cspan class=\"m\"\u003e4\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eMicrosoft recommends a minimum four-hour maintenance window to ensure upgrades complete without interruption. Combine this with the \u003ccode\u003estable\u003c/code\u003e auto-upgrade channel, which targets the previous minor version with latest patches, for a balance between staying current and avoiding bleeding-edge issues.\u003c/p\u003e\n\u003cp\u003eFor production clusters, I prefer manual upgrades with planned maintenance windows as a safety net. The automation handles the scheduling, but I control when the actual upgrade starts.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"automation-and-rollback-scripting-safe-upgrades\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#automation-and-rollback-scripting-safe-upgrades\" title=\"Automation and rollback: scripting safe upgrades\"\u003eAutomation and rollback: scripting safe upgrades\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAutomation reduces human error during upgrades, but only if the automation includes validation and rollback capabilities. A fully automated upgrade script should:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eValidate current cluster state (replica counts, PDB configurations, node health).\u003c/li\u003e\n\u003cli\u003eUpgrade in stages with validation checkpoints.\u003c/li\u003e\n\u003cli\u003eDetect failures and halt or rollback automatically.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003ePractical validation checks before upgrade:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;=== Pre-Upgrade Validation ===\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check available Kubernetes versions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Available upgrades:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-upgrades \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --output table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify all nodes are ready\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNOTREADY\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get nodes --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -v \u003cspan class=\"s2\"\u003e\u0026#34; Ready \u0026#34;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NOTREADY\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -gt \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NOTREADY\u003c/span\u003e\u003cspan class=\"s2\"\u003e nodes are not ready. Aborting upgrade.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  kubectl get nodes \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -v \u003cspan class=\"s2\"\u003e\u0026#34; Ready \u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ All nodes ready\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check for PDBs that would block drain\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eBLOCKED\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pdb -A -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.namespace}/{.metadata.name}{\u0026#34;\\n\u0026#34;}{end}\u0026#39;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$BLOCKED\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;WARNING: The following PDBs have zero allowed disruptions:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$BLOCKED\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;These will block node drain. Verify this is intentional.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify PDBs exist for critical namespaces\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e NS in production\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003ePDBS\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pdb -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --no-headers 2\u0026gt;/dev/null \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$PDBS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -eq \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;WARNING: No PDBs configured in namespace \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eelse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ \u003c/span\u003e\u003cspan class=\"nv\"\u003e$PDBS\u003c/span\u003e\u003cspan class=\"s2\"\u003e PDBs configured in \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify critical deployments have sufficient replicas\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking critical deployments...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e DEPLOYMENT in myapp-frontend myapp-backend\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eREPLICAS\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get deployment \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -n production -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{.status.readyReplicas}\u0026#39;\u003c/span\u003e 2\u0026gt;/dev/null \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;0\u0026#34;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -lt \u003cspan class=\"m\"\u003e2\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e has fewer than 2 ready replicas (\u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e). Aborting upgrade.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ \u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e replicas ready\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;=== Validation Complete ===\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eRollback is more complex. AKS does not support in-place downgrades. If an upgrade introduces breaking changes, the rollback path involves:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eRestoring from a snapshot or backup (for stateful workloads).\u003c/li\u003e\n\u003cli\u003eDeploying a new node pool with the previous Kubernetes version.\u003c/li\u003e\n\u003cli\u003eMigrating workloads to the new pool.\u003c/li\u003e\n\u003cli\u003eDeleting the upgraded pool.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis process is slow and disruptive, which is why validation before upgrade is critical. Test upgrades in staging, validate application compatibility with the new Kubernetes version, and maintain rollback procedures even if you hope never to use them.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-recommendations\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#practical-recommendations\" title=\"Practical recommendations\"\u003ePractical recommendations\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eBased on production experience, the following practices reduce upgrade-related failures:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAlways configure PDBs for production workloads.\u003c/strong\u003e Even stateless services benefit from \u003ccode\u003eminAvailable: 1\u003c/code\u003e to prevent simultaneous eviction of all replicas.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTest upgrades in staging first.\u003c/strong\u003e Validate application compatibility, verify PDB behavior, and measure upgrade duration under realistic load.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUpgrade during low-traffic windows.\u003c/strong\u003e Even with proper PDBs, upgrades reduce available capacity. Schedule upgrades when traffic is lowest to minimize user impact.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitor during upgrades.\u003c/strong\u003e Track pod eviction events, replica counts, and application error rates. Use Azure Monitor, Prometheus, or your existing observability stack to detect issues early.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAutomate validation, not just execution.\u003c/strong\u003e Scripts that upgrade without validation are worse than manual upgrades because they fail faster and more completely.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS cluster upgrades are unavoidable, but service disruption is not. Cordon and drain mechanics provide the foundation, Pod Disruption Budgets enforce availability constraints, and multi-node-pool rollouts allow graduated risk management. Combine these tools with validation-driven automation, and zero-downtime upgrades become reliable rather than aspirational.\u003c/p\u003e\n\u003cp\u003eThe key insight: upgrades succeed when the automation respects the constraints of your workloads, not when the automation assumes resilience that does not exist.\u003c/p\u003e\n\u003cp\u003eStart with the basics: configure PDBs for every production workload, set \u003ccode\u003e--max-surge 33%\u003c/code\u003e on your node pools, and always upgrade control plane before node pools. Test in staging first. Monitor during the upgrade. These practices are not optional for production clusters.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-01-28T17:00:00+01:00","id":"https://daily-devops.net/posts/cluster-upgrades-zero-downtime-aks/","language":"en","summary":"Master AKS upgrades with cordon/drain mechanics, Pod Disruption Budgets, multi-node-pool rollouts, and automation for zero-downtime operations.\n","tags":["aks","azure","kubernetes","cloud","devops","operations","reliability"],"title":"AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work\n","url":"https://daily-devops.net/posts/cluster-upgrades-zero-downtime-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eAKS documentation will get you to a running cluster. It won\u0026rsquo;t tell you why your pod authenticated in staging and gets a 401 in production. It won\u0026rsquo;t explain why upgrading a 50-node cluster at 2 AM felt fine but a 300-node upgrade at noon caused cascading evictions. It won\u0026rsquo;t show you which storage class to avoid when your database needs to survive node pool replacements.\u003c/p\u003e\n\u003cp\u003eThis series covers the operational reality — the decisions that distinguish AKS clusters that run quietly in production from clusters that generate 3 AM alerts. Nine articles, each examining a specific architectural domain with the specificity that matters when something breaks.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-aks-operations-is-different\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#why-aks-operations-is-different\" title=\"Why AKS Operations Is Different\"\u003eWhy AKS Operations Is Different\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMicrosoft manages the AKS control plane. That sounds like less work, and in some ways it is — you don\u0026rsquo;t patch etcd, you don\u0026rsquo;t replace failed control plane VMs, you don\u0026rsquo;t worry about API server certificate rotation. What it doesn\u0026rsquo;t mean is that running AKS in production is simple or that managed Kubernetes hands you a reliable platform and steps aside.\u003c/p\u003e\n\u003cp\u003eEvery node pool configuration decision is yours. Every storage class binding, every PVC lifecycle policy, every decision about which node pool hosts which workload — that\u0026rsquo;s on you. RBAC spans three separate systems simultaneously: Kubernetes RBAC, Azure RBAC, and Azure AD. A misconfiguration in any one of them produces an access failure that looks identical from the application\u0026rsquo;s perspective. The documentation will show you how to configure each system in isolation. It will not show you why they interact in non-obvious ways under specific conditions, or what the failure mode looks like when you get the federation configuration slightly wrong.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"where-networking-stops-being-managed\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#where-networking-stops-being-managed\" title=\"Where Networking Stops Being Managed\"\u003eWhere Networking Stops Being Managed\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNetworking is another area where \u0026ldquo;managed\u0026rdquo; has a narrower meaning than the word implies. Microsoft manages the control plane networking. Your VNet, your subnets, your IP address planning, your DNS configuration, your ingress architecture — all of it is your responsibility, and the decisions compound. IP exhaustion caused by node pool scaling is a common production incident that no amount of control plane management prevents. Private cluster DNS resolution breaks in ways that take hours to diagnose if you haven\u0026rsquo;t encountered the pattern before.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"upgrades-the-gap-between-docs-and-reality\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#upgrades-the-gap-between-docs-and-reality\" title=\"Upgrades: The Gap Between Docs and Reality\"\u003eUpgrades: The Gap Between Docs and Reality\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eUpgrades are perhaps the clearest illustration of the gap between documentation and reality. The documentation describes upgrade mechanics accurately. What it doesn\u0026rsquo;t describe is how Pod Disruption Budget misconfigurations interact with cluster autoscaler behavior during node pool drain, why the timing of upgrades relative to workload peak matters more than most teams expect, or how a PDB that looks correct on paper blocks drain indefinitely on a cluster that\u0026rsquo;s handling real traffic. Managed Kubernetes handles the control plane upgrade. The workload upgrade is a careful orchestration problem that the platform does not solve for you.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"storage-where-managed-disappears\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#storage-where-managed-disappears\" title=\"Storage: Where \u0026ldquo;Managed\u0026rdquo; Disappears\"\u003eStorage: Where \u0026ldquo;Managed\u0026rdquo; Disappears\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eStorage is where the word \u0026ldquo;managed\u0026rdquo; disappears entirely. Azure manages the underlying disk and file services. AKS provides the CSI drivers. Everything between your application and the storage backend — PVC binding, reclaim policies, volume expansion behavior, backup orchestration, behavior during node failure or node pool deletion — is configuration you own. Teams that treat storage as a detail find out it isn\u0026rsquo;t when a node pool replacement deletes volumes that were bound to nodes rather than to the cluster.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-decisions-compound-silently\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#cost-decisions-compound-silently\" title=\"Cost Decisions Compound Silently\"\u003eCost Decisions Compound Silently\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCost is a dimension that managed Kubernetes actively obscures. The control plane is free at most tiers. Node pool costs scale with what you configure, and the configuration space is large: VM SKU selection, autoscaler min/max bounds, system versus user node pool separation, spot VM integration, pod density targets. None of these have obviously correct values. All of them interact. Teams that inherit clusters often inherit cost structures that made sense at a different scale or for a different workload profile, and reversing those decisions requires careful sequencing to avoid downtime.\u003c/p\u003e\n\u003cp\u003eThe happy paths in the documentation work. They work because they\u0026rsquo;re constructed to work. Production clusters encounter the edges — the configuration combinations, the scale thresholds, the timing sensitivities — that happy paths don\u0026rsquo;t cover. This series is about the edges.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-this-series-covers\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#what-this-series-covers\" title=\"What This Series Covers\"\u003eWhat This Series Covers\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/\"\u003ePod Identity \u0026amp; Access Control in AKS: What Actually Breaks\u003c/a\u003e\u003c/strong\u003e starts with identity because identity failures are the most common source of production incidents. Workload Identity Federation eliminates credential lifecycle problems but introduces configuration complexity spanning three separate RBAC systems — Kubernetes RBAC, Azure RBAC, and Azure AD permissions. The article explains where credentials still leak despite federation, how layers interact and fail, and validation patterns that catch misconfigurations before they become incidents.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/\"\u003eStorage Architecture \u0026amp; Stateful Workloads in AKS\u003c/a\u003e\u003c/strong\u003e addresses what most AKS guides skip: what actually happens to your data when a node gets replaced. PVC/PV architecture, Azure Disk versus Azure Files performance trade-offs, Velero backup configurations that survive real restore scenarios, and multi-cluster replication patterns for production stateful workloads.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/\"\u003eAKS Cost Optimization: Resource Governance That Actually Works\u003c/a\u003e\u003c/strong\u003e covers the gap between \u0026ldquo;set resource limits\u0026rdquo; and actually controlling spend at scale. Pod density strategies, node pool design decisions that compound over time, spot VM integration without reliability regressions, and FinOps tagging that produces actionable cost attribution rather than unread dashboards.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/\"\u003eMulti-AKS Cluster Networking \u0026amp; Hub-Spoke Topology\u003c/a\u003e\u003c/strong\u003e examines what happens to networking when you move from one cluster to many. VNet peering patterns, hub-spoke routing, cross-cluster DNS resolution, shared ingress options, and — critically — the decision criteria for when mesh complexity becomes justified rather than premature.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/\"\u003eAKS Cluster Upgrades: Zero-Downtime Operations That Actually Work\u003c/a\u003e\u003c/strong\u003e covers upgrade mechanics that documentation describes optimistically. Cordon and drain behavior, Pod Disruption Budget configuration that prevents service disruption rather than theater-level protection, multi-node-pool rollout strategies, and validation-driven automation that makes upgrades reproducible rather than heroic.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/container-registry-image-security-aks/\"\u003eContainer Registry \u0026amp; Image Security in AKS Deployments\u003c/a\u003e\u003c/strong\u003e covers ACR hardening beyond the basics. A production-ready sequence: vulnerability scanning, image signing with Notation, RBAC scoping, private endpoints, policy enforcement through Azure Policy and admission controllers, and geo-replication strategies with clear trade-offs explained.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/\"\u003eAKS Disaster Recovery: Why Your Untested Backup Will Fail\u003c/a\u003e\u003c/strong\u003e addresses the gap between having backups and having a tested recovery plan. Velero configuration, realistic RTO/RPO targets that match business risk rather than wishful thinking, restore testing procedures that catch problems before outages, and multi-region failover steps your team can actually execute under pressure.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/\"\u003eHybrid AKS: Bridging Cloud and On-Prem with Azure Arc\u003c/a\u003e\u003c/strong\u003e covers the operational patterns for organizations running Kubernetes across cloud and on-premises simultaneously. ExpressRoute and VPN connectivity, Azure Arc for unified management across heterogeneous environments, consistent policy enforcement, DNS resolution, and identity federation without duplicating systems.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/\"\u003eAKS at Scale: Hard-Won Lessons from 1000+ Node Clusters\u003c/a\u003e\u003c/strong\u003e closes the series with what changes when clusters grow large enough that the platform itself becomes the bottleneck. etcd limits under high object churn, network saturation at scale, observability overhead that compounds with cluster size, and cost spirals that emerge from architectural decisions that seemed fine at 50 nodes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"who-this-is-for\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#who-this-is-for\" title=\"Who This Is For\"\u003eWho This Is For\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003ePlatform engineers and infrastructure-focused developers responsible for AKS clusters in production — or teams about to inherit that responsibility. Each article assumes you\u0026rsquo;ve run AKS before and want operational depth, not introductory setup instructions.\u003c/p\u003e\n\u003cp\u003eThe series covers Terraform, Bicep, Kubectl, and Azure CLI patterns throughout. Examples are grounded in production scenarios rather than constructed to demonstrate features.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"how-these-articles-were-written\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#how-these-articles-were-written\" title=\"How These Articles Were Written\"\u003eHow These Articles Were Written\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eEach article in this series is based on production experience — clusters that handled real traffic, failed in real ways, and required real fixes under time pressure. That distinction matters for what you\u0026rsquo;ll find here and what you won\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003eProduction experience means the failure patterns are specific. Not \u0026ldquo;storage can be tricky\u0026rdquo; but which storage class binding decisions survive node pool replacements and which don\u0026rsquo;t. Not \u0026ldquo;upgrades can cause downtime\u0026rdquo; but which combination of PDB configuration and autoscaler behavior produces an indefinitely blocked drain. Not \u0026ldquo;identity is complex\u0026rdquo; but the exact configuration gap in Workload Identity Federation that causes silent auth failures in one environment and not another. The specificity isn\u0026rsquo;t for its own sake — it\u0026rsquo;s the difference between an article that confirms your intuition and one that actually changes what you configure next.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"trade-offs-over-single-right-answers\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#trade-offs-over-single-right-answers\" title=\"Trade-offs Over Single Right Answers\"\u003eTrade-offs Over Single Right Answers\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhat production experience doesn\u0026rsquo;t mean is that every approach here is the only valid one. Large-scale AKS operation involves genuine trade-offs — between cost and resilience, between operational simplicity and flexibility, between standardization and workload-specific tuning. The articles explain the reasoning behind recommendations rather than just stating them, because the reasoning is what lets you adapt the approach to your constraints. A node pool design that works for a batch processing workload is wrong for a latency-sensitive API, and the article on cost governance explains why rather than presenting a single correct answer.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-aks-makes-things-harder\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#when-aks-makes-things-harder\" title=\"When AKS Makes Things Harder\"\u003eWhen AKS Makes Things Harder\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThe articles were not written to showcase features or to demonstrate that AKS has a solution for every problem. Some of them document problems that AKS makes harder than it should be, and say so directly. If a particular architectural pattern has a known failure mode at scale, that failure mode appears in the article rather than in a footnote or an FAQ three pages into the documentation. If a feature has a meaningful limitation that affects how you should configure it, that limitation is in the main text, not in a callout box labeled \u0026ldquo;note.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eThe goal is for these articles to be the thing you read before a production incident rather than the thing you find during one.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"where-to-start\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#where-to-start\" title=\"Where to Start\"\u003eWhere to Start\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRead in published order if you\u0026rsquo;re building out AKS infrastructure from scratch — identity and storage are foundational, and later articles reference earlier concepts. Jump to specific articles if you\u0026rsquo;re dealing with an immediate operational problem: the titles are specific enough that the right article for your situation should be obvious.\u003c/p\u003e\n\u003cp\u003eThe scale article at the end is worth reading early if your cluster is already growing or if you\u0026rsquo;re designing for growth — some architectural decisions made at 50 nodes are expensive to reverse at 500.\u003c/p\u003e\n","date_modified":"2026-05-25T23:41:10+02:00","date_published":"2026-01-21T17:00:00+01:00","id":"https://daily-devops.net/posts/aks-architecture-operations/","language":"en","summary":"Nine articles on production AKS—identity, storage, multi-cluster networking, cost governance, DR, and running 1000-node clusters in practice.","tags":["kubernetes","azure","cloud","devops","operations","platform-engineering"],"title":"AKS Architecture \u0026 Operations — The Complete Series","url":"https://daily-devops.net/posts/aks-architecture-operations/"}],"language":"en","title":"Operations and Infrastructure Management on Daily DevOps \u0026 .NET","version":"https://jsonfeed.org/version/1.1"}