{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"},{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"description":"Recent content in Kubernetes and Container Orchestration on Daily DevOps \u0026 .NET","favicon":"https://daily-devops.net/images/logo_hu_6465d873dfa490cf.png","feed_url":"https://daily-devops.net/tags/kubernetes/feed.json","home_page_url":"https://daily-devops.net/tags/kubernetes/","icon":"https://daily-devops.net/images/logo_hu_5926de77762241ba.png","items":[{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eNobody warns you about the point where Kubernetes stops behaving like Kubernetes. At 100 nodes the platform feels manageable: logs are searchable, deployments finish quickly, and most incidents resolve with a kubectl command and some patience. Cross 500 nodes and small architectural assumptions start cracking. Cross 1,000 nodes and those cracks become structural.\u003c/p\u003e\n\u003cp\u003eThe problems described here are not hypothetical. etcd database sizes that stretched backup windows into hours. Observability stacks consuming more cluster resources than the workloads they were supposed to monitor. Network overlays running fine at 200 nodes that started dropping packets at 800. If you\u0026rsquo;re planning to push past 500 nodes, or already running infrastructure at that scale and things feel increasingly fragile, read on.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-scale-cliff-why-1000-nodes-changes-everything\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#the-scale-cliff-why-1000-nodes-changes-everything\" title=\"The Scale Cliff: Why 1,000 Nodes Changes Everything\"\u003eThe Scale Cliff: Why 1,000 Nodes Changes Everything\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAt 100 nodes, Kubernetes feels manageable. Monitoring works. Logs are searchable. Network patterns make sense. Deployments complete in minutes. Then you cross 500 nodes and small cracks appear. By 1,000 nodes, those cracks become structural failures.\u003c/p\u003e\n\u003cp\u003eThe problem: Kubernetes components designed for graceful degradation hit hard limits at scale. etcd performance degrades non-linearly with keyspace size. Network overlay solutions that worked fine at 200 nodes saturate at 800. Observability stacks consuming 3% of cluster resources at 100 nodes consume 25% at 1,000. Cost-per-node stays flat but operational overhead per node increases exponentially.\u003c/p\u003e\n\u003cp\u003eThese aren\u0026rsquo;t bugs. They\u0026rsquo;re architectural realities. Understanding where the cliffs are lets you plan around them instead of discovering them in production outages.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"etcd-the-hidden-scaling-bottleneck\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#etcd-the-hidden-scaling-bottleneck\" title=\"etcd: The Hidden Scaling Bottleneck\"\u003eetcd: The Hidden Scaling Bottleneck\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eetcd is the single most critical component in your cluster and the first to hit scaling limits. It stores all cluster state: every pod, service, config map, secret, and custom resource. At 1,000 nodes with 200 pods per node, you\u0026rsquo;re managing 200,000+ objects. etcd wasn\u0026rsquo;t designed for that scale without careful tuning.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"performance-degradation-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#performance-degradation-patterns\" title=\"Performance Degradation Patterns\"\u003ePerformance Degradation Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eetcd performance degrades based on keyspace size, transaction rate, and storage backend latency. At small scale, these factors don\u0026rsquo;t matter. At mega-cluster scale, they dominate operational behavior.\u003c/p\u003e\n\u003cp\u003eSymptoms you\u0026rsquo;ll see:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAPI server latency spikes during deployments\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ekubectl\u003c/code\u003e commands timing out intermittently\u003c/li\u003e\n\u003cli\u003eController reconciliation loops falling behind\u003c/li\u003e\n\u003cli\u003eScheduler making suboptimal placement decisions due to stale state\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe root cause is usually one of three things: etcd database size exceeding memory capacity, insufficient IOPS on the storage backend, or transaction rate overwhelming the commit pipeline.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"backup-size-and-recovery-time\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#backup-size-and-recovery-time\" title=\"Backup Size and Recovery Time\"\u003eBackup Size and Recovery Time\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eetcd backup size scales with keyspace. A 100-node cluster might produce 500MB backups. A 1,000-node cluster produces 8GB+ backups. That size creates operational problems:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBackup windows extend from minutes to hours\u003c/li\u003e\n\u003cli\u003eNetwork transfer costs increase linearly\u003c/li\u003e\n\u003cli\u003eRecovery time objectives (RTO) slip from \u0026ldquo;15 minutes\u0026rdquo; to \u0026ldquo;2+ hours\u0026rdquo;\u003c/li\u003e\n\u003cli\u003eStorage costs for retention policies multiply unexpectedly\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWorse: most backup solutions for etcd aren\u0026rsquo;t tested at mega-cluster scale. The tooling that works reliably at 100 nodes silently fails or creates corrupted snapshots at 1,000 nodes.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-mitigation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#practical-mitigation\" title=\"Practical Mitigation\"\u003ePractical Mitigation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAKS manages etcd for you\u003c/a\u003e, but you still need to monitor and validate its health. Here\u0026rsquo;s a Terraform configuration that sets up Azure Monitor alerts for etcd-related API server latency:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_metric_alert\u0026#34; \u0026#34;etcd_latency\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-etcd-high-latency\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scopes\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  description\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Alert when API server latency exceeds 200ms (etcd saturation signal)\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  severity\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e2\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  frequency\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT1M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  window_size\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT5M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003ecriteria\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_namespace\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Microsoft.ContainerService/managedClusters\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;apiserver_request_duration_seconds\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aggregation\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Average\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    operator\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GreaterThan\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    threshold\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e0\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"m\"\u003e2\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003edimension\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      name\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;verb\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      operator\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Include\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      values\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;GET\u0026#34;, \u0026#34;LIST\u0026#34;, \u0026#34;PATCH\u0026#34;, \u0026#34;UPDATE\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eaction\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    action_group_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_monitor_action_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eplatform\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_metric_alert\u0026#34; \u0026#34;etcd_database_size\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-etcd-database-size-warning\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scopes\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  description\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Alert when etcd database approaches size limits\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  severity\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  frequency\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT5M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  window_size\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT15M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003ecriteria\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_namespace\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Microsoft.ContainerService/managedClusters\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;etcd_db_total_size_in_bytes\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aggregation\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Average\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    operator\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GreaterThan\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    threshold\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e6442450944\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # 6GB (warning threshold)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eaction\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    action_group_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_monitor_action_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eplatform\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThese alerts won\u0026rsquo;t prevent etcd saturation, but they\u0026rsquo;ll give you advance warning before cascading failures occur. At scale, that early warning is the difference between a controlled maintenance window and an all-hands incident.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"network-performance-when-overlay-solutions-hit-limits\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#network-performance-when-overlay-solutions-hit-limits\" title=\"Network Performance: When Overlay Solutions Hit Limits\"\u003eNetwork Performance: When Overlay Solutions Hit Limits\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNetwork overlay performance is invisible at small scale and catastrophic at large scale. Container Network Interface (CNI) plugins that handle 50,000 pods without issue can saturate CPU and drop packets at 200,000 pods. There is no single right answer for which CNI to use. For a full breakdown of the tradeoffs between kubenet, Azure CNI, and Azure CNI Overlay, see \u003ca href=\"/posts/aks-networking-clash/\"\u003eAKS Networking Clash: kubenet vs. CNI vs. CNI Overlay\u003c/a\u003e.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"pod-density-and-node-saturation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#pod-density-and-node-saturation\" title=\"Pod Density and Node Saturation\"\u003ePod Density and Node Saturation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure CNI Overlay\u003c/a\u003e supports up to 250 pods per node. That\u0026rsquo;s a theoretical maximum. Practical limits depend on network I/O patterns, pod churn rate, and service mesh overhead.\u003c/p\u003e\n\u003cp\u003eSignals that you\u0026rsquo;re approaching saturation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eNodes showing high system CPU (kernel networking overhead)\u003c/li\u003e\n\u003cli\u003eIntermittent packet loss between pods on the same node\u003c/li\u003e\n\u003cli\u003eService discovery latency increasing over time\u003c/li\u003e\n\u003cli\u003eDNS resolution failures under load\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe underlying issue: network namespace creation, iptables rule updates, and conntrack table management all scale poorly. At 200 pods per node, these operations consume negligible resources. At 250 pods per node, they dominate system CPU.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cross-node-latency-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cross-node-latency-patterns\" title=\"Cross-Node Latency Patterns\"\u003eCross-Node Latency Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eOverlay networks add encapsulation overhead. Azure CNI Overlay typically adds 100-200 microseconds per hop. At small scale, that\u0026rsquo;s noise. At mega-cluster scale, it compounds across multi-tier applications.\u003c/p\u003e\n\u003cp\u003eExample: a request traversing frontend → API gateway → backend service → database proxy touches 4 pods. If those pods span nodes, you\u0026rsquo;ve added 400-800 microseconds of latency from network overhead alone. Multiply that by 10,000 requests per second and the impact becomes measurable in user-facing metrics.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"mitigation-strategy\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#mitigation-strategy\" title=\"Mitigation Strategy\"\u003eMitigation Strategy\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003ePin latency-sensitive workloads to the same node using pod affinity\u003c/li\u003e\n\u003cli\u003eUse host networking for data-plane components (with appropriate security controls)\u003c/li\u003e\n\u003cli\u003eMonitor conntrack table utilization: \u003ccode\u003esysctl net.netfilter.nf_conntrack_count\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eSet conservative pod density limits (180-200 pods/node instead of 250)\u003c/li\u003e\n\u003cli\u003eImplement service mesh with extended Berkeley Packet Filter (eBPF) dataplane (\u003ca href=\"https://cilium.io/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eCilium\u003c/a\u003e) to reduce iptables overhead\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThese aren\u0026rsquo;t performance optimizations. They\u0026rsquo;re operational requirements at scale.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"observability-overhead-when-monitoring-becomes-the-problem\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#observability-overhead-when-monitoring-becomes-the-problem\" title=\"Observability Overhead: When Monitoring Becomes the Problem\"\u003eObservability Overhead: When Monitoring Becomes the Problem\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eObservability at scale creates a paradox: the systems you need to diagnose problems become the source of resource exhaustion.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"logging-cost-explosion\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#logging-cost-explosion\" title=\"Logging Cost Explosion\"\u003eLogging Cost Explosion\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eA single pod generating 100KB/day of logs costs nothing. 200,000 pods generating the same logs produce 20GB/day. Over a month, that\u0026rsquo;s 600GB. With 3x replication and 90-day retention, you\u0026rsquo;re storing 162TB of log data.\u003c/p\u003e\n\u003cp\u003eStorage costs for that volume run into thousands of dollars monthly. Query performance degrades. Log ingestion pipelines fall behind. The tooling designed to help you debug problems becomes unusable during incidents.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"metric-cardinality-problems\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#metric-cardinality-problems\" title=\"Metric Cardinality Problems\"\u003eMetric Cardinality Problems\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePrometheus-based monitoring hits cardinality limits around 10 million active time series. A 1,000-node cluster with moderate instrumentation easily exceeds that threshold:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e200,000 pods × 20 metrics per pod = 4M series\u003c/li\u003e\n\u003cli\u003e1,000 nodes × 100 metrics per node = 100K series\u003c/li\u003e\n\u003cli\u003e50 services × 10K instances × 5 metrics = 2.5M series\u003c/li\u003e\n\u003cli\u003eCustom application metrics add another 3M+ series\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWhen you exceed cardinality limits, Prometheus becomes unstable. Queries time out. Dashboards fail to render. Alerting rules stop evaluating. You lose observability exactly when you need it most.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-approaches\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#practical-approaches\" title=\"Practical Approaches\"\u003ePractical Approaches\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eImplement aggressive log sampling: 1% sampling still gives 2GB/day of logs\u003c/li\u003e\n\u003cli\u003eUse structured logging with consistent field names to enable efficient compression\u003c/li\u003e\n\u003cli\u003eArchive cold logs to blob storage (pennies per GB vs. dollars per GB in hot storage)\u003c/li\u003e\n\u003cli\u003eDeploy federated Prometheus with careful metric filtering at scrape time\u003c/li\u003e\n\u003cli\u003eUse recording rules to pre-aggregate high-cardinality metrics\u003c/li\u003e\n\u003cli\u003eConsider managed observability services (Azure Monitor, Datadog) that handle scale for you\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe honest assessment: if your observability stack consumes more than 10% of cluster resources, it\u0026rsquo;s time to rethink your approach. At mega-cluster scale, that threshold is easy to exceed.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cost-spirals-small-decisions-with-exponential-consequences\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cost-spirals-small-decisions-with-exponential-consequences\" title=\"Cost Spirals: Small Decisions with Exponential Consequences\"\u003eCost Spirals: Small Decisions with Exponential Consequences\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eCost optimization at 100 nodes is optional. At 1,000 nodes, it\u0026rsquo;s mandatory. Small inefficiencies compound brutally.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"resource-overprovisioning\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#resource-overprovisioning\" title=\"Resource Overprovisioning\"\u003eResource Overprovisioning\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eTeams typically request 2x actual resource needs for safety margin. At 100 nodes, that\u0026rsquo;s wasteful but affordable. At 1,000 nodes with 250 pods per node, you\u0026rsquo;re paying for 125,000 unutilized CPU cores.\u003c/p\u003e\n\u003cp\u003eWith Azure D8s_v5 nodes at ~$0.40/hour, a 1,000-node cluster costs ~$288,000/year in compute alone. 50% overprovisioning adds $144,000 annually. That\u0026rsquo;s real budget impact.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"storage-cost-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#storage-cost-patterns\" title=\"Storage Cost Patterns\"\u003eStorage Cost Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eEvery pod gets ephemeral storage. Most clusters also provision persistent volumes. At scale, storage costs exceed compute costs.\u003c/p\u003e\n\u003cp\u003eExample: 200,000 pods with 10GB ephemeral storage each = 2PB of ephemeral storage. Persistent volume claims add another 500TB+. Azure Premium SSD costs $0.135/GB/month. You\u0026rsquo;re paying $300K+ monthly for storage alone.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"network-egress-surprises\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#network-egress-surprises\" title=\"Network Egress Surprises\"\u003eNetwork Egress Surprises\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCross-region and internet egress costs scale linearly with traffic volume. A 1,000-node cluster handling 10TB/day of egress traffic incurs $1,500/day in bandwidth costs ($45,000/month).\u003c/p\u003e\n\u003cp\u003eTeams typically discover these costs 60 days into a scale-up when the first full billing cycle completes. By then, architectural changes are expensive and disruptive.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-control-strategy\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cost-control-strategy\" title=\"Cost Control Strategy\"\u003eCost Control Strategy\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eImplement cluster autoscaling with aggressive scale-down policies\u003c/li\u003e\n\u003cli\u003eUse spot instances for fault-tolerant workloads (70% cost reduction)\u003c/li\u003e\n\u003cli\u003eRight-size pod resource requests using \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/vertical-pod-autoscaler\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eVPA (Vertical Pod Autoscaler)\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003eEnable Azure Hybrid Benefit for Windows nodes\u003c/li\u003e\n\u003cli\u003eDeploy regional caching layers to reduce cross-region egress\u003c/li\u003e\n\u003cli\u003eMonitor and alert on cost metrics, not just resource metrics\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eTeams defer cost optimization in favor of operational simplicity, and early on that is usually the right call. At mega-cluster scale, that priority reverses. Cost efficiency becomes a constraint you cannot ignore. \u003ca href=\"/posts/cost-optimization-resource-governance-aks/\"\u003eAKS Cost Optimization: Resource Governance That Actually Works\u003c/a\u003e goes deeper on VPA configuration and autoscaling policies if you want the practical implementation details.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"debugging-at-scale-finding-needles-in-exponentially-larger-haystacks\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#debugging-at-scale-finding-needles-in-exponentially-larger-haystacks\" title=\"Debugging at Scale: Finding Needles in Exponentially Larger Haystacks\"\u003eDebugging at Scale: Finding Needles in Exponentially Larger Haystacks\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDebugging a 100-node cluster means checking logs from a few thousand pods. Debugging a 1,000-node cluster means isolating the problem from millions of log lines across 200,000+ pods.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"correlation-and-isolation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#correlation-and-isolation\" title=\"Correlation and Isolation\"\u003eCorrelation and Isolation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen a user reports an error, your troubleshooting workflow looks like this:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eIdentify the service handling the request (1 of 50+ services)\u003c/li\u003e\n\u003cli\u003eFind the pod instance that processed the request (1 of 5,000+ pod instances)\u003c/li\u003e\n\u003cli\u003eLocate the relevant log lines (1 of 10M+ log events in the time window)\u003c/li\u003e\n\u003cli\u003eCorrelate with upstream/downstream service calls\u003c/li\u003e\n\u003cli\u003eReproduce the issue in a controlled environment\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAt small scale, steps 2-3 take minutes. At mega-cluster scale, they take hours, assuming correlation IDs exist and work correctly. Without proper instrumentation, they\u0026rsquo;re impossible.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"reproduction-challenges\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#reproduction-challenges\" title=\"Reproduction Challenges\"\u003eReproduction Challenges\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eIssues that reproduce reliably at scale rarely reproduce in test environments. A race condition that triggers once per 100,000 requests never manifests in pre-production. Network congestion patterns that emerge at 1,000 nodes don\u0026rsquo;t exist at 10 nodes.\u003c/p\u003e\n\u003cp\u003eThis creates a diagnostic blind spot. You can observe the failure in production but can\u0026rsquo;t reproduce it for root cause analysis.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"large-scale-troubleshooting-checklist\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#large-scale-troubleshooting-checklist\" title=\"Large-Scale Troubleshooting Checklist\"\u003eLarge-Scale Troubleshooting Checklist\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eHere\u0026rsquo;s a diagnostic script I use for investigating performance degradation at scale:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Large-scale AKS cluster diagnostic script\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Run this when experiencing unexplained performance issues\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003e1\u003c/span\u003e\u003cspan class=\"p\"\u003e:?Cluster name required\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003e2\u003c/span\u003e\u003cspan class=\"p\"\u003e:?Resource group required\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eOUTPUT_DIR\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;./diagnostics-\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003edate +%Y%m%d-%H%M%S\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Running diagnostics for cluster: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003emkdir -p \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Get cluster credentials\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-credentials --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --overwrite-existing\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Node health check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking node health...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get nodes -o wide \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/nodes.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl top nodes \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/node-resources.txt\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Metrics server unavailable\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/node-resources.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# API server latency check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking API server latency...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e i in \u003cspan class=\"o\"\u003e{\u003c/span\u003e1..5\u003cspan class=\"o\"\u003e}\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003etime\u003c/span\u003e kubectl get nodes \u0026gt; /dev/null 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep real \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/api-latency.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# etcd health indicators\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking etcd health signals...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get --raw /metrics \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -E \u003cspan class=\"s2\"\u003e\u0026#34;apiserver_request_duration|etcd_request_duration\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/etcd-metrics.txt\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Metrics unavailable\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/etcd-metrics.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Pod distribution analysis\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Analyzing pod distribution...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | \u0026#34;\\(.spec.nodeName)\u0026#34;\u0026#39;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e sort \u003cspan class=\"p\"\u003e|\u003c/span\u003e uniq -c \u003cspan class=\"p\"\u003e|\u003c/span\u003e sort -rn \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/pod-distribution.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Network policy count (can cause iptables overhead)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking network policy count...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get networkpolicies -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/netpol-count.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Service endpoint count (affects kube-proxy performance)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking service endpoint count...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get endpoints -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq \u003cspan class=\"s1\"\u003e\u0026#39;[.items[].subsets[].addresses] | flatten | length\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/endpoint-count.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Resource pressure signals\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Identifying pods with resource pressure...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | select(.status.conditions[]? | select(.type==\u0026#34;Ready\u0026#34; and .status==\u0026#34;False\u0026#34;)) | \u0026#34;\\(.metadata.namespace)/\\(.metadata.name)\u0026#34;\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/not-ready-pods.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Recent events (truncated for performance)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Capturing recent cluster events...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get events -A --sort-by\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;.lastTimestamp\u0026#39;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e tail -1000 \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/recent-events.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Node condition checks\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking for node pressure conditions...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get nodes -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | select(.status.conditions[]? | select(.type==\u0026#34;MemoryPressure\u0026#34; or .type==\u0026#34;DiskPressure\u0026#34; or .type==\u0026#34;PIDPressure\u0026#34;) | select(.status==\u0026#34;True\u0026#34;)) | .metadata.name\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/nodes-under-pressure.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# ConfigMap and Secret count (affects etcd size)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Counting ConfigMaps and Secrets...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ConfigMaps: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get configmaps -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Secrets: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get secrets -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt;\u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Total Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pods -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt;\u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# DNS performance check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Testing DNS resolution performance...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run dns-test --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox:1.36 --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever --rm -i --command -- sh -c \u003cspan class=\"s2\"\u003e\u0026#34;time nslookup kubernetes.default\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/dns-test.txt\u0026#34;\u003c/span\u003e 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;DNS test failed\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/dns-test.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Diagnostics complete. Results in: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Quick analysis:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Nodes: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get nodes --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pods -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Not Ready Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ecat \u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e/not-ready-pods.txt \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Nodes Under Pressure: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ecat \u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e/nodes-under-pressure.txt \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Review the output files for detailed diagnostics.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script collects the signals that matter at scale: API latency, pod distribution skew, resource pressure indicators, and object count metrics. It doesn\u0026rsquo;t solve problems, but it eliminates 90% of the noise.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"patterns-that-prevent-catastrophe\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#patterns-that-prevent-catastrophe\" title=\"Patterns That Prevent Catastrophe\"\u003ePatterns That Prevent Catastrophe\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAfter running mega-clusters through multiple incident cycles, a few patterns consistently prevent the worst outcomes:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eProgressive rollouts\u003c/strong\u003e: Never deploy to 1,000 nodes simultaneously. Deploy to 1 node, then 10, then 100, then all. Automate rollback triggers. This pattern catches 95% of scale-dependent bugs before they impact production.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBlast radius isolation\u003c/strong\u003e: Segment your cluster into failure domains using node pools, namespaces, and network policies. When something fails (and it will), contain the damage. \u003ca href=\"/posts/aks-network-policies-zero-trust/\"\u003eAKS Network Policies: The Security Layer Your Cluster Is Missing\u003c/a\u003e covers practical policy configuration if you are starting from scratch.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCapacity reservation\u003c/strong\u003e: Reserve 15-20% headroom for burst traffic and incident response. Running at 90%+ utilization saves money until you need to scale during an outage and can\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eImmutable infrastructure\u003c/strong\u003e: Treat nodes as cattle, not pets. Automate node replacement on a fixed schedule (weekly or monthly). This prevents subtle configuration drift that compounds into unreproducible failures.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational runbooks\u003c/strong\u003e: Document every common failure mode. When API server latency spikes at 2 AM, you don\u0026rsquo;t want to be reading Kubernetes source code to understand etcd compaction behavior.\u003c/p\u003e\n\u003cp\u003eThese patterns aren\u0026rsquo;t revolutionary. They\u0026rsquo;re boring, defensive engineering. At mega-cluster scale, boring wins.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"honest-takeaways\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#honest-takeaways\" title=\"Honest Takeaways\"\u003eHonest Takeaways\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning AKS at 1,000+ nodes isn\u0026rsquo;t fundamentally different from running it at 100 nodes. It\u0026rsquo;s exponentially different. Problems that self-heal at small scale cascade catastrophically at large scale. Architectural decisions that feel premature at 50 nodes become load-bearing at 500 nodes.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re planning to scale past 500 nodes: budget significant engineering time for operational tooling. Plan your observability strategy before your first node boots. Understand your cost model in detail. Test failure scenarios at scale before they happen in production.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re already running at scale: you know everything in this article because you\u0026rsquo;ve lived it. The value isn\u0026rsquo;t the advice. It\u0026rsquo;s knowing you\u0026rsquo;re not alone in discovering these lessons the hard way.\u003c/p\u003e\n\u003cp\u003eScale is honest. Every shortcut taken for velocity will surface eventually, usually at the worst possible moment. Budget engineering time to address that reality before you hit 500 nodes, not after. Fixing structural problems under production pressure costs significantly more than building them correctly from the start.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-04-01T17:00:00+01:00","id":"https://daily-devops.net/posts/aks-at-scale-mega-cluster-lessons/","language":"en","summary":"Real-world lessons from operating 1000+ node AKS clusters: etcd limits, network saturation, observability overhead, and cost spirals you need to know.","tags":["kubernetes","azure","cloud","devops","operations","infrastructure"],"title":"AKS at Scale: Hard-Won Lessons from 1000+ Node Clusters","url":"https://daily-devops.net/posts/aks-at-scale-mega-cluster-lessons/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\n\n\n\n\u003ch2 id=\"the-problem-cloud-and-on-prem-as-operational-silos\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#the-problem-cloud-and-on-prem-as-operational-silos\" title=\"The Problem: Cloud and On-Prem as Operational Silos\"\u003eThe Problem: Cloud and On-Prem as Operational Silos\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMost organizations don\u0026rsquo;t run purely in the cloud. Legacy systems, compliance requirements, data gravity, and latency concerns keep critical workloads on-premises indefinitely. Running AKS in Azure alongside on-prem Kubernetes clusters multiplies management overhead: two separate control planes to patch, two policy frameworks to keep in sync, two identity configurations to audit, and two observability stacks generating alerts nobody wants to correlate manually.\u003c/p\u003e\n\u003cp\u003eThe temptation is to build custom tooling that bridges the gap. That usually ends as a fragile script collection that only one person on the team understands. Azure Arc changes the equation: it extends Azure\u0026rsquo;s management plane to any Kubernetes cluster without migrating workloads.\u003c/p\u003e\n\u003cp\u003eThis article covers the practical pieces: network connectivity options, Azure Arc for unified management, DNS resolution across environment boundaries, policy enforcement, and identity federation.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"connectivity-models-getting-traffic-between-environments\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#connectivity-models-getting-traffic-between-environments\" title=\"Connectivity Models: Getting Traffic Between Environments\"\u003eConnectivity Models: Getting Traffic Between Environments\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eBefore you can manage hybrid Kubernetes deployments, you need reliable network connectivity. Three primary patterns exist, each with distinct trade-offs.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"expressroute-dedicated-private-connectivity\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#expressroute-dedicated-private-connectivity\" title=\"ExpressRoute: Dedicated Private Connectivity\"\u003eExpressRoute: Dedicated Private Connectivity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/expressroute/expressroute-introduction\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eExpressRoute\u003c/a\u003e provides a dedicated, private connection between on-premises and Azure, bypassing the public internet entirely. Latency is predictable, throughput is consistent, and the connection doesn\u0026rsquo;t compete with general internet traffic.\u003c/p\u003e\n\u003cp\u003eThe operational reality: provisioning takes weeks, requires coordination with a connectivity provider, and demands solid Border Gateway Protocol (BGP) knowledge from your network team. Cost is significant. For production workloads with compliance requirements or sustained high-bandwidth data transfer, those trade-offs are usually acceptable. For a dev/test environment or proof-of-concept, they aren\u0026rsquo;t.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"site-to-site-vpn-cost-effective-alternative\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#site-to-site-vpn-cost-effective-alternative\" title=\"Site-to-Site VPN: Cost-Effective Alternative\"\u003eSite-to-Site VPN: Cost-Effective Alternative\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eSite-to-Site (S2S) VPN creates encrypted tunnels over the public internet. Setup takes hours rather than weeks, cost is a fraction of ExpressRoute, and it works without engaging a connectivity provider.\u003c/p\u003e\n\u003cp\u003eThe catch is performance variability. Throughput degrades under load, latency spikes during congestion periods, and encryption overhead adds up. For proof-of-concept environments, dev/test workloads, or bursty low-volume traffic, S2S VPN is the pragmatic choice. For production databases replicating continuously across the boundary, it usually isn\u0026rsquo;t enough.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"vnet-peering-cloud-only-hybrid\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#vnet-peering-cloud-only-hybrid\" title=\"VNet Peering: Cloud-Only Hybrid\"\u003eVNet Peering: Cloud-Only Hybrid\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-peering-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eVNet peering\u003c/a\u003e connects Azure VNets across regions or subscription boundaries. If both sides run in Azure and you\u0026rsquo;re drawing a line between subscriptions rather than between cloud and datacenter, this is the simplest option: no gateways, no BGP, no provider contracts.\u003c/p\u003e\n\u003cp\u003eIt doesn\u0026rsquo;t solve the on-prem connectivity problem. Peering only works between Azure VNets.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"infrastructure-as-code-expressroute--aks\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#infrastructure-as-code-expressroute--aks\" title=\"Infrastructure as Code: ExpressRoute \u0026#43; AKS\"\u003eInfrastructure as Code: ExpressRoute + AKS\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhatever connectivity model you choose, infrastructure repeatability matters from day one. Deploying gateways, subnets, route tables, and AKS clusters manually works once and creates problems on the second environment. The Terraform configuration below covers the full stack: ExpressRoute gateway, private DNS zone, and AKS with Azure CNI and private cluster enabled.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Terraform configuration for ExpressRoute + AKS hybrid connectivity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Variables and provider configuration assumed\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_resource_group\u0026#34; \u0026#34;hybrid\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hybrid-aks-rg\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;westeurope\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Virtual Network for AKS and hybrid connectivity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network\u0026#34; \u0026#34;aks_vnet\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-vnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_space\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.0.0/16\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Subnet for AKS nodes\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_subnet\u0026#34; \u0026#34;aks_nodes\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-nodes-subnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_prefixes\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.1.0/24\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Gateway subnet for ExpressRoute\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_subnet\u0026#34; \u0026#34;gateway\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GatewaySubnet\u0026#34;\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # Name must be exactly this\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_prefixes\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.255.0/27\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Public IP for ExpressRoute Gateway\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_public_ip\u0026#34; \u0026#34;er_gateway_ip\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;er-gateway-pip\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allocation_method\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Static\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  sku\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# ExpressRoute Gateway\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network_gateway\u0026#34; \u0026#34;er_gateway\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;er-gateway\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  type\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ExpressRoute\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  sku\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eip_configuration\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;gateway-ip-config\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    public_ip_address_id\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_public_ip\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eer_gateway_ip\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    private_ip_address_allocation\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Dynamic\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    subnet_id\u003c/span\u003e                     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003egateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Connection from ExpressRoute Gateway to the pre-provisioned circuit\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Configure var.expressroute_circuit_id with your existing circuit resource ID:\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# var.expressroute_circuit_id = \u0026#34;/subscriptions/.../expressRouteCircuits/...\u0026#34;\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network_gateway_connection\u0026#34; \u0026#34;onprem\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;er-onprem-connection\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e                   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  type\u003c/span\u003e                       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ExpressRoute\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_gateway_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network_gateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eer_gateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  express_route_circuit_id\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eexpressroute_circuit_id\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Private DNS Zone for internal services\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_dns_zone\u0026#34; \u0026#34;internal\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;internal.azure.local\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Link DNS zone to VNet\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_dns_zone_virtual_network_link\u0026#34; \u0026#34;aks\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-vnet-link\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  private_dns_zone_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_private_dns_zone\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003einternal\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_id\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  registration_enabled\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# AKS cluster with ExpressRoute connectivity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_kubernetes_cluster\u0026#34; \u0026#34;aks\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hybrid-aks\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  dns_prefix\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hybrid-aks\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  kubernetes_version\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;1.31\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003edefault_node_pool\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;system\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    node_count\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    vm_size\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard_D4s_v5\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    vnet_subnet_id\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_nodes\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    auto_scaling_enabled\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    min_count\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    max_count\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e10\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eidentity\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;SystemAssigned\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003enetwork_profile\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    network_plugin\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azure\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    network_policy\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;calico\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    service_cidr\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.2.0.0/16\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    dns_service_ip\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.2.0.10\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    load_balancer_sku\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  private_cluster_enabled\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  depends_on\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003eazurerm_virtual_network_gateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eer_gateway\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Route table for on-prem traffic via ExpressRoute\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_route_table\u0026#34; \u0026#34;onprem_routes\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;onprem-routes\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eroute\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;to-onprem-datacenter\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    address_prefix\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.0.0.0/8\u0026#34;\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # On-prem network range\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    next_hop_type\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;VirtualNetworkGateway\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Associate route table with AKS subnet\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_subnet_route_table_association\u0026#34; \u0026#34;aks_routes\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  subnet_id\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_nodes\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  route_table_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_route_table\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eonprem_routes\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis Terraform configuration establishes the foundation for hybrid connectivity: ExpressRoute gateway, private DNS, and AKS with network policies. Customize address ranges, SKUs, and routing rules for your environment.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-arc-unified-kubernetes-management\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#azure-arc-unified-kubernetes-management\" title=\"Azure Arc: Unified Kubernetes Management\"\u003eAzure Arc: Unified Kubernetes Management\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAzure Arc extends Azure management to any Kubernetes cluster: on-prem, edge locations, or other clouds. It registers external clusters as Azure resources, enabling centralized management without forcing workload migration.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"what-arc-provides\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#what-arc-provides\" title=\"What Arc Provides\"\u003eWhat Arc Provides\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eArc-enabled Kubernetes clusters gain:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eUnified inventory\u003c/strong\u003e: View all clusters in Azure Resource Manager\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003ePolicy enforcement\u003c/strong\u003e: Azure Policy extends to Arc clusters\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGitOps deployment\u003c/strong\u003e: Flux configurations apply consistently\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitoring integration\u003c/strong\u003e: Azure Monitor collects metrics and logs\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRBAC integration\u003c/strong\u003e: Azure AD for cluster authentication\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eArc doesn\u0026rsquo;t move workloads to Azure. It extends Azure\u0026rsquo;s control plane to wherever your clusters run.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"onboarding-an-on-prem-cluster\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#onboarding-an-on-prem-cluster\" title=\"Onboarding an On-Prem Cluster\"\u003eOnboarding an On-Prem Cluster\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eConnecting an existing Kubernetes cluster to Arc requires cluster admin access and network connectivity to Azure endpoints.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Azure Arc onboarding script\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Requires: Azure CLI, kubectl, cluster admin kubeconfig\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;hybrid-infra-rg\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;onprem-k8s-01\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eLOCATION\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;westeurope\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Login and set subscription\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz login\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz account \u003cspan class=\"nb\"\u003eset\u003c/span\u003e --subscription \u003cspan class=\"s2\"\u003e\u0026#34;your-subscription-id\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create resource group if needed\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz group create --name \u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e --location \u003cspan class=\"nv\"\u003e$LOCATION\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Register Arc providers\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider register --namespace Microsoft.Kubernetes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider register --namespace Microsoft.KubernetesConfiguration\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider register --namespace Microsoft.ExtendedLocation\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Wait for registration (can take several minutes)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider show -n Microsoft.Kubernetes -o table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider show -n Microsoft.KubernetesConfiguration -o table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install Arc extensions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz extension add --name connectedk8s\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz extension add --name k8s-configuration\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Connect cluster to Arc\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz connectedk8s connect \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --location \u003cspan class=\"nv\"\u003e$LOCATION\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --tags \u003cspan class=\"nv\"\u003eenvironment\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eproduction \u003cspan class=\"nv\"\u003edatacenter\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eonprem\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify connection\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz connectedk8s show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --query \u003cspan class=\"s2\"\u003e\u0026#34;connectivityStatus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eOnce connected, the cluster appears in the Azure portal alongside AKS clusters. Management operations (viewing workloads, applying policies, deploying via GitOps) work identically.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"policy-enforcement-across-environments\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#policy-enforcement-across-environments\" title=\"Policy Enforcement Across Environments\"\u003ePolicy Enforcement Across Environments\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure Policy for Kubernetes applies consistent governance rules across AKS and Arc clusters. Define policies once, enforce everywhere.\u003c/p\u003e\n\u003cp\u003eExample policy: require resource limits on all pods.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# pod-resource-limits-policy.yaml\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003econstraints.gatekeeper.sh/v1beta1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eK8sRequiredResources\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003erequire-pod-resource-limits\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ematch\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ekinds\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"nt\"\u003eapiGroups\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ekinds\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Pod\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003estaging\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003elimits\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ecpu\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ememory\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003erequests\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ecpu\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ememory\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eApply this policy through Azure Policy, and it enforces on both AKS and Arc-connected on-prem clusters. No duplicated configuration, no drift between environments.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"gitops-single-source-of-truth\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#gitops-single-source-of-truth\" title=\"GitOps: Single Source of Truth\"\u003eGitOps: Single Source of Truth\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eArc supports Flux-based GitOps configurations. Define cluster state in Git, and Arc ensures compliance across environments. The \u003ccode\u003eaz k8s-configuration flux create\u003c/code\u003e command links your Git repository to both AKS and Arc clusters. Changes sync automatically. Configuration drift gets corrected within minutes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"dns-and-service-discovery-hybrid-resolution-without-complexity\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#dns-and-service-discovery-hybrid-resolution-without-complexity\" title=\"DNS and Service Discovery: Hybrid Resolution Without Complexity\"\u003eDNS and Service Discovery: Hybrid Resolution Without Complexity\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid deployments need service discovery across boundaries. Pods in AKS must resolve on-prem services, and vice versa.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"approach-1-azure-private-dns-with-conditional-forwarding\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#approach-1-azure-private-dns-with-conditional-forwarding\" title=\"Approach 1: Azure Private DNS with Conditional Forwarding\"\u003eApproach 1: Azure Private DNS with Conditional Forwarding\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCreate a Private DNS zone in Azure, link it to your VNet, and configure on-prem DNS servers to forward queries for Azure domains to Azure\u0026rsquo;s DNS resolver at 168.63.129.16. AKS clusters inherit VNet DNS configuration automatically. On-prem services get custom DNS entries pointing to ExpressRoute or VPN endpoints.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"approach-2-coredns-custom-forwarding\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#approach-2-coredns-custom-forwarding\" title=\"Approach 2: CoreDNS Custom Forwarding\"\u003eApproach 2: CoreDNS Custom Forwarding\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eFor cluster-level control, patch the CoreDNS ConfigMap to forward specific domain queries to on-prem DNS servers. This is the right approach when on-prem services use a domain suffix that doesn\u0026rsquo;t overlap with Azure Private DNS zones, or when you need different forwarding behavior per cluster.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# CoreDNS custom configmap - forward internal corporate domain to on-prem resolver\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ev1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eConfigMap\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ecoredns-custom\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ekube-system\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003edata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecorp.server\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e    corp.example.com:53 {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        errors\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        cache 30\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        forward . 10.0.0.53 {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e            prefer_udp\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        }\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e    }\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eApply with \u003ccode\u003ekubectl apply -f coredns-custom.yaml\u003c/code\u003e. AKS detects the \u003ccode\u003ecoredns-custom\u003c/code\u003e ConfigMap automatically. For the reverse path, configure on-prem DNS to forward \u003ccode\u003e*.privatelink.blob.core.windows.net\u003c/code\u003e and similar zones to Azure\u0026rsquo;s virtual resolver at \u003ccode\u003e168.63.129.16\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e DNS is usually where hybrid setups produce the most subtle and hardest-to-debug failures. A pod resolves a name correctly in testing, then silently times out in production because the CoreDNS cache held a stale entry across a VPN reconnect. Keep TTLs short for cross-boundary records and verify the full resolver chain with \u003ccode\u003enslookup\u003c/code\u003e from inside the cluster, not just from a workstation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eKey principle:\u003c/strong\u003e Avoid split-horizon DNS designs where the same name resolves differently depending on source location. Use Azure Private DNS as the primary zone authority where possible, and fall back to conditional forwarding only for domains you don\u0026rsquo;t control.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"identity-across-boundaries-federation-without-duplication\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#identity-across-boundaries-federation-without-duplication\" title=\"Identity Across Boundaries: Federation Without Duplication\"\u003eIdentity Across Boundaries: Federation Without Duplication\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid deployments shouldn\u0026rsquo;t duplicate identity systems. Azure AD (now Microsoft Entra ID) integration extends to Arc clusters, providing centralized authentication and significantly reducing the number of credential systems to maintain.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"service-principals-for-cross-environment-access\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#service-principals-for-cross-environment-access\" title=\"Service Principals for Cross-Environment Access\"\u003eService Principals for Cross-Environment Access\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eApplications running on-prem that need access to Azure services (Key Vault, storage accounts, managed databases) can use Azure AD service principals with certificate-based authentication. Create a service principal, assign the appropriate role, and mount the certificate as a Kubernetes secret in the on-prem pod.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create service principal and assign Key Vault access\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz ad sp create-for-rbac \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"s2\"\u003e\u0026#34;onprem-app-sp\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --role \u003cspan class=\"s2\"\u003e\u0026#34;Key Vault Secrets User\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --scopes \u003cspan class=\"s2\"\u003e\u0026#34;/subscriptions/\u0026lt;sub-id\u0026gt;/resourceGroups/\u0026lt;rg\u0026gt;/providers/Microsoft.KeyVault/vaults/\u0026lt;vault\u0026gt;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis works reliably, but carries ongoing maintenance: certificate rotation, secret distribution across on-prem clusters, and audit trails that span two systems. For new workloads, federated credentials are worth the initial setup complexity.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"federated-credentials-for-workload-identity\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#federated-credentials-for-workload-identity\" title=\"Federated Credentials for Workload Identity\"\u003eFederated Credentials for Workload Identity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eWorkload identity federation\u003c/a\u003e allows on-prem Kubernetes service accounts to authenticate as Azure AD identities without long-lived secrets. The on-prem cluster\u0026rsquo;s OIDC issuer endpoint issues tokens for service accounts; Azure AD trusts that issuer and exchanges the projected token for an Azure AD access token.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Register the on-prem cluster\u0026#39;s OIDC issuer with an Azure AD app registration\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz ad app federated-credential create \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --id \u0026lt;app-registration-id\u0026gt; \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --parameters \u003cspan class=\"s1\"\u003e\u0026#39;{\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;name\u0026#34;: \u0026#34;onprem-k8s-workload\u0026#34;,\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;issuer\u0026#34;: \u0026#34;https://\u0026lt;your-onprem-oidc-issuer\u0026gt;\u0026#34;,\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;subject\u0026#34;: \u0026#34;system:serviceaccount:production:my-app\u0026#34;,\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;audiences\u0026#34;: [\u0026#34;api://AzureADTokenExchange\u0026#34;]\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e  }\u0026#39;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe on-prem cluster needs to expose its OIDC discovery document at a publicly reachable (or Azure-reachable) endpoint. That\u0026rsquo;s the step that most commonly blocks initial setup. Verify the discovery document is accessible before spending time debugging token exchange errors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e Migrating workloads from service principal secrets to federated credentials removes certificate rotation as a recurring task entirely. Secret sprawl across on-prem clusters was one of the more uncomfortable findings in the security reviews I\u0026rsquo;ve participated in. Federated credentials make the problem structurally impossible rather than just less likely.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"operational-consistency-making-hybrid-work-long-term\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#operational-consistency-making-hybrid-work-long-term\" title=\"Operational Consistency: Making Hybrid Work Long-Term\"\u003eOperational Consistency: Making Hybrid Work Long-Term\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid deployments fail when operational practices diverge between environments. Consistency requires deliberate effort.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"monitoring-and-observability\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#monitoring-and-observability\" title=\"Monitoring and Observability\"\u003eMonitoring and Observability\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eUse \u003ca href=\"https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure Monitor Container Insights\u003c/a\u003e for both AKS and Arc clusters. Install the extension on Arc-connected clusters explicitly (AKS picks it up automatically with the add-on flag):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz k8s-extension create \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name azuremonitor-containers \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-type connectedClusters \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name onprem-k8s-01 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group hybrid-infra-rg \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --extension-type Microsoft.AzureMonitor.Containers \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --configuration-settings \u003cspan class=\"nv\"\u003elogAnalyticsWorkspaceResourceID\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u0026lt;workspace-resource-id\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eMetrics, logs, and cluster health flow to a single Log Analytics workspace regardless of where the cluster runs. A simple Kusto Query Language (KQL) query surfaces pod restart counts across all environments at once:\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode class=\"language-kql\" data-lang=\"kql\"\u003eKubePodInventory\n| where TimeGenerated \u0026gt; ago(24h)\n| summarize Restarts=sum(ContainerRestartCount) by ClusterName, Namespace\n| order by Restarts desc\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eHaving AKS and on-prem clusters reporting to the same workspace makes cross-environment incident correlation significantly faster.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"update-management\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#update-management\" title=\"Update Management\"\u003eUpdate Management\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/agent-upgrade\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure Arc cluster autoupgrade\u003c/a\u003e reduces the operational gap between AKS (where upgrades are automated and well-understood) and self-managed on-prem clusters (where upgrades have historically been postponed due to complexity). You can define upgrade channels, schedule maintenance windows, and receive notifications through the same Azure portal used for AKS fleet management.\u003c/p\u003e\n\u003cp\u003eThis doesn\u0026rsquo;t eliminate the need for upgrade validation in staging environments. But it removes the operational friction that leads to on-prem clusters running three minor versions behind production AKS.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-and-resource-tracking\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#cost-and-resource-tracking\" title=\"Cost and Resource Tracking\"\u003eCost and Resource Tracking\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eArc-enabled clusters report resource utilization to Azure. Tag clusters consistently with environment, cost-center, and region labels using \u003ccode\u003eaz connectedk8s update\u003c/code\u003e. Use Azure Cost Management to track total Kubernetes spend across cloud and on-prem, enabling accurate chargeback and budget planning.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"key-takeaways\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#key-takeaways\" title=\"Key Takeaways\"\u003eKey Takeaways\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid AKS deployments succeed when you:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eChoose the right connectivity\u003c/strong\u003e: ExpressRoute for production, S2S VPN for dev/test, VNet peering for Azure-only scenarios\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse Azure Arc for unified management\u003c/strong\u003e: Extend Azure\u0026rsquo;s control plane rather than building parallel tooling\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEnforce policies consistently\u003c/strong\u003e: Azure Policy + GitOps eliminate configuration drift\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSimplify DNS\u003c/strong\u003e: Azure Private DNS with conditional forwarding avoids complexity\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eFederate identity\u003c/strong\u003e: Azure AD integration reduces secret sprawl and management overhead\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitor everything in one place\u003c/strong\u003e: Azure Monitor provides visibility across environments\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eHybrid infrastructure doesn\u0026rsquo;t have to mean duplicated effort. Arc, proper networking, and consistent operational practices make multi-environment Kubernetes manageable.\u003c/p\u003e\n\u003cp\u003eThe goal isn\u0026rsquo;t cloud purity. It\u0026rsquo;s operational efficiency wherever your workloads run.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-03-25T17:00:00+01:00","id":"https://daily-devops.net/posts/hybrid-aks-on-prem-azure-arc/","language":"en","summary":"Practical patterns for connecting AKS to on-prem: ExpressRoute, VPN connectivity, Azure Arc management, DNS resolution, and identity federation.","tags":["hybrid","azure","kubernetes","cloud","devops","onprem","infrastructure"],"title":"Hybrid AKS: Bridging Cloud and On-Prem with Azure Arc","url":"https://daily-devops.net/posts/hybrid-aks-on-prem-azure-arc/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eYour cluster will fail. The question is not if, but when, and whether you can recover before customers notice. Most organizations discover their backup strategy does not work during an actual outage, when recovery time matters most and manual heroics cannot save you.\u003c/p\u003e\n\u003cp\u003eIf you run Azure Kubernetes Service (AKS) in production, you need a recovery plan that engineers can execute half asleep at 2 AM. We will go through what to back up, how Velero works in day-to-day operations, when Azure Backup for AKS is enough, and how to design realistic failover with measurable Recovery Time Objective (RTO) and Recovery Point Objective (RPO).\u003c/p\u003e\n\u003cp\u003eThe goal is simple: repeatable recovery procedures you have already tested, not a document that looks good in Confluence but fails during an incident.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-untested-recovery-fails-when-it-matters\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#the-problem-untested-recovery-fails-when-it-matters\" title=\"The problem: Untested recovery fails when it matters\"\u003eThe problem: Untested recovery fails when it matters\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eEvery Kubernetes cluster accumulates state that must survive failures. Application data lives in persistent volumes. Cluster configuration exists in custom resource definitions. Workload definitions sit in YAML manifests scattered across repositories. Identity mappings, secrets, network policies, and RBAC rules define how services authenticate and communicate. Losing any of these components means downtime, data loss, and manual reconstruction under time pressure.\u003c/p\u003e\n\u003cp\u003eThe real risk is not having a backup strategy. The real risk is discovering your backup strategy does not work during an actual incident, when recovery time directly determines customer impact and business cost.\u003c/p\u003e\n\u003cp\u003eOperational reality: Most teams test backup creation but never test restoration. A backup you have never restored is a backup that will fail when you need it. Recovery procedures that require manual steps will fail during high-pressure incidents when engineers make mistakes and documentation is incomplete.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-needs-backup-understanding-cluster-state\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#what-needs-backup-understanding-cluster-state\" title=\"What needs backup: Understanding cluster state\"\u003eWhat needs backup: Understanding cluster state\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes clusters contain multiple layers of state that require different backup approaches.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"application-data-persistent-volumes\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#application-data-persistent-volumes\" title=\"Application data: Persistent volumes\"\u003eApplication data: Persistent volumes\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePersistent volumes hold databases, file storage, configuration data, and application state. Losing persistent volume data typically means permanent data loss unless you maintain application-level replication or external backups. Azure Disks and Azure Files both support snapshot-based backup, but snapshots alone do not capture the Kubernetes metadata required to restore volumes to the correct pods in the correct namespaces.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cluster-configuration-custom-resources-and-crds\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#cluster-configuration-custom-resources-and-crds\" title=\"Cluster configuration: Custom resources and CRDs\"\u003eCluster configuration: Custom resources and CRDs\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCustom Resource Definitions extend Kubernetes with domain-specific objects. Operators, service meshes, monitoring stacks, and policy engines all define Custom Resource Definitions (CRDs) that control cluster behavior. Losing CRDs means losing the schema and logic that your cluster depends on. Restoring CRDs without the corresponding custom resource objects leaves your cluster in an inconsistent state.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"application-definitions-workload-manifests\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#application-definitions-workload-manifests\" title=\"Application definitions: Workload manifests\"\u003eApplication definitions: Workload manifests\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDeployments, StatefulSets, Services, ConfigMaps, and Secrets define what runs in your cluster. Most teams store these manifests in Git, but cluster state drifts from Git over time due to manual changes, automated rollouts, and operator modifications. Restoring from Git alone may not reflect actual production state.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"identity-and-access-rbac-and-service-accounts\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#identity-and-access-rbac-and-service-accounts\" title=\"Identity and access: RBAC and service accounts\"\u003eIdentity and access: RBAC and service accounts\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRole-based access control, ServiceAccounts, and Azure AD integration define who can access what resources. Losing role-based access control (RBAC) configuration means losing security boundaries and breaking automated workflows that depend on specific service account permissions.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"network-configuration-policies-and-ingress-rules\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#network-configuration-policies-and-ingress-rules\" title=\"Network configuration: Policies and ingress rules\"\u003eNetwork configuration: Policies and ingress rules\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNetwork policies, ingress controllers, and DNS mappings control how traffic flows into and within your cluster. Restoring workloads without restoring network configuration results in unreachable services and broken traffic routing.\u003c/p\u003e\n\u003cp\u003eA complete backup strategy captures all of these layers and validates that restoration procedures actually work.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"velero-production-backup-workflows\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#velero-production-backup-workflows\" title=\"Velero: Production backup workflows\"\u003eVelero: Production backup workflows\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eVelero is the de facto standard for Kubernetes backup and restore. It runs as a controller inside your cluster, captures cluster state and persistent volume snapshots, and stores backups in object storage.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"how-velero-works\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#how-velero-works\" title=\"How Velero works\"\u003eHow Velero works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eVelero operates in two phases: backup and restore. During backup, Velero queries the Kubernetes API for resources matching your backup selectors, serializes those resources to JSON, and uploads the result to cloud object storage (Azure Blob Storage for AKS). For persistent volumes, Velero triggers volume snapshots using Azure Disk snapshots or uses Restic to perform file-level backups.\u003c/p\u003e\n\u003cp\u003eDuring restore, Velero downloads the backup manifest, applies resources to the target cluster, and restores persistent volume data from snapshots or Restic archives. Velero handles dependency ordering and namespace mapping automatically.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"backup-scheduling-and-retention\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#backup-scheduling-and-retention\" title=\"Backup scheduling and retention\"\u003eBackup scheduling and retention\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eProduction backup strategies require automated scheduling and retention policies. Velero supports cron-based schedules and configurable retention windows.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# Velero backup schedule - Helm values\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evelero\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedules\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003edaily\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"c\"\u003e# Run full backup daily at 2 AM UTC\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedule\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etemplate\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ettl\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;720h\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Retain backups for 30 days\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eincludedNamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003estaging\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003esnapshotVolumes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ehourly-critical\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"c\"\u003e# Run hourly backup for critical namespaces\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedule\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 * * * *\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etemplate\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ettl\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;168h\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Retain backups for 7 days\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eincludedNamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003elabelSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e            \u003c/span\u003e\u003cspan class=\"nt\"\u003ebackup-frequency\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ehourly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003esnapshotVolumes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eFor many teams, this minimal Terraform baseline is easier to maintain than a large, custom module. It creates the storage account and container Velero needs.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_storage_account\u0026#34; \u0026#34;velero\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;velerobackup${var.environment}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eresource_group_name\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  account_tier\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  account_replication_type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GRS\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_storage_container\u0026#34; \u0026#34;velero\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;velero\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  storage_account_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_storage_account\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003evelero\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  container_access_type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;private\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThen install Velero with Helm and pass only four required values: provider (\u003ccode\u003eazure\u003c/code\u003e), storage account name, blob container name, and resource group. Keep advanced tuning for later once backups and restores are stable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"testing-restore-procedures\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#testing-restore-procedures\" title=\"Testing restore procedures\"\u003eTesting restore procedures\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eBackup creation means nothing without verified restore capability. Production-grade DR requires regular restore testing in isolated environments.\u003c/p\u003e\n\u003cp\u003eRestore testing workflow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCreate a test AKS cluster in a separate resource group\u003c/li\u003e\n\u003cli\u003eInstall Velero with access to production backup storage\u003c/li\u003e\n\u003cli\u003eExecute restore operation for a representative namespace\u003c/li\u003e\n\u003cli\u003eValidate application functionality and data integrity\u003c/li\u003e\n\u003cli\u003eDocument restoration time and any issues encountered\u003c/li\u003e\n\u003cli\u003eDestroy test cluster\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eRun this workflow monthly at minimum. Quarterly is too infrequent because configuration drift and Velero version updates will cause surprises. Teams that skip restore testing discover broken procedures during actual outages.\u003c/p\u003e\n\u003cp\u003eCommon restore failures: Missing CRDs (restore CRDs before custom resources), incorrect namespace mappings (use Velero namespace mapping features), persistent volume availability zones (Azure Disks are zone-locked), and missing secrets (external secret management requires separate backup).\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-native-backup-when-to-use-it\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#azure-native-backup-when-to-use-it\" title=\"Azure native backup: When to use it\"\u003eAzure native backup: When to use it\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAzure Backup for AKS launched in 2023 and provides Azure-native cluster backup without deploying Velero. It integrates with Azure Backup vaults and uses the same portal experience as VM and database backups.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"azure-backup-vs-velero\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#azure-backup-vs-velero\" title=\"Azure Backup vs Velero\"\u003eAzure Backup vs Velero\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure Backup works well for organizations heavily invested in Azure tooling who want unified backup management across all Azure resources. It handles backup scheduling, retention, and monitoring through familiar Azure interfaces.\u003c/p\u003e\n\u003cp\u003eLimitations compared to Velero: Less flexibility in backup selectors and namespace filtering, fewer options for cross-region backup replication, and vendor lock-in to Azure. Velero supports multi-cloud scenarios and offers more granular control over what gets backed up.\u003c/p\u003e\n\u003cp\u003eRecommendation: Use Azure Backup if your organization already standardizes on Azure Backup for other resources and you do not require multi-cloud portability. Use Velero if you need maximum flexibility, cross-region replication control, or multi-cloud backup capability.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-region-failover-designing-for-actual-recovery\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#multi-region-failover-designing-for-actual-recovery\" title=\"Multi-region failover: Designing for actual recovery\"\u003eMulti-region failover: Designing for actual recovery\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSingle-region deployments create single points of failure. Multi-region architectures provide genuine disaster recovery capability but introduce complexity in state synchronization, traffic routing, and recovery orchestration.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"failover-architecture-patterns\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#failover-architecture-patterns\" title=\"Failover architecture patterns\"\u003eFailover architecture patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eActive-passive:\u003c/strong\u003e Primary region handles all traffic. Secondary region remains idle but receives regular backup replication. During failover, you restore backups to the secondary cluster and redirect traffic. Recovery time depends on backup restore speed and DNS propagation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eActive-active:\u003c/strong\u003e Both regions handle production traffic simultaneously. Application state synchronizes continuously (database replication, event streaming, or shared storage). During regional failure, traffic shifts to the remaining region. Recovery time depends on health check detection and DNS/load balancer failover speed.\u003c/p\u003e\n\u003cp\u003eActive-passive costs less but requires longer recovery time. Active-active provides faster failover but doubles infrastructure cost and requires application-level state synchronization.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"dns-failover-automation\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#dns-failover-automation\" title=\"DNS failover automation\"\u003eDNS failover automation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDNS-based failover redirects traffic between regions by updating DNS records to point at healthy endpoints. Azure Traffic Manager and Azure Front Door both provide automatic failover based on health probes.\u003c/p\u003e\n\u003cp\u003eUse a small script first, then expand it over time. This keeps incident handling understandable for on-call engineers.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/usr/bin/env bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSECONDARY_RG\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;rg-aks-westus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSECONDARY_CLUSTER\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;aks-dr-westus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTM_RG\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;rg-networking\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTM_PROFILE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;tm-aks-prod\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;1) Connect to secondary cluster\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-credentials -g \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$SECONDARY_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$SECONDARY_CLUSTER\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --overwrite-existing\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cluster-info\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;2) Trigger restore from latest Velero backup\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero restore create dr-\u003cspan class=\"k\"\u003e$(\u003c/span\u003edate +%Y%m%d-%H%M\u003cspan class=\"k\"\u003e)\u003c/span\u003e --from-backup \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003evelero backup get -o name \u003cspan class=\"p\"\u003e|\u003c/span\u003e tail -n1 \u003cspan class=\"p\"\u003e|\u003c/span\u003e cut -d/ -f2\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;3) Switch Traffic Manager endpoint\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz network traffic-manager endpoint update --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --profile-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_PROFILE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name endpoint-eastus --type azureEndpoints --endpoint-status Disabled\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz network traffic-manager endpoint update --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --profile-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_PROFILE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name endpoint-westus --type azureEndpoints --endpoint-status Enabled\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script is intentionally small. Add pre-checks and post-checks later, but start with a version every engineer can understand quickly during an outage.\u003c/p\u003e\n\u003cp\u003eThis script automates critical failover steps but requires human verification at each stage. Fully automated failover without human approval risks unnecessary region switches during transient failures.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"state-synchronization-strategies\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#state-synchronization-strategies\" title=\"State synchronization strategies\"\u003eState synchronization strategies\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMulti-region architectures require careful state management. Databases need replication (Azure SQL geo-replication, Cosmos DB multi-region writes). Object storage needs cross-region replication (Azure Blob Storage GRS). Message queues require either regional isolation or cross-region synchronization (Azure Service Bus premium tier supports geo-replication).\u003c/p\u003e\n\u003cp\u003eStateless services fail over easily. Stateful services require replication strategy planning during design phase, not during incident response.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"rto-and-rpo-calculating-realistic-targets\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#rto-and-rpo-calculating-realistic-targets\" title=\"RTO and RPO: Calculating realistic targets\"\u003eRTO and RPO: Calculating realistic targets\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRecovery Time Objective (RTO) measures how long systems can be down before business impact becomes unacceptable. Recovery Point Objective (RPO) measures how much data loss is acceptable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"calculating-rto\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#calculating-rto\" title=\"Calculating RTO\"\u003eCalculating RTO\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRTO includes: detection time (how long until you know there is a problem), decision time (how long to decide failover is necessary), restore time (how long to restore from backup or switch regions), and validation time (how long to confirm restoration worked).\u003c/p\u003e\n\u003cp\u003eExample calculation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eDetection: 5 minutes (health check interval)\u003c/li\u003e\n\u003cli\u003eDecision: 10 minutes (incident escalation and approval)\u003c/li\u003e\n\u003cli\u003eRestore: 45 minutes (Velero restore for 500GB cluster)\u003c/li\u003e\n\u003cli\u003eValidation: 15 minutes (smoke tests and traffic verification)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTotal RTO: 75 minutes\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf business requirements demand 30-minute RTO, your current backup-based approach will not meet SLOs. You need active-active architecture or pre-warmed standby clusters.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"calculating-rpo\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#calculating-rpo\" title=\"Calculating RPO\"\u003eCalculating RPO\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRPO depends on backup frequency. Hourly backups mean up to 60 minutes of data loss. If your application cannot tolerate 60 minutes of data loss, you need more frequent backups or continuous replication.\u003c/p\u003e\n\u003cp\u003eExample calculation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBackup frequency: Every 4 hours\u003c/li\u003e\n\u003cli\u003eLast backup: 2 hours ago\u003c/li\u003e\n\u003cli\u003eRegional failure occurs now\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eData loss: 2 hours\u003c/strong\u003e (time since last backup)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf business requirements demand 15-minute RPO, 4-hour backup intervals will not meet SLOs. You need hourly backups, application-level replication, or continuous event streaming to secondary region.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"designing-for-slos-without-over-engineering\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#designing-for-slos-without-over-engineering\" title=\"Designing for SLOs without over-engineering\"\u003eDesigning for SLOs without over-engineering\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMany teams over-engineer DR solutions trying to achieve zero data loss and instant failover without understanding actual business requirements. A 4-hour RTO may be acceptable for internal tooling but catastrophic for customer-facing APIs.\u003c/p\u003e\n\u003cp\u003ePractical use case:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eInternal reporting API: 2-hour RTO and 1-hour RPO can be enough, active-passive is usually fine.\u003c/li\u003e\n\u003cli\u003eCustomer checkout API: 15-minute RTO and near-zero RPO usually require active-active plus database replication.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe recurring theme is business impact, not architecture fashion.\u003c/p\u003e\n\u003cp\u003eStart by identifying actual business impact:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWhat revenue is lost per hour of downtime?\u003c/li\u003e\n\u003cli\u003eWhat customer commitments exist in SLAs?\u003c/li\u003e\n\u003cli\u003eWhat regulatory requirements mandate specific recovery times?\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThen design the minimum viable DR solution that meets those requirements. Do not build active-active multi-region architecture with continuous replication if business requirements allow 2-hour RTO and 1-hour RPO. That level of complexity costs significant engineering time and operational overhead.\u003c/p\u003e\n\u003cp\u003eConversely, do not assume daily backups suffice for production systems without validating business tolerance for 24-hour data loss.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"best-practices-what-actually-works\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#best-practices-what-actually-works\" title=\"Best practices: What actually works\"\u003eBest practices: What actually works\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTest restore procedures regularly.\u003c/strong\u003e Monthly restore testing in isolated environments catches broken procedures before actual incidents. Quarterly testing is too infrequent.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAutomate backup verification.\u003c/strong\u003e Run automated restore tests that verify backup integrity and measure restoration time. Manual testing does not scale and gets skipped under time pressure.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDocument recovery procedures.\u003c/strong\u003e Runbooks that sit in Confluence do not get updated and will be wrong during incidents. Store recovery procedures as executable scripts in version control and test them regularly.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSeparate backup storage from cluster infrastructure.\u003c/strong\u003e Do not store backups in the same region or subscription as the cluster. Regional Azure outages impact all resources in that region including backup storage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePlan for partial failures.\u003c/strong\u003e Not every incident requires full cluster restore. Design procedures for restoring individual namespaces, specific workloads, or single persistent volumes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse infrastructure as code for cluster rebuild.\u003c/strong\u003e Terraform or Bicep definitions for cluster creation enable rapid cluster recreation when restoration is not the best recovery path.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMonitor backup jobs.\u003c/strong\u003e Failed backups are worthless. Alert on backup failures and missing backup runs. Do not discover backup gaps during recovery.\u003c/p\u003e\n\u003cp\u003eIf you are defining a monthly DR game day, include three quick checks every time:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCan we restore one namespace end to end in a clean test cluster?\u003c/li\u003e\n\u003cli\u003eCan we switch traffic and run smoke tests in less than our RTO?\u003c/li\u003e\n\u003cli\u003eCan we prove data freshness is inside the RPO window?\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIf one answer is no, your DR posture is weaker than your dashboard suggests.\u003c/p\u003e\n\u003cp\u003eCommon mistakes: Storing backups in same region as cluster (regional failure loses backups and cluster), never testing restore procedures (broken backups discovered during incidents), manual recovery procedures (humans make mistakes under pressure), and no RTO/RPO measurement (cannot tell if recovery meets business requirements).\u003c/p\u003e\n\u003cp\u003eAuthor note: I have participated in exactly two real disaster recovery situations involving Kubernetes clusters. In the first incident, backup restoration worked but took 3 hours longer than documented because volume snapshot region restrictions were not tested. In the second incident, backups existed but CRD restoration failed because CRD versions changed between backup and restore. Both incidents would have been prevented by regular restore testing. Do not learn this lesson during a production outage.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDisaster recovery for AKS requires deliberate planning, regular testing, and honest assessment of recovery capabilities. Velero provides proven backup and restore workflows. Azure native backup offers simplified management for Azure-focused organizations. Multi-region architectures enable faster recovery but increase complexity and cost.\u003c/p\u003e\n\u003cp\u003eThe real test is not having a backup strategy documented in Confluence. The real test is whether you can restore your cluster from backup in under 60 minutes during an actual regional outage at 2 AM when half your team is asleep and the incident commander is asking for status updates.\u003c/p\u003e\n\u003cp\u003eBuild repeatable procedures. Test them monthly. Automate everything you can. Measure actual RTO and RPO. Add one more rule: if a step cannot be executed from version-controlled scripts, it is probably not ready for production incidents.\u003c/p\u003e\n\u003cp\u003eRelated reading for AKS operations maturity: \u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/\"\u003eAKS Cluster Upgrades Without Downtime\u003c/a\u003e.\u003c/p\u003e","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-03-11T17:00:00+01:00","id":"https://daily-devops.net/posts/disaster-recovery-business-continuity-aks/","language":"en","summary":"AKS outages happen. Build a tested DR plan with Velero, realistic RTO/RPO targets, and multi-region failover steps your team can run under pressure.","tags":["disaster-recovery","azure","kubernetes","cloud","devops","reliability","compliance"],"title":"AKS Disaster Recovery: Why Your Untested Backup Will Fail","url":"https://daily-devops.net/posts/disaster-recovery-business-continuity-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eYour Azure Kubernetes Service (AKS) cluster is running smoothly. Deployments are automated. Teams ship features daily. Everything looks secure, until you discover that a container image pulled from your Azure Container Registry contains a critical vulnerability that\u0026rsquo;s been actively exploited in the wild for weeks.\u003c/p\u003e\n\u003cp\u003eThis isn\u0026rsquo;t hypothetical. Supply chain attacks targeting container registries have become a primary attack vector. An unvetted image in production can expose sensitive data, allow lateral movement within your cluster, or provide an entry point for ransomware. The worst part: the vulnerability might not be in your code at all. It could be in a base image dependency you didn\u0026rsquo;t even know you were using.\u003c/p\u003e\n\u003cp\u003eContainer registry security isn\u0026rsquo;t optional. It\u0026rsquo;s foundational to your entire Kubernetes security posture. And ACR (Azure Container Registry) provides the tools you need to enforce it, if you configure them correctly.\u003c/p\u003e\n\u003cp\u003eAuthor note: I have seen teams invest heavily in runtime controls while treating the registry as a passive artifact store. That gap usually shows up during incident response, not during happy-path deployments.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"image-scanning--vulnerability-management\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#image-scanning--vulnerability-management\" title=\"Image Scanning \u0026amp; Vulnerability Management\"\u003eImage Scanning \u0026amp; Vulnerability Management\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThe first line of defense is knowing what\u0026rsquo;s actually in your images before they reach production. Image scanning tools like Trivy, Microsoft Defender for Containers (formerly Azure Defender), and Anchore analyze container layers for known vulnerabilities (CVEs), malware, and configuration issues.\u003c/p\u003e\n\u003cp\u003eBut scanning alone isn\u0026rsquo;t enough. You need a policy-based approach that blocks vulnerable images from being deployed.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"microsoft-defender-for-containers\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#microsoft-defender-for-containers\" title=\"Microsoft Defender for Containers\"\u003eMicrosoft Defender for Containers\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMicrosoft Defender for Containers integrates with ACR and provides agentless vulnerability assessment for container images when the plan and required extension are enabled. In practice, you get push-triggered assessment plus recurring reassessment over time. Findings are surfaced as recommendations in Microsoft Defender for Cloud, with severity and remediation guidance.\u003c/p\u003e\n\u003cp\u003eThe critical configuration: set up alerting and response workflows. A scan report that nobody reads is worthless. Configure alerts to notify your DevOps team when high or critical vulnerabilities are detected. Better yet, integrate with your CI/CD pipeline to fail builds that introduce new vulnerabilities above a defined threshold.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"trivy-integration\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#trivy-integration\" title=\"Trivy Integration\"\u003eTrivy Integration\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eTrivy is an open-source vulnerability scanner that\u0026rsquo;s lightweight, fast, and highly accurate. It scans container images, filesystem artifacts, and even Infrastructure as Code templates for vulnerabilities and misconfigurations.\u003c/p\u003e\n\u003cp\u003eIntegrate Trivy into your CI/CD pipeline as a gate:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Scan image before pushing to ACR\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003etrivy image --severity HIGH,CRITICAL --exit-code \u003cspan class=\"m\"\u003e1\u003c/span\u003e myapp:latest\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# If vulnerabilities are found, build fails\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Otherwise, push to ACR\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz acr login --name myregistry\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003edocker tag myapp:latest myregistry.azurecr.io/myapp:latest\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003edocker push myregistry.azurecr.io/myapp:latest\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis approach ensures that only images meeting your security threshold reach your registry in the first place.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"image-signing--verification\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#image-signing--verification\" title=\"Image Signing \u0026amp; Verification\"\u003eImage Signing \u0026amp; Verification\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eVulnerability scanning tells you what\u0026rsquo;s in an image. Image signing tells you who built it and whether it\u0026rsquo;s been tampered with. This is supply chain security at its core.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"notation-and-notary-v2\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#notation-and-notary-v2\" title=\"Notation and Notary v2\"\u003eNotation and Notary v2\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure Container Registry supports Notary v2 (via the Notation CLI), which implements the CNCF Notary specification for signing and verifying container artifacts. When you sign an image, you\u0026rsquo;re cryptographically attesting that:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eThe image was built by a trusted party (your CI/CD system)\u003c/li\u003e\n\u003cli\u003eThe image hasn\u0026rsquo;t been modified since it was signed\u003c/li\u003e\n\u003cli\u003eThe image meets specific criteria (e.g., passed security scans)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eHere\u0026rsquo;s a practical workflow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eIn CI/CD:\u003c/strong\u003e After building and scanning an image, sign it using Notation\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eIn ACR:\u003c/strong\u003e Store signatures alongside the image\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eIn AKS:\u003c/strong\u003e Use Azure Policy or admission controllers (OPA Gatekeeper, Kyverno) to verify signatures before allowing pod creation\u003c/li\u003e\n\u003c/ol\u003e\n\n\n\n\n\u003ch3 id=\"cosign-alternative\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#cosign-alternative\" title=\"Cosign Alternative\"\u003eCosign Alternative\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCosign, part of the Sigstore project, is another popular option for image signing. It\u0026rsquo;s simpler to set up than Notary v2 and integrates well with Kubernetes admission controllers. The choice between Notation and Cosign often comes down to your broader toolchain: Notation if you\u0026rsquo;re heavily invested in Azure, Cosign if you prefer open-source portability.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"rbac-for-registry-access\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#rbac-for-registry-access\" title=\"RBAC for Registry Access\"\u003eRBAC for Registry Access\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWho can push images to your registry? Who can pull them? These questions matter more than you might think.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"azure-rbac-for-acr\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#azure-rbac-for-acr\" title=\"Azure RBAC for ACR\"\u003eAzure RBAC for ACR\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eACR supports Azure role-based access control (RBAC) with granular permissions:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAcrPull:\u003c/strong\u003e Read-only access to pull images\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAcrPush:\u003c/strong\u003e Ability to push and pull images\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAcrDelete:\u003c/strong\u003e Permission to delete images\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOwner/Contributor:\u003c/strong\u003e Full management rights\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIn practice, your setup should look like this:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eCI/CD service principals:\u003c/strong\u003e AcrPush role (can build and push images)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAKS node pools:\u003c/strong\u003e AcrPull role via managed identity (can pull images for workloads)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDevelopers:\u003c/strong\u003e No direct registry access (deployments go through CI/CD)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSecurity team:\u003c/strong\u003e Reader role for auditing\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis model ensures that production images flow through controlled pipelines, not from developer laptops.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"aks-managed-identity-for-acr-access\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#aks-managed-identity-for-acr-access\" title=\"AKS Managed Identity for ACR Access\"\u003eAKS Managed Identity for ACR Access\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eInstead of storing registry credentials in Kubernetes secrets, use AKS managed identity to grant pull access. This eliminates credential management overhead and reduces the risk of credential leakage.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Attach ACR to AKS cluster using managed identity\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myakscluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myresourcegroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --attach-acr myregistry\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eNow your AKS nodes can pull images from ACR without any credentials stored in the cluster.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"private-endpoints-network-isolation\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#private-endpoints-network-isolation\" title=\"Private Endpoints: Network Isolation\"\u003ePrivate Endpoints: Network Isolation\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eBy default, Azure Container Registry is accessible over the public internet. Even with RBAC, this creates an unnecessary attack surface. Private endpoints solve this by routing registry traffic through your Azure virtual network.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"terraform-example-acr-with-private-endpoint\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#terraform-example-acr-with-private-endpoint\" title=\"Terraform Example: ACR with Private Endpoint\"\u003eTerraform Example: ACR with Private Endpoint\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eHere\u0026rsquo;s a practical Terraform configuration for deploying ACR with a private endpoint:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Azure Container Registry with Premium SKU (required for private endpoints)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_container_registry\u0026#34; \u0026#34;acr\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;myacrregistry\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003erg\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003erg\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  sku\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Premium\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  admin_enabled\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003efalse\u003c/span\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e  # Disable public network access\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  public_network_access_enabled\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003efalse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Private endpoint for ACR\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_endpoint\u0026#34; \u0026#34;acr_pe\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;acr-private-endpoint\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003erg\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003erg\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  subnet_id\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eprivate_endpoints\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eprivate_service_connection\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;acr-connection\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    private_connection_resource_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_container_registry\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eacr\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    is_manual_connection\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003efalse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    subresource_names\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;registry\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eprivate_dns_zone_group\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;acr-dns-zone-group\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    private_dns_zone_ids\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_private_dns_zone\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eacr\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Private DNS zone for ACR\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_dns_zone\u0026#34; \u0026#34;acr\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;privatelink.azurecr.io\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003erg\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Link DNS zone to VNet\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_dns_zone_virtual_network_link\u0026#34; \u0026#34;acr\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;acr-dns-link\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003erg\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  private_dns_zone_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_private_dns_zone\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eacr\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_id\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003evnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWith this configuration:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eACR is only accessible from within your VNet (or peered VNets)\u003c/li\u003e\n\u003cli\u003eDNS resolution automatically routes registry traffic through the private endpoint\u003c/li\u003e\n\u003cli\u003ePublic internet access to your registry is completely disabled\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis is especially critical if your AKS cluster handles sensitive workloads or regulated data.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"policy-enforcement-with-gatekeeper\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#policy-enforcement-with-gatekeeper\" title=\"Policy Enforcement with Gatekeeper\"\u003ePolicy Enforcement with Gatekeeper\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eScanning and signing only matter if you enforce them. Kubernetes admission controllers intercept pod creation requests and enforce policies before workloads are admitted to the cluster.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"opa-gatekeeper-for-image-source-enforcement\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#opa-gatekeeper-for-image-source-enforcement\" title=\"OPA Gatekeeper for Image Source Enforcement\"\u003eOPA Gatekeeper for Image Source Enforcement\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eOpen Policy Agent (OPA) Gatekeeper is a common admission controller for policy enforcement in Kubernetes. The following example enforces that workloads only pull images from approved registries. It does not verify cryptographic signatures by itself:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# Gatekeeper ConstraintTemplate for image signature verification\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003etemplates.gatekeeper.sh/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eConstraintTemplate\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eacrverifiedimages\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecrd\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003enames\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eAcrVerifiedImages\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003evalidation\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eopenAPIV3Schema\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003etype\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eobject\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003eproperties\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e            \u003c/span\u003e\u003cspan class=\"nt\"\u003eallowedRegistries\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e              \u003c/span\u003e\u003cspan class=\"nt\"\u003etype\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003earray\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e              \u003c/span\u003e\u003cspan class=\"nt\"\u003eitems\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e                \u003c/span\u003e\u003cspan class=\"nt\"\u003etype\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estring\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003etargets\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"nt\"\u003etarget\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eadmission.k8s.gatekeeper.sh\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003erego\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        package acrverifiedimages\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        violation[{\u0026#34;msg\u0026#34;: msg}] {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          container := input.review.object.spec.containers[_]\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          not registry_allowed(container.image)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          msg := sprintf(\u0026#34;Container image \u0026#39;%v\u0026#39; is not from an allowed registry\u0026#34;, [container.image])\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        }\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        registry_allowed(image) {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          allowed := input.parameters.allowedRegistries[_]\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          startswith(image, allowed)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        }\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# Constraint to enforce ACR-only images\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003econstraints.gatekeeper.sh/v1beta1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eAcrVerifiedImages\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003erequire-acr-images\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ematch\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ekinds\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"nt\"\u003eapiGroups\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ekinds\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Pod\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eallowedRegistries\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"s2\"\u003e\u0026#34;myacrregistry.azurecr.io/\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis policy ensures that pods in the \u003ccode\u003eproduction\u003c/code\u003e namespace can only use images from your approved ACR instance. Any attempt to deploy an image from Docker Hub, a public registry, or an unknown source will be rejected at admission time.\u003c/p\u003e\n\u003cp\u003eFor signature verification, extend this pattern with Ratify and policy enforcement so signatures and trust policies are validated before admission. AKS Image Integrity also exists, but it is currently a preview feature with notable production limitations.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"common-mistakes-with-policy-enforcement\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#common-mistakes-with-policy-enforcement\" title=\"Common Mistakes with Policy Enforcement\"\u003eCommon Mistakes with Policy Enforcement\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eNaming a registry-allowlist policy as \u0026ldquo;image verification\u0026rdquo; even though no signature verification happens\u003c/li\u003e\n\u003cli\u003eEnforcing policies only in \u003ccode\u003eproduction\u003c/code\u003e and leaving staging unrestricted\u003c/li\u003e\n\u003cli\u003eEnabling admission control without a break-glass process for incident response\u003c/li\u003e\n\u003cli\u003eForgetting to version and test policy changes like application code\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"multi-region-replication-distribution-strategy\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#multi-region-replication-distribution-strategy\" title=\"Multi-Region Replication: Distribution Strategy\"\u003eMulti-Region Replication: Distribution Strategy\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eIf your AKS workloads span multiple Azure regions, you need a registry replication strategy that balances availability, performance, and cost.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"geo-replication-in-acr\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#geo-replication-in-acr\" title=\"Geo-Replication in ACR\"\u003eGeo-Replication in ACR\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eACR Premium SKU supports geo-replication, allowing you to maintain a single registry name while automatically replicating images to multiple Azure regions. This reduces latency for image pulls and provides failover capabilities.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Enable geo-replication to multiple regions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz acr replication create \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --registry myregistry \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --location westeurope\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz acr replication create \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --registry myregistry \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --location eastus\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eNow when an AKS cluster in West Europe pulls an image, it\u0026rsquo;s served from the local replica. If that replica becomes unavailable, ACR automatically fails over to another region.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"replication-patterns\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#replication-patterns\" title=\"Replication Patterns\"\u003eReplication Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eSingle-region workloads:\u003c/strong\u003e No replication needed. Keep it simple.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMulti-region with low traffic:\u003c/strong\u003e Geo-replication provides good balance between availability and cost.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMulti-region with high traffic or strict latency requirements:\u003c/strong\u003e Consider dedicated ACR instances per region with automated image promotion pipelines. This gives you more control over what images are available in each region and when they\u0026rsquo;re promoted.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDisaster recovery:\u003c/strong\u003e Geo-replication is not a backup strategy. If an image is accidentally deleted, it\u0026rsquo;s deleted from all replicas. Implement immutability policies (supported in ACR Premium) to prevent accidental deletion of critical images.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-implementation-checklist\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#practical-implementation-checklist\" title=\"Practical Implementation Checklist\"\u003ePractical Implementation Checklist\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eIf you\u0026rsquo;re implementing ACR security for an existing AKS deployment, here\u0026rsquo;s the order of operations I\u0026rsquo;d recommend:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eEnable Microsoft Defender for Containers\u003c/strong\u003e on your ACR instance (quick win, no code changes)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSet up RBAC\u003c/strong\u003e to limit who can push/pull images (reduces blast radius)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eIntegrate Trivy or equivalent scanning\u003c/strong\u003e into your CI/CD pipeline (prevents new vulnerabilities from entering the registry)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eConfigure private endpoints\u003c/strong\u003e if your workloads are in a VNet (reduces attack surface)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eImplement image signing\u003c/strong\u003e with Notation or Cosign (establishes trust boundary)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDeploy Gatekeeper or Kyverno\u003c/strong\u003e to enforce policies at admission time (prevents policy violations from reaching runtime)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEnable geo-replication\u003c/strong\u003e if needed (improves availability and performance)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis sequence minimizes risk while keeping deployments flowing. Don\u0026rsquo;t try to implement everything at once. Layered security is iterative.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-this-actually-prevents\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#what-this-actually-prevents\" title=\"What This Actually Prevents\"\u003eWhat This Actually Prevents\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eLet\u0026rsquo;s ground this in real scenarios:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eScenario 1: Compromised base image\u003c/strong\u003e\u003cbr\u003e\nYour application uses a popular Node.js base image. A critical vulnerability is discovered (e.g., log4shell equivalent). With vulnerability scanning enabled, you\u0026rsquo;re alerted within hours. With policy enforcement, existing vulnerable images can\u0026rsquo;t be deployed until patched.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eScenario 2: Rogue developer\u003c/strong\u003e\u003cbr\u003e\nA developer with push access to ACR tries to deploy an unsigned image from their laptop. With signature verification enforced via Gatekeeper, the deployment is rejected at admission time. Your cluster never runs unverified code.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eScenario 3: Supply chain attack\u003c/strong\u003e\u003cbr\u003e\nAn attacker compromises your CI/CD pipeline and attempts to push a backdoored image to ACR. With RBAC properly configured, the service principal has limited scope. With private endpoints enabled, the attacker can\u0026rsquo;t even access your registry from outside your network.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eScenario 4: Accidental public exposure\u003c/strong\u003e\u003cbr\u003e\nA misconfiguration exposes your ACR to the public internet. With public network access disabled and private endpoints enforced, there\u0026rsquo;s no route to your registry from outside your VNet, so configuration mistakes don\u0026rsquo;t result in exposure.\u003c/p\u003e\n\u003cp\u003eThese aren\u0026rsquo;t theoretical. They\u0026rsquo;re patterns I\u0026rsquo;ve seen in production environments that failed to implement registry security correctly.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion-security-without-friction\"\u003e\u003ca href=\"/posts/container-registry-image-security-aks/#conclusion-security-without-friction\" title=\"Conclusion: Security Without Friction\"\u003eConclusion: Security Without Friction\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThe goal isn\u0026rsquo;t to lock down everything so tightly that deployments become painful. The goal is to build security into your workflow so seamlessly that it doesn\u0026rsquo;t slow down your teams.\u003c/p\u003e\n\u003cp\u003eImage scanning, signing, RBAC, private endpoints, and policy enforcement work together to create a defense-in-depth strategy. No single control is perfect. But layered together, they make successful attacks exponentially harder while keeping legitimate deployments fast.\u003c/p\u003e\n\u003cp\u003eStart with the quick wins: enable Defender for Containers, configure RBAC, integrate scanning into CI/CD. Then progressively layer on signing, private endpoints, and policy enforcement as your security maturity grows.\u003c/p\u003e\n\u003cp\u003eYour AKS cluster is only as secure as the images running inside it. Treat your container registry as the trust boundary it actually is.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-03-04T09:00:00+01:00","id":"https://daily-devops.net/posts/container-registry-image-security-aks/","language":"en","summary":"ACR security is foundational. Learn practical hardening: image scanning, signing, RBAC, private endpoints, and policy enforcement for AKS clusters.","tags":["kubernetes","azure","cloud","devops","security"],"title":"Container Registry \u0026 Image Security in AKS Deployments","url":"https://daily-devops.net/posts/container-registry-image-security-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eYour first production Azure Kubernetes Service (AKS) cluster often feels manageable for months, sometimes for years. Then demand grows and a second cluster appears. Regional resiliency might require it. Team isolation might require it. Compliance boundaries might require it.\u003c/p\u003e\n\u003cp\u003eThe hard part is not creating cluster number two. The hard part is networking between clusters in a way your team can operate at 2 a.m.\u003c/p\u003e\n\u003cp\u003eThis guide focuses on practical multi-cluster AKS networking: connectivity models, DNS (Domain Name System), ingress patterns, and the trade-offs that matter in production.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-single-clusters-hit-their-limits\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#why-single-clusters-hit-their-limits\" title=\"Why Single Clusters Hit Their Limits\"\u003eWhy Single Clusters Hit Their Limits\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSingle-cluster architectures work until they stop being a sensible risk boundary. Three constraints usually force the move:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eScale ceilings.\u003c/strong\u003e Azure CNI Overlay supports large cluster sizes, including documented scale targets up to 5,000 nodes per cluster in current AKS guidance. Verify current limits before architecture decisions because limits evolve over time (\u003ca href=\"https://learn.microsoft.com/azure/aks/quotas-skus-regions\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAKS scale limits\u003c/a\u003e).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFailure domain isolation.\u003c/strong\u003e Control plane failures are uncommon, but when they happen the impact is serious. Multi-cluster design contains incidents. A failure in cluster A should not automatically break cluster B.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTeam and workload separation.\u003c/strong\u003e Different compliance requirements, service level objectives, and release cadence often require separate clusters. Shared clusters can become an organizational bottleneck.\u003c/p\u003e\n\u003cp\u003eOnce you commit to multiple clusters, networking becomes the core design problem. Services in cluster A need controlled access to cluster B. Shared infrastructure such as DNS, observability, and data platforms must stay reachable. This must still be simple enough to run day to day.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"connectivity-models-vnet-peering-vs-private-link\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#connectivity-models-vnet-peering-vs-private-link\" title=\"Connectivity Models: VNet Peering vs Private Link\"\u003eConnectivity Models: VNet Peering vs Private Link\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eTwo patterns handle most Azure multi-cluster scenarios: Virtual Network (VNet) peering and Private Link. Both are valid, but they solve different problems.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"vnet-peering-direct-layer-3-connectivity\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#vnet-peering-direct-layer-3-connectivity\" title=\"VNet Peering: Direct Layer 3 Connectivity\"\u003eVNet Peering: Direct Layer 3 Connectivity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eVNet peering creates bidirectional connectivity between virtual networks over the Azure backbone. Traffic stays private, latency is low, and throughput is high (\u003ca href=\"https://learn.microsoft.com/azure/virtual-network/virtual-network-peering-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eVirtual network peering overview\u003c/a\u003e).\u003c/p\u003e\n\u003cp\u003eFor multi-cluster AKS, peering allows direct IP connectivity between pods and services, assuming routing and policies allow it.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse peering when:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eClusters are in the same region or a paired region\u003c/li\u003e\n\u003cli\u003eYou need low latency between workloads\u003c/li\u003e\n\u003cli\u003eYou move significant data volume between clusters\u003c/li\u003e\n\u003cli\u003eYou want simple routing with minimal translation overhead\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePeering limitations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAddress spaces cannot overlap\u003c/li\u003e\n\u003cli\u003ePeering is not transitive\u003c/li\u003e\n\u003cli\u003eSecurity controls must be correct on both sides\u003c/li\u003e\n\u003cli\u003eCross-region transfer costs can become noticeable\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003ePeering is still the default starting point for most environments because it is predictable and easy to reason about.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"private-link-service-endpoint-connectivity\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#private-link-service-endpoint-connectivity\" title=\"Private Link: Service Endpoint Connectivity\"\u003ePrivate Link: Service Endpoint Connectivity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePrivate Link exposes selected services through private endpoints. Instead of full network reachability, consumers connect only to what you explicitly publish (\u003ca href=\"https://learn.microsoft.com/azure/private-link/private-link-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eWhat is Azure Private Link\u003c/a\u003e).\u003c/p\u003e\n\u003cp\u003eIn AKS, this is commonly used to expose internal services through an internal load balancer and Private Link Service. Consumer networks do not need full peering to the provider VNet.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse Private Link when:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eYou need strict service-level exposure across boundaries\u003c/li\u003e\n\u003cli\u003eYou cannot avoid overlapping IP ranges\u003c/li\u003e\n\u003cli\u003eYou want narrow, auditable connectivity contracts\u003c/li\u003e\n\u003cli\u003eYou want to reduce broad peering relationships\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePrivate Link trade-offs:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSlightly higher latency than direct peering\u003c/li\u003e\n\u003cli\u003eMore setup and lifecycle management\u003c/li\u003e\n\u003cli\u003eService-specific by design, not full network connectivity\u003c/li\u003e\n\u003cli\u003eEndpoint cost accumulates as service count grows\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf your goal is broad cluster-to-cluster communication, peering is simpler. If your goal is controlled service publishing, Private Link is often the better boundary.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"hub-spoke-topology-centralized-connectivity\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#hub-spoke-topology-centralized-connectivity\" title=\"Hub-Spoke Topology: Centralized Connectivity\"\u003eHub-Spoke Topology: Centralized Connectivity\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHub-spoke is the topology that usually wins once cluster count grows. Instead of a full mesh, each cluster VNet connects to a central hub.\u003c/p\u003e\n\u003cdiv class=\"mermaid\"\u003egraph TB\n  Hub[\"Hub VNet\u003cbr/\u003e(Shared)\"]\n  SpokeA[\"Spoke A\u003cbr/\u003e(Prod)\"]\n  SpokeB[\"Spoke B\u003cbr/\u003e(Dev)\"]\n  SpokeC[\"Spoke C\u003cbr/\u003e(Stage)\"]\n\n  Hub --\u003e SpokeA\n  Hub --\u003e SpokeB\n  Hub --\u003e SpokeC\n\u003c/div\u003e\n\n\u003cp\u003eEach spoke VNet hosts one AKS cluster. The hub carries shared services such as firewalling, gateway connectivity, DNS forwarding, and centralized observability.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"why-hub-spoke-works\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#why-hub-spoke-works\" title=\"Why Hub-Spoke Works\"\u003eWhy Hub-Spoke Works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eSimplified management.\u003c/strong\u003e A full mesh requires $N\\times(N-1)/2$ peerings. Hub-spoke usually needs one peering per spoke.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCentralized policy enforcement.\u003c/strong\u003e Spoke egress can pass through hub security controls. Policy, logging, and compliance become easier to govern.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCost allocation clarity.\u003c/strong\u003e Shared services stay in the hub. Team-owned workload costs stay in spokes. Chargeback becomes easier.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFailure domain separation.\u003c/strong\u003e Spoke incidents are usually isolated. Hub incidents affect connectivity and must be treated as critical.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-implementation-with-terraform\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#practical-implementation-with-terraform\" title=\"Practical Implementation with Terraform\"\u003ePractical Implementation with Terraform\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThis Terraform excerpt shows the core peering pattern:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Hub VNet with shared services\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003emodule\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hub_vnet\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  source\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;./modules/vnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hub-vnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_space\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.0.0.0/16\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehub\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  subnets\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    firewall\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      address_prefixes\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.0.1.0/24\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    gateway\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      address_prefixes\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.0.2.0/24\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    shared-services\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      address_prefixes\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.0.10.0/24\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Spoke VNet for production AKS cluster\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003emodule\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;spoke_prod_vnet\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  source\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;./modules/vnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;spoke-prod-vnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_space\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.0.0/16\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003espoke_prod\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  subnets\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aks-nodes\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      address_prefixes\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.0.0/19\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Peering: Spoke to Hub\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network_peering\u0026#34; \u0026#34;spoke_prod_to_hub\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;spoke-prod-to-hub\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003espoke_prod\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_name\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003emodule\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003espoke_prod_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  remote_virtual_network_id\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003emodule\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehub_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allow_virtual_network_access\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allow_forwarded_traffic\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allow_gateway_transit\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003efalse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  use_remote_gateways\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Peering: Hub to Spoke\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network_peering\u0026#34; \u0026#34;hub_to_spoke_prod\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hub-to-spoke-prod\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehub\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_name\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003emodule\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehub_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  remote_virtual_network_id\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003emodule\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003espoke_prod_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allow_virtual_network_access\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allow_forwarded_traffic\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allow_gateway_transit\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  use_remote_gateways\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003efalse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003e\u003cstrong\u003eKey configuration points:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003eallow_forwarded_traffic = true\u003c/code\u003e permits routing through the hub for spoke-to-spoke communication if needed\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eallow_gateway_transit = true\u003c/code\u003e (hub side) allows spokes to use hub\u0026rsquo;s VPN or ExpressRoute gateway\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003euse_remote_gateways = true\u003c/code\u003e (spoke side) leverages hub gateway for on-premises connectivity\u003c/li\u003e\n\u003cli\u003eAddress spaces must not overlap; plan your CIDR ranges before deployment\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"hub-spoke-trade-offs\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#hub-spoke-trade-offs\" title=\"Hub-Spoke Trade-Offs\"\u003eHub-Spoke Trade-Offs\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eLatency.\u003c/strong\u003e Spoke-to-spoke paths include an extra hop through the hub. Usually this is acceptable, but very latency-sensitive paths should be measured.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHub as a critical dependency.\u003c/strong\u003e If core hub components fail, cross-spoke and on-premises connectivity can fail with them. Critical environments should plan for redundancy.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAdded infrastructure complexity.\u003c/strong\u003e You now own central routing, firewalling, and gateway operations. For two or three clusters, direct peering may still be simpler.\u003c/p\u003e\n\u003cp\u003eUse hub-spoke when you have several clusters, need central governance, or depend on shared network services.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"dns-resolution-across-clusters\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#dns-resolution-across-clusters\" title=\"DNS Resolution Across Clusters\"\u003eDNS Resolution Across Clusters\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDNS is where many multi-cluster designs fail quietly. Connectivity may exist while name resolution does not.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-dns-challenge\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#the-dns-challenge\" title=\"The DNS Challenge\"\u003eThe DNS Challenge\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eEach AKS cluster runs its own CoreDNS service. By default, it resolves cluster-local names such as \u003ccode\u003e.svc.cluster.local\u003c/code\u003e. Cross-cluster discovery needs explicit design.\u003c/p\u003e\n\u003cp\u003eYou need answers to two questions:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eHow does cluster A resolve service names from cluster B?\u003c/li\u003e\n\u003cli\u003eHow does this remain accurate as services change over time?\u003c/li\u003e\n\u003c/ol\u003e\n\n\n\n\n\u003ch3 id=\"approach-1-dns-forwarding-with-custom-coredns-configuration\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#approach-1-dns-forwarding-with-custom-coredns-configuration\" title=\"Approach 1: DNS Forwarding with Custom CoreDNS Configuration\"\u003eApproach 1: DNS Forwarding with Custom CoreDNS Configuration\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eYou can extend CoreDNS to forward specific zones to resolvers in another cluster.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ev1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eConfigMap\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ecoredns-custom\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ekube-system\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003edata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eclusterb.server\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e    clusterb.local:53 {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        errors\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        cache 30\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        forward . 10.2.0.10\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e    }\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis forwards queries for \u003ccode\u003eclusterb.local\u003c/code\u003e to the resolver in cluster B. Services become reachable by names such as \u003ccode\u003eservice-name.namespace.svc.clusterb.local\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLimitations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eManual configuration in each cluster\u003c/li\u003e\n\u003cli\u003eResolver endpoints must stay reachable\u003c/li\u003e\n\u003cli\u003eFragile if upstream DNS endpoints change\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"approach-2-external-dns-with-shared-zone\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#approach-2-external-dns-with-shared-zone\" title=\"Approach 2: External DNS with Shared Zone\"\u003eApproach 2: External DNS with Shared Zone\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eA more scalable pattern is running ExternalDNS in each cluster and writing records into a shared Azure Private DNS zone.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ev1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eService\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eapi-service\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eannotations\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eexternal-dns.alpha.kubernetes.io/hostname\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eapi.shared.internal\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003etype\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eLoadBalancer\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eloadBalancerIP\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e10.1.5.100\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eports\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"nt\"\u003eport\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e443\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etargetPort\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e8443\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eExternalDNS creates records such as \u003ccode\u003eapi.shared.internal\u003c/code\u003e and updates them as service endpoints change.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBenefits:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAutomatic DNS management\u003c/li\u003e\n\u003cli\u003eCentralized control through Azure DNS\u003c/li\u003e\n\u003cli\u003eWorks across clusters without manual forwarding rules\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eTrade-offs:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eRequires ExternalDNS operations in every cluster\u003c/li\u003e\n\u003cli\u003eAdds a small DNS zone cost\u003c/li\u003e\n\u003cli\u003eNaming conventions are required to avoid collisions\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eFor most production teams, this is the pragmatic default because it scales and removes manual DNS drift. In AKS, this model aligns well with Private DNS zone integration and standard cluster DNS behavior (\u003ca href=\"https://learn.microsoft.com/azure/aks/concepts-network\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAKS networking concepts\u003c/a\u003e).\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"shared-ingress-architectures\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#shared-ingress-architectures\" title=\"Shared Ingress Architectures\"\u003eShared Ingress Architectures\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eYou can expose multi-cluster services in two common ways: centralized ingress in a hub, or distributed ingress behind a global load balancer.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"centralized-ingress-in-hub-vnet\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#centralized-ingress-in-hub-vnet\" title=\"Centralized Ingress in Hub VNet\"\u003eCentralized Ingress in Hub VNet\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRun ingress in the hub VNet, for example with NGINX, Azure Application Gateway, or Envoy. External traffic enters once and is routed to spokes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAdvantages:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSingle public IP for all clusters\u003c/li\u003e\n\u003cli\u003eCentralized TLS termination and certificate management\u003c/li\u003e\n\u003cli\u003eSimplified firewall rules (only hub ingress needs public exposure)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eLimitations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eHub becomes a bottleneck for all ingress traffic\u003c/li\u003e\n\u003cli\u003eAdditional latency (traffic routes hub → spoke)\u003c/li\u003e\n\u003cli\u003eHub failure impacts all clusters\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eUse centralized hub ingress when operational simplicity and unified policy enforcement outweigh performance concerns.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"distributed-ingress-with-azure-front-door\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#distributed-ingress-with-azure-front-door\" title=\"Distributed Ingress with Azure Front Door\"\u003eDistributed Ingress with Azure Front Door\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRun ingress in each spoke and front it with Azure Front Door or Traffic Manager. Routing decisions can use health, latency, and geographic criteria (\u003ca href=\"https://learn.microsoft.com/azure/frontdoor/front-door-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure Front Door overview\u003c/a\u003e).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAdvantages:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eHigh availability (cluster failures don\u0026rsquo;t take down all ingress)\u003c/li\u003e\n\u003cli\u003eLower latency (traffic routes directly to closest cluster)\u003c/li\u003e\n\u003cli\u003eScalable ingress capacity (not bottlenecked on hub)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eLimitations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eMultiple public IPs to manage\u003c/li\u003e\n\u003cli\u003eDistributed certificate management (mitigated with cert-manager and Let\u0026rsquo;s Encrypt)\u003c/li\u003e\n\u003cli\u003eRequires global load balancer (Azure Front Door, Traffic Manager)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eFor high availability and regional resilience, distributed ingress is often the better long-term model.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"service-mesh-considerations-when-complexity-is-worth-it\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#service-mesh-considerations-when-complexity-is-worth-it\" title=\"Service Mesh Considerations: When Complexity Is Worth It\"\u003eService Mesh Considerations: When Complexity Is Worth It\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eService meshes such as Istio, Linkerd, or Consul can solve real problems, but they also add a major operational layer.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"what-service-mesh-solves\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#what-service-mesh-solves\" title=\"What Service Mesh Solves\"\u003eWhat Service Mesh Solves\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eCross-cluster service discovery.\u003c/strong\u003e Meshes can federate service catalogs, letting cluster A discover and route to services in cluster B without manual DNS configuration.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTraffic shifting and canary deployments.\u003c/strong\u003e Route a percentage of traffic from cluster A to a new version in cluster B for testing before full cutover.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMutual TLS and zero-trust networking.\u003c/strong\u003e Encrypt all inter-service traffic and enforce identity-based policies across cluster boundaries.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eObservability.\u003c/strong\u003e Centralized metrics, tracing, and logging for requests flowing between clusters.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-service-mesh-is-not-worth-it\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#when-service-mesh-is-not-worth-it\" title=\"When Service Mesh Is Not Worth It\"\u003eWhen Service Mesh Is Not Worth It\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMost multi-cluster environments do not need a mesh on day one. Managing control planes, sidecar upgrades, and mesh debugging is expensive in terms of engineering time.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsider service mesh only when:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eYou\u0026rsquo;re running 5+ clusters with complex inter-cluster traffic patterns\u003c/li\u003e\n\u003cli\u003eZero-trust networking with mTLS is a hard requirement\u003c/li\u003e\n\u003cli\u003eAdvanced traffic management (gradual rollouts, A/B testing across clusters) is core to your deployment strategy\u003c/li\u003e\n\u003cli\u003eYour team has service mesh expertise or dedicated platform engineering resources\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor note: in most organizations I have worked with, peering plus ExternalDNS plus standard ingress handled the majority of real requirements with far less cognitive load.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-pragmatic-alternative-keep-the-baseline-simple\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#the-pragmatic-alternative-keep-the-baseline-simple\" title=\"The Pragmatic Alternative: Keep the Baseline Simple\"\u003eThe Pragmatic Alternative: Keep the Baseline Simple\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eBefore adding a mesh, validate whether baseline Kubernetes networking already meets your goals. Start with clean CIDR planning, network policies, ExternalDNS, and a proven ingress setup.\u003c/p\u003e\n\u003cp\u003eThis baseline is proven and easier to run. Add mesh capabilities only when a measurable requirement demands them.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cost-and-operational-simplicity\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#cost-and-operational-simplicity\" title=\"Cost and Operational Simplicity\"\u003eCost and Operational Simplicity\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMulti-cluster architecture increases both spend and operational load. Design intentionally so cost and complexity stay proportional to business value.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-drivers\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#cost-drivers\" title=\"Cost Drivers\"\u003eCost Drivers\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eData transfer between regions.\u003c/strong\u003e Cross-region peering incurs egress charges. High-volume replication paths can become a significant monthly cost. Validate current pricing in the Azure bandwidth and networking pricing pages before committing to traffic-heavy topologies (\u003ca href=\"https://azure.microsoft.com/pricing/details/bandwidth/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure bandwidth pricing\u003c/a\u003e).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eShared infrastructure.\u003c/strong\u003e Hub-spoke designs require gateway, firewall, and DNS components. These costs usually scale with hub count, not spoke count.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDuplicated platform components.\u003c/strong\u003e More clusters often mean duplicated logging, metrics, and ingress layers. Consolidate where this does not weaken isolation.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"operational-overhead\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#operational-overhead\" title=\"Operational Overhead\"\u003eOperational Overhead\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eConfiguration drift.\u003c/strong\u003e More clusters create more drift opportunities. GitOps tools such as Flux or Argo CD help enforce consistency.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUpgrade coordination.\u003c/strong\u003e Upgrading many clusters is not linear work. Standardize upgrade pipelines and validate in staging first.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eIncident response.\u003c/strong\u003e Cross-cluster incidents are harder to debug. Centralized logs and tracing are mandatory, not optional.\u003c/p\u003e\n\u003cp\u003eBalance isolation against complexity. Extra clusters without clear boundaries usually become operational debt.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion-start-simple-scale-deliberately\"\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/#conclusion-start-simple-scale-deliberately\" title=\"Conclusion: Start Simple, Scale Deliberately\"\u003eConclusion: Start Simple, Scale Deliberately\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMulti-cluster AKS solves real problems: scale boundaries, failure isolation, and team autonomy. It also introduces networking complexity that is easy to underestimate.\u003c/p\u003e\n\u003cp\u003eFor most teams, this sequence works well:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eStart with peering and clean IP planning\u003c/li\u003e\n\u003cli\u003eMove to hub-spoke when cluster count or governance requirements grow\u003c/li\u003e\n\u003cli\u003eUse ExternalDNS for shared service discovery\u003c/li\u003e\n\u003cli\u003eChoose centralized or distributed ingress based on availability and latency goals\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eService mesh can be valuable, but only when its capabilities are tied to concrete requirements that justify the overhead.\u003c/p\u003e\n\u003cp\u003eDesign with the fewest moving parts that satisfy your constraints. Every extra layer raises troubleshooting effort and incident duration.\u003c/p\u003e\n\u003cp\u003eBuild for your current scale, then add components when measurable pain proves the need. That is the operationally honest path.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-25T18:30:00+01:00","id":"https://daily-devops.net/posts/multi-aks-cluster-networking-hub-spoke/","language":"en","summary":"Practical multi-cluster AKS networking with VNet peering, hub-spoke routing, DNS, shared ingress, and clear criteria to keep mesh complexity in check.","tags":["networking","azure","kubernetes","cloud","devops","architecture"],"title":"Multi-AKS Cluster Networking \u0026 Hub-Spoke Topology","url":"https://daily-devops.net/posts/multi-aks-cluster-networking-hub-spoke/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eCNI Overlay solves IP exhaustion by keeping pod IPs in an internal overlay network. Excellent for resource efficiency. The problem? Your observability stack just lost visibility into half your traffic. Pod IPs get masked behind node IPs through SNAT, and debugging network issues becomes a puzzle where half the pieces are missing.\u003c/p\u003e\n\u003cp\u003eWhen a pod makes an outbound connection to an Azure service, NSG logs show the node IP as the source. Try correlating that with application logs to identify which specific pod initiated the connection, and you\u0026rsquo;ll discover your traditional tooling is useless. The pod IP exists only inside the cluster. From outside, it\u0026rsquo;s invisible.\u003c/p\u003e\n\u003cp\u003eIf you run CNI Overlay in production, you need observability patterns that work with this reality: Container Insights for metadata enrichment, network flow correlation via KQL queries, SNAT port tracking, and distributed tracing.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-root-cause-snat-changes-everything\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#the-root-cause-snat-changes-everything\" title=\"The Root Cause: SNAT Changes Everything\"\u003eThe Root Cause: SNAT Changes Everything\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eIn traditional Azure CNI, each pod receives a VNet-routable IP address. Network flows are straightforward to track. Correlation is direct.\u003c/p\u003e\n\u003cp\u003eCNI Overlay changes this. Pods receive IPs from an internal overlay network (typically \u003ccode\u003e10.244.0.0/16\u003c/code\u003e) that exist only within the cluster. When a pod communicates with anything outside the cluster, the traffic undergoes Source Network Address Translation (SNAT). The pod\u0026rsquo;s internal IP gets replaced with the node\u0026rsquo;s IP before leaving the cluster.\u003c/p\u003e\n\u003cp\u003eFrom the perspective of Azure Network Watcher or NSG Flow Logs, all outbound traffic from pods on a node appears to originate from that single node IP. You lose pod-level granularity. This isn\u0026rsquo;t a bug. It\u0026rsquo;s how overlay networking works. But it breaks every observability pattern you\u0026rsquo;ve built for traditional CNI.\u003c/p\u003e\n\u003cp\u003eThe challenge is correlation. Application logs contain pod IPs. Network logs contain node IPs. Connecting these requires additional context that standard tooling doesn\u0026rsquo;t provide. Microsoft\u0026rsquo;s documentation glosses over this. They\u0026rsquo;ll tell you Container Insights \u0026ldquo;solves observability,\u0026rdquo; but won\u0026rsquo;t mention you\u0026rsquo;re about to spend weeks building KQL queries to answer \u0026ldquo;which pod is talking to this IP?\u0026rdquo;\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"container-insights-your-first-layer-of-defense\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#container-insights-your-first-layer-of-defense\" title=\"Container Insights: Your First Layer of Defense\"\u003eContainer Insights: Your First Layer of Defense\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eContainer Insights is the Azure-native solution for AKS observability. For CNI Overlay clusters, it\u0026rsquo;s mandatory if you want to maintain sanity during production incidents. It\u0026rsquo;s the only thing that maintains the pod-to-node relationship that network logs lose.\u003c/p\u003e\n\u003cp\u003eContainer Insights deploys a DaemonSet (\u003ccode\u003eama-logs\u003c/code\u003e) on every node that scrapes metrics from kubelet and collects stdout/stderr logs. Crucially, it enriches data with Kubernetes metadata: pod name, namespace, node name, labels, annotations. This enables correlation between application logs and network flows.\u003c/p\u003e\n\u003cp\u003eWhen you query Container Insights logs, you can join pod identity with node identity, bridging the gap between application-level events and network-level events. Without this enrichment, you\u0026rsquo;re stuck running kubectl commands during incidents while your cluster burns.\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s a practical Terraform configuration for enabling Container Insights on an AKS cluster with CNI Overlay:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_log_analytics_workspace\u0026#34; \u0026#34;aks_monitoring\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-logs-${var.environment}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  sku\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PerGB2018\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  retention_in_days\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e30\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_kubernetes_cluster\u0026#34; \u0026#34;aks\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-cluster-${var.environment}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  dns_prefix\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-${var.environment}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003enetwork_profile\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    network_plugin\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azure\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    network_plugin_mode\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;overlay\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    pod_cidr\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.244.0.0/16\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eoms_agent\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    log_analytics_workspace_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_log_analytics_workspace\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_monitoring\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003emonitor_metrics\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    annotations_allowed\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003enull\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    labels_allowed\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003enull\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Optional: Data Collection Rule for cost control\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_data_collection_rule\u0026#34; \u0026#34;aks_container_insights\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;MSCI-${azurerm_kubernetes_cluster.aks.name}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003edestinations\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003elog_analytics\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      workspace_resource_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_log_analytics_workspace\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_monitoring\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ciworkspace\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003edata_flow\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    streams\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Microsoft-ContainerInsights-Group-Default\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    destinations\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;ciworkspace\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003edata_sources\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003eextension\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      streams\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Microsoft-ContainerInsights-Group-Default\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      extension_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ContainerInsights\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      name\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ContainerInsightsExtension\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"budgeting-for-ingestion-cost\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#budgeting-for-ingestion-cost\" title=\"Budgeting For Ingestion Cost\"\u003eBudgeting For Ingestion Cost\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eA 100-node cluster generates roughly 50-100 GB of logs per month. That\u0026rsquo;s $10-20/month in ingestion costs alone. A 200-node cluster with verbose logging can push $500-1000/month in Log Analytics costs. The optional Data Collection Rule provides granular control to filter out noisy namespaces (kube-system) or low-value metrics before the bill surprises you.\u003c/p\u003e\n\u003cp\u003eCommon mistakes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eEnabling Container Insights without DCRs on large clusters, then discovering a $2000 Azure Monitor bill\u003c/li\u003e\n\u003cli\u003eSetting retention to 365 days without calculating cost ($0.10/GB/month beyond 31 days)\u003c/li\u003e\n\u003cli\u003eCollecting metrics at 15-second intervals when 60-second suffices for 95% of use cases\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eOnce deployed, Container Insights starts populating several key tables in Log Analytics:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003eContainerLog\u003c/code\u003e: Application logs (stdout/stderr)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ePerf\u003c/code\u003e: Performance metrics and resource usage\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eKubePodInventory\u003c/code\u003e: Pod metadata and lifecycle events\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eKubeNodeInventory\u003c/code\u003e: Node metadata and capacity information\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThese tables are your foundation for correlation queries.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"network-observability-flow-logs-and-nsg-logs\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#network-observability-flow-logs-and-nsg-logs\" title=\"Network Observability: Flow Logs and NSG Logs\"\u003eNetwork Observability: Flow Logs and NSG Logs\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eContainer Insights gives you pod-level visibility inside the cluster. But what about traffic leaving the cluster? This is where NSG Flow Logs come into play, and where your observability problems begin in earnest.\u003c/p\u003e\n\u003cp\u003eFlow Logs capture network traffic metadata: source IP, destination IP, port, protocol, allow/deny. For CNI Overlay, Flow Logs show node IPs as the source for all outbound pod traffic. You\u0026rsquo;ve lost pod-level attribution the moment traffic leaves the cluster.\u003c/p\u003e\n\u003cp\u003eThe correlation happens through timestamps and Log Analytics queries. When a pod generates outbound traffic, Container Insights logs the event with the pod\u0026rsquo;s identity and node. Flow Logs capture the same event with the node\u0026rsquo;s IP and destination. Join these datasets on node name and timestamp to reconstruct which pod initiated which connection.\u003c/p\u003e\n\u003cp\u003eAuthor note: This works in theory. In practice, timestamp-based correlation is fragile. Flow Logs have variable latency (5-10 minutes), Container Insights has ingestion delays, and timestamp precision issues mean you\u0026rsquo;ll occasionally join the wrong events. For critical debugging, correlation IDs in application logs are more reliable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"joining-pod-identity-to-flow-logs-with-kql\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#joining-pod-identity-to-flow-logs-with-kql\" title=\"Joining Pod Identity To Flow Logs With KQL\"\u003eJoining Pod Identity To Flow Logs With KQL\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eHere\u0026rsquo;s a practical KQL query that demonstrates this correlation for debugging outbound connectivity issues:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-csharp\" data-lang=\"csharp\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003elet\u003c/span\u003e \u003cspan class=\"n\"\u003epodName\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"s\"\u003e\u0026#34;my-application-pod-xyz\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003elet\u003c/span\u003e \u003cspan class=\"n\"\u003etimeRange\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003eago\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e\u003cspan class=\"n\"\u003eh\u003c/span\u003e\u003cspan class=\"p\"\u003e);\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003elet\u003c/span\u003e \u003cspan class=\"n\"\u003epodNode\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003eKubePodInventory\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"k\"\u003ewhere\u003c/span\u003e \u003cspan class=\"n\"\u003eTimeGenerated\u003c/span\u003e \u003cspan class=\"p\"\u003e\u0026gt;=\u003c/span\u003e \u003cspan class=\"n\"\u003etimeRange\u003c/span\u003e \u003cspan class=\"n\"\u003eand\u003c/span\u003e \u003cspan class=\"n\"\u003eName\u003c/span\u003e \u003cspan class=\"p\"\u003e==\u003c/span\u003e \u003cspan class=\"n\"\u003epodName\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003eproject\u003c/span\u003e \u003cspan class=\"n\"\u003eTimeGenerated\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003ePodName\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eName\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eNamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeName\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eComputer\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003etake\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003elet\u003c/span\u003e \u003cspan class=\"n\"\u003enodeIP\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003eKubeNodeInventory\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"k\"\u003ewhere\u003c/span\u003e \u003cspan class=\"n\"\u003eTimeGenerated\u003c/span\u003e \u003cspan class=\"p\"\u003e\u0026gt;=\u003c/span\u003e \u003cspan class=\"n\"\u003etimeRange\u003c/span\u003e \u003cspan class=\"n\"\u003eand\u003c/span\u003e \u003cspan class=\"n\"\u003eComputer\u003c/span\u003e \u003cspan class=\"k\"\u003ein\u003c/span\u003e \u003cspan class=\"p\"\u003e((\u003c/span\u003e\u003cspan class=\"n\"\u003epodNode\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003eproject\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeName\u003c/span\u003e\u003cspan class=\"p\"\u003e))\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003eextend\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeIP\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003etostring\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003eparse_json\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003eStatus\u003c/span\u003e\u003cspan class=\"p\"\u003e).\u003c/span\u003e\u003cspan class=\"n\"\u003eaddresses\u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"m\"\u003e0\u003c/span\u003e\u003cspan class=\"p\"\u003e].\u003c/span\u003e\u003cspan class=\"n\"\u003eaddress\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003eproject\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeName\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eComputer\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeIP\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003etake\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003elet\u003c/span\u003e \u003cspan class=\"n\"\u003epodNodeInfo\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003epodNode\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"k\"\u003ejoin\u003c/span\u003e \u003cspan class=\"n\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003einner\u003c/span\u003e \u003cspan class=\"n\"\u003enodeIP\u003c/span\u003e \u003cspan class=\"k\"\u003eon\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeName\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003epodNodeInfo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"k\"\u003ejoin\u003c/span\u003e \u003cspan class=\"n\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003einner\u003c/span\u003e \u003cspan class=\"p\"\u003e(\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"n\"\u003eAzureNetworkAnalytics_CL\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"k\"\u003ewhere\u003c/span\u003e \u003cspan class=\"n\"\u003eTimeGenerated\u003c/span\u003e \u003cspan class=\"p\"\u003e\u0026gt;=\u003c/span\u003e \u003cspan class=\"n\"\u003etimeRange\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003eproject\u003c/span\u003e \u003cspan class=\"n\"\u003eFlowTime\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eTimeGenerated\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eSourceIP\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eSrcIP_s\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eDestinationIP\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eDestIP_s\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eDestinationPort\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eDestPort_d\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eProtocol\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eL7Protocol_s\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eFlowDirection\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eFlowDirection_s\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eDecision\u003c/span\u003e\u003cspan class=\"p\"\u003e=\u003c/span\u003e\u003cspan class=\"n\"\u003eFlowStatus_s\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e \u003cspan class=\"k\"\u003eon\u003c/span\u003e \u003cspan class=\"err\"\u003e$\u003c/span\u003e\u003cspan class=\"n\"\u003eleft\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eNodeIP\u003c/span\u003e \u003cspan class=\"p\"\u003e==\u003c/span\u003e \u003cspan class=\"err\"\u003e$\u003c/span\u003e\u003cspan class=\"n\"\u003eright\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eSourceIP\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e \u003cspan class=\"n\"\u003eproject\u003c/span\u003e \u003cspan class=\"n\"\u003ePodName\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eNamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeName\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eNodeIP\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eFlowTime\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eDestinationIP\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eDestinationPort\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eProtocol\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eFlowDirection\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"n\"\u003eDecision\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe limitation: all pods on the same node appear in the results because they share the node IP after SNAT. If you have 50 pods on that node, you get 50 potential sources. To narrow this down, correlate timestamps with application-level logs. Without correlation IDs, you\u0026rsquo;re guessing based on timing.\u003c/p\u003e\n\u003cp\u003eCommon mistakes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAssuming timestamp correlation is accurate to the second (Flow Logs can be off by minutes)\u003c/li\u003e\n\u003cli\u003eNot accounting for pod restarts that change pod-to-node mapping mid-incident\u003c/li\u003e\n\u003cli\u003eForgetting that pods on the same node share the same source IP\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"snat-tracking-and-port-exhaustion\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#snat-tracking-and-port-exhaustion\" title=\"SNAT Tracking and Port Exhaustion\"\u003eSNAT Tracking and Port Exhaustion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSNAT doesn\u0026rsquo;t just mask IPs. It introduces a finite resource constraint: SNAT ports. Each node has 64,000 ephemeral ports. In CNI Overlay, all pods on a node share this pool. Under heavy load, you exhaust SNAT ports, causing intermittent connection failures that are nearly impossible to diagnose.\u003c/p\u003e\n\u003cp\u003eThe real risk: SNAT port exhaustion looks exactly like network instability, DNS issues, or backend degradation. You\u0026rsquo;ll spend hours troubleshooting the wrong layer while your SNAT ports silently hit 100%.\u003c/p\u003e\n\u003cp\u003eAzure Monitor provides \u003ccode\u003eAllocatedSnatPorts\u003c/code\u003e and \u003ccode\u003eUsedSnatPorts\u003c/code\u003e metrics at the Load Balancer level, but you need to enable them explicitly. Microsoft\u0026rsquo;s quickstart documentation conveniently omits this.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"what-snat-exhaustion-actually-looks-like\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#what-snat-exhaustion-actually-looks-like\" title=\"What SNAT Exhaustion Actually Looks Like\"\u003eWhat SNAT Exhaustion Actually Looks Like\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eSymptoms when approaching exhaustion:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eConnections timing out during peak traffic, but only for some pods\u003c/li\u003e\n\u003cli\u003eSporadic DNS resolution failures\u003c/li\u003e\n\u003cli\u003eError logs showing \u0026ldquo;unable to bind to port\u0026rdquo; or \u0026ldquo;connection refused\u0026rdquo;\u003c/li\u003e\n\u003cli\u003eRetries succeeding randomly (because a SNAT port became available)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor note: I\u0026rsquo;ve seen teams spend 6+ hours investigating \u0026ldquo;intermittent Azure Storage failures\u0026rdquo; before someone checked SNAT port metrics and found 99% utilization. The fix took 10 minutes (deploy NAT Gateway). The diagnosis took half a day because SNAT exhaustion wasn\u0026rsquo;t on their radar.\u003c/p\u003e\n\u003cp\u003eMitigation strategies:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eConnection pooling\u003c/strong\u003e: Reuse connections in your application code. Critical in CNI Overlay.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eHTTP keep-alive\u003c/strong\u003e: Reuse TCP connections. A single pod making 1000 requests/second without keep-alive exhausts SNAT ports in under a minute.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNAT Gateway\u003c/strong\u003e: Deploy NAT Gateway for outbound connectivity. Provides 64,000 SNAT ports per public IP (multiple IPs supported). Not optional for high-throughput clusters.\u003c/li\u003e\n\u003c/ol\u003e\n\n\n\n\n\u003ch3 id=\"when-to-reach-for-nat-gateway\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#when-to-reach-for-nat-gateway\" title=\"When To Reach For NAT Gateway\"\u003eWhen To Reach For NAT Gateway\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNAT Gateway is the most effective solution. Load Balancer SNAT works for dev clusters, but production clusters handling thousands of outbound requests per second will exhaust ports.\u003c/p\u003e\n\u003cp\u003eUse for: Production clusters with 20+ nodes, clusters making frequent outbound API calls, workloads with poor connection reuse.\u003c/p\u003e\n\u003cp\u003eTrade-offs: $35/month plus $0.045/GB egress (cheaper than debugging SNAT exhaustion at 2 AM). Doesn\u0026rsquo;t solve observability, only port exhaustion.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"debugging-at-scale-correlation-ids-and-distributed-tracing\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#debugging-at-scale-correlation-ids-and-distributed-tracing\" title=\"Debugging at Scale: Correlation IDs and Distributed Tracing\"\u003eDebugging at Scale: Correlation IDs and Distributed Tracing\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWhen you operate a cluster with hundreds of pods across dozens of nodes, manual correlation becomes impossible. You need structured logging and distributed tracing.\u003c/p\u003e\n\u003cp\u003eCorrelation IDs are the simplest effective pattern. Generate a unique ID at request entry point and propagate it through your entire call chain. When debugging, filter all logs by correlation ID to see the complete request flow across pods and services.\u003c/p\u003e\n\u003cp\u003eDistributed tracing models requests as traces with parent-child relationships between spans. OpenTelemetry is the current standard. Instrument your applications to emit traces to Azure Monitor Application Insights, Jaeger, or Tempo.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"why-traces-survive-snat\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#why-traces-survive-snat\" title=\"Why Traces Survive SNAT\"\u003eWhy Traces Survive SNAT\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThe value in CNI Overlay: traces maintain pod identity regardless of SNAT. A trace records which pod initiated a call, which pod received it, and operation duration. Network-level SNAT becomes irrelevant because the trace exists at the application layer.\u003c/p\u003e\n\u003cp\u003eThe recurring theme: correlation IDs and distributed tracing aren\u0026rsquo;t \u0026ldquo;nice to have\u0026rdquo; in CNI Overlay. They\u0026rsquo;re operational requirements. Without them, you\u0026rsquo;re correlating timestamps across data sources with different ingestion latencies and hoping you got the right pod. That\u0026rsquo;s not observability. That\u0026rsquo;s guessing.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-recommendations-for-production\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#practical-recommendations-for-production\" title=\"Practical Recommendations for Production\"\u003ePractical Recommendations for Production\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eEnable Container Insights from day one.\u003c/strong\u003e In CNI Overlay, pod-to-node mapping is invisible from network logs. You\u0026rsquo;re flying blind without it. Budget $50-200/month for monitoring, or budget significantly more for postmortems explaining why you couldn\u0026rsquo;t identify which pod caused the outage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConfigure log retention based on compliance, not convenience.\u003c/strong\u003e Thirty days is reasonable for most use cases. A 100-node cluster at 365-day retention can cost $500-1000/month just for storage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eImplement structured logging with correlation IDs across all applications.\u003c/strong\u003e This is not optional. JSON logging with consistent field names makes querying possible. Include: timestamp, log level, correlation ID, message. Container Insights adds pod metadata automatically, don\u0026rsquo;t duplicate it.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSet up alerts for SNAT port usage before production.\u003c/strong\u003e Monitor \u003ccode\u003eUsedSnatPorts\u003c/code\u003e and alert at 80% capacity. Better yet, deploy NAT Gateway proactively. The cost ($35/month) is trivial compared to a production outage from SNAT exhaustion.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse distributed tracing for multi-service architectures.\u003c/strong\u003e Overhead is low (1-5% CPU), debugging value is high. Start with critical paths. Without tracing, debugging cascade failures in CNI Overlay is nearly impossible.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDocument your correlation queries.\u003c/strong\u003e Keep these queries in version control alongside your infrastructure code. Tribal knowledge doesn\u0026rsquo;t scale.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-hard-truth-about-cni-overlay-observability\"\u003e\u003ca href=\"/posts/observability-logging-aks-cni-overlay/#the-hard-truth-about-cni-overlay-observability\" title=\"The Hard Truth About CNI Overlay Observability\"\u003eThe Hard Truth About CNI Overlay Observability\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eCNI Overlay makes AKS operationally better by solving IP exhaustion. But it makes observability harder by hiding pod IPs behind node IPs. This isn\u0026rsquo;t a flaw. It\u0026rsquo;s a tradeoff. The solution isn\u0026rsquo;t to avoid CNI Overlay (it\u0026rsquo;s the right choice for most clusters) but to build your observability stack with this reality in mind before production.\u003c/p\u003e\n\u003cp\u003eContainer Insights provides the metadata layer that network logs lack. Flow Logs give network-level visibility even though pod IPs are masked. Distributed tracing maintains request context regardless of SNAT. Correlation IDs make manual debugging feasible when automated tools fall short.\u003c/p\u003e\n\u003cp\u003eNone of this is automatic. You have to configure it deliberately, budget for it appropriately, and train your team to use it. Microsoft\u0026rsquo;s CNI Overlay documentation presents it as \u0026ldquo;simpler\u0026rdquo; than traditional CNI. What they don\u0026rsquo;t mention is that you\u0026rsquo;re trading networking simplicity for observability complexity. That\u0026rsquo;s a good trade for large clusters, but it\u0026rsquo;s still a trade.\u003c/p\u003e\n\u003cp\u003eThe operational reality: I\u0026rsquo;ve debugged production incidents in CNI Overlay clusters where the initial response was \u0026ldquo;we can\u0026rsquo;t see which pod is causing this.\u0026rdquo; That\u0026rsquo;s only true if you haven\u0026rsquo;t built the correlation infrastructure upfront. With Container Insights, structured logging, and distributed tracing in place, CNI Overlay observability is no harder than traditional CNI. It\u0026rsquo;s just different, and it requires deliberate tooling investment before the first incident, not during it.\u003c/p\u003e\n\u003cp\u003eThe honest assessment: CNI Overlay is the right choice for most production AKS clusters. The IP efficiency gains are significant. But if your organization isn\u0026rsquo;t prepared to invest in proper observability tooling (Container Insights, distributed tracing, structured logging with correlation IDs), you\u0026rsquo;ll regret choosing CNI Overlay the first time you debug an outbound connectivity issue at 3 AM. Plan accordingly.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-18T17:00:00+01:00","id":"https://daily-devops.net/posts/observability-logging-aks-cni-overlay/","language":"en","summary":"CNI Overlay hides pod IPs behind nodes, breaking observability. Practical patterns for log aggregation, network flows, and debugging at scale.","tags":["observability","azure","kubernetes","cloud","devops","monitoring"],"title":"Observability in AKS CNI Overlay: When Pod IPs Hide Behind Nodes","url":"https://daily-devops.net/posts/observability-logging-aks-cni-overlay/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eAKS costs are brutally simple: node sizing, pod density, workload sprawl, and reserved capacity. If you don\u0026rsquo;t have visibility and governance, your cloud bill will punch you in the face—usually when it\u0026rsquo;s too late to react without pain. I\u0026rsquo;ve watched teams scramble to cut costs after the invoice lands, breaking production in the process. This guide is for practitioners who want to avoid that mess. No theory, no vendor fluff: just what actually works to keep AKS costs under control without sacrificing reliability.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-visibility-prevents-shocks\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#the-problem-visibility-prevents-shocks\" title=\"The problem: visibility prevents shocks\"\u003eThe problem: visibility prevents shocks\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThe real risk is flying blind on cost. Most organizations throw infrastructure at problems, then act shocked when the bill arrives. This delay creates a vicious cycle: teams over-provision to avoid risk, costs spiral, finance panics, and engineers are told to \u0026ldquo;just cut spend\u0026rdquo;—usually with zero context. If you don\u0026rsquo;t know what your resources will cost before you deploy, you\u0026rsquo;re setting yourself up for failure. In AKS, node cost is king. If you don\u0026rsquo;t understand node sizing, pod density, and workload distribution, you\u0026rsquo;re not in control: you\u0026rsquo;re just hoping for the best.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pod-density-vs-node-size\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#pod-density-vs-node-size\" title=\"Pod density vs. node size\"\u003ePod density vs. node size\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003ePod density: the number of pods per node. Higher density slashes per-pod cost, but if you push it too far, a single node failure can wipe out half your workloads. Lower density means more nodes, more overhead, and more Azure spend for the same business value. Most teams get this wrong by guessing or copying defaults.\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s the honest assessment: Small nodes with low pod density are easy to replace, but you pay for the privilege. Large nodes with high pod density are efficient, but when they go down, you feel it. For most real-world workloads, start with medium-sized nodes (4-8 vCPUs, 16-32 GB RAM) and target 20-30 pods per node. Don\u0026rsquo;t trust vendor sizing calculators: watch your actual pod usage and adjust. Memory hogs need lower density. Stateless microservices? Push the density, but monitor for pain.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-blast-radius-of-a-single-node\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#the-blast-radius-of-a-single-node\" title=\"The Blast Radius Of A Single Node\"\u003eThe Blast Radius Of A Single Node\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNode size hits your Azure bill directly. A Standard_D4s_v5 (4 vCPU, 16 GB) is about $140/month in West Europe. A D8s_v5 (8 vCPU, 32 GB) is $280/month. If you run 40 pods at 0.5 vCPU and 2 GB RAM each, you can fit them on two D4s_v5 or one D8s_v5. The cost is identical, but the blast radius is not. Lose the big node, lose everything on it. The real cost difference? System overhead. Kubernetes always takes a cut for kubelet, container runtime, and system pods. Small nodes waste more on overhead. Big nodes are more efficient, but you pay in risk.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"node-pool-stratification\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#node-pool-stratification\" title=\"Node-pool stratification\"\u003eNode-pool stratification\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMixing workloads in a single node pool is a rookie mistake. Batch jobs, web services, databases, and background workers all have different needs. If you lump them together, you guarantee wasted money and operational headaches. Stratify your node pools. Optimize each for cost, performance, and reliability. No exceptions.\u003c/p\u003e\n\u003cp\u003eCreate separate node pools for:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSystem workloads:\u003c/strong\u003e Small, reliable nodes for kube-system and monitoring\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eProduction services:\u003c/strong\u003e Standard nodes for user-facing apps\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eBatch/background jobs:\u003c/strong\u003e Spot or big nodes for interruptible stuff\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateful services:\u003c/strong\u003e Nodes with local SSD for low-latency storage\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eBut here\u0026rsquo;s the catch: Node-pool stratification is useless if you don\u0026rsquo;t enforce it. Use node affinity and taints/tolerations. If you skip this, your pods will land wherever they want, and your cost model is toast.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"enforcing-pool-separation-with-taints\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#enforcing-pool-separation-with-taints\" title=\"Enforcing Pool Separation With Taints\"\u003eEnforcing Pool Separation With Taints\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eExample configuration for a production node pool with cost labels in Terraform:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_kubernetes_cluster_node_pool\u0026#34; \u0026#34;production\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  kubernetes_cluster_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  vm_size\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard_D4s_v5\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  node_count\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"n\"\u003etags\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    environment\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    workload\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;web-services\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    cost-center\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;engineering\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    managed-by\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;terraform\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"n\"\u003enode_labels\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;workload-type\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;node-pool\u0026#34;\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"n\"\u003enode_taints\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;workload\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003eproduction\u003c/span\u003e\u003cspan class=\"err\"\u003e:\u003c/span\u003e\u003cspan class=\"k\"\u003eNoSchedule\u003c/span\u003e\u003cspan class=\"err\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_kubernetes_cluster_node_pool\u0026#34; \u0026#34;batch\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;batch\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  kubernetes_cluster_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  vm_size\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard_D8s_v5\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  priority\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Spot\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  eviction_policy\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Delete\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  spot_max_price\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"err\"\u003e-\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # Pay up to regular price\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  node_count\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  enable_auto_scaling\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  min_count\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  max_count\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e5\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"n\"\u003etags\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    environment\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    workload\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;batch-processing\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    cost-center\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;data-engineering\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    managed-by\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;terraform\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"n\"\u003enode_labels\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;workload-type\u0026#34;\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;batch\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;node-pool\u0026#34;\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;batch\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;kubernetes.azure.com/scalesetpriority\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;spot\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"n\"\u003enode_taints\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;workload\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003ebatch\u003c/span\u003e\u003cspan class=\"err\"\u003e:\u003c/span\u003e\u003cspan class=\"k\"\u003eNoSchedule\u003c/span\u003e\u003cspan class=\"err\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    \u0026#34;kubernetes.azure.com/scalesetpriority\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003espot\u003c/span\u003e\u003cspan class=\"err\"\u003e:\u003c/span\u003e\u003cspan class=\"k\"\u003eNoSchedule\u003c/span\u003e\u003cspan class=\"err\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eTag your node pools with cost-center and workload. If you don\u0026rsquo;t, finance will treat your AKS bill as a black box, and you\u0026rsquo;ll lose every argument about budget. Azure Cost Management can only help if you give it the data.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"spot-vms-and-reserved-instances\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#spot-vms-and-reserved-instances\" title=\"Spot VMs and reserved instances\"\u003eSpot VMs and reserved instances\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSpot VMs are the cloud\u0026rsquo;s version of a fire sale: up to 90% off, but you can get kicked out with 30 seconds notice. Use them for workloads that can be interrupted—batch jobs, CI/CD runners, dev environments, stateless background stuff. If you put production or stateful workloads on spot VMs without bulletproof redundancy, you\u0026rsquo;re asking for trouble. The 30-second eviction is real, and Azure doesn\u0026rsquo;t care about your SLAs.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-commit-to-reserved-instances\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#when-to-commit-to-reserved-instances\" title=\"When To Commit To Reserved Instances\"\u003eWhen To Commit To Reserved Instances\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eReserved instances are the opposite: predictable discounts (up to 72% for 3-year commitments), but you pay whether you use them or not. Only buy what you know you\u0026rsquo;ll need for the next 1-3 years. Overcommit, and you\u0026rsquo;re stuck.\u003c/p\u003e\n\u003cp\u003eCost model comparison for a Standard_D4s_v5 node in West Europe:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003ePay-as-you-go: ~$140/month\u003c/li\u003e\n\u003cli\u003e1-year reserved: ~$100/month (29% discount)\u003c/li\u003e\n\u003cli\u003e3-year reserved: ~$65/month (54% discount)\u003c/li\u003e\n\u003cli\u003eSpot VM (typical): ~$20-40/month (70-85% discount, variable)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eRisk models for spot vs. reserved instances:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSpot VMs: high discount, high operational risk, suitable for interruptible workloads\u003c/li\u003e\n\u003cli\u003e1-year reserved: moderate discount, low risk, suitable for proven steady-state capacity\u003c/li\u003e\n\u003cli\u003e3-year reserved: high discount, moderate risk (commitment lock-in), suitable for long-term stable workloads\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor tip: Spot VMs for batch, dev, and test. 1-year reserved for production baseline, but only after 3-6 months of real usage data. Avoid 3-year commitments unless you\u0026rsquo;re absolutely sure. Cloud changes fast: don\u0026rsquo;t lock yourself in.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"resource-requests-and-limits\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#resource-requests-and-limits\" title=\"Resource requests and limits\"\u003eResource requests and limits\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eResource requests and limits are where most teams burn money. Over-requesting is rampant: people set requests based on worst-case guesses, not real data. The result? Nodes look full, but are actually running at 20-30% utilization. Kubernetes blocks new pods, you scale out, and your CFO wonders why the bill keeps climbing.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"under-requesting-hurts-just-as-much\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#under-requesting-hurts-just-as-much\" title=\"Under-Requesting Hurts Just As Much\"\u003eUnder-Requesting Hurts Just As Much\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eUnder-requesting is just as bad: pods get scheduled, but then fight for resources, get throttled, or OOMKilled. Kubernetes thinks there\u0026rsquo;s room, but your workloads suffer.\u003c/p\u003e\n\u003cp\u003eThe fix is not rocket science, but it requires discipline. Monitor actual usage with Azure Monitor or Prometheus. Set requests to the 95th percentile of real usage, not what you wish was true. Example: If a pod averages 200m CPU and 500Mi memory, but spikes to 400m/1Gi, set requests to 250m/600Mi, limits to 500m/1.5Gi. Review and adjust regularly. If you don\u0026rsquo;t, you\u0026rsquo;re just guessing—and paying for it.\u003c/p\u003e\n\u003cp\u003eIncorrect resource settings create a cascading waste problem. Over-requested resources kill node packing, so you run more nodes than you need. More nodes mean more cost, more operational pain, and more complexity. Don\u0026rsquo;t be lazy: measure, set, and review. It\u0026rsquo;s the only way.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cost-attribution-and-chargeback\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#cost-attribution-and-chargeback\" title=\"Cost attribution and chargeback\"\u003eCost attribution and chargeback\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eCost attribution is not optional. If you can\u0026rsquo;t answer \u0026ldquo;who owns this cost?\u0026rdquo;, you\u0026rsquo;re going to lose every budget discussion. Finance hates black boxes. Tag your node pools, disks, and load balancers with cost-center, team, and workload. If you don\u0026rsquo;t, your AKS bill is just a giant question mark.\u003c/p\u003e\n\u003cp\u003ePractical implementation steps:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eTag all AKS resources (node pools, disks, load balancers) with cost-center, workload, environment, and managed-by tags\u003c/li\u003e\n\u003cli\u003eConfigure Azure Cost Management to group costs by these tags\u003c/li\u003e\n\u003cli\u003eExport cost data monthly and distribute reports to team leads\u003c/li\u003e\n\u003cli\u003eReview high-cost workloads and investigate optimization opportunities\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eKubernetes-native attribution is possible. Use Kubecost or OpenCost to estimate per-pod cost based on real resource requests and node pricing. Aggregate by namespace, label, or deployment. If you don\u0026rsquo;t have this visibility, you\u0026rsquo;re just hoping your spend is \u0026ldquo;reasonable\u0026rdquo;. See the \u003ca href=\"https://docs.kubecost.com/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eKubecost documentation\u003c/a\u003e and \u003ca href=\"https://www.opencost.io/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eOpenCost project\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eExample Azure CLI query for cost by tag:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Query AKS costs grouped by cost-center tag for the last 30 days\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz costmanagement query \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --type Usage \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --timeframe MonthToDate \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --scope \u003cspan class=\"s2\"\u003e\u0026#34;/subscriptions/\u0026lt;subscription-id\u0026gt;\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --dataset-aggregation \u003cspan class=\"s1\"\u003e\u0026#39;{\u0026#34;totalCost\u0026#34;:{\u0026#34;name\u0026#34;:\u0026#34;Cost\u0026#34;,\u0026#34;function\u0026#34;:\u0026#34;Sum\u0026#34;}}\u0026#39;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --dataset-grouping \u003cspan class=\"nv\"\u003ename\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Tags\u0026#34;\u003c/span\u003e \u003cspan class=\"nv\"\u003etype\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Tag\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --query \u003cspan class=\"s1\"\u003e\u0026#39;properties.rows[]\u0026#39;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --output table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Export detailed cost breakdown to CSV\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz costmanagement query \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --type Usage \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --timeframe MonthToDate \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --scope \u003cspan class=\"s2\"\u003e\u0026#34;/subscriptions/\u0026lt;subscription-id\u0026gt;\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --dataset-aggregation \u003cspan class=\"s1\"\u003e\u0026#39;{\u0026#34;totalCost\u0026#34;:{\u0026#34;name\u0026#34;:\u0026#34;Cost\u0026#34;,\u0026#34;function\u0026#34;:\u0026#34;Sum\u0026#34;}}\u0026#39;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --dataset-grouping \u003cspan class=\"nv\"\u003ename\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Tags\u0026#34;\u003c/span\u003e \u003cspan class=\"nv\"\u003etype\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Tag\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --output json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.properties.rows[] | @csv\u0026#39;\u003c/span\u003e \u0026gt; aks-costs.csv\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"chargeback-or-showback-first\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#chargeback-or-showback-first\" title=\"Chargeback Or Showback First\"\u003eChargeback Or Showback First\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eChargeback or showback? Chargeback means teams pay for what they use. Showback just shows them the numbers. Start with showback—it\u0026rsquo;s less political. But if you want real accountability, move to chargeback once teams have seen the data and had a chance to optimize. Don\u0026rsquo;t try to force chargeback on day one unless you like chaos.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-checklist\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#practical-checklist\" title=\"Practical checklist\"\u003ePractical checklist\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eChecklist for teams who want to stop wasting money:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eMeasure before optimizing:\u003c/strong\u003e 30 days of real pod usage before you touch requests/limits.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTag everything:\u003c/strong\u003e Cost-center, workload, environment—no exceptions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStratify node pools:\u003c/strong\u003e Production, batch, system—separate or pay the price.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRight-size nodes:\u003c/strong\u003e 20-30 pods per node for most. Adjust for reality, not theory.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse spot VMs for batch:\u003c/strong\u003e 70-85% discounts, but only for workloads that can die anytime.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReserve baseline capacity:\u003c/strong\u003e 1-year reserved, but only after you know your steady state.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSet accurate requests:\u003c/strong\u003e 95th percentile of real usage, not guesses.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEnable autoscaling:\u003c/strong\u003e Let the cluster scale up/down based on pending pods.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReview monthly:\u003c/strong\u003e Export cost data, find the top spenders, and dig in.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAutomate reporting:\u003c/strong\u003e Monthly cost breakdowns by team, sent automatically. No excuses.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eCost optimization never ends. Workload patterns shift, prices change, new VM types appear. Review every month. If you don\u0026rsquo;t, your bill will creep up and nobody will notice until it\u0026rsquo;s a crisis.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS cost optimization is not about slashing spend after the fact. It\u0026rsquo;s about ruthless visibility, honest governance, and making the right calls up front. If you don\u0026rsquo;t understand pod density, node-pool design, spot VM risk, and resource requests, you\u0026rsquo;re just hoping for a good outcome. Hope is not a strategy.\u003c/p\u003e\n\u003cp\u003eTag everything, export cost data, and make sure the people who control resources see the numbers. Technical optimization plus organizational accountability is the only way to keep your cloud bill from spiraling out of control. Ignore this, and you\u0026rsquo;ll be back here next quarter, wondering where all the money went.\u003c/p\u003e","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-11T17:00:00+01:00","id":"https://daily-devops.net/posts/cost-optimization-resource-governance-aks/","language":"en","summary":"How to control AKS costs with pod density, node-pool design, spot VMs, and FinOps tagging—without sacrificing reliability or operational control.","tags":["kubernetes","azure","cloud","devops"],"title":"AKS Cost Optimization: Resource Governance That Actually Works","url":"https://daily-devops.net/posts/cost-optimization-resource-governance-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\n\n\n\n\u003ch2 id=\"the-problem-traditional-storage-models-dont-translate-to-kubernetes\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-problem-traditional-storage-models-dont-translate-to-kubernetes\" title=\"The Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\"\u003eThe Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads in Kubernetes means more than deploying a database pod. Traditional storage models (provision a disk, format it, mount it, expect it to stay) collide with Kubernetes\u0026rsquo; ephemeral, distributed architecture. Pods get rescheduled, scaled, and terminated. Your database shouldn\u0026rsquo;t lose data when that happens.\u003c/p\u003e\n\u003cp\u003eThe core challenge: \u003cstrong\u003ehow do you attach persistent storage to ephemeral compute?\u003c/strong\u003e On-premises infrastructure relies on SAN devices, NFS mounts, or local disks with predictable failure domains. You know which server hosts which disk. In AKS, you work with Azure storage primitives: Managed Disks, Azure Files, blob storage. These need seamless integration with Kubernetes lifecycle management. The abstractions differ, the failure modes differ, and operational patterns require rethinking.\u003c/p\u003e\n\u003cp\u003eComplexity multiplies with backup requirements, disaster recovery expectations, and multi-cluster data synchronization. Whether migrating legacy apps that expect local RAID controllers or building cloud-native data platforms from scratch, AKS storage architecture knowledge is foundational. Get it wrong: data loss, performance bottlenecks, escalating cloud bills.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pvcpv-architecture-how-storage-binds-to-pods-in-aks\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#pvcpv-architecture-how-storage-binds-to-pods-in-aks\" title=\"PVC/PV Architecture: How Storage Binds to Pods in AKS\"\u003ePVC/PV Architecture: How Storage Binds to Pods in AKS\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes abstracts storage through two key objects: \u003cstrong\u003ePersistentVolumes (PV)\u003c/strong\u003e and \u003cstrong\u003ePersistentVolumeClaims (PVC)\u003c/strong\u003e. A PV represents the actual storage resource (Azure Disk, Azure Files share). A PVC represents the request for that storage. The relationship mirrors compute abstractions: nodes are physical machines, pods are logical units consuming node resources. Similarly, PVs are physical storage, PVCs are logical requests consuming PV capacity.\u003c/p\u003e\n\u003cp\u003eThe binding flow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eDeveloper creates a PVC specifying size, access mode, and storage class\u003c/li\u003e\n\u003cli\u003eKubernetes finds or provisions a matching PV based on the storage class\u003c/li\u003e\n\u003cli\u003ePVC binds to the PV, making it available to pods\u003c/li\u003e\n\u003cli\u003ePods reference the PVC in their volume mounts\u003c/li\u003e\n\u003cli\u003eWhen the pod terminates, the PVC remains (data persists across pod lifecycles)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAccess modes matter:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteOnce (RWO)\u003c/strong\u003e: Single node can mount the volume (Azure Disk)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteMany (RWX)\u003c/strong\u003e: Multiple nodes can mount simultaneously (Azure Files)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadOnlyMany (ROX)\u003c/strong\u003e: Multiple nodes, read-only access\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eMost stateful apps (databases, message queues) use RWO. Azure Disks provide better IOPS and latency than Azure Files. For shared storage (parallel batch processing, shared config directories, legacy apps expecting NFS semantics), use RWX: Azure Files or third-party CSI drivers like NFS or CephFS.\u003c/p\u003e\n\u003cp\u003eCritical insight: \u003cstrong\u003ePVCs decouple storage requests from storage implementation.\u003c/strong\u003e Developers don\u0026rsquo;t need to know if they get a Premium SSD or Standard HDD. They request 100Gi of fast storage, the storage class handles provisioning. This abstraction enables platform teams to enforce policies (all production PVCs use Premium tier) without touching application manifests.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-disk-vs-azure-files-performance-cost-regional-constraints\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#azure-disk-vs-azure-files-performance-cost-regional-constraints\" title=\"Azure Disk vs. Azure Files: Performance, Cost, Regional Constraints\"\u003eAzure Disk vs. Azure Files: Performance, Cost, Regional Constraints\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eChoosing between Azure Disk and Azure Files isn\u0026rsquo;t a one-size-fits-all decision. Each has distinct performance profiles, cost implications, and operational constraints.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Disk (Managed Disks):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Lower latency, higher IOPS. Premium SSDs reach 20,000 IOPS, Ultra Disks exceed that.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Single-node attachment (RWO). Pod rescheduling to another node triggers disk detach and reattach (expect brief delay).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Databases (PostgreSQL, MongoDB), stateful apps requiring low-latency I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per provisioned disk size. A 1TB Premium SSD costs more than a 1TB Standard HDD, regardless of actual usage.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e Disks are zone-specific. With availability zones, pods must schedule in the same zone as the disk.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Files (SMB/NFS):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Higher latency than disks. Premium Files tier improves performance but still trails disk I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Multi-node (RWX). Multiple pods across nodes can mount the same share.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Shared logs, static assets, config files, legacy apps expecting NFS.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per storage consumed plus transactions. Transaction costs surprise teams on high-throughput workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e File shares are regional, not zonal. Better for cross-zone workloads, still tied to single region.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eDecision criteria:\u003c/strong\u003e Default to Azure Disk for databases and high-IOPS apps. Use Azure Files only when RWX access or legacy NFS compatibility is required. For backup targets or archival storage, consider blob storage with CSI drivers (experimental, improving).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-disk-attachment-penalty\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-disk-attachment-penalty\" title=\"The Disk Attachment Penalty\"\u003eThe Disk Attachment Penalty\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eGotcha: \u003cstrong\u003edisk attachment times.\u003c/strong\u003e Pod rescheduling requires Azure to detach the disk from the old node and attach it to the new one. This takes 30 to 90 seconds. Apps that cannot tolerate this downtime need application-level replication (PostgreSQL streaming replication) or third-party solutions like Portworx.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"storage-classes--dynamic-provisioning-automating-the-lifecycle\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#storage-classes--dynamic-provisioning-automating-the-lifecycle\" title=\"Storage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\"\u003eStorage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStatic provisioning (manually creating PVs, hoping someone claims them) creates operational overhead. \u003cstrong\u003eStorage classes\u003c/strong\u003e enable dynamic provisioning: Kubernetes automatically creates a PV when a PVC is submitted.\u003c/p\u003e\n\u003cp\u003eAKS ships with default storage classes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003edefault\u003c/code\u003e: Standard HDD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003emanaged-premium\u003c/code\u003e: Premium SSD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile\u003c/code\u003e: Azure Files share (RWX)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile-premium\u003c/code\u003e: Premium Azure Files share (RWX)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eYou can define custom storage classes to fine-tune parameters:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eStorageClass\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003efast-ssd\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eprovisioner\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003edisk.csi.azure.com\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eskuName\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePremium_LRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eManaged\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecachingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eReadOnly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Zone redundant storage (ZRS) for higher durability\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# skuName: Premium_ZRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eallowVolumeExpansion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ereclaimPolicy\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eRetain\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evolumeBindingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eWaitForFirstConsumer\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eKey parameters:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ereclaimPolicy:\u003c/strong\u003e \u003ccode\u003eDelete\u003c/code\u003e removes the disk when PVC is deleted, \u003ccode\u003eRetain\u003c/code\u003e keeps it. For production databases, \u003ccode\u003eRetain\u003c/code\u003e prevents accidental data deletion.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003evolumeBindingMode:\u003c/strong\u003e \u003ccode\u003eWaitForFirstConsumer\u003c/code\u003e delays PV creation until pod scheduling. Critical for zone-aware clusters (Kubernetes creates the disk in the same zone as the pod).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eallowVolumeExpansion:\u003c/strong\u003e Enables PVC resizing without recreation. Azure Disks support this, not all storage backends do.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eBest practice:\u003c/strong\u003e Create environment-specific storage classes (dev, staging, prod) with different \u003ccode\u003eskuName\u003c/code\u003e values. Dev clusters use Standard HDDs, prod uses Premium SSDs. Developers use identical manifests across environments, only the storage class name changes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"backup--recovery-rtorpo-implications\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#backup--recovery-rtorpo-implications\" title=\"Backup \u0026amp; Recovery: RTO/RPO Implications\"\u003eBackup \u0026amp; Recovery: RTO/RPO Implications\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes doesn\u0026rsquo;t backup data by default. Running \u003ccode\u003ekubectl delete pvc\u003c/code\u003e without a recovery plan means permanent data loss.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVelero\u003c/strong\u003e (formerly Heptio Ark) is the de facto standard for Kubernetes backup. It snapshots PVs, captures Kubernetes object state, stores backups in object storage (Azure Blob, S3, GCS).\u003c/p\u003e\n\u003cp\u003eExample Velero backup schedule (via CLI):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install Velero with Azure plugin\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero install \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --provider azure \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --bucket velero-backups \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --secret-file ./credentials-velero \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --backup-location-config \u003cspan class=\"nv\"\u003eresourceGroup\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eaks-backups-rg,storageAccount\u003cspan class=\"o\"\u003e=\u003c/span\u003eaksbackupssa\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create a daily backup schedule for production namespace\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero schedule create daily-prod-backup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --include-namespaces production \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --snapshot-volumes \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --ttl 720h\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"rto-and-rpo-considerations\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#rto-and-rpo-considerations\" title=\"RTO And RPO Considerations\"\u003eRTO And RPO Considerations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eRTO/RPO considerations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSnapshot-based backups (Azure Disk snapshots via Velero):\u003c/strong\u003e RPO equals backup frequency (hourly, daily). RTO equals time to provision new PV plus restore data (5 to 30 minutes).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNative Azure Backup for AKS:\u003c/strong\u003e Microsoft managed solution. Integrated with Azure Backup policies, slower restores and less granular than Velero.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eApplication-level backups (pg_dump, mongodump):\u003c/strong\u003e Bypasses Kubernetes entirely. Lower RTO with automated restore scripts, requires custom orchestration.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eGotcha:\u003c/strong\u003e Velero relies on Azure Disk snapshots. Disk in Zone 1, restore to cluster in Zone 2 requires cross-zone snapshot copy (not instant). Test restore procedures in non-prod clusters. A backup never restored is wishful thinking.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-aks-replication-patterns-for-cross-cluster-data-synchronization\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#multi-aks-replication-patterns-for-cross-cluster-data-synchronization\" title=\"Multi-AKS Replication: Patterns for Cross-Cluster Data Synchronization\"\u003eMulti-AKS Replication: Patterns for Cross-Cluster Data Synchronization\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads across multiple AKS clusters—whether for HA, disaster recovery, or multi-region latency requirements—adds another layer of complexity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 1: Application-Level Replication\u003c/strong\u003e\nLet the application handle replication. PostgreSQL streaming replication, MongoDB replica sets, Kafka replication understand their data models and replicate efficiently.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e No Kubernetes-specific dependencies. Works identically in VMs, on-premises, or managed services.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e You manage replication lag, split-brain scenarios, and failover logic.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 2: Storage-Level Replication\u003c/strong\u003e\nUse Azure NetApp Files or third-party solutions like Portworx for block or file-level replication.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Transparent to applications. Works with legacy apps lacking native replication.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e Expensive. NetApp Files Premium tier and Portworx licensing (scales with node count) add significant cost.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 3: Backup-Based DR\u003c/strong\u003e\nVelero backups from primary cluster, restore to secondary on failover.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Cost-effective (blob storage only).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e RPO equals last backup interval (hours, not seconds). RTO includes restore time (minutes to hours).\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"a-multi-region-postgresql-pattern\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-multi-region-postgresql-pattern\" title=\"A Multi-Region PostgreSQL Pattern\"\u003eA Multi-Region PostgreSQL Pattern\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eReal-world example:\u003c/strong\u003e Multi-region PostgreSQL deployment pattern I\u0026rsquo;ve encountered:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePrimary AKS cluster (West Europe):\u003c/strong\u003e Production traffic\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSecondary AKS cluster (North Europe):\u003c/strong\u003e Read replicas via PostgreSQL streaming replication\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eVelero backups:\u003c/strong\u003e Azure Blob in third region (East US) for regulatory compliance\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis provides sub-second RPO within Europe (streaming replication), hourly RPO globally (Velero), 5-minute RTO for regional failover (promote read replica).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational reality:\u003c/strong\u003e Multi-cluster data replication is complex. Avoid it by using managed services (Azure Database for PostgreSQL with geo-replication) if possible. Running databases in AKS requires investment in automation, monitoring, and runbooks. Your 3 AM self will appreciate this decision.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"final-thoughts\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#final-thoughts\" title=\"Final Thoughts\"\u003eFinal Thoughts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStorage in AKS represents a set of trade-offs requiring deliberate navigation. Azure Disk provides performance with zone-locking. Azure Files offers flexibility with latency penalties. Velero enables backups but demands operational discipline and testing. Multi-cluster replication delivers resilience with non-linear operational complexity.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"a-pragmatic-starting-point\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-pragmatic-starting-point\" title=\"A Pragmatic Starting Point\"\u003eA Pragmatic Starting Point\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePragmatic approach: Start with managed storage classes and Velero. Use Azure Disk for databases and high-IOPS workloads. Use Azure Files only when RWX access or legacy NFS compatibility is genuinely required. Test restore procedures quarterly, not during outages. Schedule fire drills: delete a namespace, restore from backup. Measure actual RTO/RPO instead of assuming SLA compliance.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-leave-aks-for-managed-data-services\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#when-to-leave-aks-for-managed-data-services\" title=\"When To Leave AKS For Managed Data Services\"\u003eWhen To Leave AKS For Managed Data Services\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen stateful workload requirements outgrow AKS storage primitives (sub-second cross-region replication, disk attachment latency breaking your app, spiraling storage costs), don\u0026rsquo;t force solutions. Consider Azure managed services (Azure Database for PostgreSQL, Cosmos DB) or specialized data platforms (Confluent Cloud for Kafka, MongoDB Atlas). Sometimes the best Kubernetes storage strategy is avoiding stateful workloads in Kubernetes.\u003c/p\u003e\n\u003cp\u003eKubernetes excels at stateless orchestration. For stateful workloads, it\u0026rsquo;s capable but demands understanding the plumbing, accepting trade-offs, building operational muscle around backups, monitoring, and runbooks. Treat storage as infrastructure that will fail, not infrastructure that just works. Plan accordingly.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-04T17:00:00+01:00","id":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/","language":"en","summary":"PVC/PV patterns, Azure Disk vs Files trade-offs, Velero backup strategies, and cross-cluster replication for production stateful workloads in AKS.","tags":["storage","azure","kubernetes","cloud","database","reliability","operations","platform-engineering","disaster-recovery"],"title":"Storage Architecture \u0026 Stateful Workloads in AKS","url":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eAKS cluster upgrades are routine maintenance, but executing them without dropping traffic or losing state is the operational challenge that separates theory from production reality. Every Kubernetes version upgrade involves replacing nodes, which means evicting pods, draining workloads, and hoping your assumptions about resilience hold true under pressure.\u003c/p\u003e\n\u003cp\u003eI have participated in dozens of AKS upgrades across production clusters ranging from 10 to 500+ nodes. The pattern is consistent: teams that treat upgrades as a checkbox operation eventually experience an outage. Teams that understand the underlying mechanics and configure explicit constraints rarely do.\u003c/p\u003e\n\u003cp\u003eThis article covers the real mechanics: how cordon and drain actually work, why Pod Disruption Budgets exist, and how to orchestrate multi-node-pool rollouts with automation that survives contact with production.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-uncontrolled-node-drains-cause-cascading-failures\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#the-problem-uncontrolled-node-drains-cause-cascading-failures\" title=\"The problem: uncontrolled node drains cause cascading failures\"\u003eThe problem: uncontrolled node drains cause cascading failures\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWhen you upgrade an AKS cluster, Azure replaces nodes with new VMs running the updated Kubernetes version. That replacement process triggers pod eviction. Without proper controls, evictions happen simultaneously across multiple nodes, stateful workloads lose quorum, and traffic drops because there are no healthy replicas left to serve requests.\u003c/p\u003e\n\u003cp\u003eThe default behavior is optimistic: Kubernetes assumes your workloads are designed for failure. But production workloads are rarely that resilient. Databases need time to transfer leadership, message queues need to flush buffers, and stateless apps still need at least one replica running to handle incoming connections.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e\u003c/em\u003e The \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-cluster\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eofficial AKS upgrade documentation\u003c/a\u003e covers the mechanics, but it does not emphasize how quickly things go wrong without proper constraints. I have seen a three-minute upgrade window turn into a two-hour incident because nobody configured Pod Disruption Budgets.\u003c/p\u003e\n\u003cp\u003eUncontrolled drains create several failure modes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eData loss:\u003c/strong\u003e Stateful workloads evicted before flushing state to disk or replicating to peers.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eService interruption:\u003c/strong\u003e All replicas terminated before new ones become ready.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCascading failures:\u003c/strong\u003e Dependent services timeout waiting for unavailable backends, triggering retries that amplify load.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe solution is not to avoid upgrades. The solution is to control the eviction process with explicit constraints that match your workload requirements.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cordon-and-drain-mechanics-what-actually-happens\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#cordon-and-drain-mechanics-what-actually-happens\" title=\"Cordon and drain mechanics: what actually happens\"\u003eCordon and drain mechanics: what actually happens\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThe Kubernetes eviction API follows a three-step process when draining a node:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eCordon:\u003c/strong\u003e Mark the node as unschedulable. New pods will not be placed on this node, but existing pods continue running.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEvict:\u003c/strong\u003e Send termination signals to all pods on the node, respecting grace periods and Pod Disruption Budgets (PDBs).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eWait:\u003c/strong\u003e Block until all pods have terminated or the drain timeout expires.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAKS automates this process during upgrades, but you can trigger it manually using kubectl for maintenance or troubleshooting:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cordon \u0026lt;node-name\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl drain \u0026lt;node-name\u0026gt; --ignore-daemonsets --delete-emptydir-data\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe \u003ccode\u003e--ignore-daemonsets\u003c/code\u003e flag prevents drain from failing on DaemonSet pods, which are designed to run on every node and will be recreated automatically. The \u003ccode\u003e--delete-emptydir-data\u003c/code\u003e flag allows drain to proceed even if pods use emptyDir volumes, which are ephemeral and will be lost.\u003c/p\u003e\n\u003cp\u003eFor AKS automated upgrades, you can configure the drain behavior per node pool using \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-node-pools-rolling\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003erolling upgrade settings\u003c/a\u003e:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks nodepool update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myNodePool \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --max-surge 33% \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --drain-timeout \u003cspan class=\"m\"\u003e45\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --node-soak-duration \u003cspan class=\"m\"\u003e5\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe \u003ccode\u003e--drain-timeout\u003c/code\u003e parameter (in minutes) controls how long AKS waits for pods to terminate before force-killing them. The \u003ccode\u003e--node-soak-duration\u003c/code\u003e (in minutes) adds a stabilization period after each node upgrade before proceeding to the next. Microsoft recommends \u003ccode\u003e--max-surge 33%\u003c/code\u003e for production workloads.\u003c/p\u003e\n\u003cp\u003eManual drain remains useful for pre-maintenance validation, testing PDB configurations, or debugging eviction failures before committing to a full cluster upgrade.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pod-disruption-budgets-the-safety-mechanism-you-should-always-configure\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#pod-disruption-budgets-the-safety-mechanism-you-should-always-configure\" title=\"Pod Disruption Budgets: the safety mechanism you should always configure\"\u003ePod Disruption Budgets: the safety mechanism you should always configure\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eA \u003ca href=\"https://kubernetes.io/docs/tasks/run-application/configure-pdb/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003ePod Disruption Budget\u003c/a\u003e (PDB) defines the minimum number of pods that must remain available during voluntary disruptions like node drains. PDBs do not prevent involuntary disruptions like node crashes or resource exhaustion, but they block evictions that would violate availability constraints.\u003c/p\u003e\n\u003cp\u003ePDBs are defined using either \u003ccode\u003eminAvailable\u003c/code\u003e or \u003ccode\u003emaxUnavailable\u003c/code\u003e:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eminAvailable:\u003c/strong\u003e The minimum number of pods (or percentage) that must remain running during a disruption.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003emaxUnavailable:\u003c/strong\u003e The maximum number of pods (or percentage) that can be unavailable during a disruption.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eExample PDB for a three-replica deployment that must keep at least two replicas running:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003epolicy/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePodDisruptionBudget\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyapp-pdb\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eminAvailable\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e2\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eselector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eapp\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyapp\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWith this PDB in place, drain will evict only one pod at a time, waiting for a replacement to become ready before proceeding to the next eviction. If no replacement becomes ready (for example, due to resource constraints or image pull failures), the drain blocks until the timeout expires.\u003c/p\u003e\n\u003cp\u003ePDBs are particularly critical for:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eStateful workloads:\u003c/strong\u003e Databases, message queues, and distributed systems that require quorum.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLow-replica deployments:\u003c/strong\u003e Services with two or three replicas where losing one pod reduces capacity significantly.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLong startup times:\u003c/strong\u003e Workloads that take minutes to initialize and become ready.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003ePractical PDB configuration advice:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSet \u003ccode\u003eminAvailable: 1\u003c/code\u003e for stateless services with two replicas.\u003c/li\u003e\n\u003cli\u003eSet \u003ccode\u003eminAvailable: N-1\u003c/code\u003e for N-replica stateful services that tolerate one failure (for example, three-node etcd allows \u003ccode\u003eminAvailable: 2\u003c/code\u003e).\u003c/li\u003e\n\u003cli\u003eAvoid \u003ccode\u003eminAvailable: N\u003c/code\u003e (all replicas), which blocks drain indefinitely and prevents upgrades.\u003c/li\u003e\n\u003cli\u003eUse percentages for large replica counts: \u003ccode\u003eminAvailable: 75%\u003c/code\u003e for a 10-replica deployment allows up to 2-3 pods to be evicted simultaneously.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor tip: Before any upgrade, run \u003ccode\u003ekubectl get pdb -A\u003c/code\u003e and verify that no PDB has \u003ccode\u003eALLOWED DISRUPTIONS\u003c/code\u003e showing zero. A PDB with zero allowed disruptions will block node drain indefinitely, and your upgrade will hang until the drain timeout expires or you manually intervene.\u003c/p\u003e\n\u003cp\u003ePDBs only apply to voluntary disruptions. Node failures ignore PDBs and evict all pods immediately.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"workload-categories-stateless-stateful-daemonsets\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#workload-categories-stateless-stateful-daemonsets\" title=\"Workload categories: stateless, stateful, DaemonSets\"\u003eWorkload categories: stateless, stateful, DaemonSets\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDifferent workload types require different upgrade strategies. A one-size-fits-all approach causes either unnecessary downtime (overly conservative) or unexpected failures (overly aggressive).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"stateless-workloads\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#stateless-workloads\" title=\"Stateless workloads\"\u003eStateless workloads\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eStateless services like web frontends, API gateways, and workers can tolerate rapid eviction as long as at least one replica remains available. Configure PDBs with \u003ccode\u003eminAvailable: 1\u003c/code\u003e or \u003ccode\u003emaxUnavailable: N-1\u003c/code\u003e to allow fast rollouts while maintaining service availability.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"stateful-workloads\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#stateful-workloads\" title=\"Stateful workloads\"\u003eStateful workloads\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDatabases, message queues, and distributed storage systems require careful sequencing. Evicting multiple replicas simultaneously can cause quorum loss, split-brain scenarios, or data corruption.\u003c/p\u003e\n\u003cp\u003eBest practices for stateful workloads:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSet conservative PDBs that preserve quorum (for example, \u003ccode\u003eminAvailable: 2\u003c/code\u003e for a three-node cluster).\u003c/li\u003e\n\u003cli\u003eConfigure long grace periods (60+ seconds) to allow state transfer and leadership handoff.\u003c/li\u003e\n\u003cli\u003eUse StatefulSets with proper readiness probes to ensure new replicas are fully initialized before old ones are terminated.\u003c/li\u003e\n\u003cli\u003eTest upgrade scenarios in staging with realistic data volumes and latency.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"daemonsets\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#daemonsets\" title=\"DaemonSets\"\u003eDaemonSets\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDaemonSets run exactly one pod per node (or per matching node). Examples include logging agents, monitoring exporters, and network plugins. Draining a node automatically terminates the DaemonSet pod, and the pod is recreated on the new node after upgrade.\u003c/p\u003e\n\u003cp\u003eDaemonSets do not require PDBs because they are designed to tolerate single-node failures. Use the \u003ccode\u003e--ignore-daemonsets\u003c/code\u003e flag during manual drain to skip these pods.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-node-pool-rollout-strategies-graduated-risk-management\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#multi-node-pool-rollout-strategies-graduated-risk-management\" title=\"Multi-node-pool rollout strategies: graduated risk management\"\u003eMulti-node-pool rollout strategies: graduated risk management\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS supports multiple node pools within a single cluster. Each pool can have different VM sizes, availability zones, and upgrade schedules. Multi-node-pool architectures enable graduated rollouts that reduce risk by upgrading non-critical workloads first.\u003c/p\u003e\n\u003cp\u003eRecommended upgrade sequence:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eDev/test pools first:\u003c/strong\u003e Upgrade node pools running non-production workloads to validate the new Kubernetes version and catch compatibility issues early.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateless application pools:\u003c/strong\u003e Upgrade pools running stateless services that can tolerate brief capacity reductions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateful application pools last:\u003c/strong\u003e Upgrade pools running databases and stateful services only after validating the rollout on stateless workloads.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eExample multi-pool upgrade using Azure CLI:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;myResourceGroup\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;myAKSCluster\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;1.29.2\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Configure rolling upgrade settings for production safety\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eMAX_SURGE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;33%\u0026#34;\u003c/span\u003e        \u003cspan class=\"c1\"\u003e# Microsoft recommended for production\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eDRAIN_TIMEOUT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;45\u0026#34;\u003c/span\u003e     \u003cspan class=\"c1\"\u003e# Minutes to wait for pod eviction\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNODE_SOAK\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;5\u0026#34;\u003c/span\u003e          \u003cspan class=\"c1\"\u003e# Minutes to stabilize after each node\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Upgrade control plane first (does not affect workloads)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Upgrading control plane to \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks upgrade \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --kubernetes-version \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TARGET_VERSION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --control-plane-only \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --yes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Upgrade node pools in sequence: system -\u0026gt; stateless -\u0026gt; stateful\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNODE_POOLS\u003c/span\u003e\u003cspan class=\"o\"\u003e=(\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;system\u0026#34;\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;stateless\u0026#34;\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;stateful\u0026#34;\u003c/span\u003e\u003cspan class=\"o\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e POOL in \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eNODE_POOLS\u003c/span\u003e\u003cspan class=\"p\"\u003e[@]\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Upgrading node pool: \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Verify current node count and health\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz aks nodepool show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --query count -o tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Current node count for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e: \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Configure rolling upgrade settings before upgrade\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --max-surge \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$MAX_SURGE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --drain-timeout \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$DRAIN_TIMEOUT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --node-soak-duration \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NODE_SOAK\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Upgrade node pool\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool upgrade \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --kubernetes-version \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TARGET_VERSION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --yes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Wait for upgrade to complete\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Waiting for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e upgrade to complete...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool \u003cspan class=\"nb\"\u003ewait\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --updated\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Verify upgraded node count matches original\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eUPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz aks nodepool show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --query count -o tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CURRENT_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e !\u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$UPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: Node count mismatch for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e. Expected \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e, got \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eUPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Pool \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e upgraded successfully.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;---\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;All node pools upgraded to \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script upgrades the control plane first (which is a non-disruptive operation), then upgrades each node pool sequentially, validating node count before and after each upgrade to detect unexpected node losses.\u003c/p\u003e\n\u003cp\u003eKey operational notes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eControl plane upgrades are non-disruptive:\u003c/strong\u003e The control plane upgrade updates the Kubernetes API server and controllers but does not affect running workloads. Only node pool upgrades trigger pod evictions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOne node pool at a time:\u003c/strong\u003e Upgrading multiple pools simultaneously multiplies risk. Sequential upgrades allow you to catch issues early and halt the rollout before affecting critical workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eValidate before proceeding:\u003c/strong\u003e Check pod health, replica counts, and application metrics after each pool upgrade. Use kubectl, Azure Monitor, or Prometheus to verify that workloads are stable before moving to the next pool.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"planned-maintenance-windows-scheduling-upgrades-safely\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#planned-maintenance-windows-scheduling-upgrades-safely\" title=\"Planned maintenance windows: scheduling upgrades safely\"\u003ePlanned maintenance windows: scheduling upgrades safely\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eFor clusters with \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eautomatic upgrades\u003c/a\u003e enabled, AKS supports \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/planned-maintenance\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eplanned maintenance windows\u003c/a\u003e to control when upgrades occur. This prevents upgrades from starting during peak traffic periods.\u003c/p\u003e\n\u003cp\u003eConfigure a weekly maintenance window using Azure CLI:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks maintenanceconfiguration add \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name aksManagedAutoUpgradeSchedule \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule-type Weekly \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --day-of-week Saturday \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --start-time 02:00 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --duration \u003cspan class=\"m\"\u003e4\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eMicrosoft recommends a minimum four-hour maintenance window to ensure upgrades complete without interruption. Combine this with the \u003ccode\u003estable\u003c/code\u003e auto-upgrade channel, which targets the previous minor version with latest patches, for a balance between staying current and avoiding bleeding-edge issues.\u003c/p\u003e\n\u003cp\u003eFor production clusters, I prefer manual upgrades with planned maintenance windows as a safety net. The automation handles the scheduling, but I control when the actual upgrade starts.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"automation-and-rollback-scripting-safe-upgrades\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#automation-and-rollback-scripting-safe-upgrades\" title=\"Automation and rollback: scripting safe upgrades\"\u003eAutomation and rollback: scripting safe upgrades\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAutomation reduces human error during upgrades, but only if the automation includes validation and rollback capabilities. A fully automated upgrade script should:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eValidate current cluster state (replica counts, PDB configurations, node health).\u003c/li\u003e\n\u003cli\u003eUpgrade in stages with validation checkpoints.\u003c/li\u003e\n\u003cli\u003eDetect failures and halt or rollback automatically.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003ePractical validation checks before upgrade:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;=== Pre-Upgrade Validation ===\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check available Kubernetes versions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Available upgrades:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-upgrades \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --output table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify all nodes are ready\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNOTREADY\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get nodes --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -v \u003cspan class=\"s2\"\u003e\u0026#34; Ready \u0026#34;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NOTREADY\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -gt \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NOTREADY\u003c/span\u003e\u003cspan class=\"s2\"\u003e nodes are not ready. Aborting upgrade.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  kubectl get nodes \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -v \u003cspan class=\"s2\"\u003e\u0026#34; Ready \u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ All nodes ready\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check for PDBs that would block drain\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eBLOCKED\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pdb -A -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.namespace}/{.metadata.name}{\u0026#34;\\n\u0026#34;}{end}\u0026#39;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$BLOCKED\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;WARNING: The following PDBs have zero allowed disruptions:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$BLOCKED\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;These will block node drain. Verify this is intentional.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify PDBs exist for critical namespaces\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e NS in production\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003ePDBS\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pdb -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --no-headers 2\u0026gt;/dev/null \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$PDBS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -eq \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;WARNING: No PDBs configured in namespace \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eelse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ \u003c/span\u003e\u003cspan class=\"nv\"\u003e$PDBS\u003c/span\u003e\u003cspan class=\"s2\"\u003e PDBs configured in \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify critical deployments have sufficient replicas\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking critical deployments...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e DEPLOYMENT in myapp-frontend myapp-backend\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eREPLICAS\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get deployment \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -n production -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{.status.readyReplicas}\u0026#39;\u003c/span\u003e 2\u0026gt;/dev/null \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;0\u0026#34;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -lt \u003cspan class=\"m\"\u003e2\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e has fewer than 2 ready replicas (\u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e). Aborting upgrade.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ \u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e replicas ready\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;=== Validation Complete ===\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eRollback is more complex. AKS does not support in-place downgrades. If an upgrade introduces breaking changes, the rollback path involves:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eRestoring from a snapshot or backup (for stateful workloads).\u003c/li\u003e\n\u003cli\u003eDeploying a new node pool with the previous Kubernetes version.\u003c/li\u003e\n\u003cli\u003eMigrating workloads to the new pool.\u003c/li\u003e\n\u003cli\u003eDeleting the upgraded pool.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis process is slow and disruptive, which is why validation before upgrade is critical. Test upgrades in staging, validate application compatibility with the new Kubernetes version, and maintain rollback procedures even if you hope never to use them.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-recommendations\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#practical-recommendations\" title=\"Practical recommendations\"\u003ePractical recommendations\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eBased on production experience, the following practices reduce upgrade-related failures:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAlways configure PDBs for production workloads.\u003c/strong\u003e Even stateless services benefit from \u003ccode\u003eminAvailable: 1\u003c/code\u003e to prevent simultaneous eviction of all replicas.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTest upgrades in staging first.\u003c/strong\u003e Validate application compatibility, verify PDB behavior, and measure upgrade duration under realistic load.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUpgrade during low-traffic windows.\u003c/strong\u003e Even with proper PDBs, upgrades reduce available capacity. Schedule upgrades when traffic is lowest to minimize user impact.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitor during upgrades.\u003c/strong\u003e Track pod eviction events, replica counts, and application error rates. Use Azure Monitor, Prometheus, or your existing observability stack to detect issues early.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAutomate validation, not just execution.\u003c/strong\u003e Scripts that upgrade without validation are worse than manual upgrades because they fail faster and more completely.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS cluster upgrades are unavoidable, but service disruption is not. Cordon and drain mechanics provide the foundation, Pod Disruption Budgets enforce availability constraints, and multi-node-pool rollouts allow graduated risk management. Combine these tools with validation-driven automation, and zero-downtime upgrades become reliable rather than aspirational.\u003c/p\u003e\n\u003cp\u003eThe key insight: upgrades succeed when the automation respects the constraints of your workloads, not when the automation assumes resilience that does not exist.\u003c/p\u003e\n\u003cp\u003eStart with the basics: configure PDBs for every production workload, set \u003ccode\u003e--max-surge 33%\u003c/code\u003e on your node pools, and always upgrade control plane before node pools. Test in staging first. Monitor during the upgrade. These practices are not optional for production clusters.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-01-28T17:00:00+01:00","id":"https://daily-devops.net/posts/cluster-upgrades-zero-downtime-aks/","language":"en","summary":"Master AKS upgrades with cordon/drain mechanics, Pod Disruption Budgets, multi-node-pool rollouts, and automation for zero-downtime operations.\n","tags":["aks","azure","kubernetes","cloud","devops","operations","reliability"],"title":"AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work\n","url":"https://daily-devops.net/posts/cluster-upgrades-zero-downtime-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eYour pod authenticates successfully in staging. Production fails with a cryptic 401. The service account exists, the managed identity is configured, Azure RBAC looks correct. Three hours later, you discover the federated credential subject doesn\u0026rsquo;t match the namespace you deployed to.\u003c/p\u003e\n\u003cp\u003eThis is the new reality of AKS authentication. Workload Identity Federation eliminates the credential lifecycle nightmares we dealt with for years: secrets expiring at 2 AM, credentials leaking into logs, service principals with subscription-wide access because someone took a shortcut during initial setup. But it replaces those problems with configuration complexity that spans three separate RBAC systems.\u003c/p\u003e\n\u003cp\u003eThis article covers what actually breaks: where credentials still leak despite federation, how Kubernetes RBAC, Azure RBAC, and Azure AD permissions interact (and fail), and the validation patterns that catch misconfigurations before they become production incidents.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-with-pod-level-credentials\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#the-problem-with-pod-level-credentials\" title=\"The problem with pod-level credentials\"\u003eThe problem with pod-level credentials\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eTraditional approaches to AKS pod authentication relied on passing Azure service principal credentials directly to workloads. Teams stored client secrets in Kubernetes secrets, mounted them as environment variables, and hoped developers wouldn\u0026rsquo;t log them accidentally. This pattern had obvious weaknesses:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCredential lifecycle management:\u003c/strong\u003e Secrets expire. When they do, workloads fail unpredictably. Rotation requires redeploying pods or restarting containers, creating operational overhead and deployment windows for what should be a background task.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBlast radius:\u003c/strong\u003e A compromised pod credential grants full access to whatever Azure resources the service principal can reach. There\u0026rsquo;s no inherent scoping to the pod, namespace, or even cluster. The credential works from anywhere—your laptop, an attacker\u0026rsquo;s server, a developer\u0026rsquo;s local environment.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eObservability gaps:\u003c/strong\u003e When authentication fails, you get a generic 401. Was the secret wrong? Expired? Never properly mounted? The pod doesn\u0026rsquo;t know, and your logs won\u0026rsquo;t tell you until you start instrumenting credential fetching yourself.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAudit trails:\u003c/strong\u003e Service principal credentials obscure which workload actually made an Azure API call. All requests appear to come from the same identity, making it impossible to trace blast radius during incidents or satisfy compliance requirements for request attribution.\u003c/p\u003e\n\u003cp\u003eWorkload Identity Federation addresses these architectural issues, but introduces new operational complexity.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"workload-identity-vs-managed-identity-vs-service-accounts\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#workload-identity-vs-managed-identity-vs-service-accounts\" title=\"Workload Identity vs. Managed Identity vs. Service Accounts\"\u003eWorkload Identity vs. Managed Identity vs. Service Accounts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eUnderstanding when to use each identity type prevents misconfiguration and operational failures.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"workload-identity-federation\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#workload-identity-federation\" title=\"Workload Identity Federation\"\u003eWorkload Identity Federation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWorkload Identity Federation maps Kubernetes service accounts to Azure AD identities through OpenID Connect (OIDC). The AKS cluster acts as an OIDC issuer, pods authenticate using their service account tokens, and Azure AD validates those tokens to grant Azure resource access.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWhen to use it:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003ePods need access to Azure resources (Storage, Key Vault, Cosmos DB, etc.)\u003c/li\u003e\n\u003cli\u003eYou want credential-free authentication without managing secrets\u003c/li\u003e\n\u003cli\u003eYou need per-workload identity isolation within the same cluster\u003c/li\u003e\n\u003cli\u003eCompliance requires audit trails showing which pod made which Azure API call\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eWhen not to use it:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003ePods only communicate within Kubernetes—use standard Kubernetes service accounts\u003c/li\u003e\n\u003cli\u003eYou\u0026rsquo;re running on non-AKS infrastructure—Managed Identity or service principals may be better fits\u003c/li\u003e\n\u003cli\u003eYour workload runs outside of Azure AD tenant boundaries\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"managed-identity\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#managed-identity\" title=\"Managed Identity\"\u003eManaged Identity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eManaged Identities work at the node or cluster level. The Azure platform manages credentials automatically, and workloads running on those resources inherit the identity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWhen to use it:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eNode-level access patterns (monitoring agents, logging daemons, backup solutions)\u003c/li\u003e\n\u003cli\u003eCluster-wide operations (DNS, ingress controllers, cluster autoscaler)\u003c/li\u003e\n\u003cli\u003eWorkloads where per-pod identity isolation isn\u0026rsquo;t required\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eWhen not to use it:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eMultiple workloads on the same node need different Azure permissions\u003c/li\u003e\n\u003cli\u003eYou need audit trails distinguishing between pod-level actions\u003c/li\u003e\n\u003cli\u003eYou\u0026rsquo;re implementing least privilege at the workload level, not the node level\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"kubernetes-service-accounts\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#kubernetes-service-accounts\" title=\"Kubernetes Service Accounts\"\u003eKubernetes Service Accounts\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eService accounts provide identity within Kubernetes. They control access to Kubernetes API resources through RBAC, but have no inherent Azure permissions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWhen to use them:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWorkloads that only interact with Kubernetes APIs\u003c/li\u003e\n\u003cli\u003eRBAC policies scoped to namespaces, pods, or specific Kubernetes resources\u003c/li\u003e\n\u003cli\u003eAs the foundation for Workload Identity Federation (every federated identity maps to a service account)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eWhen not to use them:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWorkloads need Azure resource access—layer Workload Identity Federation on top\u003c/li\u003e\n\u003cli\u003eCross-cluster identity is required—service accounts are cluster-scoped\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"rbac-layering-where-permissions-actually-fail\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#rbac-layering-where-permissions-actually-fail\" title=\"RBAC layering: Where permissions actually fail\"\u003eRBAC layering: Where permissions actually fail\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS identity and access control spans three separate RBAC systems. Each layer has different failure modes, and misalignment between layers causes the majority of production authentication failures.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"layer-1-kubernetes-rbac\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#layer-1-kubernetes-rbac\" title=\"Layer 1: Kubernetes RBAC\"\u003eLayer 1: Kubernetes RBAC\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eKubernetes RBAC controls access to Kubernetes API resources. This includes pods, services, deployments, config maps, and secrets. Permissions are scoped to namespaces or cluster-wide, defined through roles and role bindings.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCommon failures:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eService account lacks permission to read secrets it needs to mount\u003c/li\u003e\n\u003cli\u003eDeployment controller can\u0026rsquo;t create pods because the service account is missing \u003ccode\u003epods/create\u003c/code\u003e permissions\u003c/li\u003e\n\u003cli\u003eMonitoring workload can\u0026rsquo;t list nodes because it\u0026rsquo;s assigned a namespace-scoped role instead of a cluster role\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eValidation:\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check what a service account can do\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl auth can-i --list --as\u003cspan class=\"o\"\u003e=\u003c/span\u003esystem:serviceaccount:NAMESPACE:SERVICE_ACCOUNT_NAME\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check specific permission\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl auth can-i get secrets --as\u003cspan class=\"o\"\u003e=\u003c/span\u003esystem:serviceaccount:production:my-workload\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"layer-2-azure-rbac\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#layer-2-azure-rbac\" title=\"Layer 2: Azure RBAC\"\u003eLayer 2: Azure RBAC\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure RBAC controls access to Azure resources. Even with Workload Identity properly configured, pods fail to access Azure resources if the federated identity lacks appropriate Azure role assignments.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCommon failures:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWorkload Identity is configured correctly, but the Azure identity has no role assignments—pod can\u0026rsquo;t read from Storage\u003c/li\u003e\n\u003cli\u003eIdentity has \u003ccode\u003eReader\u003c/code\u003e role when it needs \u003ccode\u003eStorage Blob Data Reader\u003c/code\u003e—Azure API returns 403\u003c/li\u003e\n\u003cli\u003eRole assigned at wrong scope (subscription vs resource group vs specific resource)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eValidation:\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# List role assignments for a managed identity\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz role assignment list --assignee \u0026lt;managed-identity-client-id\u0026gt; --output table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify specific permission\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz role assignment list --assignee \u0026lt;managed-identity-client-id\u0026gt; \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --scope /subscriptions/\u0026lt;sub-id\u0026gt;/resourceGroups/\u0026lt;rg\u0026gt;/providers/Microsoft.Storage/storageAccounts/\u0026lt;account\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"layer-3-azure-ad-permissions\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#layer-3-azure-ad-permissions\" title=\"Layer 3: Azure AD permissions\"\u003eLayer 3: Azure AD permissions\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eSome Azure services require Azure AD directory permissions in addition to Azure RBAC. Microsoft Graph API calls, reading Azure AD groups, and certain Key Vault operations require directory-level permissions that aren\u0026rsquo;t managed through RBAC.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCommon failures:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWorkload can authenticate to Azure AD but can\u0026rsquo;t call Graph API—missing \u003ccode\u003eUser.Read.All\u003c/code\u003e directory permission\u003c/li\u003e\n\u003cli\u003eKey Vault access configured with access policies instead of RBAC, but identity isn\u0026rsquo;t in the access policy list\u003c/li\u003e\n\u003cli\u003eCross-tenant scenarios where the identity exists in a different Azure AD tenant than the resource\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eValidation:\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check Azure AD application permissions (if using app registration)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz ad app permission list --id \u0026lt;app-id\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# For Key Vault access policies\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz keyvault show --name \u0026lt;vault-name\u0026gt; --query properties.accessPolicies\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch2 id=\"common-misconfigurations-that-lead-to-security-breaches\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#common-misconfigurations-that-lead-to-security-breaches\" title=\"Common misconfigurations that lead to security breaches\"\u003eCommon misconfigurations that lead to security breaches\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWorkload Identity Federation reduces credential exposure, but doesn\u0026rsquo;t eliminate configuration mistakes that create security vulnerabilities.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"over-permissioned-service-principals\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#over-permissioned-service-principals\" title=\"Over-permissioned service principals\"\u003eOver-permissioned service principals\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eTeams often grant broad permissions to simplify initial setup, then never revisit those permissions. A workload that only needs to read from one storage container ends up with \u003ccode\u003eContributor\u003c/code\u003e on the entire subscription.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMitigation:\u003c/strong\u003e Start with minimal permissions. Grant access to specific resources, not resource groups or subscriptions. Use managed identities with RBAC roles scoped to individual blobs, queues, or Key Vault secrets rather than blanket \u003ccode\u003eContributor\u003c/code\u003e or \u003ccode\u003eOwner\u003c/code\u003e roles.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"credential-exposure-in-logs-and-traces\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#credential-exposure-in-logs-and-traces\" title=\"Credential exposure in logs and traces\"\u003eCredential exposure in logs and traces\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eEven with Workload Identity, tokens can leak. Application logging frameworks sometimes log HTTP headers, distributed tracing may capture authorization headers, and crash dumps may contain in-memory tokens.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMitigation:\u003c/strong\u003e Configure logging libraries to redact authorization headers. Review telemetry configurations to ensure tokens aren\u0026rsquo;t captured in traces. Use structured logging with explicit field filtering rather than logging entire request objects.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"identity-drift-between-environments\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#identity-drift-between-environments\" title=\"Identity drift between environments\"\u003eIdentity drift between environments\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDevelopment clusters use one set of identities, staging uses another, production uses a third. Workloads behave differently across environments because the underlying identities have different permissions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMitigation:\u003c/strong\u003e Use infrastructure as code (Terraform, Bicep, ARM) to define identities and role assignments consistently. Version control your identity configurations alongside application deployments. Validate permissions in CI/CD pipelines before deploying to production.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"missing-federation-trust-relationships\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#missing-federation-trust-relationships\" title=\"Missing federation trust relationships\"\u003eMissing federation trust relationships\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWorkload Identity requires a trust relationship between the Kubernetes service account and the Azure managed identity. If the federated credential isn\u0026rsquo;t configured, authentication fails silently—the pod gets a valid Kubernetes token that Azure AD rejects.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMitigation:\u003c/strong\u003e Automate federated credential creation as part of your cluster provisioning process. Validate that service account annotations match the correct Azure identity. Use admission controllers to enforce annotation standards and prevent deployment of workloads with missing or incorrect identity configurations.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"validation-patterns-how-to-audit-identity-configurations-safely\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#validation-patterns-how-to-audit-identity-configurations-safely\" title=\"Validation patterns: How to audit identity configurations safely\"\u003eValidation patterns: How to audit identity configurations safely\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eProactive validation catches misconfigurations before they cause production failures.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"pre-deployment-validation\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#pre-deployment-validation\" title=\"Pre-deployment validation\"\u003ePre-deployment validation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eBefore deploying a workload, validate that all three RBAC layers are correctly configured:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eKubernetes service account exists and has necessary Kubernetes RBAC permissions\u003c/li\u003e\n\u003cli\u003eAzure managed identity exists and has federated credential linking to the service account\u003c/li\u003e\n\u003cli\u003eAzure managed identity has required Azure RBAC role assignments on target resources\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u003cstrong\u003eExample validation script (Bash):\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNAMESPACE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;production\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSERVICE_ACCOUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;my-workload\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eMANAGED_IDENTITY_CLIENT_ID\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;00000000-0000-0000-0000-000000000000\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSTORAGE_ACCOUNT_ID\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;/subscriptions/\u0026lt;sub-id\u0026gt;/resourceGroups/\u0026lt;rg\u0026gt;/providers/Microsoft.Storage/storageAccounts/\u0026lt;account\u0026gt;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# 1. Verify service account exists\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get serviceaccount \u003cspan class=\"nv\"\u003e$SERVICE_ACCOUNT\u003c/span\u003e -n \u003cspan class=\"nv\"\u003e$NAMESPACE\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# 2. Verify service account has Workload Identity annotation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eANNOTATION\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get serviceaccount \u003cspan class=\"nv\"\u003e$SERVICE_ACCOUNT\u003c/span\u003e -n \u003cspan class=\"nv\"\u003e$NAMESPACE\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{.metadata.annotations.azure\\.workload\\.identity/client-id}\u0026#39;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$ANNOTATION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e !\u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$MANAGED_IDENTITY_CLIENT_ID\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: Service account annotation mismatch\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# 3. Verify Azure role assignment\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eROLE_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz role assignment list \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --assignee \u003cspan class=\"nv\"\u003e$MANAGED_IDENTITY_CLIENT_ID\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --scope \u003cspan class=\"nv\"\u003e$STORAGE_ACCOUNT_ID\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --query \u003cspan class=\"s2\"\u003e\u0026#34;length([?roleDefinitionName==\u0026#39;Storage Blob Data Reader\u0026#39;])\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --output tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$ROLE_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -eq \u003cspan class=\"s2\"\u003e\u0026#34;0\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: Missing Storage Blob Data Reader role assignment\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Validation passed\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"runtime-verification\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#runtime-verification\" title=\"Runtime verification\"\u003eRuntime verification\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eOnce deployed, monitor workloads for authentication failures. Azure Monitor, Application Insights, and Kubernetes events provide signals when identity issues occur.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eKey metrics to track:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAzure AD token acquisition failures (4xx responses from Azure AD endpoints)\u003c/li\u003e\n\u003cli\u003eAzure RBAC authorization failures (403 responses from Azure resource APIs)\u003c/li\u003e\n\u003cli\u003eKubernetes RBAC denials (audit log events with \u003ccode\u003eForbidden\u003c/code\u003e responses)\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"periodic-audits\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#periodic-audits\" title=\"Periodic audits\"\u003ePeriodic audits\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eIdentity configurations drift over time. Regular audits catch permissions that have grown beyond initial requirements or identities that no longer align with current workload needs.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAudit checklist:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eList all managed identities and their role assignments—remove unused identities\u003c/li\u003e\n\u003cli\u003eReview role assignments for over-privileged access—scope down to specific resources\u003c/li\u003e\n\u003cli\u003eValidate federated credentials still match deployed service accounts—remove orphaned federations\u003c/li\u003e\n\u003cli\u003eCheck for service accounts with Workload Identity annotations but no corresponding Azure identity\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"practical-configuration-minimal-working-example\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#practical-configuration-minimal-working-example\" title=\"Practical configuration: Minimal working example\"\u003ePractical configuration: Minimal working example\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHere\u0026rsquo;s a complete Workload Identity configuration showing the Kubernetes and Azure components required for a pod to access Azure Storage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eKubernetes manifest (pod with Workload Identity):\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ev1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eServiceAccount\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage-reader\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eannotations\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eazure.workload.identity/client-id\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;00000000-0000-0000-0000-000000000000\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nn\"\u003e---\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ev1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePod\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage-reader-pod\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003elabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eazure.workload.identity/use\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;true\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eserviceAccountName\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage-reader\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003econtainers\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e- \u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eapp\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eimage\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyregistry.azurecr.io/storage-app:latest\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eenv\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eAZURE_CLIENT_ID\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003evalue\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;00000000-0000-0000-0000-000000000000\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eAZURE_TENANT_ID\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003evalue\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;00000000-0000-0000-0000-000000000000\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003e\u003cstrong\u003eKey configuration points:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eService account must have \u003ccode\u003eazure.workload.identity/client-id\u003c/code\u003e annotation matching the Azure managed identity\u003c/li\u003e\n\u003cli\u003ePod must have \u003ccode\u003eazure.workload.identity/use: \u0026quot;true\u0026quot;\u003c/code\u003e label\u003c/li\u003e\n\u003cli\u003ePod must reference the service account via \u003ccode\u003eserviceAccountName\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eContainer environment variables provide Azure SDK with identity information\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eAzure RBAC assignment (Terraform):\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Managed identity for the workload\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_user_assigned_identity\u0026#34; \u0026#34;storage_reader\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;storage-reader-identity\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Federated credential linking Kubernetes SA to Azure identity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_federated_identity_credential\u0026#34; \u0026#34;storage_reader\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;storage-reader-federation\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  parent_id\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_user_assigned_identity\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003estorage_reader\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  audience\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;api://AzureADTokenExchange\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  issuer\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eoidc_issuer_url\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  subject\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;system:serviceaccount:production:storage-reader\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Grant Storage Blob Data Reader to the identity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_role_assignment\u0026#34; \u0026#34;storage_reader\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scope\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_storage_account\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003edata\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  role_definition_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Storage Blob Data Reader\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  principal_id\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_user_assigned_identity\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003estorage_reader\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eprincipal_id\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003e\u003cstrong\u003eCritical details:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003eaudience\u003c/code\u003e must be \u003ccode\u003e[\u0026quot;api://AzureADTokenExchange\u0026quot;]\u003c/code\u003e for Workload Identity\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eissuer\u003c/code\u003e must match the AKS cluster\u0026rsquo;s OIDC issuer URL exactly\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003esubject\u003c/code\u003e format is \u003ccode\u003esystem:serviceaccount:NAMESPACE:SERVICE_ACCOUNT_NAME\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eRole assignment scope should be as narrow as possible—specific storage account, not resource group\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"final-thoughts\"\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/#final-thoughts\" title=\"Final thoughts\"\u003eFinal thoughts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWorkload Identity Federation solves credential lifecycle and audit trail problems that plagued earlier AKS authentication patterns. It doesn\u0026rsquo;t eliminate configuration complexity or RBAC layering challenges. Understanding how Kubernetes RBAC, Azure RBAC, and Azure AD permissions interact is essential. Knowing where credentials still leak despite federation, what misconfigurations create security vulnerabilities, and how to validate configurations before they fail in production separates functioning workloads from 3 AM incidents.\u003c/p\u003e\n\u003cp\u003eStart with minimal permissions. Automate identity provisioning and role assignments through infrastructure as code. Validate configurations before deployment. Monitor for authentication failures and audit identity drift over time. These patterns prevent the majority of identity-related failures in production AKS environments.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-01-21T17:00:00+01:00","id":"https://daily-devops.net/posts/pod-identity-access-control-aks/","language":"en","summary":"Workload Identity Federation changed how AKS handles authentication. Credential leaks, RBAC failures, identity drift: what breaks and how to fix it.","tags":["identity","azure","kubernetes","cloud","devops","rbac","security"],"title":"Pod Identity \u0026 Access Control in AKS: What Actually Breaks","url":"https://daily-devops.net/posts/pod-identity-access-control-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eAKS documentation will get you to a running cluster. It won\u0026rsquo;t tell you why your pod authenticated in staging and gets a 401 in production. It won\u0026rsquo;t explain why upgrading a 50-node cluster at 2 AM felt fine but a 300-node upgrade at noon caused cascading evictions. It won\u0026rsquo;t show you which storage class to avoid when your database needs to survive node pool replacements.\u003c/p\u003e\n\u003cp\u003eThis series covers the operational reality — the decisions that distinguish AKS clusters that run quietly in production from clusters that generate 3 AM alerts. Nine articles, each examining a specific architectural domain with the specificity that matters when something breaks.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-aks-operations-is-different\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#why-aks-operations-is-different\" title=\"Why AKS Operations Is Different\"\u003eWhy AKS Operations Is Different\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMicrosoft manages the AKS control plane. That sounds like less work, and in some ways it is — you don\u0026rsquo;t patch etcd, you don\u0026rsquo;t replace failed control plane VMs, you don\u0026rsquo;t worry about API server certificate rotation. What it doesn\u0026rsquo;t mean is that running AKS in production is simple or that managed Kubernetes hands you a reliable platform and steps aside.\u003c/p\u003e\n\u003cp\u003eEvery node pool configuration decision is yours. Every storage class binding, every PVC lifecycle policy, every decision about which node pool hosts which workload — that\u0026rsquo;s on you. RBAC spans three separate systems simultaneously: Kubernetes RBAC, Azure RBAC, and Azure AD. A misconfiguration in any one of them produces an access failure that looks identical from the application\u0026rsquo;s perspective. The documentation will show you how to configure each system in isolation. It will not show you why they interact in non-obvious ways under specific conditions, or what the failure mode looks like when you get the federation configuration slightly wrong.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"where-networking-stops-being-managed\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#where-networking-stops-being-managed\" title=\"Where Networking Stops Being Managed\"\u003eWhere Networking Stops Being Managed\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNetworking is another area where \u0026ldquo;managed\u0026rdquo; has a narrower meaning than the word implies. Microsoft manages the control plane networking. Your VNet, your subnets, your IP address planning, your DNS configuration, your ingress architecture — all of it is your responsibility, and the decisions compound. IP exhaustion caused by node pool scaling is a common production incident that no amount of control plane management prevents. Private cluster DNS resolution breaks in ways that take hours to diagnose if you haven\u0026rsquo;t encountered the pattern before.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"upgrades-the-gap-between-docs-and-reality\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#upgrades-the-gap-between-docs-and-reality\" title=\"Upgrades: The Gap Between Docs and Reality\"\u003eUpgrades: The Gap Between Docs and Reality\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eUpgrades are perhaps the clearest illustration of the gap between documentation and reality. The documentation describes upgrade mechanics accurately. What it doesn\u0026rsquo;t describe is how Pod Disruption Budget misconfigurations interact with cluster autoscaler behavior during node pool drain, why the timing of upgrades relative to workload peak matters more than most teams expect, or how a PDB that looks correct on paper blocks drain indefinitely on a cluster that\u0026rsquo;s handling real traffic. Managed Kubernetes handles the control plane upgrade. The workload upgrade is a careful orchestration problem that the platform does not solve for you.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"storage-where-managed-disappears\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#storage-where-managed-disappears\" title=\"Storage: Where \u0026ldquo;Managed\u0026rdquo; Disappears\"\u003eStorage: Where \u0026ldquo;Managed\u0026rdquo; Disappears\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eStorage is where the word \u0026ldquo;managed\u0026rdquo; disappears entirely. Azure manages the underlying disk and file services. AKS provides the CSI drivers. Everything between your application and the storage backend — PVC binding, reclaim policies, volume expansion behavior, backup orchestration, behavior during node failure or node pool deletion — is configuration you own. Teams that treat storage as a detail find out it isn\u0026rsquo;t when a node pool replacement deletes volumes that were bound to nodes rather than to the cluster.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-decisions-compound-silently\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#cost-decisions-compound-silently\" title=\"Cost Decisions Compound Silently\"\u003eCost Decisions Compound Silently\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCost is a dimension that managed Kubernetes actively obscures. The control plane is free at most tiers. Node pool costs scale with what you configure, and the configuration space is large: VM SKU selection, autoscaler min/max bounds, system versus user node pool separation, spot VM integration, pod density targets. None of these have obviously correct values. All of them interact. Teams that inherit clusters often inherit cost structures that made sense at a different scale or for a different workload profile, and reversing those decisions requires careful sequencing to avoid downtime.\u003c/p\u003e\n\u003cp\u003eThe happy paths in the documentation work. They work because they\u0026rsquo;re constructed to work. Production clusters encounter the edges — the configuration combinations, the scale thresholds, the timing sensitivities — that happy paths don\u0026rsquo;t cover. This series is about the edges.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-this-series-covers\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#what-this-series-covers\" title=\"What This Series Covers\"\u003eWhat This Series Covers\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/pod-identity-access-control-aks/\"\u003ePod Identity \u0026amp; Access Control in AKS: What Actually Breaks\u003c/a\u003e\u003c/strong\u003e starts with identity because identity failures are the most common source of production incidents. Workload Identity Federation eliminates credential lifecycle problems but introduces configuration complexity spanning three separate RBAC systems — Kubernetes RBAC, Azure RBAC, and Azure AD permissions. The article explains where credentials still leak despite federation, how layers interact and fail, and validation patterns that catch misconfigurations before they become incidents.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/\"\u003eStorage Architecture \u0026amp; Stateful Workloads in AKS\u003c/a\u003e\u003c/strong\u003e addresses what most AKS guides skip: what actually happens to your data when a node gets replaced. PVC/PV architecture, Azure Disk versus Azure Files performance trade-offs, Velero backup configurations that survive real restore scenarios, and multi-cluster replication patterns for production stateful workloads.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/cost-optimization-resource-governance-aks/\"\u003eAKS Cost Optimization: Resource Governance That Actually Works\u003c/a\u003e\u003c/strong\u003e covers the gap between \u0026ldquo;set resource limits\u0026rdquo; and actually controlling spend at scale. Pod density strategies, node pool design decisions that compound over time, spot VM integration without reliability regressions, and FinOps tagging that produces actionable cost attribution rather than unread dashboards.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/multi-aks-cluster-networking-hub-spoke/\"\u003eMulti-AKS Cluster Networking \u0026amp; Hub-Spoke Topology\u003c/a\u003e\u003c/strong\u003e examines what happens to networking when you move from one cluster to many. VNet peering patterns, hub-spoke routing, cross-cluster DNS resolution, shared ingress options, and — critically — the decision criteria for when mesh complexity becomes justified rather than premature.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/\"\u003eAKS Cluster Upgrades: Zero-Downtime Operations That Actually Work\u003c/a\u003e\u003c/strong\u003e covers upgrade mechanics that documentation describes optimistically. Cordon and drain behavior, Pod Disruption Budget configuration that prevents service disruption rather than theater-level protection, multi-node-pool rollout strategies, and validation-driven automation that makes upgrades reproducible rather than heroic.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/container-registry-image-security-aks/\"\u003eContainer Registry \u0026amp; Image Security in AKS Deployments\u003c/a\u003e\u003c/strong\u003e covers ACR hardening beyond the basics. A production-ready sequence: vulnerability scanning, image signing with Notation, RBAC scoping, private endpoints, policy enforcement through Azure Policy and admission controllers, and geo-replication strategies with clear trade-offs explained.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/\"\u003eAKS Disaster Recovery: Why Your Untested Backup Will Fail\u003c/a\u003e\u003c/strong\u003e addresses the gap between having backups and having a tested recovery plan. Velero configuration, realistic RTO/RPO targets that match business risk rather than wishful thinking, restore testing procedures that catch problems before outages, and multi-region failover steps your team can actually execute under pressure.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/\"\u003eHybrid AKS: Bridging Cloud and On-Prem with Azure Arc\u003c/a\u003e\u003c/strong\u003e covers the operational patterns for organizations running Kubernetes across cloud and on-premises simultaneously. ExpressRoute and VPN connectivity, Azure Arc for unified management across heterogeneous environments, consistent policy enforcement, DNS resolution, and identity federation without duplicating systems.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/\"\u003eAKS at Scale: Hard-Won Lessons from 1000+ Node Clusters\u003c/a\u003e\u003c/strong\u003e closes the series with what changes when clusters grow large enough that the platform itself becomes the bottleneck. etcd limits under high object churn, network saturation at scale, observability overhead that compounds with cluster size, and cost spirals that emerge from architectural decisions that seemed fine at 50 nodes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"who-this-is-for\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#who-this-is-for\" title=\"Who This Is For\"\u003eWho This Is For\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003ePlatform engineers and infrastructure-focused developers responsible for AKS clusters in production — or teams about to inherit that responsibility. Each article assumes you\u0026rsquo;ve run AKS before and want operational depth, not introductory setup instructions.\u003c/p\u003e\n\u003cp\u003eThe series covers Terraform, Bicep, Kubectl, and Azure CLI patterns throughout. Examples are grounded in production scenarios rather than constructed to demonstrate features.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"how-these-articles-were-written\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#how-these-articles-were-written\" title=\"How These Articles Were Written\"\u003eHow These Articles Were Written\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eEach article in this series is based on production experience — clusters that handled real traffic, failed in real ways, and required real fixes under time pressure. That distinction matters for what you\u0026rsquo;ll find here and what you won\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003eProduction experience means the failure patterns are specific. Not \u0026ldquo;storage can be tricky\u0026rdquo; but which storage class binding decisions survive node pool replacements and which don\u0026rsquo;t. Not \u0026ldquo;upgrades can cause downtime\u0026rdquo; but which combination of PDB configuration and autoscaler behavior produces an indefinitely blocked drain. Not \u0026ldquo;identity is complex\u0026rdquo; but the exact configuration gap in Workload Identity Federation that causes silent auth failures in one environment and not another. The specificity isn\u0026rsquo;t for its own sake — it\u0026rsquo;s the difference between an article that confirms your intuition and one that actually changes what you configure next.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"trade-offs-over-single-right-answers\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#trade-offs-over-single-right-answers\" title=\"Trade-offs Over Single Right Answers\"\u003eTrade-offs Over Single Right Answers\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhat production experience doesn\u0026rsquo;t mean is that every approach here is the only valid one. Large-scale AKS operation involves genuine trade-offs — between cost and resilience, between operational simplicity and flexibility, between standardization and workload-specific tuning. The articles explain the reasoning behind recommendations rather than just stating them, because the reasoning is what lets you adapt the approach to your constraints. A node pool design that works for a batch processing workload is wrong for a latency-sensitive API, and the article on cost governance explains why rather than presenting a single correct answer.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-aks-makes-things-harder\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#when-aks-makes-things-harder\" title=\"When AKS Makes Things Harder\"\u003eWhen AKS Makes Things Harder\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThe articles were not written to showcase features or to demonstrate that AKS has a solution for every problem. Some of them document problems that AKS makes harder than it should be, and say so directly. If a particular architectural pattern has a known failure mode at scale, that failure mode appears in the article rather than in a footnote or an FAQ three pages into the documentation. If a feature has a meaningful limitation that affects how you should configure it, that limitation is in the main text, not in a callout box labeled \u0026ldquo;note.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eThe goal is for these articles to be the thing you read before a production incident rather than the thing you find during one.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"where-to-start\"\u003e\u003ca href=\"/posts/aks-architecture-operations/#where-to-start\" title=\"Where to Start\"\u003eWhere to Start\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRead in published order if you\u0026rsquo;re building out AKS infrastructure from scratch — identity and storage are foundational, and later articles reference earlier concepts. Jump to specific articles if you\u0026rsquo;re dealing with an immediate operational problem: the titles are specific enough that the right article for your situation should be obvious.\u003c/p\u003e\n\u003cp\u003eThe scale article at the end is worth reading early if your cluster is already growing or if you\u0026rsquo;re designing for growth — some architectural decisions made at 50 nodes are expensive to reverse at 500.\u003c/p\u003e\n","date_modified":"2026-05-25T23:41:10+02:00","date_published":"2026-01-21T17:00:00+01:00","id":"https://daily-devops.net/posts/aks-architecture-operations/","language":"en","summary":"Nine articles on production AKS—identity, storage, multi-cluster networking, cost governance, DR, and running 1000-node clusters in practice.","tags":["kubernetes","azure","cloud","devops","operations","platform-engineering"],"title":"AKS Architecture \u0026 Operations — The Complete Series","url":"https://daily-devops.net/posts/aks-architecture-operations/"},{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"}],"content_html":"\u003cp\u003eKubernetes has transitioned from a technical option to an assumed default. In organizations and projects I\u0026rsquo;ve worked with, discussions no longer start with whether Kubernetes is appropriate. They start with migration timelines. I\u0026rsquo;ve sat through planning sessions where the question wasn\u0026rsquo;t \u0026ldquo;Should we use Kubernetes?\u0026rdquo; but rather \u0026ldquo;When can we have everything moved over?\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eThis shift isn\u0026rsquo;t driven by application requirements. It\u0026rsquo;s driven by narrative. Consulting decks and reference architectures present \u003cem\u003e\u003cstrong\u003eKubernetes as a universal platform\u003c/strong\u003e\u003c/em\u003e that absorbs governance, security, scalability, observability, recovery, and operational responsibility. The implicit promise: once your software runs on Kubernetes, the hard parts are handled. I\u0026rsquo;ve watched teams adopt this belief wholesale, only to discover the gaps six months into production.\u003c/p\u003e\n\u003cp\u003eThat promise is incomplete. Kubernetes primarily addresses \u003cstrong\u003eone phase\u003c/strong\u003e: runtime orchestration. Most architectural risk, cost overruns, and operational failures occur \u003cstrong\u003ebefore\u003c/strong\u003e runtime during design and delivery, or \u003cstrong\u003eafter\u003c/strong\u003e runtime when incidents happen and systems evolve. I\u0026rsquo;ve debugged production incidents where Kubernetes ran flawlessly while the system failed spectacularly because architectural problems existed upstream and downstream of container orchestration.\u003c/p\u003e\n\u003cp\u003eTreating Kubernetes as a lifecycle platform rather than a runtime component introduces complexity that stays invisible during planning and becomes unavoidable in production. The demos look clean. The reference architectures are elegant. Then you hit reality.\u003c/p\u003e\n\u003cp\u003eTwo questions matter: Not whether Kubernetes works (it does, consistently, in its domain), but where its responsibility ends and whether your organization can handle what lies beyond those boundaries.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"kubernetes-in-the-net-reality\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#kubernetes-in-the-net-reality\" title=\"Kubernetes in the .NET Reality\"\u003eKubernetes in the .NET Reality\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes clusters rarely host a single, clean workload type in practice. They become convergence points: ASP.NET Core APIs, background workers, event-driven processors, migrated Windows Services, and platform components all sharing infrastructure. I\u0026rsquo;ve inherited clusters running everything from modern microservices to decade-old .NET Framework services wrapped in Windows containers, all competing for the same resources.\u003c/p\u003e\n\u003cp\u003eFor stateless, Linux-based ASP.NET Core services, Kubernetes is genuinely strong. Deployments are predictable. Rollouts are controlled. Health checks integrate cleanly. You implement a simple health endpoint:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-csharp\" data-lang=\"csharp\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kt\"\u003evar\u003c/span\u003e \u003cspan class=\"n\"\u003ebuilder\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003eWebApplication\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eCreateBuilder\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003eargs\u003c/span\u003e\u003cspan class=\"p\"\u003e);\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003ebuilder\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eServices\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eAddHealthChecks\u003c/span\u003e\u003cspan class=\"p\"\u003e();\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kt\"\u003evar\u003c/span\u003e \u003cspan class=\"n\"\u003eapp\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u003c/span\u003e \u003cspan class=\"n\"\u003ebuilder\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eBuild\u003c/span\u003e\u003cspan class=\"p\"\u003e();\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003eapp\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eMapHealthChecks\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"s\"\u003e\u0026#34;/health\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e);\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003eapp\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eRun\u003c/span\u003e\u003cspan class=\"p\"\u003e();\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThen you deploy 3 replicas and Kubernetes does what you asked: it keeps exactly 3 running, rolling out updates without downtime, removing failed pods from traffic automatically. You push a new image and watch the update complete—no manual intervention, no traffic loss, no coordination overhead.\u003c/p\u003e\n\u003cp\u003eThis is where Kubernetes works exactly as intended: the application exposes its state honestly, and the platform responds intelligently. Three replicas means three replicas, constantly. A pod fails, it gets replaced within seconds. A rolling update happens seamlessly because Kubernetes orchestrates the transition and the application cooperates through its health endpoint. The first time you watch this happen without manually managing anything, it feels like magic.\u003c/p\u003e\n\u003cp\u003eThis experience—predictable, reliable, hands-off—becomes the template in your mind for how Kubernetes should work everywhere.\u003c/p\u003e\n\u003cp\u003eThe mistake begins when this success gets generalized. I\u0026rsquo;ve seen this pattern repeatedly: success with stateless APIs leads to confidence that everything belongs in Kubernetes. Then the complexity arrives.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"governance-structure-without-enforcement\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#governance-structure-without-enforcement\" title=\"Governance: Structure Without Enforcement\"\u003eGovernance: Structure Without Enforcement\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes offers namespaces, labels, and RBAC. These are primitives, not governance. Real enterprise governance requires enforceable policy, auditability, cost attribution, and environmental separation. In Azure-centric environments, these concerns traditionally live at the subscription, management group, and Azure Policy layer, where they\u0026rsquo;re auditable, mandatory, and enforced at the platform level.\u003c/p\u003e\n\u003cp\u003eIntroducing Kubernetes adds a second governance plane. Without deliberate policy enforcement, clusters drift. I\u0026rsquo;ve seen production and experimental workloads coexist in the same cluster because namespace isolation felt sufficient. It wasn\u0026rsquo;t. Cost attribution becomes opaque. Who actually paid for that node pool? Which business unit owns this? When incidents happen, these questions waste critical time.\u003c/p\u003e\n\u003cp\u003eIn one organization, we discovered experimental ML workloads running on production infrastructure because someone had \u003ccode\u003ekubectl\u003c/code\u003e access and \u0026ldquo;just needed to test something quickly.\u0026rdquo; The namespace separation existed. The policy enforcement didn\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003eKubernetes doesn\u0026rsquo;t prevent this drift. It accelerates it by making deployment so frictionless that governance becomes an afterthought.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"identity-kubernetes-stops-where-entra-id-starts\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#identity-kubernetes-stops-where-entra-id-starts\" title=\"Identity: Kubernetes Stops Where Entra ID Starts\"\u003eIdentity: Kubernetes Stops Where Entra ID Starts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e.NET applications rely on Entra ID (formerly Azure AD) for authentication, authorization, managed identities, and conditional access. Kubernetes has no native concept of enterprise identity. It doesn\u0026rsquo;t integrate with Entra ID\u0026rsquo;s policy layer, conditional access rules, or compliance tracking. This isn\u0026rsquo;t a limitation; it\u0026rsquo;s architectural reality.\u003c/p\u003e\n\u003cp\u003eKubernetes RBAC governs access to cluster resources: who can deploy pods, create services, read secrets. But application identity—the identity your code runs under, the services it authenticates to, the permissions it holds—that\u0026rsquo;s entirely separate. Kubernetes facilitates the technical handshake (workload identity token exchange), but the authority making identity decisions lives outside the cluster in Entra ID. Your application integrates with Entra ID directly, not through Kubernetes.\u003c/p\u003e\n\u003cp\u003eThis boundary is invisible until you\u0026rsquo;re three months into production and security asks about conditional access policies, device compliance rules, or audit trails. Kubernetes doesn\u0026rsquo;t track any of that. It can\u0026rsquo;t. The identity system is external, and Kubernetes merely provides the plumbing to connect to it.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;ve worked with teams who expected Kubernetes to handle enterprise identity because it handled everything else. It doesn\u0026rsquo;t. That realization typically arrives when security reviews surface the integration gaps.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"networking-where-kubernetes-abstraction-fails-first\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#networking-where-kubernetes-abstraction-fails-first\" title=\"Networking: Where Kubernetes Abstraction Fails First\"\u003eNetworking: Where Kubernetes Abstraction Fails First\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNetworking is where Kubernetes myths collapse fastest. I\u0026rsquo;ve seen the most preventable production incidents here. Kubernetes introduces its own networking model, but it doesn\u0026rsquo;t replace enterprise networking. It operates \u003cstrong\u003einside\u003c/strong\u003e it. This distinction matters when things go wrong.\u003c/p\u003e\n\u003cp\u003eIn Azure-based architectures, your first line of defense exists outside the cluster:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eVirtual networks and subnet isolation\u003c/li\u003e\n\u003cli\u003eUser-defined routing (UDR)\u003c/li\u003e\n\u003cli\u003eAzure Firewall or Network Virtual Appliance (NVA)\u003c/li\u003e\n\u003cli\u003eApplication Gateway or Front Door with Web Application Firewall (WAF)\u003c/li\u003e\n\u003cli\u003ePrivate endpoints and service endpoints\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIngress controllers route traffic. They don\u0026rsquo;t defend the network. They\u0026rsquo;re application-layer components running inside pods, not hardened network appliances.\u003c/p\u003e\n\u003cp\u003eTreating Kubernetes ingress as your security perimeter shifts responsibility from hardened network controls to application-level components that were never designed to absorb hostile traffic at scale. I\u0026rsquo;ve seen this assumption lead to security incidents where attackers bypassed ingress controllers by targeting services directly once they gained cluster access.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"azure-cni-and-ip-exhaustion\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#azure-cni-and-ip-exhaustion\" title=\"Azure CNI and IP Exhaustion\"\u003eAzure CNI and IP Exhaustion\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWith Azure CNI, every pod consumes a real IP address from your virtual network subnet. Scaling pods means scaling IP consumption linearly. Poor subnet sizing surfaces late—usually in production when teams suddenly can\u0026rsquo;t scale further and the error message is cryptic. Kubernetes schedules pods until the network says no, then fails silently.\u003c/p\u003e\n\u003cp\u003eThis isn\u0026rsquo;t a Kubernetes failure. It\u0026rsquo;s a networking responsibility that Kubernetes exposes. I\u0026rsquo;ve debugged this scenario more times than I\u0026rsquo;d like to admit, always with the same root cause: network planning happened before anyone calculated peak pod counts under load.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"east-west-traffic-and-lateral-movement\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#east-west-traffic-and-lateral-movement\" title=\"East-West Traffic and Lateral Movement\"\u003eEast-West Traffic and Lateral Movement\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eKubernetes networking is flat by default. Every pod can reach every other pod within the cluster. Network policies are optional and frequently incomplete. In organizations without dedicated platform teams, they\u0026rsquo;re often absent entirely.\u003c/p\u003e\n\u003cp\u003eFor multi-service .NET systems, this makes lateral movement trivial once any single pod is compromised. An attacker who gains access to a frontend pod can immediately probe backend services, database connections, and internal APIs. Kubernetes provides the mechanism (network policies) but doesn\u0026rsquo;t enforce discipline. I worked on an incident response where a compromised pod accessed 12 different internal services before we detected it. Network policies existed in the repository. They weren\u0026rsquo;t applied.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"egress-control\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#egress-control\" title=\"Egress Control\"\u003eEgress Control\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eIngress gets constant attention: WAF rules, TLS certificates, rate limiting. Egress almost never does. By default, all pods can reach the internet: any destination, any port. In regulated environments, that\u0026rsquo;s unacceptable. Egress control requires forced routing through Azure Firewall and explicit allow-listing of destinations.\u003c/p\u003e\n\u003cp\u003eKubernetes has no native concept of allowed destinations. You build this external to the cluster, then spend weeks troubleshooting why perfectly valid application calls fail because someone forgot to allow-list a critical API endpoint.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"security-responsibility-is-concentrated-not-removed\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#security-responsibility-is-concentrated-not-removed\" title=\"Security: Responsibility Is Concentrated, Not Removed\"\u003eSecurity: Responsibility Is Concentrated, Not Removed\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes provides security mechanisms. Almost none are enabled by default. A .NET application on Azure App Service benefits from opinionated defaults: automatic image scanning, encrypted secrets, preconfigured network isolation, integrated runtime monitoring.\u003c/p\u003e\n\u003cp\u003eIn Kubernetes, every guarantee requires deliberate recreation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eImage provenance through admission controllers and policy enforcement\u003c/li\u003e\n\u003cli\u003eSecret handling through external secret stores (Azure Key Vault integration)\u003c/li\u003e\n\u003cli\u003eNetwork segmentation through network policies and firewall rules\u003c/li\u003e\n\u003cli\u003eRuntime monitoring through service mesh sidecars or host-level agents\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eEach added controller or sidecar increases capability and attack surface simultaneously. I\u0026rsquo;ve reviewed Kubernetes configurations where security controls outnumbered application pods. The cluster became a security platform that happened to run some software.\u003c/p\u003e\n\u003cp\u003eKubernetes doesn\u0026rsquo;t reduce security effort. It concentrates it into your platform team, assuming you have one.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cicd-and-supply-chain-kubernetes-consumes-trust\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#cicd-and-supply-chain-kubernetes-consumes-trust\" title=\"CI/CD and Supply Chain: Kubernetes Consumes Trust\"\u003eCI/CD and Supply Chain: Kubernetes Consumes Trust\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes consumes artifacts. It doesn\u0026rsquo;t produce trust. CI pipelines, artifact promotion, image immutability, and signing decisions all happen long before Kubernetes schedules a pod. A broken supply chain can\u0026rsquo;t be repaired at runtime. If a malicious image makes it to your registry, Kubernetes will happily deploy it.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;ve worked with a team who discovered their CI pipeline had been compromised for three weeks. Kubernetes deployed every malicious image perfectly—on schedule, with zero-downtime rolling updates. The orchestration worked flawlessly. The supply chain didn\u0026rsquo;t. Kubernetes enforces desired state but doesn\u0026rsquo;t validate how that state was produced. That validation is your responsibility in your build pipelines, artifact registries, and admission controllers.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"observability-infrastructure-metrics-are-not-insight\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#observability-infrastructure-metrics-are-not-insight\" title=\"Observability: Infrastructure Metrics Are Not Insight\"\u003eObservability: Infrastructure Metrics Are Not Insight\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes emits metrics and logs: CPU usage per pod, memory consumption, network I/O. These describe platform health, not system behavior. .NET systems require application-level observability—distributed tracing across service boundaries, dependency tracking to external systems, structured logging with correlation IDs.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-csharp\" data-lang=\"csharp\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003ebuilder\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eServices\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eAddOpenTelemetry\u003c/span\u003e\u003cspan class=\"p\"\u003e()\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eWithTracing\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003et\u003c/span\u003e \u003cspan class=\"p\"\u003e=\u0026gt;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e        \u003cspan class=\"n\"\u003et\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eAddAspNetCoreInstrumentation\u003c/span\u003e\u003cspan class=\"p\"\u003e()\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e         \u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eAddHttpClientInstrumentation\u003c/span\u003e\u003cspan class=\"p\"\u003e());\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWithout integration into Azure Monitor and Application Insights, incidents become reconstruction exercises. I\u0026rsquo;ve sat in war rooms where Kubernetes dashboards stayed green—all pods healthy, all nodes operational—while users experienced cascading timeouts. Pod restarts hide underlying failures instead of surfacing them. A pod that crashes and restarts every 30 seconds looks \u0026ldquo;healthy\u0026rdquo; to Kubernetes if it passes health checks between crashes.\u003c/p\u003e\n\u003cp\u003eObservability requires design. You bring it, or you debug blind.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"scalability-kubernetes-scales-pods-not-systems\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#scalability-kubernetes-scales-pods-not-systems\" title=\"Scalability: Kubernetes Scales Pods, Not Systems\"\u003eScalability: Kubernetes Scales Pods, Not Systems\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes scales replicas, not architectures. Database contention, synchronous dependencies, external API limits—they all remain regardless of how many pod copies you create. Kubernetes can amplify bottlenecks just as effectively as it amplifies capacity.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;ve watched auto-scaling create 50 pod replicas, all waiting for the same database connection pool that maxed out at 100 connections. More pods didn\u0026rsquo;t solve the problem—they made it worse by consuming resources while waiting.\u003c/p\u003e\n\u003cp\u003eEvent-driven scaling improves this, but only with architectural redesign. Kubernetes enables the \u003cstrong\u003emechanism\u003c/strong\u003e for elasticity—you can scale replicas based on external signals. But the architecture determines whether that mechanism translates into actual scalability. Scaling 50 pods won\u0026rsquo;t help if they\u0026rsquo;re all waiting on the same bottleneck. That\u0026rsquo;s a design problem, not an orchestration problem.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"backup-and-recovery-kubernetes-stops-completely\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#backup-and-recovery-kubernetes-stops-completely\" title=\"Backup and Recovery: Kubernetes Stops Completely\"\u003eBackup and Recovery: Kubernetes Stops Completely\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes restarts containers. It doesn\u0026rsquo;t restore systems. State lives outside the cluster in databases, message queues, caches, and storage accounts. Backup and recovery remain responsibilities of data platforms and operational processes. Kubernetes has no concept of business continuity or disaster recovery beyond \u0026ldquo;restart the pod.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eHigh availability masks failure. It doesn\u0026rsquo;t undo it. A corrupted database doesn\u0026rsquo;t care how many pod replicas exist or how fast Kubernetes can reschedule them. I\u0026rsquo;ve responded to incidents where Kubernetes performed perfectly—immediate failover, health-driven routing—while the underlying data corruption spread across all replicas.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"windows-containers-on-kubernetes-a-strong-architectural-smell\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#windows-containers-on-kubernetes-a-strong-architectural-smell\" title=\"Windows Containers on Kubernetes: A Strong Architectural Smell\"\u003eWindows Containers on Kubernetes: A Strong Architectural Smell\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWindows containers are supported but introduce slower startup times (minutes versus seconds), limited ecosystem support, and operational asymmetry—separate node pools, different update cadence, higher costs. They\u0026rsquo;re frequently used to avoid refactoring legacy workloads, turning Kubernetes into a compatibility layer rather than a platform.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;ve seen .NET Framework applications from 2010 wrapped in Windows containers and deployed to Kubernetes because \u0026ldquo;we\u0026rsquo;re moving to cloud-native.\u0026rdquo; The workload hadn\u0026rsquo;t changed. The infrastructure complexity increased dramatically. They function, they complicate operations, and they rarely age well.\u003c/p\u003e\n\u003cp\u003eEvery Windows container deployment I\u0026rsquo;ve reviewed eventually became a maintenance burden. The startup time alone makes scaling problematic. Windows licensing costs amplify infrastructure expenses. And the operational split between Linux and Windows node pools fragments your platform team\u0026rsquo;s expertise.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cost-and-organizational-economics\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#cost-and-organizational-economics\" title=\"Cost and Organizational Economics\"\u003eCost and Organizational Economics\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes isn\u0026rsquo;t cost-neutral—a realization that typically arrives 3-6 months after initial deployment when finance asks why cloud costs doubled. It shifts cost visibility from infrastructure to organization: platform teams grow from 2 to 8 people, node pools sit idle waiting for burst capacity that happens twice a month, Windows nodes amplify costs through licensing and compute, observability instrumentation adds runtime overhead and egress costs.\u003c/p\u003e\n\u003cp\u003eTechnical efficiency—improved resource utilization through bin-packing and scheduling—often comes at \u003cstrong\u003eorganizational expense\u003c/strong\u003e: larger platform teams, slower iteration velocity (every change needs cluster-wide validation), distributed debugging complexity (which of the 15 services in the trace actually caused the timeout?).\u003c/p\u003e\n\u003cp\u003eThe calculation isn\u0026rsquo;t universal. It depends on workload mix, team structure, organizational tolerance for operational complexity. For companies running 200+ microservices with dedicated SRE teams, Kubernetes pays dividends. For companies running 8 services with 3 developers, it\u0026rsquo;s often overhead.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion-kubernetes-concentrates-architectural-responsibility\"\u003e\u003ca href=\"/posts/kubernetes-not-platform-strategy/#conclusion-kubernetes-concentrates-architectural-responsibility\" title=\"Conclusion: Kubernetes Concentrates Architectural Responsibility\"\u003eConclusion: Kubernetes Concentrates Architectural Responsibility\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes is powerful and, in specific scenarios, the right choice: stateless Linux-based APIs with clean 12-factor design, event-driven background workers that scale horizontally, organizations with dedicated platform teams who can absorb operational complexity, and standardized workload portfolios where 80%+ of applications fit predictable patterns.\u003c/p\u003e\n\u003cp\u003eOutside these boundaries, Kubernetes doesn\u0026rsquo;t remove responsibility. It concentrates it. The responsibilities I\u0026rsquo;ve outlined (governance, identity, networking, security, observability, backup) don\u0026rsquo;t disappear. They become explicit architectural decisions that someone on your team must own, implement, and maintain.\u003c/p\u003e\n\u003cp\u003eKubernetes is not governance. That lives at the subscription, policy, and organizational level. It\u0026rsquo;s not identity. That authority is Entra ID. It\u0026rsquo;s not the security perimeter. That\u0026rsquo;s the network, the firewall, and the defense-in-depth controls you build around the cluster. It\u0026rsquo;s not backup and recovery. That responsibility belongs to data platforms and business continuity planning. It\u0026rsquo;s not observability. That\u0026rsquo;s an application design concern requiring deliberate instrumentation.\u003c/p\u003e\n\u003cp\u003eKubernetes orchestrates workloads, and it does this extremely well.\u003c/p\u003e\n\u003cp\u003eFrom an architect\u0026rsquo;s perspective—someone who has designed, deployed, and maintained these systems in production—Kubernetes can be the most visible component of a hosting solution but never the \u003cstrong\u003ewhole\u003c/strong\u003e solution. The promise that it absorbs the software lifecycle is marketing, not engineering reality.\u003c/p\u003e\n\u003cp\u003eThat distinction isn\u0026rsquo;t theoretical. It\u0026rsquo;s operational reality I\u0026rsquo;ve experienced across multiple organizations, multiple industries, multiple failure modes.\u003c/p\u003e\n\u003cp\u003eThe question isn\u0026rsquo;t whether Kubernetes works—it does, consistently, predictably, within its domain. The question is whether your organization can handle everything Kubernetes \u003cstrong\u003edoesn\u0026rsquo;t\u003c/strong\u003e do, and whether the complexity trade-off makes sense for your specific context, team capability, and workload characteristics.\u003c/p\u003e\n\u003cp\u003eAnswer that question honestly before committing your platform strategy.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-01-13T17:00:00+01:00","id":"https://daily-devops.net/posts/kubernetes-not-platform-strategy/","language":"en","summary":"Kubernetes orchestrates containers brilliantly. But governance, identity, and recovery live elsewhere—and ignoring those boundaries breaks production.\n","tags":["kubernetes","architecture","platform-engineering","dotnet","cloudnative"],"title":"Kubernetes Is Not a Platform Strategy\n","url":"https://daily-devops.net/posts/kubernetes-not-platform-strategy/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eNetwork segmentation is a fundamental security control for modern Kubernetes environments. AKS supports multiple networking models such as kubenet, Azure CNI, and overlay CNIs. The networking model matters, but the decisive factor for enforcing isolation and compliance is the consistent application of network policies.\u003c/p\u003e\n\u003cp\u003eThis article describes how network policies work in AKS, the available engines, practical examples, and recommended practices for enforcing a zero-trust posture within a cluster.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-network-policies-matter\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#why-network-policies-matter\" title=\"Why network policies matter\"\u003eWhy network policies matter\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes permissively allows pod-to-pod communication by default, which simplifies operations but increases risk. Without network policies, an attacker or a compromised workload can move laterally, access internal services, exfiltrate data, or generate unintended traffic. Network policies let you express explicit allow rules, reducing the cluster attack surface and supporting compliance requirements.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"aks-network-policy-engines\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#aks-network-policy-engines\" title=\"AKS network policy engines\"\u003eAKS network policy engines\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS offers two commonly used network policy implementations. Choose based on feature needs and operational constraints.\u003c/p\u003e\n\u003cp\u003eAKS also supports Cilium as a network policy and dataplane option. Evaluate Cilium if you require advanced eBPF-based dataplane features or different dataplane capabilities (see Microsoft Docs).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"azure-network-policies\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#azure-network-policies\" title=\"Azure Network Policies\"\u003eAzure Network Policies\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eNative AKS integration.\u003c/li\u003e\n\u003cli\u003eRequires Azure CNI (see Microsoft Docs: Use network policies in AKS).\u003c/li\u003e\n\u003cli\u003eHigh performance and deep integration with Azure networking.\u003c/li\u003e\n\u003cli\u003ePolicies are enforced by Azure\u0026rsquo;s policy manager.\u003c/li\u003e\n\u003cli\u003eBest suited for organizations that prefer a managed, Azure-native solution.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"calico-network-policies\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#calico-network-policies\" title=\"Calico Network Policies\"\u003eCalico Network Policies\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eOpen-source and widely adopted.\u003c/li\u003e\n\u003cli\u003eSupports advanced features such as egress controls and global policies.\u003c/li\u003e\n\u003cli\u003eWorks with Azure CNI and kubenet (see Microsoft Docs: Use network policies in AKS).\u003c/li\u003e\n\u003cli\u003eSuitable for complex architectures, multi-cloud deployments, or teams that need granular L3/L4 controls.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"how-network-policies-work\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#how-network-policies-work\" title=\"How network policies work\"\u003eHow network policies work\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNetwork policies declare allowed traffic in terms of pod selectors, namespace selectors, ports, and protocol. A policy can specify ingress rules, egress rules, or both. Importantly, once any policy selects a pod, the implicit behavior becomes deny for traffic not explicitly allowed. That default-deny behavior is the basis for predictable and auditable isolation.\u003c/p\u003e\n\u003cp\u003eNote: Network policy is commonly set at cluster creation (for example: \u003ccode\u003eaz aks create --network-plugin azure --network-policy azure\u003c/code\u003e). You can enable or change the network policy engine on an existing cluster (for example: \u003ccode\u003eaz aks update --resource-group myRG --name myAKSCluster --network-policy calico\u003c/code\u003e). However, changing the network policy can trigger node-pool reimaging and temporary disruption.\u003c/p\u003e\n\u003cp\u003ePractical maintenance steps when changing network policies:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eTest the change in a staging cluster first. Example create command for a disposable test cluster:\u003c/li\u003e\n\u003c/ul\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks create -g myRG -n test-cluster --network-plugin azure --network-policy calico --node-count \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cul\u003e\n\u003cli\u003eWhen rolling changes through production, update one node pool at a time and verify workloads before proceeding.\u003c/li\u003e\n\u003cli\u003eBefore making changes, cordon and drain affected nodes to allow graceful eviction:\u003c/li\u003e\n\u003c/ul\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cordon \u0026lt;node-name\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl drain \u0026lt;node-name\u0026gt; --ignore-daemonsets --delete-local-data\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cul\u003e\n\u003cli\u003eAfter the update, validate workloads and then uncordon nodes: \u003ccode\u003ekubectl uncordon \u0026lt;node-name\u0026gt;\u003c/code\u003e.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003ePlan a maintenance window for these operations and automate the rollback or node-pool recreation path if validation fails.\u003c/p\u003e\n\u003cp\u003eNote: Kubernetes NetworkPolicy is an L3/L4 mechanism. It controls IP and port level access between pods and namespaces. For L7 (HTTP/FQDN) filtering you need an engine that explicitly supports L7 policies (for example, Cilium\u0026rsquo;s L7 features) or a service-mesh / proxy-based approach.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-example-allow-only-specific-traffic\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#practical-example-allow-only-specific-traffic\" title=\"Practical example: Allow only specific traffic\"\u003ePractical example: Allow only specific traffic\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThis policy allows only requests from pods labeled role=app to pods labeled role=backend on TCP port 8080 in the production namespace.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yml\" data-lang=\"yml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003enetworking.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eNetworkPolicy\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eallow-app-to-backend\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003epodSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003erole\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ebackend\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eingress\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"nt\"\u003efrom\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"nt\"\u003epodSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e            \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e              \u003c/span\u003e\u003cspan class=\"nt\"\u003erole\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eapp\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eports\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"nt\"\u003eprotocol\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eTCP\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003eport\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e8080\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWithout other allow rules, all other traffic to the selected backend pods will be blocked. This approach supports a least-privilege model for intra-cluster communication.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"how-to-validate-policies\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#how-to-validate-policies\" title=\"How to validate policies\"\u003eHow to validate policies\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eQuick validation steps you can run in a test cluster:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCreate a small test cluster with Calico enabled:\u003c/li\u003e\n\u003c/ol\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks create -g myRG -n test-calico --network-plugin azure --network-policy calico --node-count \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003col\u003e\n\u003cli\u003eDeploy two lightweight pods and verify connectivity:\u003c/li\u003e\n\u003c/ol\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run client --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever -- sleep \u003cspan class=\"m\"\u003e3600\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run server --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever -- sleep \u003cspan class=\"m\"\u003e3600\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -o wide\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl \u003cspan class=\"nb\"\u003eexec\u003c/span\u003e -it client -- /bin/sh\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# from inside the client pod try to reach the server pod IP (replace \u0026lt;server-pod-ip\u0026gt;):\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003enc -zv \u0026lt;server-pod-ip\u0026gt; \u003cspan class=\"m\"\u003e8080\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003col\u003e\n\u003cli\u003eApply your NetworkPolicy and repeat the test. Use \u003ccode\u003ekubectl describe networkpolicy \u0026lt;name\u0026gt;\u003c/code\u003e to inspect selectors and rules.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThese steps are intended for validation only. Do not run them against production clusters.\u003c/p\u003e\n\u003cp\u003eCI validation snippet (example):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# apply policy and run quick connectivity check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl apply -f mypolicy.yaml\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run client --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever -- sleep \u003cspan class=\"m\"\u003e3600\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run server --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever -- sleep \u003cspan class=\"m\"\u003e3600\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSERVER_IP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pod -l \u003cspan class=\"nv\"\u003erun\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eserver -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{.items[0].status.podIP}\u0026#39;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl \u003cspan class=\"nb\"\u003eexec\u003c/span\u003e client -- nc -zv \u003cspan class=\"nv\"\u003e$SERVER_IP\u003c/span\u003e \u003cspan class=\"m\"\u003e8080\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eCI security guidance:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003ePrefer ephemeral test clusters created by the pipeline and destroyed after the run. If that is not possible, create a Kubernetes ServiceAccount with minimal RBAC instead of storing a full-cluster admin \u003ccode\u003eKUBECONFIG\u003c/code\u003e in secrets.\u003c/li\u003e\n\u003cli\u003eUse a least-privilege service principal or OIDC-based login for Azure authentication and scope credentials to the smallest resource group or cluster role necessary. Avoid exposing long-lived admin credentials in CI secrets.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"namespace-isolation\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#namespace-isolation\" title=\"Namespace isolation\"\u003eNamespace isolation\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNamespaces help organize workloads but do not enforce network isolation by themselves. Apply a policy that denies ingress to all pods unless explicitly allowed to implement namespace-level segmentation.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yml\" data-lang=\"yml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003enetworking.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eNetworkPolicy\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003edeny-cross-namespace\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003epodSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e{}\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eingress\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch2 id=\"egress-control\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#egress-control\" title=\"Egress control\"\u003eEgress control\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eOutbound traffic is often overlooked, yet many compromises involve unfiltered egress. Use egress policies to permit only required external destinations. Example: allow DNS to a specific resolver.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yml\" data-lang=\"yml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003enetworking.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eNetworkPolicy\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eallow-egress-dns\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003epodSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e{}\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003epolicyTypes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"l\"\u003eEgress\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eegress\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e- \u003cspan class=\"nt\"\u003eto\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"nt\"\u003eipBlock\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e            \u003c/span\u003e\u003cspan class=\"nt\"\u003ecidr\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e8.8.8.8\u003c/span\u003e\u003cspan class=\"l\"\u003e/32\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eports\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"nt\"\u003eprotocol\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eUDP\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003eport\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e53\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch2 id=\"choosing-the-right-engine\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#choosing-the-right-engine\" title=\"Choosing the right engine\"\u003eChoosing the right engine\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eFeature comparison at a glance:\u003c/p\u003e\n\u003ctable class=\"striped\"\u003e\n\t\u003cthead\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003cth\u003eFeature\u003c/th\u003e\n\t\t\t\t\t\u003cth style=\"text-align: right\"\u003eAzure Network Policies\u003c/th\u003e\n\t\t\t\t\t\u003cth style=\"text-align: right\"\u003eCalico\u003c/th\u003e\n\t\t\t\t\t\u003cth style=\"text-align: right\"\u003eCilium\u003c/th\u003e\n\t\t\t\u003c/tr\u003e\n\t\u003c/thead\u003e\n\t\u003ctbody\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003eAKS integration\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eVery good\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eGood\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eGood\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003ePerformance\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eHigh\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eHigh\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eHigh\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003eComplexity\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eLow\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eMedium\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eMedium\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003eAdvanced egress\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eNo\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eYes\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eYes\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003eGlobal policies\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eNo\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eYes\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eYes\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003eMulti-cloud support\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eNo\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eYes\u003c/td\u003e\n\t\t\t\t\t\u003ctd style=\"text-align: right\"\u003eYes\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eNote on Cilium: Cilium provides an eBPF-based dataplane and supports advanced L7 features and cluster/global policy CRDs. Many of Cilium\u0026rsquo;s advanced capabilities rely on Linux eBPF support; feature parity on Windows nodes is limited. Check the AKS Cilium and Cilium docs for supported scenarios and any AKS-specific integration steps.\u003c/p\u003e\n\u003cp\u003eRecommendation: use Azure Network Policies if you need a managed Azure-native solution and do not require advanced Calico features. Choose Calico if you need advanced egress controls, global policies, or multi-cloud consistency.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"best-practices\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#best-practices\" title=\"Best practices\"\u003eBest practices\u003c/a\u003e\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eStart with a default-deny posture. Block traffic first, then explicitly allow required flows.\u003c/li\u003e\n\u003cli\u003eOrganize policies per namespace to simplify governance and reduce accidental exposure.\u003c/li\u003e\n\u003cli\u003eVersion and test policies as part of CI pipelines. Tools such as Kyverno or Gatekeeper help validate and enforce policy changes before they reach production.\u003c/li\u003e\n\u003cli\u003eInstrument and visualize traffic flows using Azure Monitor, Calico UI, or third-party observability tools. Visibility is critical for troubleshooting and verification.\u003c/li\u003e\n\u003cli\u003eCombine network policies with Pod Security Standards to protect workloads and reduce risk at multiple layers.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor tip: Test policy changes in a disposable staging cluster and automate policy validation in CI pipelines. This reduces surprises during production rollouts and helps detect overly broad or blocking rules early.\u003c/p\u003e\n\u003cp\u003eAuthor note: I will be honest, when I first started working with AKS network policies I found the default behaviour a bit surprising — and you probably will too. So, a pretty simple rule of thumb I use is: start small, test often, and iterate. If you take nothing else from this article, just run the validation steps in a throwaway cluster and you\u0026rsquo;ll learn quickly what gets blocked and what does not.\u003c/p\u003e\n\u003cp\u003eKnown limitations and version notes\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWindows node support and feature parity can differ from Linux; check the AKS Windows guidance for details. (See Microsoft Docs.)\u003c/li\u003e\n\u003cli\u003eSome advanced Calico features may require specific Calico versions; refer to the Calico and AKS release notes before adopting L7 or global policy features.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eValidate NetworkPolicy\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eon\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"l\"\u003epush]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ejobs\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003evalidate-policy\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003eruns-on\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eubuntu-latest\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003esteps\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eCheckout\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003euses\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eactions/checkout@v4\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eSet up kubectl\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003euses\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eazure/setup-kubectl@v3\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eApply policy and test connectivity\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eenv\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003eKUBECONFIG\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003e${{ secrets.KUBECONFIG }}\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003erun\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          kubectl apply -f mypolicy.yaml\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          kubectl run client --image=busybox --restart=Never -- sleep 3600\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          kubectl run server --image=busybox --restart=Never -- sleep 3600\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          sleep 5\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          SERVER_IP=$(kubectl get pod -l run=server -o jsonpath=\u0026#39;{.items[0].status.podIP}\u0026#39;)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e          kubectl exec client -- nc -zv $SERVER_IP 8080 || exit 1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eA few quick, honestly practical pointers: name your test namespace \u003ccode\u003enp-test\u003c/code\u003e, use labels like \u003ccode\u003eapp=demo\u003c/code\u003e and \u003ccode\u003erole=backend\u003c/code\u003e, and store \u003ccode\u003eKUBECONFIG\u003c/code\u003e in your CI secrets. These tiny, somewhat mundane conventions make reproducible tests a lot easier.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/aks-network-policies-zero-trust/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNetwork policies are a foundational control for securing AKS clusters. They enable a zero-trust approach inside the cluster, reduce the attack surface, separate workloads, and allow precise control of inbound and outbound traffic. Whether you adopt Azure Network Policies or Calico, apply policies consistently, automate testing and deployment, and maintain visibility to ensure the cluster remains secure and auditable.\u003c/p\u003e","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2025-12-10T11:45:00+01:00","id":"https://daily-devops.net/posts/aks-network-policies-zero-trust/","language":"en","summary":"Learn why AKS Network Policies are essential for Zero Trust, pod isolation, and Kubernetes security—plus how to implement them the right way.","tags":["networking","azure","cloud","kubernetes","platform-engineering"],"title":"AKS Network Policies: The Security Layer Your Cluster Is Missing","url":"https://daily-devops.net/posts/aks-network-policies-zero-trust/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eSelecting the right network model is arguably one of the most critical architectural decisions you will make when deploying a Kubernetes cluster on Azure Kubernetes Service (AKS). This choice ripples through nearly every aspect of your cluster\u0026rsquo;s lifecycle, influencing how pods communicate, how efficiently you use your IP address space, which Azure services integrate seamlessly with your workloads, and ultimately, how well your infrastructure scales to meet future demands. It affects scalability, security posture, operational cost, performance characteristics, available integration options, and your long-term operational flexibility.\u003c/p\u003e\n\u003cp\u003eFor many years, AKS administrators have largely found themselves choosing between two well-established options: \u003cstrong\u003ekubenet\u003c/strong\u003e and \u003cstrong\u003eAzure CNI\u003c/strong\u003e. Each brought distinct tradeoffs to the table. kubenet offered simplicity and IP efficiency at the cost of limited integration, while Azure CNI provided rich enterprise capabilities but introduced significant IP consumption challenges that required careful VNet planning. With the introduction of \u003cstrong\u003eAzure CNI Overlay\u003c/strong\u003e, Microsoft has addressed these historical limitations by adding a genuinely modern option that thoughtfully combines IP efficiency with comprehensive enterprise networking capabilities.\u003c/p\u003e\n\u003cp\u003eThis article walks through a comprehensive, practical comparison of all three networking models. We\u0026rsquo;ll examine how each one works under the hood, explore the genuine strengths and limitations of each approach, and ultimately provide you with the guidance you need to make an informed decision about which model best suits your specific organizational requirements and technical constraints.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-the-network-model-actually-matters\"\u003e\u003ca href=\"/posts/aks-networking-clash/#why-the-network-model-actually-matters\" title=\"Why the Network Model Actually Matters\"\u003eWhy the Network Model Actually Matters\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eYour choice of network model influences practically every layer of your cluster. How pods receive IP addresses, how they communicate with each other and the VNet, performance and latency characteristics, security boundaries, and policy enforcement all hinge on this decision. So does your ability to integrate with Azure services, your scalability ceiling, your cluster density potential, and ultimately your VNet planning complexity.\u003c/p\u003e\n\u003cp\u003eChanging this decision later is difficult and sometimes impossible. It\u0026rsquo;s not a setting you adjust casually after launch. Getting it right from the start matters considerably.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"kubenet-simplicity-at-the-cost-of-integration\"\u003e\u003ca href=\"/posts/aks-networking-clash/#kubenet-simplicity-at-the-cost-of-integration\" title=\"kubenet: Simplicity at the Cost of Integration\"\u003ekubenet: Simplicity at the Cost of Integration\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubenet is effectively legacy for new projects. Microsoft maintains it for existing clusters, but no production workloads should start with it today.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"how-it-works\"\u003e\u003ca href=\"/posts/aks-networking-clash/#how-it-works\" title=\"How it works\"\u003eHow it works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ekubenet is the simplest networking approach for AKS. Each node receives a single VNet-routable IP address, but pods get their IPs from a separate, non-routable CIDR range that exists only within the cluster. When pods need to communicate outside the cluster, traffic goes through network address translation (NAT) and user-defined routes (UDRs) that you manage yourself. This fundamental separation is both kubenet\u0026rsquo;s defining feature and its core limitation.\u003c/p\u003e\n\u003cp\u003eKubenet maxes out at 400 nodes. For modern clusters, that\u0026rsquo;s a hard ceiling you\u0026rsquo;ll hit faster than you expect.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"strengths\"\u003e\u003ca href=\"/posts/aks-networking-clash/#strengths\" title=\"Strengths\"\u003eStrengths\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThe appeal is genuine. Kubenet is IP efficient—you consume very few VNet IPs because pods sit in their own address space. It\u0026rsquo;s simple to understand and straightforward to configure, which makes it attractive for teams new to Kubernetes or environments where networking should stay uncomplicated. Operationally, that translates to lower cost and less day-to-day overhead.\u003c/p\u003e\n\u003cp\u003eThe downside? Isolation.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"limitations\"\u003e\u003ca href=\"/posts/aks-networking-clash/#limitations\" title=\"Limitations\"\u003eLimitations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eBecause pods aren\u0026rsquo;t directly routable in the VNet, they remain isolated from your broader Azure networking ecosystem. NAT adds overhead and troubleshooting complexity. Integration with Azure networking features—Network Security Groups, Private Link, Azure Firewall—remains limited. For enterprise deployments or hybrid scenarios where your cluster needs to participate seamlessly in existing infrastructure, these limitations become real constraints.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-use-it\"\u003e\u003ca href=\"/posts/aks-networking-clash/#when-to-use-it\" title=\"When to use it\"\u003eWhen to use it\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ekubenet works well in specific contexts: development and test environments where simplicity matters more than features, small clusters running non-critical workloads, or scenarios with minimal networking requirements. Beyond those cases, you\u0026rsquo;re better served exploring alternatives.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-cni-enterprise-integration-comes-with-a-price\"\u003e\u003ca href=\"/posts/aks-networking-clash/#azure-cni-enterprise-integration-comes-with-a-price\" title=\"Azure CNI: Enterprise Integration Comes with a Price\"\u003eAzure CNI: Enterprise Integration Comes with a Price\u003c/a\u003e\u003c/h2\u003e\n\n\n\n\n\u003ch3 id=\"how-it-works-1\"\u003e\u003ca href=\"/posts/aks-networking-clash/#how-it-works-1\" title=\"How it works\"\u003eHow it works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure CNI (Container Networking Interface) represents a fundamental shift from kubenet. Instead of isolating pods in a separate address space, this model assigns each pod a direct, fully routable IP address from your VNet subnet. Pods become first-class participants in your Azure network, capable of direct communication with any VNet resource without NAT or additional routing rules. Traffic flows directly with minimal overhead, resulting in transparent and predictable networking.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"strengths-1\"\u003e\u003ca href=\"/posts/aks-networking-clash/#strengths-1\" title=\"Strengths\"\u003eStrengths\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThe advantages become apparent in enterprise environments where network visibility matters. Pods hold genuine VNet addresses, so they participate fully in your security frameworks, policy enforcement, and monitoring. Network Security Groups apply directly to pods. Private Link connections work seamlessly. Azure Firewall can inspect traffic properly. Your monitoring tools see pods as native VNet resources. This transparency is invaluable in regulated industries or zero-trust architectures where every network flow must be visible and controllable. Performance is excellent too—no NAT overhead means direct, efficient communication.\u003c/p\u003e\n\u003cp\u003eThe trade-off is real: you need substantial VNet address space.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"limitations-1\"\u003e\u003ca href=\"/posts/aks-networking-clash/#limitations-1\" title=\"Limitations\"\u003eLimitations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure CNI has a substantial appetite for IP addresses. Every pod needs its own VNet IP, which exhausts address space quickly in larger clusters or with high pod density. A 100-node cluster with 200 pods per node consumes 20,000 pod IPs alone—you need a /14 VNet subnet just for pods. For organizations with limited IP space or managing many clusters in a constrained range, this becomes a genuine scaling constraint.\u003c/p\u003e\n\u003cp\u003eCommon mistakes with Azure CNI: Teams underestimate pod density and provision subnets too small. A /19 feels generous until you hit 250 pods/node on 50 nodes. Then you\u0026rsquo;re recreating the entire cluster. Plan your pod count ceiling carefully—don\u0026rsquo;t guess.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-use-it-1\"\u003e\u003ca href=\"/posts/aks-networking-clash/#when-to-use-it-1\" title=\"When to use it\"\u003eWhen to use it\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eChoose Azure CNI when network governance, compliance, and performance take priority over IP efficiency. Production workloads in regulated industries, hybrid environments, and zero-trust architectures all benefit from its full integration story. If your organization can accommodate the IP consumption and your workloads demand strong visibility, Azure CNI delivers consistently.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"how-it-works-2\"\u003e\u003ca href=\"/posts/aks-networking-clash/#how-it-works-2\" title=\"How it works\"\u003eHow it works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNodes still receive VNet IP addresses as in standard Azure CNI. But pods operate within a separate overlay network with its own CIDR range, decoupled from the VNet. Pod traffic routes through a lightweight overlay stack that handles encapsulation transparently. Despite this separation, full Azure CNI functionality remains available—pods retain integration benefits with Azure services and security constructs.\u003c/p\u003e\n\u003cp\u003eThe math changes dramatically: a 1,000-node cluster with 200 pods/node requires only a /19 overlay CIDR (8,192 IPs), not a /11 VNet subnet like traditional Azure CNI. Traditional CNI would need approximately 200,000 VNet IPs (1,000 nodes × 250 pods/node capacity). That\u0026rsquo;s roughly a 25x reduction in VNet consumption compared to traditional CNI\u0026rsquo;s flat model.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"strengths-2\"\u003e\u003ca href=\"/posts/aks-networking-clash/#strengths-2\" title=\"Strengths\"\u003eStrengths\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure CNI Overlay combines the best of both models. It maintains high IP efficiency similar to kubenet—run large numbers of pods without exhausting VNet address space. Simultaneously, it delivers full enterprise integration like Azure CNI—direct compatibility with Network Security Groups, Private Link, Azure Firewall, and monitoring solutions. Large-scale clusters work without complex subnet planning. Organizations with limited IP space or managing many clusters get a significant scaling advantage. Microsoft explicitly recommends this as the standard for new production clusters, reflecting the platform\u0026rsquo;s evolution.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"limitations-2\"\u003e\u003ca href=\"/posts/aks-networking-clash/#limitations-2\" title=\"Limitations\"\u003eLimitations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eOverlay adds minor latency—plan for 100-200 microseconds extra per pod-to-external hop due to NAT translation. For latency-sensitive workloads (HFT trading, real-time gaming), this matters. Classic Azure CNI eliminates this cost entirely.\u003c/p\u003e\n\u003cp\u003eDebugging pod-to-external traffic is harder. You\u0026rsquo;ll need to understand SNAT translation. Classic Azure CNI shows pod IPs in network traces; Overlay hides them behind node IPs. Budget extra engineering for network troubleshooting. Most teams underestimate this operational cost.\u003c/p\u003e\n\u003cp\u003eRegional limitations remain: Windows Server 2019 pod support rolled out Q4 2024, but DCsv2 Confidential Computing VMs are unsupported on Overlay (use DCAsv5 instead). Check your region\u0026rsquo;s feature matrix before committing.\u003c/p\u003e\n\u003cp\u003eCommon mistakes: Forgetting that Overlay configuration can\u0026rsquo;t be changed post-deployment. Teams have recreated entire clusters after discovering pod density requirements too late. Finalize your pod count ceiling before cluster creation.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-use-it-2\"\u003e\u003ca href=\"/posts/aks-networking-clash/#when-to-use-it-2\" title=\"When to use it\"\u003eWhen to use it\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eChoose Azure CNI Overlay if: (1) Your cluster will exceed 1,000 nodes, (2) IP space is scarce, or (3) Pod density baseline exceeds 100 pods/node. For smaller clusters with abundant IP space, classic Azure CNI remains valid.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCritical operational consideration:\u003c/strong\u003e Overlay networking impacts your observability strategy. Direct pod IP logging doesn\u0026rsquo;t work. Your monitoring tools must track node IPs and SNAT mappings instead. Prometheus scrapes will show node targets, not pod targets. Container registries see pod IPs translate through node IPs. Budget extra engineering for network observability—this is where most teams get blindsided.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"putting-it-all-in-perspective-a-practical-comparison\"\u003e\u003ca href=\"/posts/aks-networking-clash/#putting-it-all-in-perspective-a-practical-comparison\" title=\"Putting It All in Perspective: A Practical Comparison\"\u003ePutting It All in Perspective: A Practical Comparison\u003c/a\u003e\u003c/h2\u003e\n\u003ctable class=\"striped\"\u003e\n\t\u003cthead\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003cth\u003eFeature\u003c/th\u003e\n\t\t\t\t\t\u003cth\u003ekubenet\u003c/th\u003e\n\t\t\t\t\t\u003cth\u003eAzure CNI\u003c/th\u003e\n\t\t\t\t\t\u003cth\u003eAzure CNI Overlay\u003c/th\u003e\n\t\t\t\u003c/tr\u003e\n\t\u003c/thead\u003e\n\t\u003ctbody\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eMax Nodes\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003e400\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003e1,000+\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003e5,000\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003ePod IP Source\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003ePod CIDR\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eVNet subnet\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eOverlay CIDR\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eIP Efficiency\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eHigh\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eLow (20,000+ IPs/100 nodes)\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eHigh (8,000 IPs/100 nodes)\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eRouting\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eNAT + UDR\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eDirect\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eOverlay (SNAT egress)\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003ePerformance\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eGood (+latency)\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eExcellent\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eHigh (+100-200μs NAT)\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eAzure Integration\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eLimited\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eFull\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eFull\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eComplexity\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eLow\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eHigh\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eMedium\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\t\t\u003ctr\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eProduction-Ready?\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eLegacy only\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003eYes (IP constraints)\u003c/td\u003e\n\t\t\t\t\t\u003ctd\u003e\u003cstrong\u003eYes (default)\u003c/strong\u003e\u003c/td\u003e\n\t\t\t\u003c/tr\u003e\n\t\u003c/tbody\u003e\n\u003c/table\u003e\n\n\n\n\n\u003ch2 id=\"making-the-right-choice-for-your-constraints\"\u003e\u003ca href=\"/posts/aks-networking-clash/#making-the-right-choice-for-your-constraints\" title=\"Making the Right Choice for Your Constraints\"\u003eMaking the Right Choice for Your Constraints\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAs of Q4 2025, Microsoft recommends CNI Overlay for all new AKS clusters. Kubenet remains only for legacy migration scenarios. Traditional Azure CNI (flat model) is now positioned as \u0026ldquo;advanced use only.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eYour decision depends on your specific constraints. Here\u0026rsquo;s what that means practically:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLimited IP address space?\u003c/strong\u003e Overlay is your only option. A 500-node cluster with traditional CNI burns 125,000 VNet IPs (500 nodes × 250 pods/node). Overlay uses maybe 500 IPs for nodes, 8,000 for the private CIDR. That\u0026rsquo;s the difference between feasible and impossible.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRegulated industry requiring direct pod traceability?\u003c/strong\u003e Traditional Azure CNI gives you pod IPs you can trace end-to-end. Overlay requires you to reverse-engineer SNAT mappings. Compliance frameworks sometimes demand the former. Check your audit requirements before deciding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDevelopment or proof-of-concept?\u003c/strong\u003e Kubenet is still reasonable here. Simplicity wins. Just don\u0026rsquo;t ship it to production.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNew production cluster with no prior constraints?\u003c/strong\u003e Overlay. Default assumption. End of discussion. The platform matured past the point where you need to second-guess this.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-path-forward-understanding-the-real-trade-offs\"\u003e\u003ca href=\"/posts/aks-networking-clash/#the-path-forward-understanding-the-real-trade-offs\" title=\"The Path Forward: Understanding the Real Trade-offs\"\u003eThe Path Forward: Understanding the Real Trade-offs\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS networking boils down to this: kubenet is dead for production. Azure CNI works only if you have VNet space to burn. Overlay is the pragmatic default.\u003c/p\u003e\n\u003cp\u003eKubenet was the starting point in 2017. Azure CNI added enterprise features in 2019. But both forced uncomfortable choices: either accept a 400-node ceiling with poor observability, or reserve a /11 subnet that might bankrupt your IP planning. Neither worked for real clusters at scale.\u003c/p\u003e\n\u003cp\u003eOverlay changed that equation. Yes, you lose direct pod IP traceability. Yes, you add 100-200 microseconds latency. But you get 5,000-node clusters with IP efficiency that makes sense. You get monitoring that doesn\u0026rsquo;t require reverse-engineering NAT tables. You get a path forward that doesn\u0026rsquo;t require architectural compromise.\u003c/p\u003e\n\u003cp\u003eThe trade-off is honest: latency and debugging complexity for scalability and IP efficiency. For most organizations, that\u0026rsquo;s the right trade.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re building new infrastructure on AKS, start with Overlay. If you\u0026rsquo;re running the math on existing clusters and wondering whether to migrate, Overlay is probably cheaper than the subnet expansion you\u0026rsquo;re otherwise facing. Plan your observability around SNAT mappings from day one. Budget engineering time for network troubleshooting. But build forward knowing the constraint that has limited AKS clusters for five years is finally solved.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2025-12-03T11:45:00+01:00","id":"https://daily-devops.net/posts/aks-networking-clash/","language":"en","summary":"Azure CNI Overlay beats kubenet's 400-node ceiling and classic CNI's IP exhaustion. Compare all three AKS network models before the cluster locks in.","tags":["networking","azure","cloud","kubernetes","platform-engineering"],"title":"AKS Networking Clash: kubenet vs. CNI vs. CNI Overlay","url":"https://daily-devops.net/posts/aks-networking-clash/"}],"language":"en","title":"Kubernetes and Container Orchestration on Daily DevOps \u0026 .NET","version":"https://jsonfeed.org/version/1.1"}