{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"},{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"description":"Recent content in Infrastructure Engineering and Cloud Operations on Daily DevOps \u0026 .NET","favicon":"https://daily-devops.net/images/logo_hu_6465d873dfa490cf.png","feed_url":"https://daily-devops.net/tags/infrastructure/feed.json","home_page_url":"https://daily-devops.net/tags/infrastructure/","icon":"https://daily-devops.net/images/logo_hu_5926de77762241ba.png","items":[{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eNobody warns you about the point where Kubernetes stops behaving like Kubernetes. At 100 nodes the platform feels manageable: logs are searchable, deployments finish quickly, and most incidents resolve with a kubectl command and some patience. Cross 500 nodes and small architectural assumptions start cracking. Cross 1,000 nodes and those cracks become structural.\u003c/p\u003e\n\u003cp\u003eThe problems described here are not hypothetical. etcd database sizes that stretched backup windows into hours. Observability stacks consuming more cluster resources than the workloads they were supposed to monitor. Network overlays running fine at 200 nodes that started dropping packets at 800. If you\u0026rsquo;re planning to push past 500 nodes, or already running infrastructure at that scale and things feel increasingly fragile, read on.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-scale-cliff-why-1000-nodes-changes-everything\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#the-scale-cliff-why-1000-nodes-changes-everything\" title=\"The Scale Cliff: Why 1,000 Nodes Changes Everything\"\u003eThe Scale Cliff: Why 1,000 Nodes Changes Everything\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAt 100 nodes, Kubernetes feels manageable. Monitoring works. Logs are searchable. Network patterns make sense. Deployments complete in minutes. Then you cross 500 nodes and small cracks appear. By 1,000 nodes, those cracks become structural failures.\u003c/p\u003e\n\u003cp\u003eThe problem: Kubernetes components designed for graceful degradation hit hard limits at scale. etcd performance degrades non-linearly with keyspace size. Network overlay solutions that worked fine at 200 nodes saturate at 800. Observability stacks consuming 3% of cluster resources at 100 nodes consume 25% at 1,000. Cost-per-node stays flat but operational overhead per node increases exponentially.\u003c/p\u003e\n\u003cp\u003eThese aren\u0026rsquo;t bugs. They\u0026rsquo;re architectural realities. Understanding where the cliffs are lets you plan around them instead of discovering them in production outages.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"etcd-the-hidden-scaling-bottleneck\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#etcd-the-hidden-scaling-bottleneck\" title=\"etcd: The Hidden Scaling Bottleneck\"\u003eetcd: The Hidden Scaling Bottleneck\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eetcd is the single most critical component in your cluster and the first to hit scaling limits. It stores all cluster state: every pod, service, config map, secret, and custom resource. At 1,000 nodes with 200 pods per node, you\u0026rsquo;re managing 200,000+ objects. etcd wasn\u0026rsquo;t designed for that scale without careful tuning.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"performance-degradation-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#performance-degradation-patterns\" title=\"Performance Degradation Patterns\"\u003ePerformance Degradation Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eetcd performance degrades based on keyspace size, transaction rate, and storage backend latency. At small scale, these factors don\u0026rsquo;t matter. At mega-cluster scale, they dominate operational behavior.\u003c/p\u003e\n\u003cp\u003eSymptoms you\u0026rsquo;ll see:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eAPI server latency spikes during deployments\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ekubectl\u003c/code\u003e commands timing out intermittently\u003c/li\u003e\n\u003cli\u003eController reconciliation loops falling behind\u003c/li\u003e\n\u003cli\u003eScheduler making suboptimal placement decisions due to stale state\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe root cause is usually one of three things: etcd database size exceeding memory capacity, insufficient IOPS on the storage backend, or transaction rate overwhelming the commit pipeline.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"backup-size-and-recovery-time\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#backup-size-and-recovery-time\" title=\"Backup Size and Recovery Time\"\u003eBackup Size and Recovery Time\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eetcd backup size scales with keyspace. A 100-node cluster might produce 500MB backups. A 1,000-node cluster produces 8GB+ backups. That size creates operational problems:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBackup windows extend from minutes to hours\u003c/li\u003e\n\u003cli\u003eNetwork transfer costs increase linearly\u003c/li\u003e\n\u003cli\u003eRecovery time objectives (RTO) slip from \u0026ldquo;15 minutes\u0026rdquo; to \u0026ldquo;2+ hours\u0026rdquo;\u003c/li\u003e\n\u003cli\u003eStorage costs for retention policies multiply unexpectedly\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWorse: most backup solutions for etcd aren\u0026rsquo;t tested at mega-cluster scale. The tooling that works reliably at 100 nodes silently fails or creates corrupted snapshots at 1,000 nodes.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-mitigation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#practical-mitigation\" title=\"Practical Mitigation\"\u003ePractical Mitigation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAKS manages etcd for you\u003c/a\u003e, but you still need to monitor and validate its health. Here\u0026rsquo;s a Terraform configuration that sets up Azure Monitor alerts for etcd-related API server latency:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_metric_alert\u0026#34; \u0026#34;etcd_latency\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-etcd-high-latency\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scopes\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  description\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Alert when API server latency exceeds 200ms (etcd saturation signal)\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  severity\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e2\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  frequency\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT1M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  window_size\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT5M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003ecriteria\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_namespace\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Microsoft.ContainerService/managedClusters\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;apiserver_request_duration_seconds\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aggregation\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Average\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    operator\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GreaterThan\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    threshold\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e0\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"m\"\u003e2\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003edimension\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      name\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;verb\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      operator\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Include\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e      values\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;GET\u0026#34;, \u0026#34;LIST\u0026#34;, \u0026#34;PATCH\u0026#34;, \u0026#34;UPDATE\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eaction\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    action_group_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_monitor_action_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eplatform\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_monitor_metric_alert\u0026#34; \u0026#34;etcd_database_size\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-etcd-database-size-warning\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  scopes\u003c/span\u003e              \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"k\"\u003eazurerm_kubernetes_cluster\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003emain\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  description\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Alert when etcd database approaches size limits\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  severity\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  frequency\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT5M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  window_size\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;PT15M\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003ecriteria\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_namespace\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Microsoft.ContainerService/managedClusters\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    metric_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;etcd_db_total_size_in_bytes\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    aggregation\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Average\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    operator\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GreaterThan\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    threshold\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e6442450944\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # 6GB (warning threshold)\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eaction\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    action_group_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_monitor_action_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eplatform\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThese alerts won\u0026rsquo;t prevent etcd saturation, but they\u0026rsquo;ll give you advance warning before cascading failures occur. At scale, that early warning is the difference between a controlled maintenance window and an all-hands incident.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"network-performance-when-overlay-solutions-hit-limits\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#network-performance-when-overlay-solutions-hit-limits\" title=\"Network Performance: When Overlay Solutions Hit Limits\"\u003eNetwork Performance: When Overlay Solutions Hit Limits\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eNetwork overlay performance is invisible at small scale and catastrophic at large scale. Container Network Interface (CNI) plugins that handle 50,000 pods without issue can saturate CPU and drop packets at 200,000 pods. There is no single right answer for which CNI to use. For a full breakdown of the tradeoffs between kubenet, Azure CNI, and Azure CNI Overlay, see \u003ca href=\"/posts/aks-networking-clash/\"\u003eAKS Networking Clash: kubenet vs. CNI vs. CNI Overlay\u003c/a\u003e.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"pod-density-and-node-saturation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#pod-density-and-node-saturation\" title=\"Pod Density and Node Saturation\"\u003ePod Density and Node Saturation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure CNI Overlay\u003c/a\u003e supports up to 250 pods per node. That\u0026rsquo;s a theoretical maximum. Practical limits depend on network I/O patterns, pod churn rate, and service mesh overhead.\u003c/p\u003e\n\u003cp\u003eSignals that you\u0026rsquo;re approaching saturation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eNodes showing high system CPU (kernel networking overhead)\u003c/li\u003e\n\u003cli\u003eIntermittent packet loss between pods on the same node\u003c/li\u003e\n\u003cli\u003eService discovery latency increasing over time\u003c/li\u003e\n\u003cli\u003eDNS resolution failures under load\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe underlying issue: network namespace creation, iptables rule updates, and conntrack table management all scale poorly. At 200 pods per node, these operations consume negligible resources. At 250 pods per node, they dominate system CPU.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cross-node-latency-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cross-node-latency-patterns\" title=\"Cross-Node Latency Patterns\"\u003eCross-Node Latency Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eOverlay networks add encapsulation overhead. Azure CNI Overlay typically adds 100-200 microseconds per hop. At small scale, that\u0026rsquo;s noise. At mega-cluster scale, it compounds across multi-tier applications.\u003c/p\u003e\n\u003cp\u003eExample: a request traversing frontend → API gateway → backend service → database proxy touches 4 pods. If those pods span nodes, you\u0026rsquo;ve added 400-800 microseconds of latency from network overhead alone. Multiply that by 10,000 requests per second and the impact becomes measurable in user-facing metrics.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"mitigation-strategy\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#mitigation-strategy\" title=\"Mitigation Strategy\"\u003eMitigation Strategy\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003ePin latency-sensitive workloads to the same node using pod affinity\u003c/li\u003e\n\u003cli\u003eUse host networking for data-plane components (with appropriate security controls)\u003c/li\u003e\n\u003cli\u003eMonitor conntrack table utilization: \u003ccode\u003esysctl net.netfilter.nf_conntrack_count\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eSet conservative pod density limits (180-200 pods/node instead of 250)\u003c/li\u003e\n\u003cli\u003eImplement service mesh with extended Berkeley Packet Filter (eBPF) dataplane (\u003ca href=\"https://cilium.io/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eCilium\u003c/a\u003e) to reduce iptables overhead\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThese aren\u0026rsquo;t performance optimizations. They\u0026rsquo;re operational requirements at scale.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"observability-overhead-when-monitoring-becomes-the-problem\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#observability-overhead-when-monitoring-becomes-the-problem\" title=\"Observability Overhead: When Monitoring Becomes the Problem\"\u003eObservability Overhead: When Monitoring Becomes the Problem\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eObservability at scale creates a paradox: the systems you need to diagnose problems become the source of resource exhaustion.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"logging-cost-explosion\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#logging-cost-explosion\" title=\"Logging Cost Explosion\"\u003eLogging Cost Explosion\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eA single pod generating 100KB/day of logs costs nothing. 200,000 pods generating the same logs produce 20GB/day. Over a month, that\u0026rsquo;s 600GB. With 3x replication and 90-day retention, you\u0026rsquo;re storing 162TB of log data.\u003c/p\u003e\n\u003cp\u003eStorage costs for that volume run into thousands of dollars monthly. Query performance degrades. Log ingestion pipelines fall behind. The tooling designed to help you debug problems becomes unusable during incidents.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"metric-cardinality-problems\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#metric-cardinality-problems\" title=\"Metric Cardinality Problems\"\u003eMetric Cardinality Problems\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePrometheus-based monitoring hits cardinality limits around 10 million active time series. A 1,000-node cluster with moderate instrumentation easily exceeds that threshold:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e200,000 pods × 20 metrics per pod = 4M series\u003c/li\u003e\n\u003cli\u003e1,000 nodes × 100 metrics per node = 100K series\u003c/li\u003e\n\u003cli\u003e50 services × 10K instances × 5 metrics = 2.5M series\u003c/li\u003e\n\u003cli\u003eCustom application metrics add another 3M+ series\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWhen you exceed cardinality limits, Prometheus becomes unstable. Queries time out. Dashboards fail to render. Alerting rules stop evaluating. You lose observability exactly when you need it most.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"practical-approaches\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#practical-approaches\" title=\"Practical Approaches\"\u003ePractical Approaches\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eImplement aggressive log sampling: 1% sampling still gives 2GB/day of logs\u003c/li\u003e\n\u003cli\u003eUse structured logging with consistent field names to enable efficient compression\u003c/li\u003e\n\u003cli\u003eArchive cold logs to blob storage (pennies per GB vs. dollars per GB in hot storage)\u003c/li\u003e\n\u003cli\u003eDeploy federated Prometheus with careful metric filtering at scrape time\u003c/li\u003e\n\u003cli\u003eUse recording rules to pre-aggregate high-cardinality metrics\u003c/li\u003e\n\u003cli\u003eConsider managed observability services (Azure Monitor, Datadog) that handle scale for you\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe honest assessment: if your observability stack consumes more than 10% of cluster resources, it\u0026rsquo;s time to rethink your approach. At mega-cluster scale, that threshold is easy to exceed.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cost-spirals-small-decisions-with-exponential-consequences\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cost-spirals-small-decisions-with-exponential-consequences\" title=\"Cost Spirals: Small Decisions with Exponential Consequences\"\u003eCost Spirals: Small Decisions with Exponential Consequences\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eCost optimization at 100 nodes is optional. At 1,000 nodes, it\u0026rsquo;s mandatory. Small inefficiencies compound brutally.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"resource-overprovisioning\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#resource-overprovisioning\" title=\"Resource Overprovisioning\"\u003eResource Overprovisioning\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eTeams typically request 2x actual resource needs for safety margin. At 100 nodes, that\u0026rsquo;s wasteful but affordable. At 1,000 nodes with 250 pods per node, you\u0026rsquo;re paying for 125,000 unutilized CPU cores.\u003c/p\u003e\n\u003cp\u003eWith Azure D8s_v5 nodes at ~$0.40/hour, a 1,000-node cluster costs ~$288,000/year in compute alone. 50% overprovisioning adds $144,000 annually. That\u0026rsquo;s real budget impact.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"storage-cost-patterns\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#storage-cost-patterns\" title=\"Storage Cost Patterns\"\u003eStorage Cost Patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eEvery pod gets ephemeral storage. Most clusters also provision persistent volumes. At scale, storage costs exceed compute costs.\u003c/p\u003e\n\u003cp\u003eExample: 200,000 pods with 10GB ephemeral storage each = 2PB of ephemeral storage. Persistent volume claims add another 500TB+. Azure Premium SSD costs $0.135/GB/month. You\u0026rsquo;re paying $300K+ monthly for storage alone.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"network-egress-surprises\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#network-egress-surprises\" title=\"Network Egress Surprises\"\u003eNetwork Egress Surprises\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCross-region and internet egress costs scale linearly with traffic volume. A 1,000-node cluster handling 10TB/day of egress traffic incurs $1,500/day in bandwidth costs ($45,000/month).\u003c/p\u003e\n\u003cp\u003eTeams typically discover these costs 60 days into a scale-up when the first full billing cycle completes. By then, architectural changes are expensive and disruptive.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-control-strategy\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#cost-control-strategy\" title=\"Cost Control Strategy\"\u003eCost Control Strategy\u003c/a\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003eImplement cluster autoscaling with aggressive scale-down policies\u003c/li\u003e\n\u003cli\u003eUse spot instances for fault-tolerant workloads (70% cost reduction)\u003c/li\u003e\n\u003cli\u003eRight-size pod resource requests using \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/vertical-pod-autoscaler\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eVPA (Vertical Pod Autoscaler)\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003eEnable Azure Hybrid Benefit for Windows nodes\u003c/li\u003e\n\u003cli\u003eDeploy regional caching layers to reduce cross-region egress\u003c/li\u003e\n\u003cli\u003eMonitor and alert on cost metrics, not just resource metrics\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eTeams defer cost optimization in favor of operational simplicity, and early on that is usually the right call. At mega-cluster scale, that priority reverses. Cost efficiency becomes a constraint you cannot ignore. \u003ca href=\"/posts/cost-optimization-resource-governance-aks/\"\u003eAKS Cost Optimization: Resource Governance That Actually Works\u003c/a\u003e goes deeper on VPA configuration and autoscaling policies if you want the practical implementation details.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"debugging-at-scale-finding-needles-in-exponentially-larger-haystacks\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#debugging-at-scale-finding-needles-in-exponentially-larger-haystacks\" title=\"Debugging at Scale: Finding Needles in Exponentially Larger Haystacks\"\u003eDebugging at Scale: Finding Needles in Exponentially Larger Haystacks\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDebugging a 100-node cluster means checking logs from a few thousand pods. Debugging a 1,000-node cluster means isolating the problem from millions of log lines across 200,000+ pods.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"correlation-and-isolation\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#correlation-and-isolation\" title=\"Correlation and Isolation\"\u003eCorrelation and Isolation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen a user reports an error, your troubleshooting workflow looks like this:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eIdentify the service handling the request (1 of 50+ services)\u003c/li\u003e\n\u003cli\u003eFind the pod instance that processed the request (1 of 5,000+ pod instances)\u003c/li\u003e\n\u003cli\u003eLocate the relevant log lines (1 of 10M+ log events in the time window)\u003c/li\u003e\n\u003cli\u003eCorrelate with upstream/downstream service calls\u003c/li\u003e\n\u003cli\u003eReproduce the issue in a controlled environment\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAt small scale, steps 2-3 take minutes. At mega-cluster scale, they take hours, assuming correlation IDs exist and work correctly. Without proper instrumentation, they\u0026rsquo;re impossible.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"reproduction-challenges\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#reproduction-challenges\" title=\"Reproduction Challenges\"\u003eReproduction Challenges\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eIssues that reproduce reliably at scale rarely reproduce in test environments. A race condition that triggers once per 100,000 requests never manifests in pre-production. Network congestion patterns that emerge at 1,000 nodes don\u0026rsquo;t exist at 10 nodes.\u003c/p\u003e\n\u003cp\u003eThis creates a diagnostic blind spot. You can observe the failure in production but can\u0026rsquo;t reproduce it for root cause analysis.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"large-scale-troubleshooting-checklist\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#large-scale-troubleshooting-checklist\" title=\"Large-Scale Troubleshooting Checklist\"\u003eLarge-Scale Troubleshooting Checklist\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eHere\u0026rsquo;s a diagnostic script I use for investigating performance degradation at scale:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Large-scale AKS cluster diagnostic script\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Run this when experiencing unexplained performance issues\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003e1\u003c/span\u003e\u003cspan class=\"p\"\u003e:?Cluster name required\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003e2\u003c/span\u003e\u003cspan class=\"p\"\u003e:?Resource group required\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eOUTPUT_DIR\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;./diagnostics-\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003edate +%Y%m%d-%H%M%S\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Running diagnostics for cluster: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003emkdir -p \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Get cluster credentials\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-credentials --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --overwrite-existing\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Node health check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking node health...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get nodes -o wide \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/nodes.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl top nodes \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/node-resources.txt\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Metrics server unavailable\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/node-resources.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# API server latency check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking API server latency...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e i in \u003cspan class=\"o\"\u003e{\u003c/span\u003e1..5\u003cspan class=\"o\"\u003e}\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003etime\u003c/span\u003e kubectl get nodes \u0026gt; /dev/null 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep real \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/api-latency.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# etcd health indicators\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking etcd health signals...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get --raw /metrics \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -E \u003cspan class=\"s2\"\u003e\u0026#34;apiserver_request_duration|etcd_request_duration\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/etcd-metrics.txt\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Metrics unavailable\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/etcd-metrics.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Pod distribution analysis\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Analyzing pod distribution...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | \u0026#34;\\(.spec.nodeName)\u0026#34;\u0026#39;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e sort \u003cspan class=\"p\"\u003e|\u003c/span\u003e uniq -c \u003cspan class=\"p\"\u003e|\u003c/span\u003e sort -rn \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/pod-distribution.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Network policy count (can cause iptables overhead)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking network policy count...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get networkpolicies -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/netpol-count.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Service endpoint count (affects kube-proxy performance)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking service endpoint count...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get endpoints -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq \u003cspan class=\"s1\"\u003e\u0026#39;[.items[].subsets[].addresses] | flatten | length\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/endpoint-count.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Resource pressure signals\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Identifying pods with resource pressure...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get pods -A -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | select(.status.conditions[]? | select(.type==\u0026#34;Ready\u0026#34; and .status==\u0026#34;False\u0026#34;)) | \u0026#34;\\(.metadata.namespace)/\\(.metadata.name)\u0026#34;\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/not-ready-pods.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Recent events (truncated for performance)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Capturing recent cluster events...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get events -A --sort-by\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;.lastTimestamp\u0026#39;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e tail -1000 \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/recent-events.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Node condition checks\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking for node pressure conditions...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl get nodes -o json \u003cspan class=\"p\"\u003e|\u003c/span\u003e jq -r \u003cspan class=\"s1\"\u003e\u0026#39;.items[] | select(.status.conditions[]? | select(.type==\u0026#34;MemoryPressure\u0026#34; or .type==\u0026#34;DiskPressure\u0026#34; or .type==\u0026#34;PIDPressure\u0026#34;) | select(.status==\u0026#34;True\u0026#34;)) | .metadata.name\u0026#39;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/nodes-under-pressure.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# ConfigMap and Secret count (affects etcd size)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Counting ConfigMaps and Secrets...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ConfigMaps: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get configmaps -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Secrets: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get secrets -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt;\u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Total Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pods -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u0026gt;\u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/object-counts.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# DNS performance check\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Testing DNS resolution performance...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl run dns-test --image\u003cspan class=\"o\"\u003e=\u003c/span\u003ebusybox:1.36 --restart\u003cspan class=\"o\"\u003e=\u003c/span\u003eNever --rm -i --command -- sh -c \u003cspan class=\"s2\"\u003e\u0026#34;time nslookup kubernetes.default\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/dns-test.txt\u0026#34;\u003c/span\u003e 2\u0026gt;\u003cspan class=\"p\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"m\"\u003e1\u003c/span\u003e \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;DNS test failed\u0026#34;\u003c/span\u003e \u0026gt; \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e/dns-test.txt\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Diagnostics complete. Results in: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Quick analysis:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Nodes: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get nodes --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pods -A --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Not Ready Pods: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ecat \u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e/not-ready-pods.txt \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Nodes Under Pressure: \u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ecat \u003cspan class=\"nv\"\u003e$OUTPUT_DIR\u003c/span\u003e/nodes-under-pressure.txt \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Review the output files for detailed diagnostics.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script collects the signals that matter at scale: API latency, pod distribution skew, resource pressure indicators, and object count metrics. It doesn\u0026rsquo;t solve problems, but it eliminates 90% of the noise.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"patterns-that-prevent-catastrophe\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#patterns-that-prevent-catastrophe\" title=\"Patterns That Prevent Catastrophe\"\u003ePatterns That Prevent Catastrophe\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAfter running mega-clusters through multiple incident cycles, a few patterns consistently prevent the worst outcomes:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eProgressive rollouts\u003c/strong\u003e: Never deploy to 1,000 nodes simultaneously. Deploy to 1 node, then 10, then 100, then all. Automate rollback triggers. This pattern catches 95% of scale-dependent bugs before they impact production.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBlast radius isolation\u003c/strong\u003e: Segment your cluster into failure domains using node pools, namespaces, and network policies. When something fails (and it will), contain the damage. \u003ca href=\"/posts/aks-network-policies-zero-trust/\"\u003eAKS Network Policies: The Security Layer Your Cluster Is Missing\u003c/a\u003e covers practical policy configuration if you are starting from scratch.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCapacity reservation\u003c/strong\u003e: Reserve 15-20% headroom for burst traffic and incident response. Running at 90%+ utilization saves money until you need to scale during an outage and can\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eImmutable infrastructure\u003c/strong\u003e: Treat nodes as cattle, not pets. Automate node replacement on a fixed schedule (weekly or monthly). This prevents subtle configuration drift that compounds into unreproducible failures.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational runbooks\u003c/strong\u003e: Document every common failure mode. When API server latency spikes at 2 AM, you don\u0026rsquo;t want to be reading Kubernetes source code to understand etcd compaction behavior.\u003c/p\u003e\n\u003cp\u003eThese patterns aren\u0026rsquo;t revolutionary. They\u0026rsquo;re boring, defensive engineering. At mega-cluster scale, boring wins.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"honest-takeaways\"\u003e\u003ca href=\"/posts/aks-at-scale-mega-cluster-lessons/#honest-takeaways\" title=\"Honest Takeaways\"\u003eHonest Takeaways\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning AKS at 1,000+ nodes isn\u0026rsquo;t fundamentally different from running it at 100 nodes. It\u0026rsquo;s exponentially different. Problems that self-heal at small scale cascade catastrophically at large scale. Architectural decisions that feel premature at 50 nodes become load-bearing at 500 nodes.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re planning to scale past 500 nodes: budget significant engineering time for operational tooling. Plan your observability strategy before your first node boots. Understand your cost model in detail. Test failure scenarios at scale before they happen in production.\u003c/p\u003e\n\u003cp\u003eIf you\u0026rsquo;re already running at scale: you know everything in this article because you\u0026rsquo;ve lived it. The value isn\u0026rsquo;t the advice. It\u0026rsquo;s knowing you\u0026rsquo;re not alone in discovering these lessons the hard way.\u003c/p\u003e\n\u003cp\u003eScale is honest. Every shortcut taken for velocity will surface eventually, usually at the worst possible moment. Budget engineering time to address that reality before you hit 500 nodes, not after. Fixing structural problems under production pressure costs significantly more than building them correctly from the start.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-04-01T17:00:00+01:00","id":"https://daily-devops.net/posts/aks-at-scale-mega-cluster-lessons/","language":"en","summary":"Real-world lessons from operating 1000+ node AKS clusters: etcd limits, network saturation, observability overhead, and cost spirals you need to know.","tags":["kubernetes","azure","cloud","devops","operations","infrastructure"],"title":"AKS at Scale: Hard-Won Lessons from 1000+ Node Clusters","url":"https://daily-devops.net/posts/aks-at-scale-mega-cluster-lessons/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\n\n\n\n\u003ch2 id=\"the-problem-cloud-and-on-prem-as-operational-silos\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#the-problem-cloud-and-on-prem-as-operational-silos\" title=\"The Problem: Cloud and On-Prem as Operational Silos\"\u003eThe Problem: Cloud and On-Prem as Operational Silos\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMost organizations don\u0026rsquo;t run purely in the cloud. Legacy systems, compliance requirements, data gravity, and latency concerns keep critical workloads on-premises indefinitely. Running AKS in Azure alongside on-prem Kubernetes clusters multiplies management overhead: two separate control planes to patch, two policy frameworks to keep in sync, two identity configurations to audit, and two observability stacks generating alerts nobody wants to correlate manually.\u003c/p\u003e\n\u003cp\u003eThe temptation is to build custom tooling that bridges the gap. That usually ends as a fragile script collection that only one person on the team understands. Azure Arc changes the equation: it extends Azure\u0026rsquo;s management plane to any Kubernetes cluster without migrating workloads.\u003c/p\u003e\n\u003cp\u003eThis article covers the practical pieces: network connectivity options, Azure Arc for unified management, DNS resolution across environment boundaries, policy enforcement, and identity federation.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"connectivity-models-getting-traffic-between-environments\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#connectivity-models-getting-traffic-between-environments\" title=\"Connectivity Models: Getting Traffic Between Environments\"\u003eConnectivity Models: Getting Traffic Between Environments\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eBefore you can manage hybrid Kubernetes deployments, you need reliable network connectivity. Three primary patterns exist, each with distinct trade-offs.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"expressroute-dedicated-private-connectivity\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#expressroute-dedicated-private-connectivity\" title=\"ExpressRoute: Dedicated Private Connectivity\"\u003eExpressRoute: Dedicated Private Connectivity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/expressroute/expressroute-introduction\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eExpressRoute\u003c/a\u003e provides a dedicated, private connection between on-premises and Azure, bypassing the public internet entirely. Latency is predictable, throughput is consistent, and the connection doesn\u0026rsquo;t compete with general internet traffic.\u003c/p\u003e\n\u003cp\u003eThe operational reality: provisioning takes weeks, requires coordination with a connectivity provider, and demands solid Border Gateway Protocol (BGP) knowledge from your network team. Cost is significant. For production workloads with compliance requirements or sustained high-bandwidth data transfer, those trade-offs are usually acceptable. For a dev/test environment or proof-of-concept, they aren\u0026rsquo;t.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"site-to-site-vpn-cost-effective-alternative\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#site-to-site-vpn-cost-effective-alternative\" title=\"Site-to-Site VPN: Cost-Effective Alternative\"\u003eSite-to-Site VPN: Cost-Effective Alternative\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eSite-to-Site (S2S) VPN creates encrypted tunnels over the public internet. Setup takes hours rather than weeks, cost is a fraction of ExpressRoute, and it works without engaging a connectivity provider.\u003c/p\u003e\n\u003cp\u003eThe catch is performance variability. Throughput degrades under load, latency spikes during congestion periods, and encryption overhead adds up. For proof-of-concept environments, dev/test workloads, or bursty low-volume traffic, S2S VPN is the pragmatic choice. For production databases replicating continuously across the boundary, it usually isn\u0026rsquo;t enough.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"vnet-peering-cloud-only-hybrid\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#vnet-peering-cloud-only-hybrid\" title=\"VNet Peering: Cloud-Only Hybrid\"\u003eVNet Peering: Cloud-Only Hybrid\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-peering-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eVNet peering\u003c/a\u003e connects Azure VNets across regions or subscription boundaries. If both sides run in Azure and you\u0026rsquo;re drawing a line between subscriptions rather than between cloud and datacenter, this is the simplest option: no gateways, no BGP, no provider contracts.\u003c/p\u003e\n\u003cp\u003eIt doesn\u0026rsquo;t solve the on-prem connectivity problem. Peering only works between Azure VNets.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"infrastructure-as-code-expressroute--aks\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#infrastructure-as-code-expressroute--aks\" title=\"Infrastructure as Code: ExpressRoute \u0026#43; AKS\"\u003eInfrastructure as Code: ExpressRoute + AKS\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhatever connectivity model you choose, infrastructure repeatability matters from day one. Deploying gateways, subnets, route tables, and AKS clusters manually works once and creates problems on the second environment. The Terraform configuration below covers the full stack: ExpressRoute gateway, private DNS zone, and AKS with Azure CNI and private cluster enabled.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Terraform configuration for ExpressRoute + AKS hybrid connectivity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Variables and provider configuration assumed\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_resource_group\u0026#34; \u0026#34;hybrid\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hybrid-aks-rg\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;westeurope\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Virtual Network for AKS and hybrid connectivity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network\u0026#34; \u0026#34;aks_vnet\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-vnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_space\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.0.0/16\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Subnet for AKS nodes\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_subnet\u0026#34; \u0026#34;aks_nodes\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-nodes-subnet\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_prefixes\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.1.0/24\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Gateway subnet for ExpressRoute\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_subnet\u0026#34; \u0026#34;gateway\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GatewaySubnet\u0026#34;\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # Name must be exactly this\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  address_prefixes\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;10.1.255.0/27\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Public IP for ExpressRoute Gateway\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_public_ip\u0026#34; \u0026#34;er_gateway_ip\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;er-gateway-pip\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  allocation_method\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Static\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  sku\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# ExpressRoute Gateway\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network_gateway\u0026#34; \u0026#34;er_gateway\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;er-gateway\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  type\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ExpressRoute\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  sku\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eip_configuration\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;gateway-ip-config\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    public_ip_address_id\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_public_ip\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eer_gateway_ip\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    private_ip_address_allocation\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Dynamic\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    subnet_id\u003c/span\u003e                     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003egateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Connection from ExpressRoute Gateway to the pre-provisioned circuit\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Configure var.expressroute_circuit_id with your existing circuit resource ID:\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# var.expressroute_circuit_id = \u0026#34;/subscriptions/.../expressRouteCircuits/...\u0026#34;\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_virtual_network_gateway_connection\u0026#34; \u0026#34;onprem\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;er-onprem-connection\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e                   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e        \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  type\u003c/span\u003e                       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ExpressRoute\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_gateway_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network_gateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eer_gateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  express_route_circuit_id\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eexpressroute_circuit_id\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Private DNS Zone for internal services\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_dns_zone\u0026#34; \u0026#34;internal\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;internal.azure.local\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Link DNS zone to VNet\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_private_dns_zone_virtual_network_link\u0026#34; \u0026#34;aks\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;aks-vnet-link\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  private_dns_zone_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_private_dns_zone\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003einternal\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  virtual_network_id\u003c/span\u003e    \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_virtual_network\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_vnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  registration_enabled\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# AKS cluster with ExpressRoute connectivity\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_kubernetes_cluster\u0026#34; \u0026#34;aks\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hybrid-aks\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  dns_prefix\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;hybrid-aks\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  kubernetes_version\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;1.31\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003edefault_node_pool\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;system\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    node_count\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    vm_size\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard_D4s_v5\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    vnet_subnet_id\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_nodes\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    auto_scaling_enabled\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    min_count\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e3\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    max_count\u003c/span\u003e           \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"m\"\u003e10\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eidentity\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;SystemAssigned\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003enetwork_profile\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    network_plugin\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azure\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    network_policy\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;calico\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    service_cidr\u003c/span\u003e       \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.2.0.0/16\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    dns_service_ip\u003c/span\u003e     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.2.0.10\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    load_balancer_sku\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  private_cluster_enabled\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"kt\"\u003etrue\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  depends_on\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e[\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003eazurerm_virtual_network_gateway\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eer_gateway\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"p\"\u003e]\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Route table for on-prem traffic via ExpressRoute\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_route_table\u0026#34; \u0026#34;onprem_routes\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;onprem-routes\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e            \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_resource_group\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ehybrid\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eroute\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    name\u003c/span\u003e                   \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;to-onprem-datacenter\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    address_prefix\u003c/span\u003e         \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;10.0.0.0/8\u0026#34;\u003c/span\u003e\u003cspan class=\"c1\"\u003e  # On-prem network range\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e    next_hop_type\u003c/span\u003e          \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;VirtualNetworkGateway\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  }\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Associate route table with AKS subnet\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_subnet_route_table_association\u0026#34; \u0026#34;aks_routes\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  subnet_id\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_subnet\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eaks_nodes\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  route_table_id\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_route_table\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eonprem_routes\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eid\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis Terraform configuration establishes the foundation for hybrid connectivity: ExpressRoute gateway, private DNS, and AKS with network policies. Customize address ranges, SKUs, and routing rules for your environment.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-arc-unified-kubernetes-management\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#azure-arc-unified-kubernetes-management\" title=\"Azure Arc: Unified Kubernetes Management\"\u003eAzure Arc: Unified Kubernetes Management\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAzure Arc extends Azure management to any Kubernetes cluster: on-prem, edge locations, or other clouds. It registers external clusters as Azure resources, enabling centralized management without forcing workload migration.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"what-arc-provides\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#what-arc-provides\" title=\"What Arc Provides\"\u003eWhat Arc Provides\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eArc-enabled Kubernetes clusters gain:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eUnified inventory\u003c/strong\u003e: View all clusters in Azure Resource Manager\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003ePolicy enforcement\u003c/strong\u003e: Azure Policy extends to Arc clusters\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGitOps deployment\u003c/strong\u003e: Flux configurations apply consistently\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitoring integration\u003c/strong\u003e: Azure Monitor collects metrics and logs\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRBAC integration\u003c/strong\u003e: Azure AD for cluster authentication\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eArc doesn\u0026rsquo;t move workloads to Azure. It extends Azure\u0026rsquo;s control plane to wherever your clusters run.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"onboarding-an-on-prem-cluster\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#onboarding-an-on-prem-cluster\" title=\"Onboarding an On-Prem Cluster\"\u003eOnboarding an On-Prem Cluster\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eConnecting an existing Kubernetes cluster to Arc requires cluster admin access and network connectivity to Azure endpoints.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Azure Arc onboarding script\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Requires: Azure CLI, kubectl, cluster admin kubeconfig\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;hybrid-infra-rg\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;onprem-k8s-01\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eLOCATION\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;westeurope\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Login and set subscription\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz login\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz account \u003cspan class=\"nb\"\u003eset\u003c/span\u003e --subscription \u003cspan class=\"s2\"\u003e\u0026#34;your-subscription-id\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create resource group if needed\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz group create --name \u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e --location \u003cspan class=\"nv\"\u003e$LOCATION\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Register Arc providers\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider register --namespace Microsoft.Kubernetes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider register --namespace Microsoft.KubernetesConfiguration\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider register --namespace Microsoft.ExtendedLocation\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Wait for registration (can take several minutes)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider show -n Microsoft.Kubernetes -o table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz provider show -n Microsoft.KubernetesConfiguration -o table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install Arc extensions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz extension add --name connectedk8s\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz extension add --name k8s-configuration\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Connect cluster to Arc\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz connectedk8s connect \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --location \u003cspan class=\"nv\"\u003e$LOCATION\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --tags \u003cspan class=\"nv\"\u003eenvironment\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eproduction \u003cspan class=\"nv\"\u003edatacenter\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eonprem\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify connection\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz connectedk8s show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --query \u003cspan class=\"s2\"\u003e\u0026#34;connectivityStatus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eOnce connected, the cluster appears in the Azure portal alongside AKS clusters. Management operations (viewing workloads, applying policies, deploying via GitOps) work identically.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"policy-enforcement-across-environments\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#policy-enforcement-across-environments\" title=\"Policy Enforcement Across Environments\"\u003ePolicy Enforcement Across Environments\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure Policy for Kubernetes applies consistent governance rules across AKS and Arc clusters. Define policies once, enforce everywhere.\u003c/p\u003e\n\u003cp\u003eExample policy: require resource limits on all pods.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# pod-resource-limits-policy.yaml\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003econstraints.gatekeeper.sh/v1beta1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eK8sRequiredResources\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003erequire-pod-resource-limits\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ematch\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ekinds\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"nt\"\u003eapiGroups\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ekinds\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e[\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;Pod\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e]\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003estaging\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003elimits\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ecpu\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ememory\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003erequests\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ecpu\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e- \u003cspan class=\"l\"\u003ememory\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eApply this policy through Azure Policy, and it enforces on both AKS and Arc-connected on-prem clusters. No duplicated configuration, no drift between environments.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"gitops-single-source-of-truth\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#gitops-single-source-of-truth\" title=\"GitOps: Single Source of Truth\"\u003eGitOps: Single Source of Truth\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eArc supports Flux-based GitOps configurations. Define cluster state in Git, and Arc ensures compliance across environments. The \u003ccode\u003eaz k8s-configuration flux create\u003c/code\u003e command links your Git repository to both AKS and Arc clusters. Changes sync automatically. Configuration drift gets corrected within minutes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"dns-and-service-discovery-hybrid-resolution-without-complexity\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#dns-and-service-discovery-hybrid-resolution-without-complexity\" title=\"DNS and Service Discovery: Hybrid Resolution Without Complexity\"\u003eDNS and Service Discovery: Hybrid Resolution Without Complexity\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid deployments need service discovery across boundaries. Pods in AKS must resolve on-prem services, and vice versa.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"approach-1-azure-private-dns-with-conditional-forwarding\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#approach-1-azure-private-dns-with-conditional-forwarding\" title=\"Approach 1: Azure Private DNS with Conditional Forwarding\"\u003eApproach 1: Azure Private DNS with Conditional Forwarding\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCreate a Private DNS zone in Azure, link it to your VNet, and configure on-prem DNS servers to forward queries for Azure domains to Azure\u0026rsquo;s DNS resolver at 168.63.129.16. AKS clusters inherit VNet DNS configuration automatically. On-prem services get custom DNS entries pointing to ExpressRoute or VPN endpoints.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"approach-2-coredns-custom-forwarding\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#approach-2-coredns-custom-forwarding\" title=\"Approach 2: CoreDNS Custom Forwarding\"\u003eApproach 2: CoreDNS Custom Forwarding\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eFor cluster-level control, patch the CoreDNS ConfigMap to forward specific domain queries to on-prem DNS servers. This is the right approach when on-prem services use a domain suffix that doesn\u0026rsquo;t overlap with Azure Private DNS zones, or when you need different forwarding behavior per cluster.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# CoreDNS custom configmap - forward internal corporate domain to on-prem resolver\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ev1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eConfigMap\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ecoredns-custom\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ekube-system\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003edata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecorp.server\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"p\"\u003e|\u003c/span\u003e\u003cspan class=\"sd\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e    corp.example.com:53 {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        errors\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        cache 30\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        forward . 10.0.0.53 {\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e            prefer_udp\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e        }\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"sd\"\u003e    }\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eApply with \u003ccode\u003ekubectl apply -f coredns-custom.yaml\u003c/code\u003e. AKS detects the \u003ccode\u003ecoredns-custom\u003c/code\u003e ConfigMap automatically. For the reverse path, configure on-prem DNS to forward \u003ccode\u003e*.privatelink.blob.core.windows.net\u003c/code\u003e and similar zones to Azure\u0026rsquo;s virtual resolver at \u003ccode\u003e168.63.129.16\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e DNS is usually where hybrid setups produce the most subtle and hardest-to-debug failures. A pod resolves a name correctly in testing, then silently times out in production because the CoreDNS cache held a stale entry across a VPN reconnect. Keep TTLs short for cross-boundary records and verify the full resolver chain with \u003ccode\u003enslookup\u003c/code\u003e from inside the cluster, not just from a workstation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eKey principle:\u003c/strong\u003e Avoid split-horizon DNS designs where the same name resolves differently depending on source location. Use Azure Private DNS as the primary zone authority where possible, and fall back to conditional forwarding only for domains you don\u0026rsquo;t control.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"identity-across-boundaries-federation-without-duplication\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#identity-across-boundaries-federation-without-duplication\" title=\"Identity Across Boundaries: Federation Without Duplication\"\u003eIdentity Across Boundaries: Federation Without Duplication\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid deployments shouldn\u0026rsquo;t duplicate identity systems. Azure AD (now Microsoft Entra ID) integration extends to Arc clusters, providing centralized authentication and significantly reducing the number of credential systems to maintain.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"service-principals-for-cross-environment-access\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#service-principals-for-cross-environment-access\" title=\"Service Principals for Cross-Environment Access\"\u003eService Principals for Cross-Environment Access\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eApplications running on-prem that need access to Azure services (Key Vault, storage accounts, managed databases) can use Azure AD service principals with certificate-based authentication. Create a service principal, assign the appropriate role, and mount the certificate as a Kubernetes secret in the on-prem pod.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create service principal and assign Key Vault access\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz ad sp create-for-rbac \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"s2\"\u003e\u0026#34;onprem-app-sp\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --role \u003cspan class=\"s2\"\u003e\u0026#34;Key Vault Secrets User\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --scopes \u003cspan class=\"s2\"\u003e\u0026#34;/subscriptions/\u0026lt;sub-id\u0026gt;/resourceGroups/\u0026lt;rg\u0026gt;/providers/Microsoft.KeyVault/vaults/\u0026lt;vault\u0026gt;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis works reliably, but carries ongoing maintenance: certificate rotation, secret distribution across on-prem clusters, and audit trails that span two systems. For new workloads, federated credentials are worth the initial setup complexity.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"federated-credentials-for-workload-identity\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#federated-credentials-for-workload-identity\" title=\"Federated Credentials for Workload Identity\"\u003eFederated Credentials for Workload Identity\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eWorkload identity federation\u003c/a\u003e allows on-prem Kubernetes service accounts to authenticate as Azure AD identities without long-lived secrets. The on-prem cluster\u0026rsquo;s OIDC issuer endpoint issues tokens for service accounts; Azure AD trusts that issuer and exchanges the projected token for an Azure AD access token.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Register the on-prem cluster\u0026#39;s OIDC issuer with an Azure AD app registration\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz ad app federated-credential create \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --id \u0026lt;app-registration-id\u0026gt; \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --parameters \u003cspan class=\"s1\"\u003e\u0026#39;{\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;name\u0026#34;: \u0026#34;onprem-k8s-workload\u0026#34;,\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;issuer\u0026#34;: \u0026#34;https://\u0026lt;your-onprem-oidc-issuer\u0026gt;\u0026#34;,\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;subject\u0026#34;: \u0026#34;system:serviceaccount:production:my-app\u0026#34;,\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e    \u0026#34;audiences\u0026#34;: [\u0026#34;api://AzureADTokenExchange\u0026#34;]\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s1\"\u003e  }\u0026#39;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe on-prem cluster needs to expose its OIDC discovery document at a publicly reachable (or Azure-reachable) endpoint. That\u0026rsquo;s the step that most commonly blocks initial setup. Verify the discovery document is accessible before spending time debugging token exchange errors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e Migrating workloads from service principal secrets to federated credentials removes certificate rotation as a recurring task entirely. Secret sprawl across on-prem clusters was one of the more uncomfortable findings in the security reviews I\u0026rsquo;ve participated in. Federated credentials make the problem structurally impossible rather than just less likely.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"operational-consistency-making-hybrid-work-long-term\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#operational-consistency-making-hybrid-work-long-term\" title=\"Operational Consistency: Making Hybrid Work Long-Term\"\u003eOperational Consistency: Making Hybrid Work Long-Term\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid deployments fail when operational practices diverge between environments. Consistency requires deliberate effort.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"monitoring-and-observability\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#monitoring-and-observability\" title=\"Monitoring and Observability\"\u003eMonitoring and Observability\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eUse \u003ca href=\"https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure Monitor Container Insights\u003c/a\u003e for both AKS and Arc clusters. Install the extension on Arc-connected clusters explicitly (AKS picks it up automatically with the add-on flag):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz k8s-extension create \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name azuremonitor-containers \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-type connectedClusters \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name onprem-k8s-01 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group hybrid-infra-rg \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --extension-type Microsoft.AzureMonitor.Containers \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --configuration-settings \u003cspan class=\"nv\"\u003elogAnalyticsWorkspaceResourceID\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u0026lt;workspace-resource-id\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eMetrics, logs, and cluster health flow to a single Log Analytics workspace regardless of where the cluster runs. A simple Kusto Query Language (KQL) query surfaces pod restart counts across all environments at once:\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode class=\"language-kql\" data-lang=\"kql\"\u003eKubePodInventory\n| where TimeGenerated \u0026gt; ago(24h)\n| summarize Restarts=sum(ContainerRestartCount) by ClusterName, Namespace\n| order by Restarts desc\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eHaving AKS and on-prem clusters reporting to the same workspace makes cross-environment incident correlation significantly faster.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"update-management\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#update-management\" title=\"Update Management\"\u003eUpdate Management\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/agent-upgrade\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eAzure Arc cluster autoupgrade\u003c/a\u003e reduces the operational gap between AKS (where upgrades are automated and well-understood) and self-managed on-prem clusters (where upgrades have historically been postponed due to complexity). You can define upgrade channels, schedule maintenance windows, and receive notifications through the same Azure portal used for AKS fleet management.\u003c/p\u003e\n\u003cp\u003eThis doesn\u0026rsquo;t eliminate the need for upgrade validation in staging environments. But it removes the operational friction that leads to on-prem clusters running three minor versions behind production AKS.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cost-and-resource-tracking\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#cost-and-resource-tracking\" title=\"Cost and Resource Tracking\"\u003eCost and Resource Tracking\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eArc-enabled clusters report resource utilization to Azure. Tag clusters consistently with environment, cost-center, and region labels using \u003ccode\u003eaz connectedk8s update\u003c/code\u003e. Use Azure Cost Management to track total Kubernetes spend across cloud and on-prem, enabling accurate chargeback and budget planning.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"key-takeaways\"\u003e\u003ca href=\"/posts/hybrid-aks-on-prem-azure-arc/#key-takeaways\" title=\"Key Takeaways\"\u003eKey Takeaways\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eHybrid AKS deployments succeed when you:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eChoose the right connectivity\u003c/strong\u003e: ExpressRoute for production, S2S VPN for dev/test, VNet peering for Azure-only scenarios\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse Azure Arc for unified management\u003c/strong\u003e: Extend Azure\u0026rsquo;s control plane rather than building parallel tooling\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEnforce policies consistently\u003c/strong\u003e: Azure Policy + GitOps eliminate configuration drift\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSimplify DNS\u003c/strong\u003e: Azure Private DNS with conditional forwarding avoids complexity\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eFederate identity\u003c/strong\u003e: Azure AD integration reduces secret sprawl and management overhead\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitor everything in one place\u003c/strong\u003e: Azure Monitor provides visibility across environments\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eHybrid infrastructure doesn\u0026rsquo;t have to mean duplicated effort. Arc, proper networking, and consistent operational practices make multi-environment Kubernetes manageable.\u003c/p\u003e\n\u003cp\u003eThe goal isn\u0026rsquo;t cloud purity. It\u0026rsquo;s operational efficiency wherever your workloads run.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-03-25T17:00:00+01:00","id":"https://daily-devops.net/posts/hybrid-aks-on-prem-azure-arc/","language":"en","summary":"Practical patterns for connecting AKS to on-prem: ExpressRoute, VPN connectivity, Azure Arc management, DNS resolution, and identity federation.","tags":["hybrid","azure","kubernetes","cloud","devops","onprem","infrastructure"],"title":"Hybrid AKS: Bridging Cloud and On-Prem with Azure Arc","url":"https://daily-devops.net/posts/hybrid-aks-on-prem-azure-arc/"}],"language":"en","title":"Infrastructure Engineering and Cloud Operations on Daily DevOps \u0026 .NET","version":"https://jsonfeed.org/version/1.1"}