AKS Architecture & Operations — The Complete Series

AKS documentation will get you to a running cluster. It won’t tell you why your pod authenticated in staging and gets a 401 in production. It won’t explain why upgrading a 50-node cluster at 2 AM felt fine but a 300-node upgrade at noon caused cascading evictions. It won’t show you which storage class to avoid when your database needs to survive node pool replacements.

This series covers the operational reality — the decisions that distinguish AKS clusters that run quietly in production from clusters that generate 3 AM alerts. Nine articles, each examining a specific architectural domain with the specificity that matters when something breaks.

Why AKS Operations Is Different

Microsoft manages the AKS control plane. That sounds like less work, and in some ways it is — you don’t patch etcd, you don’t replace failed control plane VMs, you don’t worry about API server certificate rotation. What it doesn’t mean is that running AKS in production is simple or that managed Kubernetes hands you a reliable platform and steps aside.

Every node pool configuration decision is yours. Every storage class binding, every PVC lifecycle policy, every decision about which node pool hosts which workload — that’s on you. RBAC spans three separate systems simultaneously: Kubernetes RBAC, Azure RBAC, and Azure AD. A misconfiguration in any one of them produces an access failure that looks identical from the application’s perspective. The documentation will show you how to configure each system in isolation. It will not show you why they interact in non-obvious ways under specific conditions, or what the failure mode looks like when you get the federation configuration slightly wrong.

Where Networking Stops Being Managed

Networking is another area where “managed” has a narrower meaning than the word implies. Microsoft manages the control plane networking. Your VNet, your subnets, your IP address planning, your DNS configuration, your ingress architecture — all of it is your responsibility, and the decisions compound. IP exhaustion caused by node pool scaling is a common production incident that no amount of control plane management prevents. Private cluster DNS resolution breaks in ways that take hours to diagnose if you haven’t encountered the pattern before.

Upgrades: The Gap Between Docs and Reality

Upgrades are perhaps the clearest illustration of the gap between documentation and reality. The documentation describes upgrade mechanics accurately. What it doesn’t describe is how Pod Disruption Budget misconfigurations interact with cluster autoscaler behavior during node pool drain, why the timing of upgrades relative to workload peak matters more than most teams expect, or how a PDB that looks correct on paper blocks drain indefinitely on a cluster that’s handling real traffic. Managed Kubernetes handles the control plane upgrade. The workload upgrade is a careful orchestration problem that the platform does not solve for you.

Storage: Where “Managed” Disappears

Storage is where the word “managed” disappears entirely. Azure manages the underlying disk and file services. AKS provides the CSI drivers. Everything between your application and the storage backend — PVC binding, reclaim policies, volume expansion behavior, backup orchestration, behavior during node failure or node pool deletion — is configuration you own. Teams that treat storage as a detail find out it isn’t when a node pool replacement deletes volumes that were bound to nodes rather than to the cluster.

Cost Decisions Compound Silently

Cost is a dimension that managed Kubernetes actively obscures. The control plane is free at most tiers. Node pool costs scale with what you configure, and the configuration space is large: VM SKU selection, autoscaler min/max bounds, system versus user node pool separation, spot VM integration, pod density targets. None of these have obviously correct values. All of them interact. Teams that inherit clusters often inherit cost structures that made sense at a different scale or for a different workload profile, and reversing those decisions requires careful sequencing to avoid downtime.

The happy paths in the documentation work. They work because they’re constructed to work. Production clusters encounter the edges — the configuration combinations, the scale thresholds, the timing sensitivities — that happy paths don’t cover. This series is about the edges.

What This Series Covers

Pod Identity & Access Control in AKS: What Actually Breaks starts with identity because identity failures are the most common source of production incidents. Workload Identity Federation eliminates credential lifecycle problems but introduces configuration complexity spanning three separate RBAC systems — Kubernetes RBAC, Azure RBAC, and Azure AD permissions. The article explains where credentials still leak despite federation, how layers interact and fail, and validation patterns that catch misconfigurations before they become incidents.

Storage Architecture & Stateful Workloads in AKS addresses what most AKS guides skip: what actually happens to your data when a node gets replaced. PVC/PV architecture, Azure Disk versus Azure Files performance trade-offs, Velero backup configurations that survive real restore scenarios, and multi-cluster replication patterns for production stateful workloads.

AKS Cost Optimization: Resource Governance That Actually Works covers the gap between “set resource limits” and actually controlling spend at scale. Pod density strategies, node pool design decisions that compound over time, spot VM integration without reliability regressions, and FinOps tagging that produces actionable cost attribution rather than unread dashboards.

Multi-AKS Cluster Networking & Hub-Spoke Topology examines what happens to networking when you move from one cluster to many. VNet peering patterns, hub-spoke routing, cross-cluster DNS resolution, shared ingress options, and — critically — the decision criteria for when mesh complexity becomes justified rather than premature.

AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work covers upgrade mechanics that documentation describes optimistically. Cordon and drain behavior, Pod Disruption Budget configuration that prevents service disruption rather than theater-level protection, multi-node-pool rollout strategies, and validation-driven automation that makes upgrades reproducible rather than heroic.

Container Registry & Image Security in AKS Deployments covers ACR hardening beyond the basics. A production-ready sequence: vulnerability scanning, image signing with Notation, RBAC scoping, private endpoints, policy enforcement through Azure Policy and admission controllers, and geo-replication strategies with clear trade-offs explained.

AKS Disaster Recovery: Why Your Untested Backup Will Fail addresses the gap between having backups and having a tested recovery plan. Velero configuration, realistic RTO/RPO targets that match business risk rather than wishful thinking, restore testing procedures that catch problems before outages, and multi-region failover steps your team can actually execute under pressure.

Hybrid AKS: Bridging Cloud and On-Prem with Azure Arc covers the operational patterns for organizations running Kubernetes across cloud and on-premises simultaneously. ExpressRoute and VPN connectivity, Azure Arc for unified management across heterogeneous environments, consistent policy enforcement, DNS resolution, and identity federation without duplicating systems.

AKS at Scale: Hard-Won Lessons from 1000+ Node Clusters closes the series with what changes when clusters grow large enough that the platform itself becomes the bottleneck. etcd limits under high object churn, network saturation at scale, observability overhead that compounds with cluster size, and cost spirals that emerge from architectural decisions that seemed fine at 50 nodes.

Who This Is For

Platform engineers and infrastructure-focused developers responsible for AKS clusters in production — or teams about to inherit that responsibility. Each article assumes you’ve run AKS before and want operational depth, not introductory setup instructions.

The series covers Terraform, Bicep, Kubectl, and Azure CLI patterns throughout. Examples are grounded in production scenarios rather than constructed to demonstrate features.

How These Articles Were Written

Each article in this series is based on production experience — clusters that handled real traffic, failed in real ways, and required real fixes under time pressure. That distinction matters for what you’ll find here and what you won’t.

Production experience means the failure patterns are specific. Not “storage can be tricky” but which storage class binding decisions survive node pool replacements and which don’t. Not “upgrades can cause downtime” but which combination of PDB configuration and autoscaler behavior produces an indefinitely blocked drain. Not “identity is complex” but the exact configuration gap in Workload Identity Federation that causes silent auth failures in one environment and not another. The specificity isn’t for its own sake — it’s the difference between an article that confirms your intuition and one that actually changes what you configure next.

Trade-offs Over Single Right Answers

What production experience doesn’t mean is that every approach here is the only valid one. Large-scale AKS operation involves genuine trade-offs — between cost and resilience, between operational simplicity and flexibility, between standardization and workload-specific tuning. The articles explain the reasoning behind recommendations rather than just stating them, because the reasoning is what lets you adapt the approach to your constraints. A node pool design that works for a batch processing workload is wrong for a latency-sensitive API, and the article on cost governance explains why rather than presenting a single correct answer.

When AKS Makes Things Harder

The articles were not written to showcase features or to demonstrate that AKS has a solution for every problem. Some of them document problems that AKS makes harder than it should be, and say so directly. If a particular architectural pattern has a known failure mode at scale, that failure mode appears in the article rather than in a footnote or an FAQ three pages into the documentation. If a feature has a meaningful limitation that affects how you should configure it, that limitation is in the main text, not in a callout box labeled “note.”

The goal is for these articles to be the thing you read before a production incident rather than the thing you find during one.

Where to Start

Read in published order if you’re building out AKS infrastructure from scratch — identity and storage are foundational, and later articles reference earlier concepts. Jump to specific articles if you’re dealing with an immediate operational problem: the titles are specific enough that the right article for your situation should be obvious.

The scale article at the end is worth reading early if your cluster is already growing or if you’re designing for growth — some architectural decisions made at 50 nodes are expensive to reverse at 500.

Comments