Azure Kubernetes Service (AKS)

Azure Kubernetes Service reduces the operational burden of running Kubernetes by managing the control plane — API server, etcd, scheduler — so you don’t have to. What it doesn’t manage is everything else: node pool configuration, workload identity, storage class selection, networking topology, upgrade timing, cost governance, and security posture. The “managed” label covers a narrow slice of what actually requires operational attention.

The articles in this collection address the decisions AKS leaves to platform engineers. Identity configuration covers Workload Identity Federation — why it’s more complex than service account tokens and where credentials still leak despite federation. Storage articles examine what happens to PVCs when node pools get replaced and which storage classes survive real restore scenarios. Networking content covers multi-cluster hub-spoke topologies and when mesh complexity becomes justified rather than premature.

Cluster upgrades are a recurring theme because the documentation describes them optimistically. Cordon and drain behavior, Pod Disruption Budget configuration that actually prevents downtime rather than just satisfying a checkbox, and multi-node-pool rollout strategies that make upgrades reproducible rather than heroic one-off events.

Cost governance gets specific attention because resource limits are not a cost strategy. Node pool design decisions, spot VM integration without reliability regressions, and FinOps tagging that produces actionable attribution are each covered with the trade-offs named explicitly.

At scale — clusters above a few hundred nodes — AKS behavior changes in ways that smaller clusters don’t expose. etcd limits under high object churn, network saturation, observability overhead that compounds with cluster size, and cost patterns that emerge from early architectural decisions are addressed in articles grounded in production experience rather than constructed scenarios.

AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work

AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work

AKS cluster upgrades involve node replacement and pod eviction, which can cause service disruption without proper controls. This article explains cordon and drain mechanics, Pod Disruption Budget configuration, and multi-node-pool rollout strategies with validation-driven automation for reliable zero-downtime upgrades.