AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work

AKS cluster upgrades are routine maintenance, but executing them without dropping traffic or losing state is the operational challenge that separates theory from production reality. Every Kubernetes version upgrade involves replacing nodes, which means evicting pods, draining workloads, and hoping your assumptions about resilience hold true under pressure.

I have participated in dozens of AKS upgrades across production clusters ranging from 10 to 500+ nodes. The pattern is consistent: teams that treat upgrades as a checkbox operation eventually experience an outage. Teams that understand the underlying mechanics and configure explicit constraints rarely do.

This article covers the real mechanics: how cordon and drain actually work, why Pod Disruption Budgets exist, and how to orchestrate multi-node-pool rollouts with automation that survives contact with production.

The problem: uncontrolled node drains cause cascading failures

When you upgrade an AKS cluster, Azure replaces nodes with new VMs running the updated Kubernetes version. That replacement process triggers pod eviction. Without proper controls, evictions happen simultaneously across multiple nodes, stateful workloads lose quorum, and traffic drops because there are no healthy replicas left to serve requests.

The default behavior is optimistic: Kubernetes assumes your workloads are designed for failure. But production workloads are rarely that resilient. Databases need time to transfer leadership, message queues need to flush buffers, and stateless apps still need at least one replica running to handle incoming connections.

Author note: The official AKS upgrade documentation covers the mechanics, but it does not emphasize how quickly things go wrong without proper constraints. I have seen a three-minute upgrade window turn into a two-hour incident because nobody configured Pod Disruption Budgets.

Uncontrolled drains create several failure modes:

Data loss: Stateful workloads evicted before flushing state to disk or replicating to peers.
Service interruption: All replicas terminated before new ones become ready.
Cascading failures: Dependent services timeout waiting for unavailable backends, triggering retries that amplify load.

The solution is not to avoid upgrades. The solution is to control the eviction process with explicit constraints that match your workload requirements.

Cordon and drain mechanics: what actually happens

The Kubernetes eviction API follows a three-step process when draining a node:

Cordon: Mark the node as unschedulable. New pods will not be placed on this node, but existing pods continue running.
Evict: Send termination signals to all pods on the node, respecting grace periods and Pod Disruption Budgets (PDBs).
Wait: Block until all pods have terminated or the drain timeout expires.

AKS automates this process during upgrades, but you can trigger it manually using kubectl for maintenance or troubleshooting:

kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

The --ignore-daemonsets flag prevents drain from failing on DaemonSet pods, which are designed to run on every node and will be recreated automatically. The --delete-emptydir-data flag allows drain to proceed even if pods use emptyDir volumes, which are ephemeral and will be lost.

For AKS automated upgrades, you can configure the drain behavior per node pool using rolling upgrade settings:

az aks nodepool update \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name myNodePool \
  --max-surge 33% \
  --drain-timeout 45 \
  --node-soak-duration 5

The --drain-timeout parameter (in minutes) controls how long AKS waits for pods to terminate before force-killing them. The --node-soak-duration (in minutes) adds a stabilization period after each node upgrade before proceeding to the next. Microsoft recommends --max-surge 33% for production workloads.

Manual drain remains useful for pre-maintenance validation, testing PDB configurations, or debugging eviction failures before committing to a full cluster upgrade.

Pod Disruption Budgets: the safety mechanism you should always configure

A Pod Disruption Budget (PDB) defines the minimum number of pods that must remain available during voluntary disruptions like node drains. PDBs do not prevent involuntary disruptions like node crashes or resource exhaustion, but they block evictions that would violate availability constraints.

PDBs are defined using either minAvailable or maxUnavailable:

minAvailable: The minimum number of pods (or percentage) that must remain running during a disruption.
maxUnavailable: The maximum number of pods (or percentage) that can be unavailable during a disruption.

Example PDB for a three-replica deployment that must keep at least two replicas running:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: myapp

With this PDB in place, drain will evict only one pod at a time, waiting for a replacement to become ready before proceeding to the next eviction. If no replacement becomes ready (for example, due to resource constraints or image pull failures), the drain blocks until the timeout expires.

PDBs are particularly critical for:

Stateful workloads: Databases, message queues, and distributed systems that require quorum.
Low-replica deployments: Services with two or three replicas where losing one pod reduces capacity significantly.
Long startup times: Workloads that take minutes to initialize and become ready.

Practical PDB configuration advice:

Set minAvailable: 1 for stateless services with two replicas.
Set minAvailable: N-1 for N-replica stateful services that tolerate one failure (for example, three-node etcd allows minAvailable: 2).
Avoid minAvailable: N (all replicas), which blocks drain indefinitely and prevents upgrades.
Use percentages for large replica counts: minAvailable: 75% for a 10-replica deployment allows up to 2-3 pods to be evicted simultaneously.

Author tip: Before any upgrade, run kubectl get pdb -A and verify that no PDB has ALLOWED DISRUPTIONS showing zero. A PDB with zero allowed disruptions will block node drain indefinitely, and your upgrade will hang until the drain timeout expires or you manually intervene.

PDBs only apply to voluntary disruptions. Node failures ignore PDBs and evict all pods immediately.

Workload categories: stateless, stateful, DaemonSets

Different workload types require different upgrade strategies. A one-size-fits-all approach causes either unnecessary downtime (overly conservative) or unexpected failures (overly aggressive).

Stateless workloads

Stateless services like web frontends, API gateways, and workers can tolerate rapid eviction as long as at least one replica remains available. Configure PDBs with minAvailable: 1 or maxUnavailable: N-1 to allow fast rollouts while maintaining service availability.

Stateful workloads

Databases, message queues, and distributed storage systems require careful sequencing. Evicting multiple replicas simultaneously can cause quorum loss, split-brain scenarios, or data corruption.

Best practices for stateful workloads:

Set conservative PDBs that preserve quorum (for example, minAvailable: 2 for a three-node cluster).
Configure long grace periods (60+ seconds) to allow state transfer and leadership handoff.
Use StatefulSets with proper readiness probes to ensure new replicas are fully initialized before old ones are terminated.
Test upgrade scenarios in staging with realistic data volumes and latency.

DaemonSets

DaemonSets run exactly one pod per node (or per matching node). Examples include logging agents, monitoring exporters, and network plugins. Draining a node automatically terminates the DaemonSet pod, and the pod is recreated on the new node after upgrade.

DaemonSets do not require PDBs because they are designed to tolerate single-node failures. Use the --ignore-daemonsets flag during manual drain to skip these pods.

Multi-node-pool rollout strategies: graduated risk management

AKS supports multiple node pools within a single cluster. Each pool can have different VM sizes, availability zones, and upgrade schedules. Multi-node-pool architectures enable graduated rollouts that reduce risk by upgrading non-critical workloads first.

Recommended upgrade sequence:

Dev/test pools first: Upgrade node pools running non-production workloads to validate the new Kubernetes version and catch compatibility issues early.
Stateless application pools: Upgrade pools running stateless services that can tolerate brief capacity reductions.
Stateful application pools last: Upgrade pools running databases and stateful services only after validating the rollout on stateless workloads.

Example multi-pool upgrade using Azure CLI:

#!/bin/bash
set -euo pipefail

RESOURCE_GROUP="myResourceGroup"
CLUSTER_NAME="myAKSCluster"
TARGET_VERSION="1.29.2"

# Configure rolling upgrade settings for production safety
MAX_SURGE="33%"        # Microsoft recommended for production
DRAIN_TIMEOUT="45"     # Minutes to wait for pod eviction
NODE_SOAK="5"          # Minutes to stabilize after each node

# Upgrade control plane first (does not affect workloads)
echo "Upgrading control plane to ${TARGET_VERSION}..."
az aks upgrade \
  --resource-group "$RESOURCE_GROUP" \
  --name "$CLUSTER_NAME" \
  --kubernetes-version "$TARGET_VERSION" \
  --control-plane-only \
  --yes

# Upgrade node pools in sequence: system -> stateless -> stateful
NODE_POOLS=("system" "stateless" "stateful")

for POOL in "${NODE_POOLS[@]}"; do
  echo "Upgrading node pool: ${POOL}..."
  
  # Verify current node count and health
  CURRENT_COUNT=$(az aks nodepool show \
    --resource-group "$RESOURCE_GROUP" \
    --cluster-name "$CLUSTER_NAME" \
    --name "$POOL" \
    --query count -o tsv)
  
  echo "Current node count for ${POOL}: ${CURRENT_COUNT}"
  
  # Configure rolling upgrade settings before upgrade
  az aks nodepool update \
    --resource-group "$RESOURCE_GROUP" \
    --cluster-name "$CLUSTER_NAME" \
    --name "$POOL" \
    --max-surge "$MAX_SURGE" \
    --drain-timeout "$DRAIN_TIMEOUT" \
    --node-soak-duration "$NODE_SOAK"
  
  # Upgrade node pool
  az aks nodepool upgrade \
    --resource-group "$RESOURCE_GROUP" \
    --cluster-name "$CLUSTER_NAME" \
    --name "$POOL" \
    --kubernetes-version "$TARGET_VERSION" \
    --yes
  
  # Wait for upgrade to complete
  echo "Waiting for ${POOL} upgrade to complete..."
  az aks nodepool wait \
    --resource-group "$RESOURCE_GROUP" \
    --cluster-name "$CLUSTER_NAME" \
    --name "$POOL" \
    --updated
  
  # Verify upgraded node count matches original
  UPGRADED_COUNT=$(az aks nodepool show \
    --resource-group "$RESOURCE_GROUP" \
    --cluster-name "$CLUSTER_NAME" \
    --name "$POOL" \
    --query count -o tsv)
  
  if [ "$CURRENT_COUNT" != "$UPGRADED_COUNT" ]; then
    echo "ERROR: Node count mismatch for ${POOL}. Expected ${CURRENT_COUNT}, got ${UPGRADED_COUNT}"
    exit 1
  fi
  
  echo "Pool ${POOL} upgraded successfully."
  echo "---"
done

echo "All node pools upgraded to ${TARGET_VERSION}."

This script upgrades the control plane first (which is a non-disruptive operation), then upgrades each node pool sequentially, validating node count before and after each upgrade to detect unexpected node losses.

Key operational notes:

Control plane upgrades are non-disruptive: The control plane upgrade updates the Kubernetes API server and controllers but does not affect running workloads. Only node pool upgrades trigger pod evictions.
One node pool at a time: Upgrading multiple pools simultaneously multiplies risk. Sequential upgrades allow you to catch issues early and halt the rollout before affecting critical workloads.
Validate before proceeding: Check pod health, replica counts, and application metrics after each pool upgrade. Use kubectl, Azure Monitor, or Prometheus to verify that workloads are stable before moving to the next pool.

Planned maintenance windows: scheduling upgrades safely

For clusters with automatic upgrades enabled, AKS supports planned maintenance windows to control when upgrades occur. This prevents upgrades from starting during peak traffic periods.

Configure a weekly maintenance window using Azure CLI:

az aks maintenanceconfiguration add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name aksManagedAutoUpgradeSchedule \
  --schedule-type Weekly \
  --day-of-week Saturday \
  --start-time 02:00 \
  --duration 4

Microsoft recommends a minimum four-hour maintenance window to ensure upgrades complete without interruption. Combine this with the stable auto-upgrade channel, which targets the previous minor version with latest patches, for a balance between staying current and avoiding bleeding-edge issues.

For production clusters, I prefer manual upgrades with planned maintenance windows as a safety net. The automation handles the scheduling, but I control when the actual upgrade starts.

Automation and rollback: scripting safe upgrades

Automation reduces human error during upgrades, but only if the automation includes validation and rollback capabilities. A fully automated upgrade script should:

Validate current cluster state (replica counts, PDB configurations, node health).
Upgrade in stages with validation checkpoints.
Detect failures and halt or rollback automatically.

Practical validation checks before upgrade:

#!/bin/bash
set -euo pipefail

echo "=== Pre-Upgrade Validation ==="

# Check available Kubernetes versions
echo "Available upgrades:"
az aks get-upgrades \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --output table

# Verify all nodes are ready
NOTREADY=$(kubectl get nodes --no-headers | grep -v " Ready " | wc -l)
if [ "$NOTREADY" -gt 0 ]; then
  echo "ERROR: $NOTREADY nodes are not ready. Aborting upgrade."
  kubectl get nodes | grep -v " Ready "
  exit 1
fi
echo "✓ All nodes ready"

# Check for PDBs that would block drain
BLOCKED=$(kubectl get pdb -A -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}')
if [ -n "$BLOCKED" ]; then
  echo "WARNING: The following PDBs have zero allowed disruptions:"
  echo "$BLOCKED"
  echo "These will block node drain. Verify this is intentional."
fi

# Verify PDBs exist for critical namespaces
for NS in production; do
  PDBS=$(kubectl get pdb -n "$NS" --no-headers 2>/dev/null | wc -l)
  if [ "$PDBS" -eq 0 ]; then
    echo "WARNING: No PDBs configured in namespace $NS"
  else
    echo "✓ $PDBS PDBs configured in $NS"
  fi
done

# Verify critical deployments have sufficient replicas
echo "Checking critical deployments..."
for DEPLOYMENT in myapp-frontend myapp-backend; do
  REPLICAS=$(kubectl get deployment "$DEPLOYMENT" -n production -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0")
  if [ "$REPLICAS" -lt 2 ]; then
    echo "ERROR: $DEPLOYMENT has fewer than 2 ready replicas ($REPLICAS). Aborting upgrade."
    exit 1
  fi
  echo "✓ $DEPLOYMENT: $REPLICAS replicas ready"
done

echo "=== Validation Complete ==="

Rollback is more complex. AKS does not support in-place downgrades. If an upgrade introduces breaking changes, the rollback path involves:

Restoring from a snapshot or backup (for stateful workloads).
Deploying a new node pool with the previous Kubernetes version.
Migrating workloads to the new pool.
Deleting the upgraded pool.

This process is slow and disruptive, which is why validation before upgrade is critical. Test upgrades in staging, validate application compatibility with the new Kubernetes version, and maintain rollback procedures even if you hope never to use them.

Practical recommendations

Based on production experience, the following practices reduce upgrade-related failures:

Always configure PDBs for production workloads. Even stateless services benefit from minAvailable: 1 to prevent simultaneous eviction of all replicas.
Test upgrades in staging first. Validate application compatibility, verify PDB behavior, and measure upgrade duration under realistic load.
Upgrade during low-traffic windows. Even with proper PDBs, upgrades reduce available capacity. Schedule upgrades when traffic is lowest to minimize user impact.
Monitor during upgrades. Track pod eviction events, replica counts, and application error rates. Use Azure Monitor, Prometheus, or your existing observability stack to detect issues early.
Automate validation, not just execution. Scripts that upgrade without validation are worse than manual upgrades because they fail faster and more completely.

Conclusion

AKS cluster upgrades are unavoidable, but service disruption is not. Cordon and drain mechanics provide the foundation, Pod Disruption Budgets enforce availability constraints, and multi-node-pool rollouts allow graduated risk management. Combine these tools with validation-driven automation, and zero-downtime upgrades become reliable rather than aspirational.

The key insight: upgrades succeed when the automation respects the constraints of your workloads, not when the automation assumes resilience that does not exist.

Start with the basics: configure PDBs for every production workload, set --max-surge 33% on your node pools, and always upgrade control plane before node pools. Test in staging first. Monitor during the upgrade. These practices are not optional for production clusters.

Comments