{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"},{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"description":"Recent content in System Reliability and Resilience on Daily DevOps \u0026 .NET","favicon":"https://daily-devops.net/images/logo_hu_6465d873dfa490cf.png","feed_url":"https://daily-devops.net/tags/reliability/feed.json","home_page_url":"https://daily-devops.net/tags/reliability/","icon":"https://daily-devops.net/images/logo_hu_5926de77762241ba.png","items":[{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eYour cluster will fail. The question is not if, but when, and whether you can recover before customers notice. Most organizations discover their backup strategy does not work during an actual outage, when recovery time matters most and manual heroics cannot save you.\u003c/p\u003e\n\u003cp\u003eIf you run Azure Kubernetes Service (AKS) in production, you need a recovery plan that engineers can execute half asleep at 2 AM. We will go through what to back up, how Velero works in day-to-day operations, when Azure Backup for AKS is enough, and how to design realistic failover with measurable Recovery Time Objective (RTO) and Recovery Point Objective (RPO).\u003c/p\u003e\n\u003cp\u003eThe goal is simple: repeatable recovery procedures you have already tested, not a document that looks good in Confluence but fails during an incident.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-untested-recovery-fails-when-it-matters\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#the-problem-untested-recovery-fails-when-it-matters\" title=\"The problem: Untested recovery fails when it matters\"\u003eThe problem: Untested recovery fails when it matters\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eEvery Kubernetes cluster accumulates state that must survive failures. Application data lives in persistent volumes. Cluster configuration exists in custom resource definitions. Workload definitions sit in YAML manifests scattered across repositories. Identity mappings, secrets, network policies, and RBAC rules define how services authenticate and communicate. Losing any of these components means downtime, data loss, and manual reconstruction under time pressure.\u003c/p\u003e\n\u003cp\u003eThe real risk is not having a backup strategy. The real risk is discovering your backup strategy does not work during an actual incident, when recovery time directly determines customer impact and business cost.\u003c/p\u003e\n\u003cp\u003eOperational reality: Most teams test backup creation but never test restoration. A backup you have never restored is a backup that will fail when you need it. Recovery procedures that require manual steps will fail during high-pressure incidents when engineers make mistakes and documentation is incomplete.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-needs-backup-understanding-cluster-state\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#what-needs-backup-understanding-cluster-state\" title=\"What needs backup: Understanding cluster state\"\u003eWhat needs backup: Understanding cluster state\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes clusters contain multiple layers of state that require different backup approaches.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"application-data-persistent-volumes\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#application-data-persistent-volumes\" title=\"Application data: Persistent volumes\"\u003eApplication data: Persistent volumes\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePersistent volumes hold databases, file storage, configuration data, and application state. Losing persistent volume data typically means permanent data loss unless you maintain application-level replication or external backups. Azure Disks and Azure Files both support snapshot-based backup, but snapshots alone do not capture the Kubernetes metadata required to restore volumes to the correct pods in the correct namespaces.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cluster-configuration-custom-resources-and-crds\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#cluster-configuration-custom-resources-and-crds\" title=\"Cluster configuration: Custom resources and CRDs\"\u003eCluster configuration: Custom resources and CRDs\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCustom Resource Definitions extend Kubernetes with domain-specific objects. Operators, service meshes, monitoring stacks, and policy engines all define Custom Resource Definitions (CRDs) that control cluster behavior. Losing CRDs means losing the schema and logic that your cluster depends on. Restoring CRDs without the corresponding custom resource objects leaves your cluster in an inconsistent state.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"application-definitions-workload-manifests\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#application-definitions-workload-manifests\" title=\"Application definitions: Workload manifests\"\u003eApplication definitions: Workload manifests\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDeployments, StatefulSets, Services, ConfigMaps, and Secrets define what runs in your cluster. Most teams store these manifests in Git, but cluster state drifts from Git over time due to manual changes, automated rollouts, and operator modifications. Restoring from Git alone may not reflect actual production state.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"identity-and-access-rbac-and-service-accounts\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#identity-and-access-rbac-and-service-accounts\" title=\"Identity and access: RBAC and service accounts\"\u003eIdentity and access: RBAC and service accounts\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRole-based access control, ServiceAccounts, and Azure AD integration define who can access what resources. Losing role-based access control (RBAC) configuration means losing security boundaries and breaking automated workflows that depend on specific service account permissions.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"network-configuration-policies-and-ingress-rules\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#network-configuration-policies-and-ingress-rules\" title=\"Network configuration: Policies and ingress rules\"\u003eNetwork configuration: Policies and ingress rules\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNetwork policies, ingress controllers, and DNS mappings control how traffic flows into and within your cluster. Restoring workloads without restoring network configuration results in unreachable services and broken traffic routing.\u003c/p\u003e\n\u003cp\u003eA complete backup strategy captures all of these layers and validates that restoration procedures actually work.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"velero-production-backup-workflows\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#velero-production-backup-workflows\" title=\"Velero: Production backup workflows\"\u003eVelero: Production backup workflows\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eVelero is the de facto standard for Kubernetes backup and restore. It runs as a controller inside your cluster, captures cluster state and persistent volume snapshots, and stores backups in object storage.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"how-velero-works\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#how-velero-works\" title=\"How Velero works\"\u003eHow Velero works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eVelero operates in two phases: backup and restore. During backup, Velero queries the Kubernetes API for resources matching your backup selectors, serializes those resources to JSON, and uploads the result to cloud object storage (Azure Blob Storage for AKS). For persistent volumes, Velero triggers volume snapshots using Azure Disk snapshots or uses Restic to perform file-level backups.\u003c/p\u003e\n\u003cp\u003eDuring restore, Velero downloads the backup manifest, applies resources to the target cluster, and restores persistent volume data from snapshots or Restic archives. Velero handles dependency ordering and namespace mapping automatically.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"backup-scheduling-and-retention\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#backup-scheduling-and-retention\" title=\"Backup scheduling and retention\"\u003eBackup scheduling and retention\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eProduction backup strategies require automated scheduling and retention policies. Velero supports cron-based schedules and configurable retention windows.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# Velero backup schedule - Helm values\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evelero\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedules\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003edaily\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"c\"\u003e# Run full backup daily at 2 AM UTC\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedule\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etemplate\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ettl\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;720h\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Retain backups for 30 days\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eincludedNamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003estaging\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003esnapshotVolumes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ehourly-critical\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"c\"\u003e# Run hourly backup for critical namespaces\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedule\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 * * * *\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etemplate\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ettl\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;168h\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Retain backups for 7 days\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eincludedNamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003elabelSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e            \u003c/span\u003e\u003cspan class=\"nt\"\u003ebackup-frequency\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ehourly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003esnapshotVolumes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eFor many teams, this minimal Terraform baseline is easier to maintain than a large, custom module. It creates the storage account and container Velero needs.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_storage_account\u0026#34; \u0026#34;velero\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;velerobackup${var.environment}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eresource_group_name\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  account_tier\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  account_replication_type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GRS\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_storage_container\u0026#34; \u0026#34;velero\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;velero\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  storage_account_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_storage_account\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003evelero\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  container_access_type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;private\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThen install Velero with Helm and pass only four required values: provider (\u003ccode\u003eazure\u003c/code\u003e), storage account name, blob container name, and resource group. Keep advanced tuning for later once backups and restores are stable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"testing-restore-procedures\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#testing-restore-procedures\" title=\"Testing restore procedures\"\u003eTesting restore procedures\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eBackup creation means nothing without verified restore capability. Production-grade DR requires regular restore testing in isolated environments.\u003c/p\u003e\n\u003cp\u003eRestore testing workflow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCreate a test AKS cluster in a separate resource group\u003c/li\u003e\n\u003cli\u003eInstall Velero with access to production backup storage\u003c/li\u003e\n\u003cli\u003eExecute restore operation for a representative namespace\u003c/li\u003e\n\u003cli\u003eValidate application functionality and data integrity\u003c/li\u003e\n\u003cli\u003eDocument restoration time and any issues encountered\u003c/li\u003e\n\u003cli\u003eDestroy test cluster\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eRun this workflow monthly at minimum. Quarterly is too infrequent because configuration drift and Velero version updates will cause surprises. Teams that skip restore testing discover broken procedures during actual outages.\u003c/p\u003e\n\u003cp\u003eCommon restore failures: Missing CRDs (restore CRDs before custom resources), incorrect namespace mappings (use Velero namespace mapping features), persistent volume availability zones (Azure Disks are zone-locked), and missing secrets (external secret management requires separate backup).\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-native-backup-when-to-use-it\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#azure-native-backup-when-to-use-it\" title=\"Azure native backup: When to use it\"\u003eAzure native backup: When to use it\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAzure Backup for AKS launched in 2023 and provides Azure-native cluster backup without deploying Velero. It integrates with Azure Backup vaults and uses the same portal experience as VM and database backups.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"azure-backup-vs-velero\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#azure-backup-vs-velero\" title=\"Azure Backup vs Velero\"\u003eAzure Backup vs Velero\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure Backup works well for organizations heavily invested in Azure tooling who want unified backup management across all Azure resources. It handles backup scheduling, retention, and monitoring through familiar Azure interfaces.\u003c/p\u003e\n\u003cp\u003eLimitations compared to Velero: Less flexibility in backup selectors and namespace filtering, fewer options for cross-region backup replication, and vendor lock-in to Azure. Velero supports multi-cloud scenarios and offers more granular control over what gets backed up.\u003c/p\u003e\n\u003cp\u003eRecommendation: Use Azure Backup if your organization already standardizes on Azure Backup for other resources and you do not require multi-cloud portability. Use Velero if you need maximum flexibility, cross-region replication control, or multi-cloud backup capability.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-region-failover-designing-for-actual-recovery\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#multi-region-failover-designing-for-actual-recovery\" title=\"Multi-region failover: Designing for actual recovery\"\u003eMulti-region failover: Designing for actual recovery\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSingle-region deployments create single points of failure. Multi-region architectures provide genuine disaster recovery capability but introduce complexity in state synchronization, traffic routing, and recovery orchestration.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"failover-architecture-patterns\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#failover-architecture-patterns\" title=\"Failover architecture patterns\"\u003eFailover architecture patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eActive-passive:\u003c/strong\u003e Primary region handles all traffic. Secondary region remains idle but receives regular backup replication. During failover, you restore backups to the secondary cluster and redirect traffic. Recovery time depends on backup restore speed and DNS propagation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eActive-active:\u003c/strong\u003e Both regions handle production traffic simultaneously. Application state synchronizes continuously (database replication, event streaming, or shared storage). During regional failure, traffic shifts to the remaining region. Recovery time depends on health check detection and DNS/load balancer failover speed.\u003c/p\u003e\n\u003cp\u003eActive-passive costs less but requires longer recovery time. Active-active provides faster failover but doubles infrastructure cost and requires application-level state synchronization.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"dns-failover-automation\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#dns-failover-automation\" title=\"DNS failover automation\"\u003eDNS failover automation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDNS-based failover redirects traffic between regions by updating DNS records to point at healthy endpoints. Azure Traffic Manager and Azure Front Door both provide automatic failover based on health probes.\u003c/p\u003e\n\u003cp\u003eUse a small script first, then expand it over time. This keeps incident handling understandable for on-call engineers.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/usr/bin/env bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSECONDARY_RG\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;rg-aks-westus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSECONDARY_CLUSTER\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;aks-dr-westus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTM_RG\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;rg-networking\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTM_PROFILE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;tm-aks-prod\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;1) Connect to secondary cluster\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-credentials -g \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$SECONDARY_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$SECONDARY_CLUSTER\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --overwrite-existing\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cluster-info\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;2) Trigger restore from latest Velero backup\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero restore create dr-\u003cspan class=\"k\"\u003e$(\u003c/span\u003edate +%Y%m%d-%H%M\u003cspan class=\"k\"\u003e)\u003c/span\u003e --from-backup \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003evelero backup get -o name \u003cspan class=\"p\"\u003e|\u003c/span\u003e tail -n1 \u003cspan class=\"p\"\u003e|\u003c/span\u003e cut -d/ -f2\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;3) Switch Traffic Manager endpoint\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz network traffic-manager endpoint update --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --profile-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_PROFILE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name endpoint-eastus --type azureEndpoints --endpoint-status Disabled\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz network traffic-manager endpoint update --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --profile-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_PROFILE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name endpoint-westus --type azureEndpoints --endpoint-status Enabled\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script is intentionally small. Add pre-checks and post-checks later, but start with a version every engineer can understand quickly during an outage.\u003c/p\u003e\n\u003cp\u003eThis script automates critical failover steps but requires human verification at each stage. Fully automated failover without human approval risks unnecessary region switches during transient failures.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"state-synchronization-strategies\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#state-synchronization-strategies\" title=\"State synchronization strategies\"\u003eState synchronization strategies\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMulti-region architectures require careful state management. Databases need replication (Azure SQL geo-replication, Cosmos DB multi-region writes). Object storage needs cross-region replication (Azure Blob Storage GRS). Message queues require either regional isolation or cross-region synchronization (Azure Service Bus premium tier supports geo-replication).\u003c/p\u003e\n\u003cp\u003eStateless services fail over easily. Stateful services require replication strategy planning during design phase, not during incident response.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"rto-and-rpo-calculating-realistic-targets\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#rto-and-rpo-calculating-realistic-targets\" title=\"RTO and RPO: Calculating realistic targets\"\u003eRTO and RPO: Calculating realistic targets\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRecovery Time Objective (RTO) measures how long systems can be down before business impact becomes unacceptable. Recovery Point Objective (RPO) measures how much data loss is acceptable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"calculating-rto\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#calculating-rto\" title=\"Calculating RTO\"\u003eCalculating RTO\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRTO includes: detection time (how long until you know there is a problem), decision time (how long to decide failover is necessary), restore time (how long to restore from backup or switch regions), and validation time (how long to confirm restoration worked).\u003c/p\u003e\n\u003cp\u003eExample calculation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eDetection: 5 minutes (health check interval)\u003c/li\u003e\n\u003cli\u003eDecision: 10 minutes (incident escalation and approval)\u003c/li\u003e\n\u003cli\u003eRestore: 45 minutes (Velero restore for 500GB cluster)\u003c/li\u003e\n\u003cli\u003eValidation: 15 minutes (smoke tests and traffic verification)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTotal RTO: 75 minutes\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf business requirements demand 30-minute RTO, your current backup-based approach will not meet SLOs. You need active-active architecture or pre-warmed standby clusters.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"calculating-rpo\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#calculating-rpo\" title=\"Calculating RPO\"\u003eCalculating RPO\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRPO depends on backup frequency. Hourly backups mean up to 60 minutes of data loss. If your application cannot tolerate 60 minutes of data loss, you need more frequent backups or continuous replication.\u003c/p\u003e\n\u003cp\u003eExample calculation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBackup frequency: Every 4 hours\u003c/li\u003e\n\u003cli\u003eLast backup: 2 hours ago\u003c/li\u003e\n\u003cli\u003eRegional failure occurs now\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eData loss: 2 hours\u003c/strong\u003e (time since last backup)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf business requirements demand 15-minute RPO, 4-hour backup intervals will not meet SLOs. You need hourly backups, application-level replication, or continuous event streaming to secondary region.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"designing-for-slos-without-over-engineering\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#designing-for-slos-without-over-engineering\" title=\"Designing for SLOs without over-engineering\"\u003eDesigning for SLOs without over-engineering\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMany teams over-engineer DR solutions trying to achieve zero data loss and instant failover without understanding actual business requirements. A 4-hour RTO may be acceptable for internal tooling but catastrophic for customer-facing APIs.\u003c/p\u003e\n\u003cp\u003ePractical use case:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eInternal reporting API: 2-hour RTO and 1-hour RPO can be enough, active-passive is usually fine.\u003c/li\u003e\n\u003cli\u003eCustomer checkout API: 15-minute RTO and near-zero RPO usually require active-active plus database replication.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe recurring theme is business impact, not architecture fashion.\u003c/p\u003e\n\u003cp\u003eStart by identifying actual business impact:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWhat revenue is lost per hour of downtime?\u003c/li\u003e\n\u003cli\u003eWhat customer commitments exist in SLAs?\u003c/li\u003e\n\u003cli\u003eWhat regulatory requirements mandate specific recovery times?\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThen design the minimum viable DR solution that meets those requirements. Do not build active-active multi-region architecture with continuous replication if business requirements allow 2-hour RTO and 1-hour RPO. That level of complexity costs significant engineering time and operational overhead.\u003c/p\u003e\n\u003cp\u003eConversely, do not assume daily backups suffice for production systems without validating business tolerance for 24-hour data loss.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"best-practices-what-actually-works\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#best-practices-what-actually-works\" title=\"Best practices: What actually works\"\u003eBest practices: What actually works\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTest restore procedures regularly.\u003c/strong\u003e Monthly restore testing in isolated environments catches broken procedures before actual incidents. Quarterly testing is too infrequent.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAutomate backup verification.\u003c/strong\u003e Run automated restore tests that verify backup integrity and measure restoration time. Manual testing does not scale and gets skipped under time pressure.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDocument recovery procedures.\u003c/strong\u003e Runbooks that sit in Confluence do not get updated and will be wrong during incidents. Store recovery procedures as executable scripts in version control and test them regularly.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSeparate backup storage from cluster infrastructure.\u003c/strong\u003e Do not store backups in the same region or subscription as the cluster. Regional Azure outages impact all resources in that region including backup storage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePlan for partial failures.\u003c/strong\u003e Not every incident requires full cluster restore. Design procedures for restoring individual namespaces, specific workloads, or single persistent volumes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse infrastructure as code for cluster rebuild.\u003c/strong\u003e Terraform or Bicep definitions for cluster creation enable rapid cluster recreation when restoration is not the best recovery path.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMonitor backup jobs.\u003c/strong\u003e Failed backups are worthless. Alert on backup failures and missing backup runs. Do not discover backup gaps during recovery.\u003c/p\u003e\n\u003cp\u003eIf you are defining a monthly DR game day, include three quick checks every time:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCan we restore one namespace end to end in a clean test cluster?\u003c/li\u003e\n\u003cli\u003eCan we switch traffic and run smoke tests in less than our RTO?\u003c/li\u003e\n\u003cli\u003eCan we prove data freshness is inside the RPO window?\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIf one answer is no, your DR posture is weaker than your dashboard suggests.\u003c/p\u003e\n\u003cp\u003eCommon mistakes: Storing backups in same region as cluster (regional failure loses backups and cluster), never testing restore procedures (broken backups discovered during incidents), manual recovery procedures (humans make mistakes under pressure), and no RTO/RPO measurement (cannot tell if recovery meets business requirements).\u003c/p\u003e\n\u003cp\u003eAuthor note: I have participated in exactly two real disaster recovery situations involving Kubernetes clusters. In the first incident, backup restoration worked but took 3 hours longer than documented because volume snapshot region restrictions were not tested. In the second incident, backups existed but CRD restoration failed because CRD versions changed between backup and restore. Both incidents would have been prevented by regular restore testing. Do not learn this lesson during a production outage.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDisaster recovery for AKS requires deliberate planning, regular testing, and honest assessment of recovery capabilities. Velero provides proven backup and restore workflows. Azure native backup offers simplified management for Azure-focused organizations. Multi-region architectures enable faster recovery but increase complexity and cost.\u003c/p\u003e\n\u003cp\u003eThe real test is not having a backup strategy documented in Confluence. The real test is whether you can restore your cluster from backup in under 60 minutes during an actual regional outage at 2 AM when half your team is asleep and the incident commander is asking for status updates.\u003c/p\u003e\n\u003cp\u003eBuild repeatable procedures. Test them monthly. Automate everything you can. Measure actual RTO and RPO. Add one more rule: if a step cannot be executed from version-controlled scripts, it is probably not ready for production incidents.\u003c/p\u003e\n\u003cp\u003eRelated reading for AKS operations maturity: \u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/\"\u003eAKS Cluster Upgrades Without Downtime\u003c/a\u003e.\u003c/p\u003e","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-03-11T17:00:00+01:00","id":"https://daily-devops.net/posts/disaster-recovery-business-continuity-aks/","language":"en","summary":"AKS outages happen. Build a tested DR plan with Velero, realistic RTO/RPO targets, and multi-region failover steps your team can run under pressure.","tags":["disaster-recovery","azure","kubernetes","cloud","devops","reliability","compliance"],"title":"AKS Disaster Recovery: Why Your Untested Backup Will Fail","url":"https://daily-devops.net/posts/disaster-recovery-business-continuity-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\n\n\n\n\u003ch2 id=\"the-problem-traditional-storage-models-dont-translate-to-kubernetes\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-problem-traditional-storage-models-dont-translate-to-kubernetes\" title=\"The Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\"\u003eThe Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads in Kubernetes means more than deploying a database pod. Traditional storage models (provision a disk, format it, mount it, expect it to stay) collide with Kubernetes\u0026rsquo; ephemeral, distributed architecture. Pods get rescheduled, scaled, and terminated. Your database shouldn\u0026rsquo;t lose data when that happens.\u003c/p\u003e\n\u003cp\u003eThe core challenge: \u003cstrong\u003ehow do you attach persistent storage to ephemeral compute?\u003c/strong\u003e On-premises infrastructure relies on SAN devices, NFS mounts, or local disks with predictable failure domains. You know which server hosts which disk. In AKS, you work with Azure storage primitives: Managed Disks, Azure Files, blob storage. These need seamless integration with Kubernetes lifecycle management. The abstractions differ, the failure modes differ, and operational patterns require rethinking.\u003c/p\u003e\n\u003cp\u003eComplexity multiplies with backup requirements, disaster recovery expectations, and multi-cluster data synchronization. Whether migrating legacy apps that expect local RAID controllers or building cloud-native data platforms from scratch, AKS storage architecture knowledge is foundational. Get it wrong: data loss, performance bottlenecks, escalating cloud bills.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pvcpv-architecture-how-storage-binds-to-pods-in-aks\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#pvcpv-architecture-how-storage-binds-to-pods-in-aks\" title=\"PVC/PV Architecture: How Storage Binds to Pods in AKS\"\u003ePVC/PV Architecture: How Storage Binds to Pods in AKS\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes abstracts storage through two key objects: \u003cstrong\u003ePersistentVolumes (PV)\u003c/strong\u003e and \u003cstrong\u003ePersistentVolumeClaims (PVC)\u003c/strong\u003e. A PV represents the actual storage resource (Azure Disk, Azure Files share). A PVC represents the request for that storage. The relationship mirrors compute abstractions: nodes are physical machines, pods are logical units consuming node resources. Similarly, PVs are physical storage, PVCs are logical requests consuming PV capacity.\u003c/p\u003e\n\u003cp\u003eThe binding flow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eDeveloper creates a PVC specifying size, access mode, and storage class\u003c/li\u003e\n\u003cli\u003eKubernetes finds or provisions a matching PV based on the storage class\u003c/li\u003e\n\u003cli\u003ePVC binds to the PV, making it available to pods\u003c/li\u003e\n\u003cli\u003ePods reference the PVC in their volume mounts\u003c/li\u003e\n\u003cli\u003eWhen the pod terminates, the PVC remains (data persists across pod lifecycles)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAccess modes matter:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteOnce (RWO)\u003c/strong\u003e: Single node can mount the volume (Azure Disk)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteMany (RWX)\u003c/strong\u003e: Multiple nodes can mount simultaneously (Azure Files)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadOnlyMany (ROX)\u003c/strong\u003e: Multiple nodes, read-only access\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eMost stateful apps (databases, message queues) use RWO. Azure Disks provide better IOPS and latency than Azure Files. For shared storage (parallel batch processing, shared config directories, legacy apps expecting NFS semantics), use RWX: Azure Files or third-party CSI drivers like NFS or CephFS.\u003c/p\u003e\n\u003cp\u003eCritical insight: \u003cstrong\u003ePVCs decouple storage requests from storage implementation.\u003c/strong\u003e Developers don\u0026rsquo;t need to know if they get a Premium SSD or Standard HDD. They request 100Gi of fast storage, the storage class handles provisioning. This abstraction enables platform teams to enforce policies (all production PVCs use Premium tier) without touching application manifests.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-disk-vs-azure-files-performance-cost-regional-constraints\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#azure-disk-vs-azure-files-performance-cost-regional-constraints\" title=\"Azure Disk vs. Azure Files: Performance, Cost, Regional Constraints\"\u003eAzure Disk vs. Azure Files: Performance, Cost, Regional Constraints\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eChoosing between Azure Disk and Azure Files isn\u0026rsquo;t a one-size-fits-all decision. Each has distinct performance profiles, cost implications, and operational constraints.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Disk (Managed Disks):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Lower latency, higher IOPS. Premium SSDs reach 20,000 IOPS, Ultra Disks exceed that.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Single-node attachment (RWO). Pod rescheduling to another node triggers disk detach and reattach (expect brief delay).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Databases (PostgreSQL, MongoDB), stateful apps requiring low-latency I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per provisioned disk size. A 1TB Premium SSD costs more than a 1TB Standard HDD, regardless of actual usage.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e Disks are zone-specific. With availability zones, pods must schedule in the same zone as the disk.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Files (SMB/NFS):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Higher latency than disks. Premium Files tier improves performance but still trails disk I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Multi-node (RWX). Multiple pods across nodes can mount the same share.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Shared logs, static assets, config files, legacy apps expecting NFS.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per storage consumed plus transactions. Transaction costs surprise teams on high-throughput workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e File shares are regional, not zonal. Better for cross-zone workloads, still tied to single region.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eDecision criteria:\u003c/strong\u003e Default to Azure Disk for databases and high-IOPS apps. Use Azure Files only when RWX access or legacy NFS compatibility is required. For backup targets or archival storage, consider blob storage with CSI drivers (experimental, improving).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-disk-attachment-penalty\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-disk-attachment-penalty\" title=\"The Disk Attachment Penalty\"\u003eThe Disk Attachment Penalty\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eGotcha: \u003cstrong\u003edisk attachment times.\u003c/strong\u003e Pod rescheduling requires Azure to detach the disk from the old node and attach it to the new one. This takes 30 to 90 seconds. Apps that cannot tolerate this downtime need application-level replication (PostgreSQL streaming replication) or third-party solutions like Portworx.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"storage-classes--dynamic-provisioning-automating-the-lifecycle\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#storage-classes--dynamic-provisioning-automating-the-lifecycle\" title=\"Storage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\"\u003eStorage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStatic provisioning (manually creating PVs, hoping someone claims them) creates operational overhead. \u003cstrong\u003eStorage classes\u003c/strong\u003e enable dynamic provisioning: Kubernetes automatically creates a PV when a PVC is submitted.\u003c/p\u003e\n\u003cp\u003eAKS ships with default storage classes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003edefault\u003c/code\u003e: Standard HDD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003emanaged-premium\u003c/code\u003e: Premium SSD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile\u003c/code\u003e: Azure Files share (RWX)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile-premium\u003c/code\u003e: Premium Azure Files share (RWX)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eYou can define custom storage classes to fine-tune parameters:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eStorageClass\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003efast-ssd\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eprovisioner\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003edisk.csi.azure.com\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eskuName\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePremium_LRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eManaged\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecachingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eReadOnly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Zone redundant storage (ZRS) for higher durability\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# skuName: Premium_ZRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eallowVolumeExpansion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ereclaimPolicy\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eRetain\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evolumeBindingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eWaitForFirstConsumer\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eKey parameters:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ereclaimPolicy:\u003c/strong\u003e \u003ccode\u003eDelete\u003c/code\u003e removes the disk when PVC is deleted, \u003ccode\u003eRetain\u003c/code\u003e keeps it. For production databases, \u003ccode\u003eRetain\u003c/code\u003e prevents accidental data deletion.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003evolumeBindingMode:\u003c/strong\u003e \u003ccode\u003eWaitForFirstConsumer\u003c/code\u003e delays PV creation until pod scheduling. Critical for zone-aware clusters (Kubernetes creates the disk in the same zone as the pod).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eallowVolumeExpansion:\u003c/strong\u003e Enables PVC resizing without recreation. Azure Disks support this, not all storage backends do.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eBest practice:\u003c/strong\u003e Create environment-specific storage classes (dev, staging, prod) with different \u003ccode\u003eskuName\u003c/code\u003e values. Dev clusters use Standard HDDs, prod uses Premium SSDs. Developers use identical manifests across environments, only the storage class name changes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"backup--recovery-rtorpo-implications\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#backup--recovery-rtorpo-implications\" title=\"Backup \u0026amp; Recovery: RTO/RPO Implications\"\u003eBackup \u0026amp; Recovery: RTO/RPO Implications\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes doesn\u0026rsquo;t backup data by default. Running \u003ccode\u003ekubectl delete pvc\u003c/code\u003e without a recovery plan means permanent data loss.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVelero\u003c/strong\u003e (formerly Heptio Ark) is the de facto standard for Kubernetes backup. It snapshots PVs, captures Kubernetes object state, stores backups in object storage (Azure Blob, S3, GCS).\u003c/p\u003e\n\u003cp\u003eExample Velero backup schedule (via CLI):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install Velero with Azure plugin\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero install \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --provider azure \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --bucket velero-backups \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --secret-file ./credentials-velero \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --backup-location-config \u003cspan class=\"nv\"\u003eresourceGroup\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eaks-backups-rg,storageAccount\u003cspan class=\"o\"\u003e=\u003c/span\u003eaksbackupssa\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create a daily backup schedule for production namespace\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero schedule create daily-prod-backup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --include-namespaces production \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --snapshot-volumes \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --ttl 720h\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"rto-and-rpo-considerations\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#rto-and-rpo-considerations\" title=\"RTO And RPO Considerations\"\u003eRTO And RPO Considerations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eRTO/RPO considerations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSnapshot-based backups (Azure Disk snapshots via Velero):\u003c/strong\u003e RPO equals backup frequency (hourly, daily). RTO equals time to provision new PV plus restore data (5 to 30 minutes).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNative Azure Backup for AKS:\u003c/strong\u003e Microsoft managed solution. Integrated with Azure Backup policies, slower restores and less granular than Velero.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eApplication-level backups (pg_dump, mongodump):\u003c/strong\u003e Bypasses Kubernetes entirely. Lower RTO with automated restore scripts, requires custom orchestration.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eGotcha:\u003c/strong\u003e Velero relies on Azure Disk snapshots. Disk in Zone 1, restore to cluster in Zone 2 requires cross-zone snapshot copy (not instant). Test restore procedures in non-prod clusters. A backup never restored is wishful thinking.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-aks-replication-patterns-for-cross-cluster-data-synchronization\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#multi-aks-replication-patterns-for-cross-cluster-data-synchronization\" title=\"Multi-AKS Replication: Patterns for Cross-Cluster Data Synchronization\"\u003eMulti-AKS Replication: Patterns for Cross-Cluster Data Synchronization\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads across multiple AKS clusters—whether for HA, disaster recovery, or multi-region latency requirements—adds another layer of complexity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 1: Application-Level Replication\u003c/strong\u003e\nLet the application handle replication. PostgreSQL streaming replication, MongoDB replica sets, Kafka replication understand their data models and replicate efficiently.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e No Kubernetes-specific dependencies. Works identically in VMs, on-premises, or managed services.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e You manage replication lag, split-brain scenarios, and failover logic.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 2: Storage-Level Replication\u003c/strong\u003e\nUse Azure NetApp Files or third-party solutions like Portworx for block or file-level replication.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Transparent to applications. Works with legacy apps lacking native replication.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e Expensive. NetApp Files Premium tier and Portworx licensing (scales with node count) add significant cost.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 3: Backup-Based DR\u003c/strong\u003e\nVelero backups from primary cluster, restore to secondary on failover.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Cost-effective (blob storage only).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e RPO equals last backup interval (hours, not seconds). RTO includes restore time (minutes to hours).\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"a-multi-region-postgresql-pattern\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-multi-region-postgresql-pattern\" title=\"A Multi-Region PostgreSQL Pattern\"\u003eA Multi-Region PostgreSQL Pattern\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eReal-world example:\u003c/strong\u003e Multi-region PostgreSQL deployment pattern I\u0026rsquo;ve encountered:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePrimary AKS cluster (West Europe):\u003c/strong\u003e Production traffic\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSecondary AKS cluster (North Europe):\u003c/strong\u003e Read replicas via PostgreSQL streaming replication\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eVelero backups:\u003c/strong\u003e Azure Blob in third region (East US) for regulatory compliance\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis provides sub-second RPO within Europe (streaming replication), hourly RPO globally (Velero), 5-minute RTO for regional failover (promote read replica).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational reality:\u003c/strong\u003e Multi-cluster data replication is complex. Avoid it by using managed services (Azure Database for PostgreSQL with geo-replication) if possible. Running databases in AKS requires investment in automation, monitoring, and runbooks. Your 3 AM self will appreciate this decision.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"final-thoughts\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#final-thoughts\" title=\"Final Thoughts\"\u003eFinal Thoughts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStorage in AKS represents a set of trade-offs requiring deliberate navigation. Azure Disk provides performance with zone-locking. Azure Files offers flexibility with latency penalties. Velero enables backups but demands operational discipline and testing. Multi-cluster replication delivers resilience with non-linear operational complexity.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"a-pragmatic-starting-point\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-pragmatic-starting-point\" title=\"A Pragmatic Starting Point\"\u003eA Pragmatic Starting Point\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePragmatic approach: Start with managed storage classes and Velero. Use Azure Disk for databases and high-IOPS workloads. Use Azure Files only when RWX access or legacy NFS compatibility is genuinely required. Test restore procedures quarterly, not during outages. Schedule fire drills: delete a namespace, restore from backup. Measure actual RTO/RPO instead of assuming SLA compliance.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-leave-aks-for-managed-data-services\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#when-to-leave-aks-for-managed-data-services\" title=\"When To Leave AKS For Managed Data Services\"\u003eWhen To Leave AKS For Managed Data Services\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen stateful workload requirements outgrow AKS storage primitives (sub-second cross-region replication, disk attachment latency breaking your app, spiraling storage costs), don\u0026rsquo;t force solutions. Consider Azure managed services (Azure Database for PostgreSQL, Cosmos DB) or specialized data platforms (Confluent Cloud for Kafka, MongoDB Atlas). Sometimes the best Kubernetes storage strategy is avoiding stateful workloads in Kubernetes.\u003c/p\u003e\n\u003cp\u003eKubernetes excels at stateless orchestration. For stateful workloads, it\u0026rsquo;s capable but demands understanding the plumbing, accepting trade-offs, building operational muscle around backups, monitoring, and runbooks. Treat storage as infrastructure that will fail, not infrastructure that just works. Plan accordingly.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-04T17:00:00+01:00","id":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/","language":"en","summary":"PVC/PV patterns, Azure Disk vs Files trade-offs, Velero backup strategies, and cross-cluster replication for production stateful workloads in AKS.","tags":["storage","azure","kubernetes","cloud","database","reliability","operations","platform-engineering","disaster-recovery"],"title":"Storage Architecture \u0026 Stateful Workloads in AKS","url":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eAKS cluster upgrades are routine maintenance, but executing them without dropping traffic or losing state is the operational challenge that separates theory from production reality. Every Kubernetes version upgrade involves replacing nodes, which means evicting pods, draining workloads, and hoping your assumptions about resilience hold true under pressure.\u003c/p\u003e\n\u003cp\u003eI have participated in dozens of AKS upgrades across production clusters ranging from 10 to 500+ nodes. The pattern is consistent: teams that treat upgrades as a checkbox operation eventually experience an outage. Teams that understand the underlying mechanics and configure explicit constraints rarely do.\u003c/p\u003e\n\u003cp\u003eThis article covers the real mechanics: how cordon and drain actually work, why Pod Disruption Budgets exist, and how to orchestrate multi-node-pool rollouts with automation that survives contact with production.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-uncontrolled-node-drains-cause-cascading-failures\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#the-problem-uncontrolled-node-drains-cause-cascading-failures\" title=\"The problem: uncontrolled node drains cause cascading failures\"\u003eThe problem: uncontrolled node drains cause cascading failures\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eWhen you upgrade an AKS cluster, Azure replaces nodes with new VMs running the updated Kubernetes version. That replacement process triggers pod eviction. Without proper controls, evictions happen simultaneously across multiple nodes, stateful workloads lose quorum, and traffic drops because there are no healthy replicas left to serve requests.\u003c/p\u003e\n\u003cp\u003eThe default behavior is optimistic: Kubernetes assumes your workloads are designed for failure. But production workloads are rarely that resilient. Databases need time to transfer leadership, message queues need to flush buffers, and stateless apps still need at least one replica running to handle incoming connections.\u003c/p\u003e\n\u003cp\u003e\u003cem\u003e\u003cstrong\u003eAuthor note:\u003c/strong\u003e\u003c/em\u003e The \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-cluster\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eofficial AKS upgrade documentation\u003c/a\u003e covers the mechanics, but it does not emphasize how quickly things go wrong without proper constraints. I have seen a three-minute upgrade window turn into a two-hour incident because nobody configured Pod Disruption Budgets.\u003c/p\u003e\n\u003cp\u003eUncontrolled drains create several failure modes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eData loss:\u003c/strong\u003e Stateful workloads evicted before flushing state to disk or replicating to peers.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eService interruption:\u003c/strong\u003e All replicas terminated before new ones become ready.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCascading failures:\u003c/strong\u003e Dependent services timeout waiting for unavailable backends, triggering retries that amplify load.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe solution is not to avoid upgrades. The solution is to control the eviction process with explicit constraints that match your workload requirements.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"cordon-and-drain-mechanics-what-actually-happens\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#cordon-and-drain-mechanics-what-actually-happens\" title=\"Cordon and drain mechanics: what actually happens\"\u003eCordon and drain mechanics: what actually happens\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThe Kubernetes eviction API follows a three-step process when draining a node:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eCordon:\u003c/strong\u003e Mark the node as unschedulable. New pods will not be placed on this node, but existing pods continue running.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEvict:\u003c/strong\u003e Send termination signals to all pods on the node, respecting grace periods and Pod Disruption Budgets (PDBs).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eWait:\u003c/strong\u003e Block until all pods have terminated or the drain timeout expires.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAKS automates this process during upgrades, but you can trigger it manually using kubectl for maintenance or troubleshooting:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cordon \u0026lt;node-name\u0026gt;\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl drain \u0026lt;node-name\u0026gt; --ignore-daemonsets --delete-emptydir-data\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe \u003ccode\u003e--ignore-daemonsets\u003c/code\u003e flag prevents drain from failing on DaemonSet pods, which are designed to run on every node and will be recreated automatically. The \u003ccode\u003e--delete-emptydir-data\u003c/code\u003e flag allows drain to proceed even if pods use emptyDir volumes, which are ephemeral and will be lost.\u003c/p\u003e\n\u003cp\u003eFor AKS automated upgrades, you can configure the drain behavior per node pool using \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-node-pools-rolling\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003erolling upgrade settings\u003c/a\u003e:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks nodepool update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myNodePool \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --max-surge 33% \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --drain-timeout \u003cspan class=\"m\"\u003e45\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --node-soak-duration \u003cspan class=\"m\"\u003e5\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe \u003ccode\u003e--drain-timeout\u003c/code\u003e parameter (in minutes) controls how long AKS waits for pods to terminate before force-killing them. The \u003ccode\u003e--node-soak-duration\u003c/code\u003e (in minutes) adds a stabilization period after each node upgrade before proceeding to the next. Microsoft recommends \u003ccode\u003e--max-surge 33%\u003c/code\u003e for production workloads.\u003c/p\u003e\n\u003cp\u003eManual drain remains useful for pre-maintenance validation, testing PDB configurations, or debugging eviction failures before committing to a full cluster upgrade.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pod-disruption-budgets-the-safety-mechanism-you-should-always-configure\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#pod-disruption-budgets-the-safety-mechanism-you-should-always-configure\" title=\"Pod Disruption Budgets: the safety mechanism you should always configure\"\u003ePod Disruption Budgets: the safety mechanism you should always configure\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eA \u003ca href=\"https://kubernetes.io/docs/tasks/run-application/configure-pdb/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003ePod Disruption Budget\u003c/a\u003e (PDB) defines the minimum number of pods that must remain available during voluntary disruptions like node drains. PDBs do not prevent involuntary disruptions like node crashes or resource exhaustion, but they block evictions that would violate availability constraints.\u003c/p\u003e\n\u003cp\u003ePDBs are defined using either \u003ccode\u003eminAvailable\u003c/code\u003e or \u003ccode\u003emaxUnavailable\u003c/code\u003e:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eminAvailable:\u003c/strong\u003e The minimum number of pods (or percentage) that must remain running during a disruption.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003emaxUnavailable:\u003c/strong\u003e The maximum number of pods (or percentage) that can be unavailable during a disruption.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eExample PDB for a three-replica deployment that must keep at least two replicas running:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003epolicy/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePodDisruptionBudget\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyapp-pdb\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003enamespace\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003espec\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eminAvailable\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"m\"\u003e2\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eselector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eapp\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003emyapp\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWith this PDB in place, drain will evict only one pod at a time, waiting for a replacement to become ready before proceeding to the next eviction. If no replacement becomes ready (for example, due to resource constraints or image pull failures), the drain blocks until the timeout expires.\u003c/p\u003e\n\u003cp\u003ePDBs are particularly critical for:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eStateful workloads:\u003c/strong\u003e Databases, message queues, and distributed systems that require quorum.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLow-replica deployments:\u003c/strong\u003e Services with two or three replicas where losing one pod reduces capacity significantly.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLong startup times:\u003c/strong\u003e Workloads that take minutes to initialize and become ready.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003ePractical PDB configuration advice:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSet \u003ccode\u003eminAvailable: 1\u003c/code\u003e for stateless services with two replicas.\u003c/li\u003e\n\u003cli\u003eSet \u003ccode\u003eminAvailable: N-1\u003c/code\u003e for N-replica stateful services that tolerate one failure (for example, three-node etcd allows \u003ccode\u003eminAvailable: 2\u003c/code\u003e).\u003c/li\u003e\n\u003cli\u003eAvoid \u003ccode\u003eminAvailable: N\u003c/code\u003e (all replicas), which blocks drain indefinitely and prevents upgrades.\u003c/li\u003e\n\u003cli\u003eUse percentages for large replica counts: \u003ccode\u003eminAvailable: 75%\u003c/code\u003e for a 10-replica deployment allows up to 2-3 pods to be evicted simultaneously.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAuthor tip: Before any upgrade, run \u003ccode\u003ekubectl get pdb -A\u003c/code\u003e and verify that no PDB has \u003ccode\u003eALLOWED DISRUPTIONS\u003c/code\u003e showing zero. A PDB with zero allowed disruptions will block node drain indefinitely, and your upgrade will hang until the drain timeout expires or you manually intervene.\u003c/p\u003e\n\u003cp\u003ePDBs only apply to voluntary disruptions. Node failures ignore PDBs and evict all pods immediately.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"workload-categories-stateless-stateful-daemonsets\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#workload-categories-stateless-stateful-daemonsets\" title=\"Workload categories: stateless, stateful, DaemonSets\"\u003eWorkload categories: stateless, stateful, DaemonSets\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDifferent workload types require different upgrade strategies. A one-size-fits-all approach causes either unnecessary downtime (overly conservative) or unexpected failures (overly aggressive).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"stateless-workloads\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#stateless-workloads\" title=\"Stateless workloads\"\u003eStateless workloads\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eStateless services like web frontends, API gateways, and workers can tolerate rapid eviction as long as at least one replica remains available. Configure PDBs with \u003ccode\u003eminAvailable: 1\u003c/code\u003e or \u003ccode\u003emaxUnavailable: N-1\u003c/code\u003e to allow fast rollouts while maintaining service availability.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"stateful-workloads\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#stateful-workloads\" title=\"Stateful workloads\"\u003eStateful workloads\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDatabases, message queues, and distributed storage systems require careful sequencing. Evicting multiple replicas simultaneously can cause quorum loss, split-brain scenarios, or data corruption.\u003c/p\u003e\n\u003cp\u003eBest practices for stateful workloads:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eSet conservative PDBs that preserve quorum (for example, \u003ccode\u003eminAvailable: 2\u003c/code\u003e for a three-node cluster).\u003c/li\u003e\n\u003cli\u003eConfigure long grace periods (60+ seconds) to allow state transfer and leadership handoff.\u003c/li\u003e\n\u003cli\u003eUse StatefulSets with proper readiness probes to ensure new replicas are fully initialized before old ones are terminated.\u003c/li\u003e\n\u003cli\u003eTest upgrade scenarios in staging with realistic data volumes and latency.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"daemonsets\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#daemonsets\" title=\"DaemonSets\"\u003eDaemonSets\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDaemonSets run exactly one pod per node (or per matching node). Examples include logging agents, monitoring exporters, and network plugins. Draining a node automatically terminates the DaemonSet pod, and the pod is recreated on the new node after upgrade.\u003c/p\u003e\n\u003cp\u003eDaemonSets do not require PDBs because they are designed to tolerate single-node failures. Use the \u003ccode\u003e--ignore-daemonsets\u003c/code\u003e flag during manual drain to skip these pods.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-node-pool-rollout-strategies-graduated-risk-management\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#multi-node-pool-rollout-strategies-graduated-risk-management\" title=\"Multi-node-pool rollout strategies: graduated risk management\"\u003eMulti-node-pool rollout strategies: graduated risk management\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS supports multiple node pools within a single cluster. Each pool can have different VM sizes, availability zones, and upgrade schedules. Multi-node-pool architectures enable graduated rollouts that reduce risk by upgrading non-critical workloads first.\u003c/p\u003e\n\u003cp\u003eRecommended upgrade sequence:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eDev/test pools first:\u003c/strong\u003e Upgrade node pools running non-production workloads to validate the new Kubernetes version and catch compatibility issues early.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateless application pools:\u003c/strong\u003e Upgrade pools running stateless services that can tolerate brief capacity reductions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStateful application pools last:\u003c/strong\u003e Upgrade pools running databases and stateful services only after validating the rollout on stateless workloads.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eExample multi-pool upgrade using Azure CLI:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eRESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;myResourceGroup\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eCLUSTER_NAME\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;myAKSCluster\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;1.29.2\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Configure rolling upgrade settings for production safety\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eMAX_SURGE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;33%\u0026#34;\u003c/span\u003e        \u003cspan class=\"c1\"\u003e# Microsoft recommended for production\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eDRAIN_TIMEOUT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;45\u0026#34;\u003c/span\u003e     \u003cspan class=\"c1\"\u003e# Minutes to wait for pod eviction\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNODE_SOAK\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;5\u0026#34;\u003c/span\u003e          \u003cspan class=\"c1\"\u003e# Minutes to stabilize after each node\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Upgrade control plane first (does not affect workloads)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Upgrading control plane to \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks upgrade \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --kubernetes-version \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TARGET_VERSION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --control-plane-only \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --yes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Upgrade node pools in sequence: system -\u0026gt; stateless -\u0026gt; stateful\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNODE_POOLS\u003c/span\u003e\u003cspan class=\"o\"\u003e=(\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;system\u0026#34;\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;stateless\u0026#34;\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;stateful\u0026#34;\u003c/span\u003e\u003cspan class=\"o\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e POOL in \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eNODE_POOLS\u003c/span\u003e\u003cspan class=\"p\"\u003e[@]\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Upgrading node pool: \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Verify current node count and health\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz aks nodepool show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --query count -o tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Current node count for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e: \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Configure rolling upgrade settings before upgrade\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool update \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --max-surge \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$MAX_SURGE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --drain-timeout \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$DRAIN_TIMEOUT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --node-soak-duration \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NODE_SOAK\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Upgrade node pool\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool upgrade \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --kubernetes-version \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TARGET_VERSION\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --yes\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Wait for upgrade to complete\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Waiting for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e upgrade to complete...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  az aks nodepool \u003cspan class=\"nb\"\u003ewait\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --updated\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"c1\"\u003e# Verify upgraded node count matches original\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eUPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003eaz aks nodepool show \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$RESOURCE_GROUP\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --cluster-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CLUSTER_NAME\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$POOL\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    --query count -o tsv\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$CURRENT_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e !\u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$UPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: Node count mismatch for \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e. Expected \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eCURRENT_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e, got \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eUPGRADED_COUNT\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Pool \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003ePOOL\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e upgraded successfully.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;---\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;All node pools upgraded to \u003c/span\u003e\u003cspan class=\"si\"\u003e${\u003c/span\u003e\u003cspan class=\"nv\"\u003eTARGET_VERSION\u003c/span\u003e\u003cspan class=\"si\"\u003e}\u003c/span\u003e\u003cspan class=\"s2\"\u003e.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script upgrades the control plane first (which is a non-disruptive operation), then upgrades each node pool sequentially, validating node count before and after each upgrade to detect unexpected node losses.\u003c/p\u003e\n\u003cp\u003eKey operational notes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eControl plane upgrades are non-disruptive:\u003c/strong\u003e The control plane upgrade updates the Kubernetes API server and controllers but does not affect running workloads. Only node pool upgrades trigger pod evictions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOne node pool at a time:\u003c/strong\u003e Upgrading multiple pools simultaneously multiplies risk. Sequential upgrades allow you to catch issues early and halt the rollout before affecting critical workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eValidate before proceeding:\u003c/strong\u003e Check pod health, replica counts, and application metrics after each pool upgrade. Use kubectl, Azure Monitor, or Prometheus to verify that workloads are stable before moving to the next pool.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"planned-maintenance-windows-scheduling-upgrades-safely\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#planned-maintenance-windows-scheduling-upgrades-safely\" title=\"Planned maintenance windows: scheduling upgrades safely\"\u003ePlanned maintenance windows: scheduling upgrades safely\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eFor clusters with \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eautomatic upgrades\u003c/a\u003e enabled, AKS supports \u003ca href=\"https://learn.microsoft.com/en-us/azure/aks/planned-maintenance\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eplanned maintenance windows\u003c/a\u003e to control when upgrades occur. This prevents upgrades from starting during peak traffic periods.\u003c/p\u003e\n\u003cp\u003eConfigure a weekly maintenance window using Azure CLI:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks maintenanceconfiguration add \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --cluster-name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name aksManagedAutoUpgradeSchedule \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule-type Weekly \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --day-of-week Saturday \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --start-time 02:00 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --duration \u003cspan class=\"m\"\u003e4\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eMicrosoft recommends a minimum four-hour maintenance window to ensure upgrades complete without interruption. Combine this with the \u003ccode\u003estable\u003c/code\u003e auto-upgrade channel, which targets the previous minor version with latest patches, for a balance between staying current and avoiding bleeding-edge issues.\u003c/p\u003e\n\u003cp\u003eFor production clusters, I prefer manual upgrades with planned maintenance windows as a safety net. The automation handles the scheduling, but I control when the actual upgrade starts.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"automation-and-rollback-scripting-safe-upgrades\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#automation-and-rollback-scripting-safe-upgrades\" title=\"Automation and rollback: scripting safe upgrades\"\u003eAutomation and rollback: scripting safe upgrades\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAutomation reduces human error during upgrades, but only if the automation includes validation and rollback capabilities. A fully automated upgrade script should:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eValidate current cluster state (replica counts, PDB configurations, node health).\u003c/li\u003e\n\u003cli\u003eUpgrade in stages with validation checkpoints.\u003c/li\u003e\n\u003cli\u003eDetect failures and halt or rollback automatically.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003ePractical validation checks before upgrade:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/bin/bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;=== Pre-Upgrade Validation ===\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check available Kubernetes versions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Available upgrades:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-upgrades \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --resource-group myResourceGroup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --name myAKSCluster \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --output table\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify all nodes are ready\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eNOTREADY\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get nodes --no-headers \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -v \u003cspan class=\"s2\"\u003e\u0026#34; Ready \u0026#34;\u003c/span\u003e \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NOTREADY\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -gt \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NOTREADY\u003c/span\u003e\u003cspan class=\"s2\"\u003e nodes are not ready. Aborting upgrade.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  kubectl get nodes \u003cspan class=\"p\"\u003e|\u003c/span\u003e grep -v \u003cspan class=\"s2\"\u003e\u0026#34; Ready \u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ All nodes ready\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Check for PDBs that would block drain\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eBLOCKED\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pdb -A -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.namespace}/{.metadata.name}{\u0026#34;\\n\u0026#34;}{end}\u0026#39;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$BLOCKED\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;WARNING: The following PDBs have zero allowed disruptions:\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$BLOCKED\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;These will block node drain. Verify this is intentional.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify PDBs exist for critical namespaces\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e NS in production\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003ePDBS\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get pdb -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --no-headers 2\u0026gt;/dev/null \u003cspan class=\"p\"\u003e|\u003c/span\u003e wc -l\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$PDBS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -eq \u003cspan class=\"m\"\u003e0\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;WARNING: No PDBs configured in namespace \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eelse\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ \u003c/span\u003e\u003cspan class=\"nv\"\u003e$PDBS\u003c/span\u003e\u003cspan class=\"s2\"\u003e PDBs configured in \u003c/span\u003e\u003cspan class=\"nv\"\u003e$NS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Verify critical deployments have sufficient replicas\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Checking critical deployments...\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e DEPLOYMENT in myapp-frontend myapp-backend\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003edo\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nv\"\u003eREPLICAS\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003ekubectl get deployment \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -n production -o \u003cspan class=\"nv\"\u003ejsonpath\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s1\"\u003e\u0026#39;{.status.readyReplicas}\u0026#39;\u003c/span\u003e 2\u0026gt;/dev/null \u003cspan class=\"o\"\u003e||\u003c/span\u003e \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;0\u0026#34;\u003c/span\u003e\u003cspan class=\"k\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"o\"\u003e[\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -lt \u003cspan class=\"m\"\u003e2\u003c/span\u003e \u003cspan class=\"o\"\u003e]\u003c/span\u003e\u003cspan class=\"p\"\u003e;\u003c/span\u003e \u003cspan class=\"k\"\u003ethen\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;ERROR: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e has fewer than 2 ready replicas (\u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e). Aborting upgrade.\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eexit\u003c/span\u003e \u003cspan class=\"m\"\u003e1\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"k\"\u003efi\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  \u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;✓ \u003c/span\u003e\u003cspan class=\"nv\"\u003e$DEPLOYMENT\u003c/span\u003e\u003cspan class=\"s2\"\u003e: \u003c/span\u003e\u003cspan class=\"nv\"\u003e$REPLICAS\u003c/span\u003e\u003cspan class=\"s2\"\u003e replicas ready\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;=== Validation Complete ===\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eRollback is more complex. AKS does not support in-place downgrades. If an upgrade introduces breaking changes, the rollback path involves:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eRestoring from a snapshot or backup (for stateful workloads).\u003c/li\u003e\n\u003cli\u003eDeploying a new node pool with the previous Kubernetes version.\u003c/li\u003e\n\u003cli\u003eMigrating workloads to the new pool.\u003c/li\u003e\n\u003cli\u003eDeleting the upgraded pool.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis process is slow and disruptive, which is why validation before upgrade is critical. Test upgrades in staging, validate application compatibility with the new Kubernetes version, and maintain rollback procedures even if you hope never to use them.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"practical-recommendations\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#practical-recommendations\" title=\"Practical recommendations\"\u003ePractical recommendations\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eBased on production experience, the following practices reduce upgrade-related failures:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAlways configure PDBs for production workloads.\u003c/strong\u003e Even stateless services benefit from \u003ccode\u003eminAvailable: 1\u003c/code\u003e to prevent simultaneous eviction of all replicas.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTest upgrades in staging first.\u003c/strong\u003e Validate application compatibility, verify PDB behavior, and measure upgrade duration under realistic load.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUpgrade during low-traffic windows.\u003c/strong\u003e Even with proper PDBs, upgrades reduce available capacity. Schedule upgrades when traffic is lowest to minimize user impact.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMonitor during upgrades.\u003c/strong\u003e Track pod eviction events, replica counts, and application error rates. Use Azure Monitor, Prometheus, or your existing observability stack to detect issues early.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAutomate validation, not just execution.\u003c/strong\u003e Scripts that upgrade without validation are worse than manual upgrades because they fail faster and more completely.\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAKS cluster upgrades are unavoidable, but service disruption is not. Cordon and drain mechanics provide the foundation, Pod Disruption Budgets enforce availability constraints, and multi-node-pool rollouts allow graduated risk management. Combine these tools with validation-driven automation, and zero-downtime upgrades become reliable rather than aspirational.\u003c/p\u003e\n\u003cp\u003eThe key insight: upgrades succeed when the automation respects the constraints of your workloads, not when the automation assumes resilience that does not exist.\u003c/p\u003e\n\u003cp\u003eStart with the basics: configure PDBs for every production workload, set \u003ccode\u003e--max-surge 33%\u003c/code\u003e on your node pools, and always upgrade control plane before node pools. Test in staging first. Monitor during the upgrade. These practices are not optional for production clusters.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-01-28T17:00:00+01:00","id":"https://daily-devops.net/posts/cluster-upgrades-zero-downtime-aks/","language":"en","summary":"Master AKS upgrades with cordon/drain mechanics, Pod Disruption Budgets, multi-node-pool rollouts, and automation for zero-downtime operations.\n","tags":["aks","azure","kubernetes","cloud","devops","operations","reliability"],"title":"AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work\n","url":"https://daily-devops.net/posts/cluster-upgrades-zero-downtime-aks/"},{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"}],"content_html":"\u003cp\u003eIn Swabia, southern Germany, there is another cultural practice that outsiders often misunderstand or quietly ignore until it becomes unavoidable. It is called Stoßlüften.\u003c/p\u003e\n\u003cp\u003eTranslated literally, it means \u0026ldquo;shock ventilation.\u0026rdquo; The idea is simple and non-negotiable. Several times a day, regardless of season, you open all windows fully for a few minutes. In winter. In rain. In freezing temperatures. Then you close them again.\u003c/p\u003e\n\u003cp\u003eNo tilted windows. No half measures. No \u0026ldquo;we\u0026rsquo;ll do it later.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eThe goal is not comfort. The goal is system health.\u003c/p\u003e\n\u003cp\u003eAnd once again, this mindset maps disturbingly well to how we should treat long-running software systems.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-stoßlüften-actually-solves\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#what-sto%c3%9fl%c3%bcften-actually-solves\" title=\"What Stoßlüften Actually Solves\"\u003eWhat Stoßlüften Actually Solves\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStoßlüften is not about temperature control. It is about air quality.\u003c/p\u003e\n\u003cp\u003eKeeping windows slightly open all day feels reasonable. It avoids discomfort. It avoids confrontation with reality. It also does absolutely nothing to remove stale air, humidity, or long-term buildup. Over time, the room feels heavy. Mold appears quietly. The damage is discovered too late.\u003c/p\u003e\n\u003cp\u003eSwabians learned this the hard way. The solution was not better perfume. It was short, aggressive, intentional intervention.\u003c/p\u003e\n\u003cp\u003eThat distinction matters.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-software-equivalent-of-stale-air\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#the-software-equivalent-of-stale-air\" title=\"The Software Equivalent of Stale Air\"\u003eThe Software Equivalent of Stale Air\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eIn software systems, stale air takes many forms, and they\u0026rsquo;re often invisible until catastrophe hits.\u003c/p\u003e\n\u003cp\u003eConsider a long-running ASP.NET Core service that hasn\u0026rsquo;t been redeployed in eight months. It\u0026rsquo;s stable, right? The monitoring shows green. Latency is acceptable. But inside, subtle decay is accumulating:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMemory pressure\u003c/strong\u003e: A Garbage Collector tuned optimally for 100 concurrent users now serves 800. Heap fragmentation increases. Full collections pause the application for 200ms, 300ms, sometimes 500ms. But \u0026ldquo;it doesn\u0026rsquo;t crash,\u0026rdquo; so nobody investigates.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eConnection pools\u003c/strong\u003e: Database connection strings are cached. A DBA migrated the database to a new cluster and updated DNS, but the service still holds stale connection references. The connection pool wastes resources on dead connections. Some queries mysteriously slow to timeout.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTemporal cache\u003c/strong\u003e: An in-memory cache stores \u0026ldquo;permanent\u0026rdquo; reference data. A new region was added six months ago. The cache has never been cleared. Old entries are queried frequently, new entries are missing.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eHardware drift\u003c/strong\u003e: The service was deployed on Intel Xeon E5 processors. Your cloud provider migrated to AMD EPYC. The CPU instruction set is different. Some optimizations no longer apply. Latency jitter increases without explanation.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eNothing is technically broken. Monitoring is green. Latency is acceptable. Everyone feels slightly uncomfortable, but nobody can point to a single failure.\u003c/p\u003e\n\u003cp\u003eThis is the most dangerous state a system can be in.\u003c/p\u003e\n\u003cp\u003eLike a poorly ventilated room, everything still works. Until it doesn\u0026rsquo;t.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-small-open-windows-dont-work\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#why-small-open-windows-dont-work\" title=\"Why Small Open Windows Don\u0026rsquo;t Work\"\u003eWhy Small Open Windows Don\u0026rsquo;t Work\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eMany teams believe incremental improvements are enough. A small refactor here. A minor dependency update there. A single flag cleaned up during a feature sprint. These adjustments feel responsible, but they don\u0026rsquo;t meaningfully reset the system.\u003c/p\u003e\n\u003cp\u003eThe problem is structural. Incremental fixes optimize for comfort—avoiding downtime—rather than outcome: system health. They reduce immediate discomfort but leave stale state untouched. A \u003ccode\u003eFileSystemWatcher\u003c/code\u003e still holds old file references. Memory fragmentation still accumulates. Cached data still sits in memory indefinitely.\u003c/p\u003e\n\u003cp\u003eStoßlüften works differently. It is deliberate and complete. You don\u0026rsquo;t optimize for comfort during the process. You optimize for outcome. The system must prove it can start fresh, not just continue indefinitely. Fresh air replaces stale air quickly. This completeness is why it succeeds where partial measures fail.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"restarts-rebuilds-and-reality\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#restarts-rebuilds-and-reality\" title=\"Restarts, Rebuilds, and Reality\"\u003eRestarts, Rebuilds, and Reality\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eOne of the clearest expressions of Stoßlüften in software is restarting services on purpose. Not because they crashed. Not because alerts fired. But because long-lived state is a liability.\u003c/p\u003e\n\u003cp\u003eTeams that never restart services accumulate invisible risk. What looks stable—green metrics, acceptable latency—is often just decay that hasn\u0026rsquo;t been measured yet. Consider what happens in a Kubernetes cluster when pods run for months without intentional resets:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWithout regular restarts:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eA \u003ccode\u003eFileSystemWatcher\u003c/code\u003e monitoring a config directory holds an open file handle. When the config is deleted, the watcher doesn\u0026rsquo;t detect it. New instances read fresh config, old instances don\u0026rsquo;t. Configuration drift is invisible.\u003c/li\u003e\n\u003cli\u003eA background task crashes after 6 hours. The pod stays alive but the task loop is dead. No alerts fire. Work silently backs up for days.\u003c/li\u003e\n\u003cli\u003eMemory fragmentation becomes pathological. The heap fragments to 40%. Simple allocations start failing. Response times degrade silently by 30-40% before anyone connects the dots.\u003c/li\u003e\n\u003cli\u003eInfrastructure migrates to a new subnet. Old instances reference stale gateway IPs. Requests time out randomly. Debugging becomes a nightmare because the failure is intermittent and invisible.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eWith regular restarts (every 24-72 hours):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eConfig mismatches surface immediately. New instances must read fresh config or fail to start. Inconsistency becomes visible rather than silent.\u003c/li\u003e\n\u003cli\u003eDead task loops are discovered during the next startup. The problem is surfaced while it\u0026rsquo;s still manageable.\u003c/li\u003e\n\u003cli\u003eMemory is reclaimed and fragmentation resets. Degradation is measured in days, not months.\u003c/li\u003e\n\u003cli\u003eNetwork connectivity is re-established from scratch. Stale routing tables disappear. The system proves it can reconnect.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eFresh air hurts briefly. Stale air hurts later—and in production, later often means 3am on a Sunday.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"stoßlüften-is-not-chaos-engineering\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#sto%c3%9fl%c3%bcften-is-not-chaos-engineering\" title=\"Stoßlüften Is Not Chaos Engineering\"\u003eStoßlüften Is Not Chaos Engineering\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eThis is not about randomness or stress for its own sake.\u003c/p\u003e\n\u003cp\u003eStoßlüften is predictable. Scheduled. Expected. Everyone knows it will happen. Windows open. Windows close. Life continues.\u003c/p\u003e\n\u003cp\u003eThe software equivalent is controlled disruption. Planned redeployments. Regular dependency refresh cycles. Explicit cleanup phases. Intentional cache invalidation. Rebuilding environments from scratch instead of patching them indefinitely.\u003c/p\u003e\n\u003cp\u003eNone of this is exciting. That is precisely why it works.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"why-teams-avoid-it\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#why-teams-avoid-it\" title=\"Why Teams Avoid It\"\u003eWhy Teams Avoid It\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStoßlüften is uncomfortable. Especially in winter.\u003c/p\u003e\n\u003cp\u003eIt interrupts the illusion of stability. It creates a brief moment where the system is exposed. People feel the cold and question whether this is really necessary.\u003c/p\u003e\n\u003cp\u003eSoftware teams do the same thing. They avoid actions that temporarily increase risk, even if those actions reduce long-term risk dramatically. They prefer slow suffocation over short discomfort.\u003c/p\u003e\n\u003cp\u003eUntil mold shows up. Or outages. Or security incidents. Or the realization that nobody knows how the system actually starts anymore.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"a-practical-translation\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#a-practical-translation\" title=\"A Practical Translation\"\u003eA Practical Translation\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStoßlüften in software does not mean reckless change. It means building intentional reset points into your systems and enforcing them with discipline.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"service-restarts\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#service-restarts\" title=\"Service Restarts\"\u003eService Restarts\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRestart services regularly via orchestration. In Kubernetes, it\u0026rsquo;s a single command:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Restart all pods in a deployment, rolling one at a time\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl rollout restart deployment/api-service -n production\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eSee the \u003ca href=\"https://kubernetes.io/docs/reference/kubectl/generated/kubectl_rollout/kubectl_rollout_restart/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eofficial kubectl rollout restart documentation\u003c/a\u003e for more options.\u003c/p\u003e\n\u003cp\u003eThis forces your system to prove it can start cleanly. Every day. Without exception. If a pod fails to start, you discover it during a planned restart, not at 3am when users are affected. If it succeeds, you\u0026rsquo;ve just validated that all your startup assumptions still hold true.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"environment-rebuilds\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#environment-rebuilds\" title=\"Environment Rebuilds\"\u003eEnvironment Rebuilds\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRebuild environments from code, not from manual patches. If your production infrastructure has undocumented changes scattered across SSH sessions and Slack messages, you\u0026rsquo;ve created a disaster waiting to happen.\u003c/p\u003e\n\u003cp\u003eStore everything in \u003ca href=\"https://www.terraform.io/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eTerraform\u003c/a\u003e, \u003ca href=\"https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eBicep\u003c/a\u003e, or \u003ca href=\"https://aws.amazon.com/cloudformation/\" target=\"_blank\" rel=\"noopener external noreferrer\"\u003eCloudFormation\u003c/a\u003e. Every configuration change goes through code review and staging validation. When something breaks, you rebuild identically in 10 minutes from version control. When you discover a performance bottleneck, you update the code, get peer review, test in staging, then apply with confidence. The previous state is in git history. Rollback is one command away.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cache-and-state-management\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#cache-and-state-management\" title=\"Cache and State Management\"\u003eCache and State Management\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDo not rely on in-process caches that accumulate for months. They become invisible knowledge that only exists in memory. Instead, use distributed caches with explicit expiration times. Set TTLs (Time-To-Live values) to hours, not days. Force the cache to refresh regularly. Every 2-24 hours, the system reaches back to its source of truth instead of trusting what memory told it.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"feature-flag-discipline\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#feature-flag-discipline\" title=\"Feature Flag Discipline\"\u003eFeature Flag Discipline\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRemove flags aggressively. I\u0026rsquo;ve worked on systems where three-year-old feature flags were still active. The code paths they protected were theoretically unreachable, but nobody was certain enough to delete them. They accumulated like technical sediment.\u003c/p\u003e\n\u003cp\u003eEstablish a rhythm: \u003cstrong\u003eEvery quarter, audit all active flags.\u003c/strong\u003e Answer one question: \u0026ldquo;Is this flag still serving a purpose?\u0026rdquo; If the answer is no, delete it the same day. Dead code paths with unclear purposes are a slow poison. Kill them before they spread.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"force-reproducibility\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#force-reproducibility\" title=\"Force Reproducibility\"\u003eForce Reproducibility\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eThe final check: Force systems to prove they can start cleanly. Implement startup validation that runs every time your application boots. Three questions:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eCan you read essential configuration?\u003c/li\u003e\n\u003cli\u003eCan you connect to the database?\u003c/li\u003e\n\u003cli\u003eAre critical external services online?\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf any check fails, the pod doesn\u0026rsquo;t become \u0026ldquo;ready.\u0026rdquo; Kubernetes doesn\u0026rsquo;t route traffic to it. The problem surfaces immediately. No silent degradation. No invisible failures that accumulate for months. The system has to prove it\u0026rsquo;s healthy to be allowed to serve traffic.\u003c/p\u003e\n\u003cp\u003eIf your production environment cannot be recreated without tribal knowledge, you are not ventilating. You are masking smells. And masked smells always get worse.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"final-thought\"\u003e\u003ca href=\"/posts/stossluften-and-software-systems/#final-thought\" title=\"Final Thought\"\u003eFinal Thought\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSwabians do not Stoßlüften because they enjoy cold air. They do it because ignoring air quality is more expensive in the long run.\u003c/p\u003e\n\u003cp\u003eThe same applies to software systems. Stability is not about avoiding disruption. It is about choosing the right kind of disruption at the right time.\u003c/p\u003e\n\u003cp\u003eKehrwoche teaches us to clean regularly.\nStoßlüften teaches us to reset deliberately.\u003c/p\u003e\n\u003cp\u003eBoth are boring. Both are effective. And both exist because people learned that slow decay is harder to fix than brief discomfort.\u003c/p\u003e\n\u003cp\u003eOpen the windows.\nLet the stale assumptions out.\nClose them again.\u003c/p\u003e\n\u003cp\u003eYour system will breathe easier afterward.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-01-16T11:30:00+01:00","id":"https://daily-devops.net/posts/stossluften-and-software-systems/","language":"en","summary":"Hidden decay slips past green dashboards: intentional resets, rebuilds, and reproducibility checks expose what monitoring quietly keeps hiding.\n","tags":["technicaldebt","architecture","devops","reliability"],"title":"Stoßlüften: The Architecture of Intentional Resets","url":"https://daily-devops.net/posts/stossluften-and-software-systems/"}],"language":"en","title":"System Reliability and Resilience on Daily DevOps \u0026 .NET","version":"https://jsonfeed.org/version/1.1"}