{"authors":[{"name":"Martin Stühmer","url":"https://daily-devops.net/authors/martin/"},{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"description":"Recent content in Disaster Recovery for Cloud \u0026 .NET Systems on Daily DevOps \u0026 .NET","favicon":"https://daily-devops.net/images/logo_hu_6465d873dfa490cf.png","feed_url":"https://daily-devops.net/tags/disaster-recovery/feed.json","home_page_url":"https://daily-devops.net/tags/disaster-recovery/","icon":"https://daily-devops.net/images/logo_hu_5926de77762241ba.png","items":[{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\u003cp\u003eYour cluster will fail. The question is not if, but when, and whether you can recover before customers notice. Most organizations discover their backup strategy does not work during an actual outage, when recovery time matters most and manual heroics cannot save you.\u003c/p\u003e\n\u003cp\u003eIf you run Azure Kubernetes Service (AKS) in production, you need a recovery plan that engineers can execute half asleep at 2 AM. We will go through what to back up, how Velero works in day-to-day operations, when Azure Backup for AKS is enough, and how to design realistic failover with measurable Recovery Time Objective (RTO) and Recovery Point Objective (RPO).\u003c/p\u003e\n\u003cp\u003eThe goal is simple: repeatable recovery procedures you have already tested, not a document that looks good in Confluence but fails during an incident.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"the-problem-untested-recovery-fails-when-it-matters\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#the-problem-untested-recovery-fails-when-it-matters\" title=\"The problem: Untested recovery fails when it matters\"\u003eThe problem: Untested recovery fails when it matters\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eEvery Kubernetes cluster accumulates state that must survive failures. Application data lives in persistent volumes. Cluster configuration exists in custom resource definitions. Workload definitions sit in YAML manifests scattered across repositories. Identity mappings, secrets, network policies, and RBAC rules define how services authenticate and communicate. Losing any of these components means downtime, data loss, and manual reconstruction under time pressure.\u003c/p\u003e\n\u003cp\u003eThe real risk is not having a backup strategy. The real risk is discovering your backup strategy does not work during an actual incident, when recovery time directly determines customer impact and business cost.\u003c/p\u003e\n\u003cp\u003eOperational reality: Most teams test backup creation but never test restoration. A backup you have never restored is a backup that will fail when you need it. Recovery procedures that require manual steps will fail during high-pressure incidents when engineers make mistakes and documentation is incomplete.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"what-needs-backup-understanding-cluster-state\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#what-needs-backup-understanding-cluster-state\" title=\"What needs backup: Understanding cluster state\"\u003eWhat needs backup: Understanding cluster state\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes clusters contain multiple layers of state that require different backup approaches.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"application-data-persistent-volumes\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#application-data-persistent-volumes\" title=\"Application data: Persistent volumes\"\u003eApplication data: Persistent volumes\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePersistent volumes hold databases, file storage, configuration data, and application state. Losing persistent volume data typically means permanent data loss unless you maintain application-level replication or external backups. Azure Disks and Azure Files both support snapshot-based backup, but snapshots alone do not capture the Kubernetes metadata required to restore volumes to the correct pods in the correct namespaces.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"cluster-configuration-custom-resources-and-crds\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#cluster-configuration-custom-resources-and-crds\" title=\"Cluster configuration: Custom resources and CRDs\"\u003eCluster configuration: Custom resources and CRDs\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eCustom Resource Definitions extend Kubernetes with domain-specific objects. Operators, service meshes, monitoring stacks, and policy engines all define Custom Resource Definitions (CRDs) that control cluster behavior. Losing CRDs means losing the schema and logic that your cluster depends on. Restoring CRDs without the corresponding custom resource objects leaves your cluster in an inconsistent state.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"application-definitions-workload-manifests\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#application-definitions-workload-manifests\" title=\"Application definitions: Workload manifests\"\u003eApplication definitions: Workload manifests\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDeployments, StatefulSets, Services, ConfigMaps, and Secrets define what runs in your cluster. Most teams store these manifests in Git, but cluster state drifts from Git over time due to manual changes, automated rollouts, and operator modifications. Restoring from Git alone may not reflect actual production state.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"identity-and-access-rbac-and-service-accounts\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#identity-and-access-rbac-and-service-accounts\" title=\"Identity and access: RBAC and service accounts\"\u003eIdentity and access: RBAC and service accounts\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRole-based access control, ServiceAccounts, and Azure AD integration define who can access what resources. Losing role-based access control (RBAC) configuration means losing security boundaries and breaking automated workflows that depend on specific service account permissions.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"network-configuration-policies-and-ingress-rules\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#network-configuration-policies-and-ingress-rules\" title=\"Network configuration: Policies and ingress rules\"\u003eNetwork configuration: Policies and ingress rules\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eNetwork policies, ingress controllers, and DNS mappings control how traffic flows into and within your cluster. Restoring workloads without restoring network configuration results in unreachable services and broken traffic routing.\u003c/p\u003e\n\u003cp\u003eA complete backup strategy captures all of these layers and validates that restoration procedures actually work.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"velero-production-backup-workflows\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#velero-production-backup-workflows\" title=\"Velero: Production backup workflows\"\u003eVelero: Production backup workflows\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eVelero is the de facto standard for Kubernetes backup and restore. It runs as a controller inside your cluster, captures cluster state and persistent volume snapshots, and stores backups in object storage.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"how-velero-works\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#how-velero-works\" title=\"How Velero works\"\u003eHow Velero works\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eVelero operates in two phases: backup and restore. During backup, Velero queries the Kubernetes API for resources matching your backup selectors, serializes those resources to JSON, and uploads the result to cloud object storage (Azure Blob Storage for AKS). For persistent volumes, Velero triggers volume snapshots using Azure Disk snapshots or uses Restic to perform file-level backups.\u003c/p\u003e\n\u003cp\u003eDuring restore, Velero downloads the backup manifest, applies resources to the target cluster, and restores persistent volume data from snapshots or Restic archives. Velero handles dependency ordering and namespace mapping automatically.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"backup-scheduling-and-retention\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#backup-scheduling-and-retention\" title=\"Backup scheduling and retention\"\u003eBackup scheduling and retention\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eProduction backup strategies require automated scheduling and retention policies. Velero supports cron-based schedules and configurable retention windows.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c\"\u003e# Velero backup schedule - Helm values\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evelero\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedules\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003edaily\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"c\"\u003e# Run full backup daily at 2 AM UTC\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedule\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etemplate\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ettl\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;720h\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Retain backups for 30 days\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eincludedNamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003estaging\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003esnapshotVolumes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e    \u003c/span\u003e\u003cspan class=\"nt\"\u003ehourly-critical\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"c\"\u003e# Run hourly backup for critical namespaces\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003eschedule\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 * * * *\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e      \u003c/span\u003e\u003cspan class=\"nt\"\u003etemplate\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003ettl\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;168h\u0026#34;\u003c/span\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Retain backups for 7 days\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003eincludedNamespaces\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e- \u003cspan class=\"l\"\u003eproduction\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003elabelSelector\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e          \u003c/span\u003e\u003cspan class=\"nt\"\u003ematchLabels\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e            \u003c/span\u003e\u003cspan class=\"nt\"\u003ebackup-frequency\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ehourly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e        \u003c/span\u003e\u003cspan class=\"nt\"\u003esnapshotVolumes\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eFor many teams, this minimal Terraform baseline is easier to maintain than a large, custom module. It creates the storage account and container Velero needs.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-hcl\" data-lang=\"hcl\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_storage_account\u0026#34; \u0026#34;velero\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                     \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;velerobackup${var.environment}\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  resource_group_name\u003c/span\u003e      \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003eresource_group_name\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  location\u003c/span\u003e                 \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003evar\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003elocation\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  account_tier\u003c/span\u003e             \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;Standard\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  account_replication_type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;GRS\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eresource\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;azurerm_storage_container\u0026#34; \u0026#34;velero\u0026#34;\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  name\u003c/span\u003e                  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;velero\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  storage_account_name\u003c/span\u003e  \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"k\"\u003eazurerm_storage_account\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003evelero\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"k\"\u003ename\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003e  container_access_type\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;private\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThen install Velero with Helm and pass only four required values: provider (\u003ccode\u003eazure\u003c/code\u003e), storage account name, blob container name, and resource group. Keep advanced tuning for later once backups and restores are stable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"testing-restore-procedures\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#testing-restore-procedures\" title=\"Testing restore procedures\"\u003eTesting restore procedures\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eBackup creation means nothing without verified restore capability. Production-grade DR requires regular restore testing in isolated environments.\u003c/p\u003e\n\u003cp\u003eRestore testing workflow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCreate a test AKS cluster in a separate resource group\u003c/li\u003e\n\u003cli\u003eInstall Velero with access to production backup storage\u003c/li\u003e\n\u003cli\u003eExecute restore operation for a representative namespace\u003c/li\u003e\n\u003cli\u003eValidate application functionality and data integrity\u003c/li\u003e\n\u003cli\u003eDocument restoration time and any issues encountered\u003c/li\u003e\n\u003cli\u003eDestroy test cluster\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eRun this workflow monthly at minimum. Quarterly is too infrequent because configuration drift and Velero version updates will cause surprises. Teams that skip restore testing discover broken procedures during actual outages.\u003c/p\u003e\n\u003cp\u003eCommon restore failures: Missing CRDs (restore CRDs before custom resources), incorrect namespace mappings (use Velero namespace mapping features), persistent volume availability zones (Azure Disks are zone-locked), and missing secrets (external secret management requires separate backup).\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-native-backup-when-to-use-it\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#azure-native-backup-when-to-use-it\" title=\"Azure native backup: When to use it\"\u003eAzure native backup: When to use it\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAzure Backup for AKS launched in 2023 and provides Azure-native cluster backup without deploying Velero. It integrates with Azure Backup vaults and uses the same portal experience as VM and database backups.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"azure-backup-vs-velero\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#azure-backup-vs-velero\" title=\"Azure Backup vs Velero\"\u003eAzure Backup vs Velero\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eAzure Backup works well for organizations heavily invested in Azure tooling who want unified backup management across all Azure resources. It handles backup scheduling, retention, and monitoring through familiar Azure interfaces.\u003c/p\u003e\n\u003cp\u003eLimitations compared to Velero: Less flexibility in backup selectors and namespace filtering, fewer options for cross-region backup replication, and vendor lock-in to Azure. Velero supports multi-cloud scenarios and offers more granular control over what gets backed up.\u003c/p\u003e\n\u003cp\u003eRecommendation: Use Azure Backup if your organization already standardizes on Azure Backup for other resources and you do not require multi-cloud portability. Use Velero if you need maximum flexibility, cross-region replication control, or multi-cloud backup capability.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-region-failover-designing-for-actual-recovery\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#multi-region-failover-designing-for-actual-recovery\" title=\"Multi-region failover: Designing for actual recovery\"\u003eMulti-region failover: Designing for actual recovery\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eSingle-region deployments create single points of failure. Multi-region architectures provide genuine disaster recovery capability but introduce complexity in state synchronization, traffic routing, and recovery orchestration.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"failover-architecture-patterns\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#failover-architecture-patterns\" title=\"Failover architecture patterns\"\u003eFailover architecture patterns\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eActive-passive:\u003c/strong\u003e Primary region handles all traffic. Secondary region remains idle but receives regular backup replication. During failover, you restore backups to the secondary cluster and redirect traffic. Recovery time depends on backup restore speed and DNS propagation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eActive-active:\u003c/strong\u003e Both regions handle production traffic simultaneously. Application state synchronizes continuously (database replication, event streaming, or shared storage). During regional failure, traffic shifts to the remaining region. Recovery time depends on health check detection and DNS/load balancer failover speed.\u003c/p\u003e\n\u003cp\u003eActive-passive costs less but requires longer recovery time. Active-active provides faster failover but doubles infrastructure cost and requires application-level state synchronization.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"dns-failover-automation\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#dns-failover-automation\" title=\"DNS failover automation\"\u003eDNS failover automation\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eDNS-based failover redirects traffic between regions by updating DNS records to point at healthy endpoints. Azure Traffic Manager and Azure Front Door both provide automatic failover based on health probes.\u003c/p\u003e\n\u003cp\u003eUse a small script first, then expand it over time. This keeps incident handling understandable for on-call engineers.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"cp\"\u003e#!/usr/bin/env bash\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eset\u003c/span\u003e -euo pipefail\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSECONDARY_RG\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;rg-aks-westus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eSECONDARY_CLUSTER\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;aks-dr-westus\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTM_RG\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;rg-networking\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nv\"\u003eTM_PROFILE\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;tm-aks-prod\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;1) Connect to secondary cluster\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz aks get-credentials -g \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$SECONDARY_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e -n \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$SECONDARY_CLUSTER\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --overwrite-existing\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ekubectl cluster-info\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;2) Trigger restore from latest Velero backup\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero restore create dr-\u003cspan class=\"k\"\u003e$(\u003c/span\u003edate +%Y%m%d-%H%M\u003cspan class=\"k\"\u003e)\u003c/span\u003e --from-backup \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"k\"\u003e$(\u003c/span\u003evelero backup get -o name \u003cspan class=\"p\"\u003e|\u003c/span\u003e tail -n1 \u003cspan class=\"p\"\u003e|\u003c/span\u003e cut -d/ -f2\u003cspan class=\"k\"\u003e)\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003eecho\u003c/span\u003e \u003cspan class=\"s2\"\u003e\u0026#34;3) Switch Traffic Manager endpoint\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz network traffic-manager endpoint update --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --profile-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_PROFILE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name endpoint-eastus --type azureEndpoints --endpoint-status Disabled\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eaz network traffic-manager endpoint update --resource-group \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_RG\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --profile-name \u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e\u003cspan class=\"nv\"\u003e$TM_PROFILE\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;\u003c/span\u003e --name endpoint-westus --type azureEndpoints --endpoint-status Enabled\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis script is intentionally small. Add pre-checks and post-checks later, but start with a version every engineer can understand quickly during an outage.\u003c/p\u003e\n\u003cp\u003eThis script automates critical failover steps but requires human verification at each stage. Fully automated failover without human approval risks unnecessary region switches during transient failures.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"state-synchronization-strategies\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#state-synchronization-strategies\" title=\"State synchronization strategies\"\u003eState synchronization strategies\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMulti-region architectures require careful state management. Databases need replication (Azure SQL geo-replication, Cosmos DB multi-region writes). Object storage needs cross-region replication (Azure Blob Storage GRS). Message queues require either regional isolation or cross-region synchronization (Azure Service Bus premium tier supports geo-replication).\u003c/p\u003e\n\u003cp\u003eStateless services fail over easily. Stateful services require replication strategy planning during design phase, not during incident response.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"rto-and-rpo-calculating-realistic-targets\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#rto-and-rpo-calculating-realistic-targets\" title=\"RTO and RPO: Calculating realistic targets\"\u003eRTO and RPO: Calculating realistic targets\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRecovery Time Objective (RTO) measures how long systems can be down before business impact becomes unacceptable. Recovery Point Objective (RPO) measures how much data loss is acceptable.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"calculating-rto\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#calculating-rto\" title=\"Calculating RTO\"\u003eCalculating RTO\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRTO includes: detection time (how long until you know there is a problem), decision time (how long to decide failover is necessary), restore time (how long to restore from backup or switch regions), and validation time (how long to confirm restoration worked).\u003c/p\u003e\n\u003cp\u003eExample calculation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eDetection: 5 minutes (health check interval)\u003c/li\u003e\n\u003cli\u003eDecision: 10 minutes (incident escalation and approval)\u003c/li\u003e\n\u003cli\u003eRestore: 45 minutes (Velero restore for 500GB cluster)\u003c/li\u003e\n\u003cli\u003eValidation: 15 minutes (smoke tests and traffic verification)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTotal RTO: 75 minutes\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf business requirements demand 30-minute RTO, your current backup-based approach will not meet SLOs. You need active-active architecture or pre-warmed standby clusters.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"calculating-rpo\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#calculating-rpo\" title=\"Calculating RPO\"\u003eCalculating RPO\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eRPO depends on backup frequency. Hourly backups mean up to 60 minutes of data loss. If your application cannot tolerate 60 minutes of data loss, you need more frequent backups or continuous replication.\u003c/p\u003e\n\u003cp\u003eExample calculation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBackup frequency: Every 4 hours\u003c/li\u003e\n\u003cli\u003eLast backup: 2 hours ago\u003c/li\u003e\n\u003cli\u003eRegional failure occurs now\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eData loss: 2 hours\u003c/strong\u003e (time since last backup)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf business requirements demand 15-minute RPO, 4-hour backup intervals will not meet SLOs. You need hourly backups, application-level replication, or continuous event streaming to secondary region.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"designing-for-slos-without-over-engineering\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#designing-for-slos-without-over-engineering\" title=\"Designing for SLOs without over-engineering\"\u003eDesigning for SLOs without over-engineering\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eMany teams over-engineer DR solutions trying to achieve zero data loss and instant failover without understanding actual business requirements. A 4-hour RTO may be acceptable for internal tooling but catastrophic for customer-facing APIs.\u003c/p\u003e\n\u003cp\u003ePractical use case:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eInternal reporting API: 2-hour RTO and 1-hour RPO can be enough, active-passive is usually fine.\u003c/li\u003e\n\u003cli\u003eCustomer checkout API: 15-minute RTO and near-zero RPO usually require active-active plus database replication.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe recurring theme is business impact, not architecture fashion.\u003c/p\u003e\n\u003cp\u003eStart by identifying actual business impact:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWhat revenue is lost per hour of downtime?\u003c/li\u003e\n\u003cli\u003eWhat customer commitments exist in SLAs?\u003c/li\u003e\n\u003cli\u003eWhat regulatory requirements mandate specific recovery times?\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThen design the minimum viable DR solution that meets those requirements. Do not build active-active multi-region architecture with continuous replication if business requirements allow 2-hour RTO and 1-hour RPO. That level of complexity costs significant engineering time and operational overhead.\u003c/p\u003e\n\u003cp\u003eConversely, do not assume daily backups suffice for production systems without validating business tolerance for 24-hour data loss.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"best-practices-what-actually-works\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#best-practices-what-actually-works\" title=\"Best practices: What actually works\"\u003eBest practices: What actually works\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eTest restore procedures regularly.\u003c/strong\u003e Monthly restore testing in isolated environments catches broken procedures before actual incidents. Quarterly testing is too infrequent.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAutomate backup verification.\u003c/strong\u003e Run automated restore tests that verify backup integrity and measure restoration time. Manual testing does not scale and gets skipped under time pressure.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDocument recovery procedures.\u003c/strong\u003e Runbooks that sit in Confluence do not get updated and will be wrong during incidents. Store recovery procedures as executable scripts in version control and test them regularly.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSeparate backup storage from cluster infrastructure.\u003c/strong\u003e Do not store backups in the same region or subscription as the cluster. Regional Azure outages impact all resources in that region including backup storage.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePlan for partial failures.\u003c/strong\u003e Not every incident requires full cluster restore. Design procedures for restoring individual namespaces, specific workloads, or single persistent volumes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUse infrastructure as code for cluster rebuild.\u003c/strong\u003e Terraform or Bicep definitions for cluster creation enable rapid cluster recreation when restoration is not the best recovery path.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMonitor backup jobs.\u003c/strong\u003e Failed backups are worthless. Alert on backup failures and missing backup runs. Do not discover backup gaps during recovery.\u003c/p\u003e\n\u003cp\u003eIf you are defining a monthly DR game day, include three quick checks every time:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eCan we restore one namespace end to end in a clean test cluster?\u003c/li\u003e\n\u003cli\u003eCan we switch traffic and run smoke tests in less than our RTO?\u003c/li\u003e\n\u003cli\u003eCan we prove data freshness is inside the RPO window?\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIf one answer is no, your DR posture is weaker than your dashboard suggests.\u003c/p\u003e\n\u003cp\u003eCommon mistakes: Storing backups in same region as cluster (regional failure loses backups and cluster), never testing restore procedures (broken backups discovered during incidents), manual recovery procedures (humans make mistakes under pressure), and no RTO/RPO measurement (cannot tell if recovery meets business requirements).\u003c/p\u003e\n\u003cp\u003eAuthor note: I have participated in exactly two real disaster recovery situations involving Kubernetes clusters. In the first incident, backup restoration worked but took 3 hours longer than documented because volume snapshot region restrictions were not tested. In the second incident, backups existed but CRD restoration failed because CRD versions changed between backup and restore. Both incidents would have been prevented by regular restore testing. Do not learn this lesson during a production outage.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"conclusion\"\u003e\u003ca href=\"/posts/disaster-recovery-business-continuity-aks/#conclusion\" title=\"Conclusion\"\u003eConclusion\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eDisaster recovery for AKS requires deliberate planning, regular testing, and honest assessment of recovery capabilities. Velero provides proven backup and restore workflows. Azure native backup offers simplified management for Azure-focused organizations. Multi-region architectures enable faster recovery but increase complexity and cost.\u003c/p\u003e\n\u003cp\u003eThe real test is not having a backup strategy documented in Confluence. The real test is whether you can restore your cluster from backup in under 60 minutes during an actual regional outage at 2 AM when half your team is asleep and the incident commander is asking for status updates.\u003c/p\u003e\n\u003cp\u003eBuild repeatable procedures. Test them monthly. Automate everything you can. Measure actual RTO and RPO. Add one more rule: if a step cannot be executed from version-controlled scripts, it is probably not ready for production incidents.\u003c/p\u003e\n\u003cp\u003eRelated reading for AKS operations maturity: \u003ca href=\"/posts/cluster-upgrades-zero-downtime-aks/\"\u003eAKS Cluster Upgrades Without Downtime\u003c/a\u003e.\u003c/p\u003e","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-03-11T17:00:00+01:00","id":"https://daily-devops.net/posts/disaster-recovery-business-continuity-aks/","language":"en","summary":"AKS outages happen. Build a tested DR plan with Velero, realistic RTO/RPO targets, and multi-region failover steps your team can run under pressure.","tags":["disaster-recovery","azure","kubernetes","cloud","devops","reliability","compliance"],"title":"AKS Disaster Recovery: Why Your Untested Backup Will Fail","url":"https://daily-devops.net/posts/disaster-recovery-business-continuity-aks/"},{"authors":[{"name":"Jendrik Brack","url":"https://daily-devops.net/authors/jendrik/"}],"content_html":"\n\n\n\n\u003ch2 id=\"the-problem-traditional-storage-models-dont-translate-to-kubernetes\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-problem-traditional-storage-models-dont-translate-to-kubernetes\" title=\"The Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\"\u003eThe Problem: Traditional Storage Models Don\u0026rsquo;t Translate to Kubernetes\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads in Kubernetes means more than deploying a database pod. Traditional storage models (provision a disk, format it, mount it, expect it to stay) collide with Kubernetes\u0026rsquo; ephemeral, distributed architecture. Pods get rescheduled, scaled, and terminated. Your database shouldn\u0026rsquo;t lose data when that happens.\u003c/p\u003e\n\u003cp\u003eThe core challenge: \u003cstrong\u003ehow do you attach persistent storage to ephemeral compute?\u003c/strong\u003e On-premises infrastructure relies on SAN devices, NFS mounts, or local disks with predictable failure domains. You know which server hosts which disk. In AKS, you work with Azure storage primitives: Managed Disks, Azure Files, blob storage. These need seamless integration with Kubernetes lifecycle management. The abstractions differ, the failure modes differ, and operational patterns require rethinking.\u003c/p\u003e\n\u003cp\u003eComplexity multiplies with backup requirements, disaster recovery expectations, and multi-cluster data synchronization. Whether migrating legacy apps that expect local RAID controllers or building cloud-native data platforms from scratch, AKS storage architecture knowledge is foundational. Get it wrong: data loss, performance bottlenecks, escalating cloud bills.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"pvcpv-architecture-how-storage-binds-to-pods-in-aks\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#pvcpv-architecture-how-storage-binds-to-pods-in-aks\" title=\"PVC/PV Architecture: How Storage Binds to Pods in AKS\"\u003ePVC/PV Architecture: How Storage Binds to Pods in AKS\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes abstracts storage through two key objects: \u003cstrong\u003ePersistentVolumes (PV)\u003c/strong\u003e and \u003cstrong\u003ePersistentVolumeClaims (PVC)\u003c/strong\u003e. A PV represents the actual storage resource (Azure Disk, Azure Files share). A PVC represents the request for that storage. The relationship mirrors compute abstractions: nodes are physical machines, pods are logical units consuming node resources. Similarly, PVs are physical storage, PVCs are logical requests consuming PV capacity.\u003c/p\u003e\n\u003cp\u003eThe binding flow:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eDeveloper creates a PVC specifying size, access mode, and storage class\u003c/li\u003e\n\u003cli\u003eKubernetes finds or provisions a matching PV based on the storage class\u003c/li\u003e\n\u003cli\u003ePVC binds to the PV, making it available to pods\u003c/li\u003e\n\u003cli\u003ePods reference the PVC in their volume mounts\u003c/li\u003e\n\u003cli\u003eWhen the pod terminates, the PVC remains (data persists across pod lifecycles)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAccess modes matter:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteOnce (RWO)\u003c/strong\u003e: Single node can mount the volume (Azure Disk)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadWriteMany (RWX)\u003c/strong\u003e: Multiple nodes can mount simultaneously (Azure Files)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReadOnlyMany (ROX)\u003c/strong\u003e: Multiple nodes, read-only access\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eMost stateful apps (databases, message queues) use RWO. Azure Disks provide better IOPS and latency than Azure Files. For shared storage (parallel batch processing, shared config directories, legacy apps expecting NFS semantics), use RWX: Azure Files or third-party CSI drivers like NFS or CephFS.\u003c/p\u003e\n\u003cp\u003eCritical insight: \u003cstrong\u003ePVCs decouple storage requests from storage implementation.\u003c/strong\u003e Developers don\u0026rsquo;t need to know if they get a Premium SSD or Standard HDD. They request 100Gi of fast storage, the storage class handles provisioning. This abstraction enables platform teams to enforce policies (all production PVCs use Premium tier) without touching application manifests.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"azure-disk-vs-azure-files-performance-cost-regional-constraints\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#azure-disk-vs-azure-files-performance-cost-regional-constraints\" title=\"Azure Disk vs. Azure Files: Performance, Cost, Regional Constraints\"\u003eAzure Disk vs. Azure Files: Performance, Cost, Regional Constraints\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eChoosing between Azure Disk and Azure Files isn\u0026rsquo;t a one-size-fits-all decision. Each has distinct performance profiles, cost implications, and operational constraints.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Disk (Managed Disks):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Lower latency, higher IOPS. Premium SSDs reach 20,000 IOPS, Ultra Disks exceed that.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Single-node attachment (RWO). Pod rescheduling to another node triggers disk detach and reattach (expect brief delay).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Databases (PostgreSQL, MongoDB), stateful apps requiring low-latency I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per provisioned disk size. A 1TB Premium SSD costs more than a 1TB Standard HDD, regardless of actual usage.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e Disks are zone-specific. With availability zones, pods must schedule in the same zone as the disk.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eAzure Files (SMB/NFS):\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePerformance:\u003c/strong\u003e Higher latency than disks. Premium Files tier improves performance but still trails disk I/O.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAccess:\u003c/strong\u003e Multi-node (RWX). Multiple pods across nodes can mount the same share.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse cases:\u003c/strong\u003e Shared logs, static assets, config files, legacy apps expecting NFS.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCost:\u003c/strong\u003e Pay per storage consumed plus transactions. Transaction costs surprise teams on high-throughput workloads.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRegional constraints:\u003c/strong\u003e File shares are regional, not zonal. Better for cross-zone workloads, still tied to single region.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eDecision criteria:\u003c/strong\u003e Default to Azure Disk for databases and high-IOPS apps. Use Azure Files only when RWX access or legacy NFS compatibility is required. For backup targets or archival storage, consider blob storage with CSI drivers (experimental, improving).\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"the-disk-attachment-penalty\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#the-disk-attachment-penalty\" title=\"The Disk Attachment Penalty\"\u003eThe Disk Attachment Penalty\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eGotcha: \u003cstrong\u003edisk attachment times.\u003c/strong\u003e Pod rescheduling requires Azure to detach the disk from the old node and attach it to the new one. This takes 30 to 90 seconds. Apps that cannot tolerate this downtime need application-level replication (PostgreSQL streaming replication) or third-party solutions like Portworx.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"storage-classes--dynamic-provisioning-automating-the-lifecycle\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#storage-classes--dynamic-provisioning-automating-the-lifecycle\" title=\"Storage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\"\u003eStorage Classes \u0026amp; Dynamic Provisioning: Automating the Lifecycle\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStatic provisioning (manually creating PVs, hoping someone claims them) creates operational overhead. \u003cstrong\u003eStorage classes\u003c/strong\u003e enable dynamic provisioning: Kubernetes automatically creates a PV when a PVC is submitted.\u003c/p\u003e\n\u003cp\u003eAKS ships with default storage classes:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003edefault\u003c/code\u003e: Standard HDD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003emanaged-premium\u003c/code\u003e: Premium SSD Azure Disk (RWO)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile\u003c/code\u003e: Azure Files share (RWX)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eazurefile-premium\u003c/code\u003e: Premium Azure Files share (RWX)\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eYou can define custom storage classes to fine-tune parameters:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-yaml\" data-lang=\"yaml\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eapiVersion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003estorage.k8s.io/v1\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eStorageClass\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003emetadata\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ename\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003efast-ssd\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eprovisioner\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003edisk.csi.azure.com\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eparameters\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003eskuName\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003ePremium_LRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ekind\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eManaged\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"nt\"\u003ecachingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eReadOnly\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# Zone redundant storage (ZRS) for higher durability\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"w\"\u003e  \u003c/span\u003e\u003cspan class=\"c\"\u003e# skuName: Premium_ZRS\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003eallowVolumeExpansion\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"kc\"\u003etrue\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003ereclaimPolicy\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eRetain\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nt\"\u003evolumeBindingMode\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\u003cspan class=\"w\"\u003e \u003c/span\u003e\u003cspan class=\"l\"\u003eWaitForFirstConsumer\u003c/span\u003e\u003cspan class=\"w\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eKey parameters:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ereclaimPolicy:\u003c/strong\u003e \u003ccode\u003eDelete\u003c/code\u003e removes the disk when PVC is deleted, \u003ccode\u003eRetain\u003c/code\u003e keeps it. For production databases, \u003ccode\u003eRetain\u003c/code\u003e prevents accidental data deletion.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003evolumeBindingMode:\u003c/strong\u003e \u003ccode\u003eWaitForFirstConsumer\u003c/code\u003e delays PV creation until pod scheduling. Critical for zone-aware clusters (Kubernetes creates the disk in the same zone as the pod).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eallowVolumeExpansion:\u003c/strong\u003e Enables PVC resizing without recreation. Azure Disks support this, not all storage backends do.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eBest practice:\u003c/strong\u003e Create environment-specific storage classes (dev, staging, prod) with different \u003ccode\u003eskuName\u003c/code\u003e values. Dev clusters use Standard HDDs, prod uses Premium SSDs. Developers use identical manifests across environments, only the storage class name changes.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"backup--recovery-rtorpo-implications\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#backup--recovery-rtorpo-implications\" title=\"Backup \u0026amp; Recovery: RTO/RPO Implications\"\u003eBackup \u0026amp; Recovery: RTO/RPO Implications\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eKubernetes doesn\u0026rsquo;t backup data by default. Running \u003ccode\u003ekubectl delete pvc\u003c/code\u003e without a recovery plan means permanent data loss.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eVelero\u003c/strong\u003e (formerly Heptio Ark) is the de facto standard for Kubernetes backup. It snapshots PVs, captures Kubernetes object state, stores backups in object storage (Azure Blob, S3, GCS).\u003c/p\u003e\n\u003cp\u003eExample Velero backup schedule (via CLI):\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install Velero with Azure plugin\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero install \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --provider azure \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --bucket velero-backups \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --secret-file ./credentials-velero \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --backup-location-config \u003cspan class=\"nv\"\u003eresourceGroup\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003eaks-backups-rg,storageAccount\u003cspan class=\"o\"\u003e=\u003c/span\u003eaksbackupssa\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Create a daily backup schedule for production namespace\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003evelero schedule create daily-prod-backup \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --schedule\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;0 2 * * *\u0026#34;\u003c/span\u003e \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --include-namespaces production \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --snapshot-volumes \u003cspan class=\"se\"\u003e\\\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e  --ttl 720h\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\n\n\n\u003ch3 id=\"rto-and-rpo-considerations\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#rto-and-rpo-considerations\" title=\"RTO And RPO Considerations\"\u003eRTO And RPO Considerations\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eRTO/RPO considerations:\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSnapshot-based backups (Azure Disk snapshots via Velero):\u003c/strong\u003e RPO equals backup frequency (hourly, daily). RTO equals time to provision new PV plus restore data (5 to 30 minutes).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNative Azure Backup for AKS:\u003c/strong\u003e Microsoft managed solution. Integrated with Azure Backup policies, slower restores and less granular than Velero.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eApplication-level backups (pg_dump, mongodump):\u003c/strong\u003e Bypasses Kubernetes entirely. Lower RTO with automated restore scripts, requires custom orchestration.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eGotcha:\u003c/strong\u003e Velero relies on Azure Disk snapshots. Disk in Zone 1, restore to cluster in Zone 2 requires cross-zone snapshot copy (not instant). Test restore procedures in non-prod clusters. A backup never restored is wishful thinking.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"multi-aks-replication-patterns-for-cross-cluster-data-synchronization\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#multi-aks-replication-patterns-for-cross-cluster-data-synchronization\" title=\"Multi-AKS Replication: Patterns for Cross-Cluster Data Synchronization\"\u003eMulti-AKS Replication: Patterns for Cross-Cluster Data Synchronization\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eRunning stateful workloads across multiple AKS clusters—whether for HA, disaster recovery, or multi-region latency requirements—adds another layer of complexity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 1: Application-Level Replication\u003c/strong\u003e\nLet the application handle replication. PostgreSQL streaming replication, MongoDB replica sets, Kafka replication understand their data models and replicate efficiently.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e No Kubernetes-specific dependencies. Works identically in VMs, on-premises, or managed services.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e You manage replication lag, split-brain scenarios, and failover logic.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 2: Storage-Level Replication\u003c/strong\u003e\nUse Azure NetApp Files or third-party solutions like Portworx for block or file-level replication.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Transparent to applications. Works with legacy apps lacking native replication.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e Expensive. NetApp Files Premium tier and Portworx licensing (scales with node count) add significant cost.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003ePattern 3: Backup-Based DR\u003c/strong\u003e\nVelero backups from primary cluster, restore to secondary on failover.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePros:\u003c/strong\u003e Cost-effective (blob storage only).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCons:\u003c/strong\u003e RPO equals last backup interval (hours, not seconds). RTO includes restore time (minutes to hours).\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\n\n\u003ch3 id=\"a-multi-region-postgresql-pattern\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-multi-region-postgresql-pattern\" title=\"A Multi-Region PostgreSQL Pattern\"\u003eA Multi-Region PostgreSQL Pattern\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eReal-world example:\u003c/strong\u003e Multi-region PostgreSQL deployment pattern I\u0026rsquo;ve encountered:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePrimary AKS cluster (West Europe):\u003c/strong\u003e Production traffic\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSecondary AKS cluster (North Europe):\u003c/strong\u003e Read replicas via PostgreSQL streaming replication\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eVelero backups:\u003c/strong\u003e Azure Blob in third region (East US) for regulatory compliance\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis provides sub-second RPO within Europe (streaming replication), hourly RPO globally (Velero), 5-minute RTO for regional failover (promote read replica).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOperational reality:\u003c/strong\u003e Multi-cluster data replication is complex. Avoid it by using managed services (Azure Database for PostgreSQL with geo-replication) if possible. Running databases in AKS requires investment in automation, monitoring, and runbooks. Your 3 AM self will appreciate this decision.\u003c/p\u003e\n\n\n\n\n\u003ch2 id=\"final-thoughts\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#final-thoughts\" title=\"Final Thoughts\"\u003eFinal Thoughts\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eStorage in AKS represents a set of trade-offs requiring deliberate navigation. Azure Disk provides performance with zone-locking. Azure Files offers flexibility with latency penalties. Velero enables backups but demands operational discipline and testing. Multi-cluster replication delivers resilience with non-linear operational complexity.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"a-pragmatic-starting-point\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#a-pragmatic-starting-point\" title=\"A Pragmatic Starting Point\"\u003eA Pragmatic Starting Point\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003ePragmatic approach: Start with managed storage classes and Velero. Use Azure Disk for databases and high-IOPS workloads. Use Azure Files only when RWX access or legacy NFS compatibility is genuinely required. Test restore procedures quarterly, not during outages. Schedule fire drills: delete a namespace, restore from backup. Measure actual RTO/RPO instead of assuming SLA compliance.\u003c/p\u003e\n\n\n\n\n\u003ch3 id=\"when-to-leave-aks-for-managed-data-services\"\u003e\u003ca href=\"/posts/storage-architecture-stateful-workloads-aks/#when-to-leave-aks-for-managed-data-services\" title=\"When To Leave AKS For Managed Data Services\"\u003eWhen To Leave AKS For Managed Data Services\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eWhen stateful workload requirements outgrow AKS storage primitives (sub-second cross-region replication, disk attachment latency breaking your app, spiraling storage costs), don\u0026rsquo;t force solutions. Consider Azure managed services (Azure Database for PostgreSQL, Cosmos DB) or specialized data platforms (Confluent Cloud for Kafka, MongoDB Atlas). Sometimes the best Kubernetes storage strategy is avoiding stateful workloads in Kubernetes.\u003c/p\u003e\n\u003cp\u003eKubernetes excels at stateless orchestration. For stateful workloads, it\u0026rsquo;s capable but demands understanding the plumbing, accepting trade-offs, building operational muscle around backups, monitoring, and runbooks. Treat storage as infrastructure that will fail, not infrastructure that just works. Plan accordingly.\u003c/p\u003e\n","date_modified":"2026-05-26T10:22:03+02:00","date_published":"2026-02-04T17:00:00+01:00","id":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/","language":"en","summary":"PVC/PV patterns, Azure Disk vs Files trade-offs, Velero backup strategies, and cross-cluster replication for production stateful workloads in AKS.","tags":["storage","azure","kubernetes","cloud","database","reliability","operations","platform-engineering","disaster-recovery"],"title":"Storage Architecture \u0026 Stateful Workloads in AKS","url":"https://daily-devops.net/posts/storage-architecture-stateful-workloads-aks/"}],"language":"en","title":"Disaster Recovery for Cloud \u0026 .NET Systems on Daily DevOps \u0026 .NET","version":"https://jsonfeed.org/version/1.1"}