AKS Disaster Recovery: Why Your Untested Backup Will Fail

Your cluster will fail. The question is not if, but when, and whether you can recover before customers notice. Most organizations discover their backup strategy does not work during an actual outage, when recovery time matters most and manual heroics cannot save you.

If you run Azure Kubernetes Service (AKS) in production, you need a recovery plan that engineers can execute half asleep at 2 AM. We will go through what to back up, how Velero works in day-to-day operations, when Azure Backup for AKS is enough, and how to design realistic failover with measurable Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

The goal is simple: repeatable recovery procedures you have already tested, not a document that looks good in Confluence but fails during an incident.

The problem: Untested recovery fails when it matters

Every Kubernetes cluster accumulates state that must survive failures. Application data lives in persistent volumes. Cluster configuration exists in custom resource definitions. Workload definitions sit in YAML manifests scattered across repositories. Identity mappings, secrets, network policies, and RBAC rules define how services authenticate and communicate. Losing any of these components means downtime, data loss, and manual reconstruction under time pressure.

The real risk is not having a backup strategy. The real risk is discovering your backup strategy does not work during an actual incident, when recovery time directly determines customer impact and business cost.

Operational reality: Most teams test backup creation but never test restoration. A backup you have never restored is a backup that will fail when you need it. Recovery procedures that require manual steps will fail during high-pressure incidents when engineers make mistakes and documentation is incomplete.

What needs backup: Understanding cluster state

Kubernetes clusters contain multiple layers of state that require different backup approaches.

Application data: Persistent volumes

Persistent volumes hold databases, file storage, configuration data, and application state. Losing persistent volume data typically means permanent data loss unless you maintain application-level replication or external backups. Azure Disks and Azure Files both support snapshot-based backup, but snapshots alone do not capture the Kubernetes metadata required to restore volumes to the correct pods in the correct namespaces.

Cluster configuration: Custom resources and CRDs

Custom Resource Definitions extend Kubernetes with domain-specific objects. Operators, service meshes, monitoring stacks, and policy engines all define Custom Resource Definitions (CRDs) that control cluster behavior. Losing CRDs means losing the schema and logic that your cluster depends on. Restoring CRDs without the corresponding custom resource objects leaves your cluster in an inconsistent state.

Application definitions: Workload manifests

Deployments, StatefulSets, Services, ConfigMaps, and Secrets define what runs in your cluster. Most teams store these manifests in Git, but cluster state drifts from Git over time due to manual changes, automated rollouts, and operator modifications. Restoring from Git alone may not reflect actual production state.

Identity and access: RBAC and service accounts

Role-based access control, ServiceAccounts, and Azure AD integration define who can access what resources. Losing role-based access control (RBAC) configuration means losing security boundaries and breaking automated workflows that depend on specific service account permissions.

Network configuration: Policies and ingress rules

Network policies, ingress controllers, and DNS mappings control how traffic flows into and within your cluster. Restoring workloads without restoring network configuration results in unreachable services and broken traffic routing.

A complete backup strategy captures all of these layers and validates that restoration procedures actually work.

Velero: Production backup workflows

Velero is the de facto standard for Kubernetes backup and restore. It runs as a controller inside your cluster, captures cluster state and persistent volume snapshots, and stores backups in object storage.

How Velero works

Velero operates in two phases: backup and restore. During backup, Velero queries the Kubernetes API for resources matching your backup selectors, serializes those resources to JSON, and uploads the result to cloud object storage (Azure Blob Storage for AKS). For persistent volumes, Velero triggers volume snapshots using Azure Disk snapshots or uses Restic to perform file-level backups.

During restore, Velero downloads the backup manifest, applies resources to the target cluster, and restores persistent volume data from snapshots or Restic archives. Velero handles dependency ordering and namespace mapping automatically.

Backup scheduling and retention

Production backup strategies require automated scheduling and retention policies. Velero supports cron-based schedules and configurable retention windows.

# Velero backup schedule - Helm values
velero:
  schedules:
    daily:
      # Run full backup daily at 2 AM UTC
      schedule: "0 2 * * *"
      template:
        ttl: "720h"  # Retain backups for 30 days
        includedNamespaces:
        - production
        - staging
        snapshotVolumes: true
    
    hourly-critical:
      # Run hourly backup for critical namespaces
      schedule: "0 * * * *"
      template:
        ttl: "168h"  # Retain backups for 7 days
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            backup-frequency: hourly
        snapshotVolumes: true

For many teams, this minimal Terraform baseline is easier to maintain than a large, custom module. It creates the storage account and container Velero needs.

resource "azurerm_storage_account" "velero" {
  name                     = "velerobackup${var.environment}"
  resource_group_name      = var.resource_group_name
  location                 = var.location
  account_tier             = "Standard"
  account_replication_type = "GRS"
}

resource "azurerm_storage_container" "velero" {
  name                  = "velero"
  storage_account_name  = azurerm_storage_account.velero.name
  container_access_type = "private"
}

Then install Velero with Helm and pass only four required values: provider (azure), storage account name, blob container name, and resource group. Keep advanced tuning for later once backups and restores are stable.

Testing restore procedures

Backup creation means nothing without verified restore capability. Production-grade DR requires regular restore testing in isolated environments.

Restore testing workflow:

Create a test AKS cluster in a separate resource group
Install Velero with access to production backup storage
Execute restore operation for a representative namespace
Validate application functionality and data integrity
Document restoration time and any issues encountered
Destroy test cluster

Run this workflow monthly at minimum. Quarterly is too infrequent because configuration drift and Velero version updates will cause surprises. Teams that skip restore testing discover broken procedures during actual outages.

Common restore failures: Missing CRDs (restore CRDs before custom resources), incorrect namespace mappings (use Velero namespace mapping features), persistent volume availability zones (Azure Disks are zone-locked), and missing secrets (external secret management requires separate backup).

Azure native backup: When to use it

Azure Backup for AKS launched in 2023 and provides Azure-native cluster backup without deploying Velero. It integrates with Azure Backup vaults and uses the same portal experience as VM and database backups.

Azure Backup vs Velero

Azure Backup works well for organizations heavily invested in Azure tooling who want unified backup management across all Azure resources. It handles backup scheduling, retention, and monitoring through familiar Azure interfaces.

Limitations compared to Velero: Less flexibility in backup selectors and namespace filtering, fewer options for cross-region backup replication, and vendor lock-in to Azure. Velero supports multi-cloud scenarios and offers more granular control over what gets backed up.

Recommendation: Use Azure Backup if your organization already standardizes on Azure Backup for other resources and you do not require multi-cloud portability. Use Velero if you need maximum flexibility, cross-region replication control, or multi-cloud backup capability.

Multi-region failover: Designing for actual recovery

Single-region deployments create single points of failure. Multi-region architectures provide genuine disaster recovery capability but introduce complexity in state synchronization, traffic routing, and recovery orchestration.

Failover architecture patterns

Active-passive: Primary region handles all traffic. Secondary region remains idle but receives regular backup replication. During failover, you restore backups to the secondary cluster and redirect traffic. Recovery time depends on backup restore speed and DNS propagation.

Active-active: Both regions handle production traffic simultaneously. Application state synchronizes continuously (database replication, event streaming, or shared storage). During regional failure, traffic shifts to the remaining region. Recovery time depends on health check detection and DNS/load balancer failover speed.

Active-passive costs less but requires longer recovery time. Active-active provides faster failover but doubles infrastructure cost and requires application-level state synchronization.

DNS failover automation

DNS-based failover redirects traffic between regions by updating DNS records to point at healthy endpoints. Azure Traffic Manager and Azure Front Door both provide automatic failover based on health probes.

Use a small script first, then expand it over time. This keeps incident handling understandable for on-call engineers.

#!/usr/bin/env bash
set -euo pipefail

SECONDARY_RG="rg-aks-westus"
SECONDARY_CLUSTER="aks-dr-westus"
TM_RG="rg-networking"
TM_PROFILE="tm-aks-prod"

echo "1) Connect to secondary cluster"
az aks get-credentials -g "$SECONDARY_RG" -n "$SECONDARY_CLUSTER" --overwrite-existing
kubectl cluster-info

echo "2) Trigger restore from latest Velero backup"
velero restore create dr-$(date +%Y%m%d-%H%M) --from-backup "$(velero backup get -o name | tail -n1 | cut -d/ -f2)"

echo "3) Switch Traffic Manager endpoint"
az network traffic-manager endpoint update --resource-group "$TM_RG" --profile-name "$TM_PROFILE" --name endpoint-eastus --type azureEndpoints --endpoint-status Disabled
az network traffic-manager endpoint update --resource-group "$TM_RG" --profile-name "$TM_PROFILE" --name endpoint-westus --type azureEndpoints --endpoint-status Enabled

This script is intentionally small. Add pre-checks and post-checks later, but start with a version every engineer can understand quickly during an outage.

This script automates critical failover steps but requires human verification at each stage. Fully automated failover without human approval risks unnecessary region switches during transient failures.

State synchronization strategies

Multi-region architectures require careful state management. Databases need replication (Azure SQL geo-replication, Cosmos DB multi-region writes). Object storage needs cross-region replication (Azure Blob Storage GRS). Message queues require either regional isolation or cross-region synchronization (Azure Service Bus premium tier supports geo-replication).

Stateless services fail over easily. Stateful services require replication strategy planning during design phase, not during incident response.

RTO and RPO: Calculating realistic targets

Recovery Time Objective (RTO) measures how long systems can be down before business impact becomes unacceptable. Recovery Point Objective (RPO) measures how much data loss is acceptable.

Calculating RTO

RTO includes: detection time (how long until you know there is a problem), decision time (how long to decide failover is necessary), restore time (how long to restore from backup or switch regions), and validation time (how long to confirm restoration worked).

Example calculation:

Detection: 5 minutes (health check interval)
Decision: 10 minutes (incident escalation and approval)
Restore: 45 minutes (Velero restore for 500GB cluster)
Validation: 15 minutes (smoke tests and traffic verification)
Total RTO: 75 minutes

If business requirements demand 30-minute RTO, your current backup-based approach will not meet SLOs. You need active-active architecture or pre-warmed standby clusters.

Calculating RPO

RPO depends on backup frequency. Hourly backups mean up to 60 minutes of data loss. If your application cannot tolerate 60 minutes of data loss, you need more frequent backups or continuous replication.

Example calculation:

Backup frequency: Every 4 hours
Last backup: 2 hours ago
Regional failure occurs now
Data loss: 2 hours (time since last backup)

If business requirements demand 15-minute RPO, 4-hour backup intervals will not meet SLOs. You need hourly backups, application-level replication, or continuous event streaming to secondary region.

Designing for SLOs without over-engineering

Many teams over-engineer DR solutions trying to achieve zero data loss and instant failover without understanding actual business requirements. A 4-hour RTO may be acceptable for internal tooling but catastrophic for customer-facing APIs.

Practical use case:

Internal reporting API: 2-hour RTO and 1-hour RPO can be enough, active-passive is usually fine.
Customer checkout API: 15-minute RTO and near-zero RPO usually require active-active plus database replication.

The recurring theme is business impact, not architecture fashion.

Start by identifying actual business impact:

What revenue is lost per hour of downtime?
What customer commitments exist in SLAs?
What regulatory requirements mandate specific recovery times?

Then design the minimum viable DR solution that meets those requirements. Do not build active-active multi-region architecture with continuous replication if business requirements allow 2-hour RTO and 1-hour RPO. That level of complexity costs significant engineering time and operational overhead.

Conversely, do not assume daily backups suffice for production systems without validating business tolerance for 24-hour data loss.

Best practices: What actually works

Test restore procedures regularly. Monthly restore testing in isolated environments catches broken procedures before actual incidents. Quarterly testing is too infrequent.

Automate backup verification. Run automated restore tests that verify backup integrity and measure restoration time. Manual testing does not scale and gets skipped under time pressure.

Document recovery procedures. Runbooks that sit in Confluence do not get updated and will be wrong during incidents. Store recovery procedures as executable scripts in version control and test them regularly.

Separate backup storage from cluster infrastructure. Do not store backups in the same region or subscription as the cluster. Regional Azure outages impact all resources in that region including backup storage.

Plan for partial failures. Not every incident requires full cluster restore. Design procedures for restoring individual namespaces, specific workloads, or single persistent volumes.

Use infrastructure as code for cluster rebuild. Terraform or Bicep definitions for cluster creation enable rapid cluster recreation when restoration is not the best recovery path.

Monitor backup jobs. Failed backups are worthless. Alert on backup failures and missing backup runs. Do not discover backup gaps during recovery.

If you are defining a monthly DR game day, include three quick checks every time:

Can we restore one namespace end to end in a clean test cluster?
Can we switch traffic and run smoke tests in less than our RTO?
Can we prove data freshness is inside the RPO window?

If one answer is no, your DR posture is weaker than your dashboard suggests.

Common mistakes: Storing backups in same region as cluster (regional failure loses backups and cluster), never testing restore procedures (broken backups discovered during incidents), manual recovery procedures (humans make mistakes under pressure), and no RTO/RPO measurement (cannot tell if recovery meets business requirements).

Author note: I have participated in exactly two real disaster recovery situations involving Kubernetes clusters. In the first incident, backup restoration worked but took 3 hours longer than documented because volume snapshot region restrictions were not tested. In the second incident, backups existed but CRD restoration failed because CRD versions changed between backup and restore. Both incidents would have been prevented by regular restore testing. Do not learn this lesson during a production outage.

Conclusion

Disaster recovery for AKS requires deliberate planning, regular testing, and honest assessment of recovery capabilities. Velero provides proven backup and restore workflows. Azure native backup offers simplified management for Azure-focused organizations. Multi-region architectures enable faster recovery but increase complexity and cost.

The real test is not having a backup strategy documented in Confluence. The real test is whether you can restore your cluster from backup in under 60 minutes during an actual regional outage at 2 AM when half your team is asleep and the incident commander is asking for status updates.

Build repeatable procedures. Test them monthly. Automate everything you can. Measure actual RTO and RPO. Add one more rule: if a step cannot be executed from version-controlled scripts, it is probably not ready for production incidents.

Related reading for AKS operations maturity: AKS Cluster Upgrades Without Downtime.

Comments