Storage Architecture & Stateful Workloads in AKS

The Problem: Traditional Storage Models Don’t Translate to Kubernetes

Running stateful workloads in Kubernetes means more than deploying a database pod. Traditional storage models (provision a disk, format it, mount it, expect it to stay) collide with Kubernetes’ ephemeral, distributed architecture. Pods get rescheduled, scaled, and terminated. Your database shouldn’t lose data when that happens.

The core challenge: how do you attach persistent storage to ephemeral compute? On-premises infrastructure relies on SAN devices, NFS mounts, or local disks with predictable failure domains. You know which server hosts which disk. In AKS, you work with Azure storage primitives: Managed Disks, Azure Files, blob storage. These need seamless integration with Kubernetes lifecycle management. The abstractions differ, the failure modes differ, and operational patterns require rethinking.

Complexity multiplies with backup requirements, disaster recovery expectations, and multi-cluster data synchronization. Whether migrating legacy apps that expect local RAID controllers or building cloud-native data platforms from scratch, AKS storage architecture knowledge is foundational. Get it wrong: data loss, performance bottlenecks, escalating cloud bills.

PVC/PV Architecture: How Storage Binds to Pods in AKS

Kubernetes abstracts storage through two key objects: PersistentVolumes (PV) and PersistentVolumeClaims (PVC). A PV represents the actual storage resource (Azure Disk, Azure Files share). A PVC represents the request for that storage. The relationship mirrors compute abstractions: nodes are physical machines, pods are logical units consuming node resources. Similarly, PVs are physical storage, PVCs are logical requests consuming PV capacity.

The binding flow:

Developer creates a PVC specifying size, access mode, and storage class
Kubernetes finds or provisions a matching PV based on the storage class
PVC binds to the PV, making it available to pods
Pods reference the PVC in their volume mounts
When the pod terminates, the PVC remains (data persists across pod lifecycles)

Access modes matter:

ReadWriteOnce (RWO): Single node can mount the volume (Azure Disk)
ReadWriteMany (RWX): Multiple nodes can mount simultaneously (Azure Files)
ReadOnlyMany (ROX): Multiple nodes, read-only access

Most stateful apps (databases, message queues) use RWO. Azure Disks provide better IOPS and latency than Azure Files. For shared storage (parallel batch processing, shared config directories, legacy apps expecting NFS semantics), use RWX: Azure Files or third-party CSI drivers like NFS or CephFS.

Critical insight: PVCs decouple storage requests from storage implementation. Developers don’t need to know if they get a Premium SSD or Standard HDD. They request 100Gi of fast storage, the storage class handles provisioning. This abstraction enables platform teams to enforce policies (all production PVCs use Premium tier) without touching application manifests.

Azure Disk vs. Azure Files: Performance, Cost, Regional Constraints

Choosing between Azure Disk and Azure Files isn’t a one-size-fits-all decision. Each has distinct performance profiles, cost implications, and operational constraints.

Azure Disk (Managed Disks):

Performance: Lower latency, higher IOPS. Premium SSDs reach 20,000 IOPS, Ultra Disks exceed that.
Access: Single-node attachment (RWO). Pod rescheduling to another node triggers disk detach and reattach (expect brief delay).
Use cases: Databases (PostgreSQL, MongoDB), stateful apps requiring low-latency I/O.
Cost: Pay per provisioned disk size. A 1TB Premium SSD costs more than a 1TB Standard HDD, regardless of actual usage.
Regional constraints: Disks are zone-specific. With availability zones, pods must schedule in the same zone as the disk.

Azure Files (SMB/NFS):

Performance: Higher latency than disks. Premium Files tier improves performance but still trails disk I/O.
Access: Multi-node (RWX). Multiple pods across nodes can mount the same share.
Use cases: Shared logs, static assets, config files, legacy apps expecting NFS.
Cost: Pay per storage consumed plus transactions. Transaction costs surprise teams on high-throughput workloads.
Regional constraints: File shares are regional, not zonal. Better for cross-zone workloads, still tied to single region.

Decision criteria: Default to Azure Disk for databases and high-IOPS apps. Use Azure Files only when RWX access or legacy NFS compatibility is required. For backup targets or archival storage, consider blob storage with CSI drivers (experimental, improving).

Gotcha: disk attachment times. Pod rescheduling requires Azure to detach the disk from the old node and attach it to the new one. This takes 30 to 90 seconds. Apps that cannot tolerate this downtime need application-level replication (PostgreSQL streaming replication) or third-party solutions like Portworx.

Storage Classes & Dynamic Provisioning: Automating the Lifecycle

Static provisioning (manually creating PVs, hoping someone claims them) creates operational overhead. Storage classes enable dynamic provisioning: Kubernetes automatically creates a PV when a PVC is submitted.

AKS ships with default storage classes:

default: Standard HDD Azure Disk (RWO)
managed-premium: Premium SSD Azure Disk (RWO)
azurefile: Azure Files share (RWX)
azurefile-premium: Premium Azure Files share (RWX)

You can define custom storage classes to fine-tune parameters:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
  kind: Managed
  cachingMode: ReadOnly
  # Zone redundant storage (ZRS) for higher durability
  # skuName: Premium_ZRS
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Key parameters:

reclaimPolicy: Delete removes the disk when PVC is deleted, Retain keeps it. For production databases, Retain prevents accidental data deletion.
volumeBindingMode: WaitForFirstConsumer delays PV creation until pod scheduling. Critical for zone-aware clusters (Kubernetes creates the disk in the same zone as the pod).
allowVolumeExpansion: Enables PVC resizing without recreation. Azure Disks support this, not all storage backends do.

Best practice: Create environment-specific storage classes (dev, staging, prod) with different skuName values. Dev clusters use Standard HDDs, prod uses Premium SSDs. Developers use identical manifests across environments, only the storage class name changes.

Backup & Recovery: RTO/RPO Implications

Kubernetes doesn’t backup data by default. Running kubectl delete pvc without a recovery plan means permanent data loss.

Velero (formerly Heptio Ark) is the de facto standard for Kubernetes backup. It snapshots PVs, captures Kubernetes object state, stores backups in object storage (Azure Blob, S3, GCS).

Example Velero backup schedule (via CLI):

# Install Velero with Azure plugin
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config resourceGroup=aks-backups-rg,storageAccount=aksbackupssa

# Create a daily backup schedule for production namespace
velero schedule create daily-prod-backup \
  --schedule="0 2 * * *" \
  --include-namespaces production \
  --snapshot-volumes \
  --ttl 720h

RTO/RPO considerations:

Snapshot-based backups (Azure Disk snapshots via Velero): RPO equals backup frequency (hourly, daily). RTO equals time to provision new PV plus restore data (5 to 30 minutes).
Native Azure Backup for AKS: Microsoft managed solution. Integrated with Azure Backup policies, slower restores and less granular than Velero.
Application-level backups (pg_dump, mongodump): Bypasses Kubernetes entirely. Lower RTO with automated restore scripts, requires custom orchestration.

Gotcha: Velero relies on Azure Disk snapshots. Disk in Zone 1, restore to cluster in Zone 2 requires cross-zone snapshot copy (not instant). Test restore procedures in non-prod clusters. A backup never restored is wishful thinking.

Multi-AKS Replication: Patterns for Cross-Cluster Data Synchronization

Running stateful workloads across multiple AKS clusters—whether for HA, disaster recovery, or multi-region latency requirements—adds another layer of complexity.

Pattern 1: Application-Level Replication Let the application handle replication. PostgreSQL streaming replication, MongoDB replica sets, Kafka replication understand their data models and replicate efficiently.

Pros: No Kubernetes-specific dependencies. Works identically in VMs, on-premises, or managed services.
Cons: You manage replication lag, split-brain scenarios, and failover logic.

Pattern 2: Storage-Level Replication Use Azure NetApp Files or third-party solutions like Portworx for block or file-level replication.

Pros: Transparent to applications. Works with legacy apps lacking native replication.
Cons: Expensive. NetApp Files Premium tier and Portworx licensing (scales with node count) add significant cost.

Pattern 3: Backup-Based DR Velero backups from primary cluster, restore to secondary on failover.

Pros: Cost-effective (blob storage only).
Cons: RPO equals last backup interval (hours, not seconds). RTO includes restore time (minutes to hours).

Real-world example: Multi-region PostgreSQL deployment pattern I’ve encountered:

Primary AKS cluster (West Europe): Production traffic
Secondary AKS cluster (North Europe): Read replicas via PostgreSQL streaming replication
Velero backups: Azure Blob in third region (East US) for regulatory compliance

This provides sub-second RPO within Europe (streaming replication), hourly RPO globally (Velero), 5-minute RTO for regional failover (promote read replica).

Operational reality: Multi-cluster data replication is complex. Avoid it by using managed services (Azure Database for PostgreSQL with geo-replication) if possible. Running databases in AKS requires investment in automation, monitoring, and runbooks. Your 3 AM self will appreciate this decision.

Final Thoughts

Storage in AKS represents a set of trade-offs requiring deliberate navigation. Azure Disk provides performance with zone-locking. Azure Files offers flexibility with latency penalties. Velero enables backups but demands operational discipline and testing. Multi-cluster replication delivers resilience with non-linear operational complexity.

Pragmatic approach: Start with managed storage classes and Velero. Use Azure Disk for databases and high-IOPS workloads. Use Azure Files only when RWX access or legacy NFS compatibility is genuinely required. Test restore procedures quarterly, not during outages. Schedule fire drills: delete a namespace, restore from backup. Measure actual RTO/RPO instead of assuming SLA compliance.

When stateful workload requirements outgrow AKS storage primitives (sub-second cross-region replication, disk attachment latency breaking your app, spiraling storage costs), don’t force solutions. Consider Azure managed services (Azure Database for PostgreSQL, Cosmos DB) or specialized data platforms (Confluent Cloud for Kafka, MongoDB Atlas). Sometimes the best Kubernetes storage strategy is avoiding stateful workloads in Kubernetes.

Kubernetes excels at stateless orchestration. For stateful workloads, it’s capable but demands understanding the plumbing, accepting trade-offs, building operational muscle around backups, monitoring, and runbooks. Treat storage as infrastructure that will fail, not infrastructure that just works. Plan accordingly.

Comments