AKS at Scale: Hard-Won Lessons from 1000+ Node Clusters
Running AKS at 1,000+ nodes exposes hard limits in etcd, networking, observability, and cost that never appear in vendor documentation. This article shares operational lessons from mega-cluster deployments: where the scaling cliffs are and how to plan around them before production outages force your hand.
Storage Architecture & Stateful Workloads in AKS
Stateful workloads in Kubernetes require understanding PersistentVolume architecture, Azure storage trade-offs, and backup strategies. This article covers PVC/PV patterns, Azure Disk vs Files performance profiles, Velero backup configurations, and multi-cluster replication patterns based on production experience.
AKS Cluster Upgrades: Zero-Downtime Operations That Actually Work
AKS cluster upgrades involve node replacement and pod eviction, which can cause service disruption without proper controls. This article explains cordon and drain mechanics, Pod Disruption Budget configuration, and multi-node-pool rollout strategies with validation-driven automation for reliable zero-downtime upgrades.