Operations covers the gap between a working deployment and a system that stays working. Architecture and code decisions determine what’s possible; operational discipline determines what actually happens in production. These articles focus on the practical mechanics — not what to do in theory, but what breaks in practice and how to prevent it.
Cluster and service upgrades appear throughout because they concentrate operational risk into a window where everything that was stable can stop being stable. The articles examine how to make upgrades reproducible: Pod Disruption Budgets that actually protect services, node pool rollout strategies, validation gates, and the difference between upgrade processes that work at 50 nodes and those that hold at 500.
Monitoring and observability content addresses what production visibility actually requires. Health checks that confirm a process is running are not the same as monitoring that detects degradation before users report it. The articles cover metric collection, structured logging with compliance requirements, and dashboards that reflect operational reality rather than just confirming green status.
Incident response articles examine the gap between incident runbooks and what teams actually do under pressure. GitHub Actions workflows for automated response, runbook design that works when the author isn’t available, and post-incident analysis patterns that produce durable improvements rather than temporary fixes.
Cost governance is an operational concern, not just a finance concern. Resource limits set at deployment become costs that compound over months. The articles cover FinOps tagging, node pool rightsizing, spot VM integration, and the specific architectural decisions that create cost spirals when clusters grow.

