Observability in AKS CNI Overlay: When Pod IPs Hide Behind Nodes

CNI Overlay solves IP exhaustion by keeping pod IPs in an internal overlay network. Excellent for resource efficiency. The problem? Your observability stack just lost visibility into half your traffic. Pod IPs get masked behind node IPs through SNAT, and debugging network issues becomes a puzzle where half the pieces are missing.

When a pod makes an outbound connection to an Azure service, NSG logs show the node IP as the source. Try correlating that with application logs to identify which specific pod initiated the connection, and you’ll discover your traditional tooling is useless. The pod IP exists only inside the cluster. From outside, it’s invisible.

If you run CNI Overlay in production, you need observability patterns that work with this reality: Container Insights for metadata enrichment, network flow correlation via KQL queries, SNAT port tracking, and distributed tracing.

The Root Cause: SNAT Changes Everything

In traditional Azure CNI, each pod receives a VNet-routable IP address. Network flows are straightforward to track. Correlation is direct.

CNI Overlay changes this. Pods receive IPs from an internal overlay network (typically 10.244.0.0/16) that exist only within the cluster. When a pod communicates with anything outside the cluster, the traffic undergoes Source Network Address Translation (SNAT). The pod’s internal IP gets replaced with the node’s IP before leaving the cluster.

From the perspective of Azure Network Watcher or NSG Flow Logs, all outbound traffic from pods on a node appears to originate from that single node IP. You lose pod-level granularity. This isn’t a bug. It’s how overlay networking works. But it breaks every observability pattern you’ve built for traditional CNI.

The challenge is correlation. Application logs contain pod IPs. Network logs contain node IPs. Connecting these requires additional context that standard tooling doesn’t provide. Microsoft’s documentation glosses over this. They’ll tell you Container Insights “solves observability,” but won’t mention you’re about to spend weeks building KQL queries to answer “which pod is talking to this IP?”

Container Insights: Your First Layer of Defense

Container Insights is the Azure-native solution for AKS observability. For CNI Overlay clusters, it’s mandatory if you want to maintain sanity during production incidents. It’s the only thing that maintains the pod-to-node relationship that network logs lose.

Container Insights deploys a DaemonSet (ama-logs) on every node that scrapes metrics from kubelet and collects stdout/stderr logs. Crucially, it enriches data with Kubernetes metadata: pod name, namespace, node name, labels, annotations. This enables correlation between application logs and network flows.

When you query Container Insights logs, you can join pod identity with node identity, bridging the gap between application-level events and network-level events. Without this enrichment, you’re stuck running kubectl commands during incidents while your cluster burns.

Here’s a practical Terraform configuration for enabling Container Insights on an AKS cluster with CNI Overlay:

resource "azurerm_log_analytics_workspace" "aks_monitoring" {
  name                = "aks-logs-${var.environment}"
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name
  sku                 = "PerGB2018"
  retention_in_days   = 30
}

resource "azurerm_kubernetes_cluster" "aks" {
  name                = "aks-cluster-${var.environment}"
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name
  dns_prefix          = "aks-${var.environment}"

  network_profile {
    network_plugin      = "azure"
    network_plugin_mode = "overlay"
    pod_cidr            = "10.244.0.0/16"
  }

  oms_agent {
    log_analytics_workspace_id = azurerm_log_analytics_workspace.aks_monitoring.id
  }

  monitor_metrics {
    annotations_allowed = null
    labels_allowed      = null
  }
}

# Optional: Data Collection Rule for cost control
resource "azurerm_monitor_data_collection_rule" "aks_container_insights" {
  name                = "MSCI-${azurerm_kubernetes_cluster.aks.name}"
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name

  destinations {
    log_analytics {
      workspace_resource_id = azurerm_log_analytics_workspace.aks_monitoring.id
      name                  = "ciworkspace"
    }
  }

  data_flow {
    streams      = ["Microsoft-ContainerInsights-Group-Default"]
    destinations = ["ciworkspace"]
  }

  data_sources {
    extension {
      streams        = ["Microsoft-ContainerInsights-Group-Default"]
      extension_name = "ContainerInsights"
      name           = "ContainerInsightsExtension"
    }
  }
}

A 100-node cluster generates roughly 50-100 GB of logs per month. That’s $10-20/month in ingestion costs alone. A 200-node cluster with verbose logging can push $500-1000/month in Log Analytics costs. The optional Data Collection Rule provides granular control to filter out noisy namespaces (kube-system) or low-value metrics before the bill surprises you.

Common mistakes:

Enabling Container Insights without DCRs on large clusters, then discovering a $2000 Azure Monitor bill
Setting retention to 365 days without calculating cost ($0.10/GB/month beyond 31 days)
Collecting metrics at 15-second intervals when 60-second suffices for 95% of use cases

Once deployed, Container Insights starts populating several key tables in Log Analytics:

ContainerLog: Application logs (stdout/stderr)
Perf: Performance metrics and resource usage
KubePodInventory: Pod metadata and lifecycle events
KubeNodeInventory: Node metadata and capacity information

These tables are your foundation for correlation queries.

Network Observability: Flow Logs and NSG Logs

Container Insights gives you pod-level visibility inside the cluster. But what about traffic leaving the cluster? This is where NSG Flow Logs come into play, and where your observability problems begin in earnest.

Flow Logs capture network traffic metadata: source IP, destination IP, port, protocol, allow/deny. For CNI Overlay, Flow Logs show node IPs as the source for all outbound pod traffic. You’ve lost pod-level attribution the moment traffic leaves the cluster.

The correlation happens through timestamps and Log Analytics queries. When a pod generates outbound traffic, Container Insights logs the event with the pod’s identity and node. Flow Logs capture the same event with the node’s IP and destination. Join these datasets on node name and timestamp to reconstruct which pod initiated which connection.

Author note: This works in theory. In practice, timestamp-based correlation is fragile. Flow Logs have variable latency (5-10 minutes), Container Insights has ingestion delays, and timestamp precision issues mean you’ll occasionally join the wrong events. For critical debugging, correlation IDs in application logs are more reliable.

Here’s a practical KQL query that demonstrates this correlation for debugging outbound connectivity issues:

let podName = "my-application-pod-xyz";
let timeRange = ago(1h);
let podNode = KubePodInventory
| where TimeGenerated >= timeRange and Name == podName
| project TimeGenerated, PodName=Name, Namespace, NodeName=Computer | take 1;
let nodeIP = KubeNodeInventory
| where TimeGenerated >= timeRange and Computer in ((podNode | project NodeName))
| extend NodeIP = tostring(parse_json(Status).addresses[0].address)
| project NodeName=Computer, NodeIP | take 1;
let podNodeInfo = podNode | join kind=inner nodeIP on NodeName;
podNodeInfo
| join kind=inner (
    AzureNetworkAnalytics_CL
    | where TimeGenerated >= timeRange
    | project FlowTime=TimeGenerated, SourceIP=SrcIP_s, DestinationIP=DestIP_s, DestinationPort=DestPort_d, Protocol=L7Protocol_s, FlowDirection=FlowDirection_s, Decision=FlowStatus_s
) on $left.NodeIP == $right.SourceIP
| project PodName, Namespace, NodeName, NodeIP, FlowTime, DestinationIP, DestinationPort, Protocol, FlowDirection, Decision

The limitation: all pods on the same node appear in the results because they share the node IP after SNAT. If you have 50 pods on that node, you get 50 potential sources. To narrow this down, correlate timestamps with application-level logs. Without correlation IDs, you’re guessing based on timing.

Common mistakes:

Assuming timestamp correlation is accurate to the second (Flow Logs can be off by minutes)
Not accounting for pod restarts that change pod-to-node mapping mid-incident
Forgetting that pods on the same node share the same source IP

SNAT Tracking and Port Exhaustion

SNAT doesn’t just mask IPs. It introduces a finite resource constraint: SNAT ports. Each node has 64,000 ephemeral ports. In CNI Overlay, all pods on a node share this pool. Under heavy load, you exhaust SNAT ports, causing intermittent connection failures that are nearly impossible to diagnose.

The real risk: SNAT port exhaustion looks exactly like network instability, DNS issues, or backend degradation. You’ll spend hours troubleshooting the wrong layer while your SNAT ports silently hit 100%.

Azure Monitor provides AllocatedSnatPorts and UsedSnatPorts metrics at the Load Balancer level, but you need to enable them explicitly. Microsoft’s quickstart documentation conveniently omits this.

Symptoms when approaching exhaustion:

Connections timing out during peak traffic, but only for some pods
Sporadic DNS resolution failures
Error logs showing “unable to bind to port” or “connection refused”
Retries succeeding randomly (because a SNAT port became available)

Author note: I’ve seen teams spend 6+ hours investigating “intermittent Azure Storage failures” before someone checked SNAT port metrics and found 99% utilization. The fix took 10 minutes (deploy NAT Gateway). The diagnosis took half a day because SNAT exhaustion wasn’t on their radar.

Mitigation strategies:

Connection pooling: Reuse connections in your application code. Critical in CNI Overlay.
HTTP keep-alive: Reuse TCP connections. A single pod making 1000 requests/second without keep-alive exhausts SNAT ports in under a minute.
NAT Gateway: Deploy NAT Gateway for outbound connectivity. Provides 64,000 SNAT ports per public IP (multiple IPs supported). Not optional for high-throughput clusters.

NAT Gateway is the most effective solution. Load Balancer SNAT works for dev clusters, but production clusters handling thousands of outbound requests per second will exhaust ports.

Use for: Production clusters with 20+ nodes, clusters making frequent outbound API calls, workloads with poor connection reuse.

Trade-offs: $35/month plus $0.045/GB egress (cheaper than debugging SNAT exhaustion at 2 AM). Doesn’t solve observability, only port exhaustion.

Debugging at Scale: Correlation IDs and Distributed Tracing

When you operate a cluster with hundreds of pods across dozens of nodes, manual correlation becomes impossible. You need structured logging and distributed tracing.

Correlation IDs are the simplest effective pattern. Generate a unique ID at request entry point and propagate it through your entire call chain. When debugging, filter all logs by correlation ID to see the complete request flow across pods and services.

Distributed tracing models requests as traces with parent-child relationships between spans. OpenTelemetry is the current standard. Instrument your applications to emit traces to Azure Monitor Application Insights, Jaeger, or Tempo.

The value in CNI Overlay: traces maintain pod identity regardless of SNAT. A trace records which pod initiated a call, which pod received it, and operation duration. Network-level SNAT becomes irrelevant because the trace exists at the application layer.

The recurring theme: correlation IDs and distributed tracing aren’t “nice to have” in CNI Overlay. They’re operational requirements. Without them, you’re correlating timestamps across data sources with different ingestion latencies and hoping you got the right pod. That’s not observability. That’s guessing.

Practical Recommendations for Production

Enable Container Insights from day one. In CNI Overlay, pod-to-node mapping is invisible from network logs. You’re flying blind without it. Budget $50-200/month for monitoring, or budget significantly more for postmortems explaining why you couldn’t identify which pod caused the outage.

Configure log retention based on compliance, not convenience. Thirty days is reasonable for most use cases. A 100-node cluster at 365-day retention can cost $500-1000/month just for storage.

Implement structured logging with correlation IDs across all applications. This is not optional. JSON logging with consistent field names makes querying possible. Include: timestamp, log level, correlation ID, message. Container Insights adds pod metadata automatically, don’t duplicate it.

Set up alerts for SNAT port usage before production. Monitor UsedSnatPorts and alert at 80% capacity. Better yet, deploy NAT Gateway proactively. The cost ($35/month) is trivial compared to a production outage from SNAT exhaustion.

Use distributed tracing for multi-service architectures. Overhead is low (1-5% CPU), debugging value is high. Start with critical paths. Without tracing, debugging cascade failures in CNI Overlay is nearly impossible.

Document your correlation queries. Keep these queries in version control alongside your infrastructure code. Tribal knowledge doesn’t scale.

The Hard Truth About CNI Overlay Observability

CNI Overlay makes AKS operationally better by solving IP exhaustion. But it makes observability harder by hiding pod IPs behind node IPs. This isn’t a flaw. It’s a tradeoff. The solution isn’t to avoid CNI Overlay (it’s the right choice for most clusters) but to build your observability stack with this reality in mind before production.

Container Insights provides the metadata layer that network logs lack. Flow Logs give network-level visibility even though pod IPs are masked. Distributed tracing maintains request context regardless of SNAT. Correlation IDs make manual debugging feasible when automated tools fall short.

None of this is automatic. You have to configure it deliberately, budget for it appropriately, and train your team to use it. Microsoft’s CNI Overlay documentation presents it as “simpler” than traditional CNI. What they don’t mention is that you’re trading networking simplicity for observability complexity. That’s a good trade for large clusters, but it’s still a trade.

The operational reality: I’ve debugged production incidents in CNI Overlay clusters where the initial response was “we can’t see which pod is causing this.” That’s only true if you haven’t built the correlation infrastructure upfront. With Container Insights, structured logging, and distributed tracing in place, CNI Overlay observability is no harder than traditional CNI. It’s just different, and it requires deliberate tooling investment before the first incident, not during it.

The honest assessment: CNI Overlay is the right choice for most production AKS clusters. The IP efficiency gains are significant. But if your organization isn’t prepared to invest in proper observability tooling (Container Insights, distributed tracing, structured logging with correlation IDs), you’ll regret choosing CNI Overlay the first time you debug an outbound connectivity issue at 3 AM. Plan accordingly.

Comments