Hybrid AKS: Bridging Cloud and On-Prem with Azure Arc

Hybrid AKS: Bridging Cloud and On-Prem with Azure Arc

The Problem: Cloud and On-Prem as Operational Silos

Most organizations don’t run purely in the cloud. Legacy systems, compliance requirements, data gravity, and latency concerns keep critical workloads on-premises indefinitely. Running AKS in Azure alongside on-prem Kubernetes clusters multiplies management overhead: two separate control planes to patch, two policy frameworks to keep in sync, two identity configurations to audit, and two observability stacks generating alerts nobody wants to correlate manually.

The temptation is to build custom tooling that bridges the gap. That usually ends as a fragile script collection that only one person on the team understands. Azure Arc changes the equation: it extends Azure’s management plane to any Kubernetes cluster without migrating workloads.

This article covers the practical pieces: network connectivity options, Azure Arc for unified management, DNS resolution across environment boundaries, policy enforcement, and identity federation.

Connectivity Models: Getting Traffic Between Environments

Before you can manage hybrid Kubernetes deployments, you need reliable network connectivity. Three primary patterns exist, each with distinct trade-offs.

ExpressRoute: Dedicated Private Connectivity

ExpressRoute provides a dedicated, private connection between on-premises and Azure, bypassing the public internet entirely. Latency is predictable, throughput is consistent, and the connection doesn’t compete with general internet traffic.

The operational reality: provisioning takes weeks, requires coordination with a connectivity provider, and demands solid Border Gateway Protocol (BGP) knowledge from your network team. Cost is significant. For production workloads with compliance requirements or sustained high-bandwidth data transfer, those trade-offs are usually acceptable. For a dev/test environment or proof-of-concept, they aren’t.

Site-to-Site VPN: Cost-Effective Alternative

Site-to-Site (S2S) VPN creates encrypted tunnels over the public internet. Setup takes hours rather than weeks, cost is a fraction of ExpressRoute, and it works without engaging a connectivity provider.

The catch is performance variability. Throughput degrades under load, latency spikes during congestion periods, and encryption overhead adds up. For proof-of-concept environments, dev/test workloads, or bursty low-volume traffic, S2S VPN is the pragmatic choice. For production databases replicating continuously across the boundary, it usually isn’t enough.

VNet Peering: Cloud-Only Hybrid

VNet peering connects Azure VNets across regions or subscription boundaries. If both sides run in Azure and you’re drawing a line between subscriptions rather than between cloud and datacenter, this is the simplest option: no gateways, no BGP, no provider contracts.

It doesn’t solve the on-prem connectivity problem. Peering only works between Azure VNets.

Infrastructure as Code: ExpressRoute + AKS

Whatever connectivity model you choose, infrastructure repeatability matters from day one. Deploying gateways, subnets, route tables, and AKS clusters manually works once and creates problems on the second environment. The Terraform configuration below covers the full stack: ExpressRoute gateway, private DNS zone, and AKS with Azure CNI and private cluster enabled.

# Terraform configuration for ExpressRoute + AKS hybrid connectivity
# Variables and provider configuration assumed

resource "azurerm_resource_group" "hybrid" {
  name     = "hybrid-aks-rg"
  location = "westeurope"
}

# Virtual Network for AKS and hybrid connectivity
resource "azurerm_virtual_network" "aks_vnet" {
  name                = "aks-vnet"
  address_space       = ["10.1.0.0/16"]
  location            = azurerm_resource_group.hybrid.location
  resource_group_name = azurerm_resource_group.hybrid.name
}

# Subnet for AKS nodes
resource "azurerm_subnet" "aks_nodes" {
  name                 = "aks-nodes-subnet"
  resource_group_name  = azurerm_resource_group.hybrid.name
  virtual_network_name = azurerm_virtual_network.aks_vnet.name
  address_prefixes     = ["10.1.1.0/24"]
}

# Gateway subnet for ExpressRoute
resource "azurerm_subnet" "gateway" {
  name                 = "GatewaySubnet"  # Name must be exactly this
  resource_group_name  = azurerm_resource_group.hybrid.name
  virtual_network_name = azurerm_virtual_network.aks_vnet.name
  address_prefixes     = ["10.1.255.0/27"]
}

# Public IP for ExpressRoute Gateway
resource "azurerm_public_ip" "er_gateway_ip" {
  name                = "er-gateway-pip"
  location            = azurerm_resource_group.hybrid.location
  resource_group_name = azurerm_resource_group.hybrid.name
  allocation_method   = "Static"
  sku                 = "Standard"
}

# ExpressRoute Gateway
resource "azurerm_virtual_network_gateway" "er_gateway" {
  name                = "er-gateway"
  location            = azurerm_resource_group.hybrid.location
  resource_group_name = azurerm_resource_group.hybrid.name
  type                = "ExpressRoute"
  sku                 = "Standard"

  ip_configuration {
    name                          = "gateway-ip-config"
    public_ip_address_id          = azurerm_public_ip.er_gateway_ip.id
    private_ip_address_allocation = "Dynamic"
    subnet_id                     = azurerm_subnet.gateway.id
  }
}

# Connection from ExpressRoute Gateway to the pre-provisioned circuit
# Configure var.expressroute_circuit_id with your existing circuit resource ID:
# var.expressroute_circuit_id = "/subscriptions/.../expressRouteCircuits/..."
resource "azurerm_virtual_network_gateway_connection" "onprem" {
  name                       = "er-onprem-connection"
  location                   = azurerm_resource_group.hybrid.location
  resource_group_name        = azurerm_resource_group.hybrid.name
  type                       = "ExpressRoute"
  virtual_network_gateway_id = azurerm_virtual_network_gateway.er_gateway.id
  express_route_circuit_id   = var.expressroute_circuit_id
}

# Private DNS Zone for internal services
resource "azurerm_private_dns_zone" "internal" {
  name                = "internal.azure.local"
  resource_group_name = azurerm_resource_group.hybrid.name
}

# Link DNS zone to VNet
resource "azurerm_private_dns_zone_virtual_network_link" "aks" {
  name                  = "aks-vnet-link"
  resource_group_name   = azurerm_resource_group.hybrid.name
  private_dns_zone_name = azurerm_private_dns_zone.internal.name
  virtual_network_id    = azurerm_virtual_network.aks_vnet.id
  registration_enabled  = true
}

# AKS cluster with ExpressRoute connectivity
resource "azurerm_kubernetes_cluster" "aks" {
  name                = "hybrid-aks"
  location            = azurerm_resource_group.hybrid.location
  resource_group_name = azurerm_resource_group.hybrid.name
  dns_prefix          = "hybrid-aks"
  kubernetes_version  = "1.31"

  default_node_pool {
    name                = "system"
    node_count          = 3
    vm_size             = "Standard_D4s_v5"
    vnet_subnet_id      = azurerm_subnet.aks_nodes.id
    auto_scaling_enabled = true
    min_count           = 3
    max_count           = 10
  }

  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin     = "azure"
    network_policy     = "calico"
    service_cidr       = "10.2.0.0/16"
    dns_service_ip     = "10.2.0.10"
    load_balancer_sku  = "standard"
  }

  private_cluster_enabled = true

  depends_on = [
    azurerm_virtual_network_gateway.er_gateway
  ]
}

# Route table for on-prem traffic via ExpressRoute
resource "azurerm_route_table" "onprem_routes" {
  name                = "onprem-routes"
  location            = azurerm_resource_group.hybrid.location
  resource_group_name = azurerm_resource_group.hybrid.name

  route {
    name                   = "to-onprem-datacenter"
    address_prefix         = "10.0.0.0/8"  # On-prem network range
    next_hop_type          = "VirtualNetworkGateway"
  }
}

# Associate route table with AKS subnet
resource "azurerm_subnet_route_table_association" "aks_routes" {
  subnet_id      = azurerm_subnet.aks_nodes.id
  route_table_id = azurerm_route_table.onprem_routes.id
}

This Terraform configuration establishes the foundation for hybrid connectivity: ExpressRoute gateway, private DNS, and AKS with network policies. Customize address ranges, SKUs, and routing rules for your environment.

Azure Arc: Unified Kubernetes Management

Azure Arc extends Azure management to any Kubernetes cluster: on-prem, edge locations, or other clouds. It registers external clusters as Azure resources, enabling centralized management without forcing workload migration.

What Arc Provides

Arc-enabled Kubernetes clusters gain:

  • Unified inventory: View all clusters in Azure Resource Manager
  • Policy enforcement: Azure Policy extends to Arc clusters
  • GitOps deployment: Flux configurations apply consistently
  • Monitoring integration: Azure Monitor collects metrics and logs
  • RBAC integration: Azure AD for cluster authentication

Arc doesn’t move workloads to Azure. It extends Azure’s control plane to wherever your clusters run.

Onboarding an On-Prem Cluster

Connecting an existing Kubernetes cluster to Arc requires cluster admin access and network connectivity to Azure endpoints.

#!/bin/bash
# Azure Arc onboarding script
# Requires: Azure CLI, kubectl, cluster admin kubeconfig

RESOURCE_GROUP="hybrid-infra-rg"
CLUSTER_NAME="onprem-k8s-01"
LOCATION="westeurope"

# Login and set subscription
az login
az account set --subscription "your-subscription-id"

# Create resource group if needed
az group create --name $RESOURCE_GROUP --location $LOCATION

# Register Arc providers
az provider register --namespace Microsoft.Kubernetes
az provider register --namespace Microsoft.KubernetesConfiguration
az provider register --namespace Microsoft.ExtendedLocation

# Wait for registration (can take several minutes)
az provider show -n Microsoft.Kubernetes -o table
az provider show -n Microsoft.KubernetesConfiguration -o table

# Install Arc extensions
az extension add --name connectedk8s
az extension add --name k8s-configuration

# Connect cluster to Arc
az connectedk8s connect \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --tags environment=production datacenter=onprem

# Verify connection
az connectedk8s show \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query "connectivityStatus"

Once connected, the cluster appears in the Azure portal alongside AKS clusters. Management operations (viewing workloads, applying policies, deploying via GitOps) work identically.

Policy Enforcement Across Environments

Azure Policy for Kubernetes applies consistent governance rules across AKS and Arc clusters. Define policies once, enforce everywhere.

Example policy: require resource limits on all pods.

# pod-resource-limits-policy.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: require-pod-resource-limits
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    namespaces:
      - production
      - staging
  parameters:
    limits:
      - cpu
      - memory
    requests:
      - cpu
      - memory

Apply this policy through Azure Policy, and it enforces on both AKS and Arc-connected on-prem clusters. No duplicated configuration, no drift between environments.

GitOps: Single Source of Truth

Arc supports Flux-based GitOps configurations. Define cluster state in Git, and Arc ensures compliance across environments. The az k8s-configuration flux create command links your Git repository to both AKS and Arc clusters. Changes sync automatically. Configuration drift gets corrected within minutes.

DNS and Service Discovery: Hybrid Resolution Without Complexity

Hybrid deployments need service discovery across boundaries. Pods in AKS must resolve on-prem services, and vice versa.

Approach 1: Azure Private DNS with Conditional Forwarding

Create a Private DNS zone in Azure, link it to your VNet, and configure on-prem DNS servers to forward queries for Azure domains to Azure’s DNS resolver at 168.63.129.16. AKS clusters inherit VNet DNS configuration automatically. On-prem services get custom DNS entries pointing to ExpressRoute or VPN endpoints.

Approach 2: CoreDNS Custom Forwarding

For cluster-level control, patch the CoreDNS ConfigMap to forward specific domain queries to on-prem DNS servers. This is the right approach when on-prem services use a domain suffix that doesn’t overlap with Azure Private DNS zones, or when you need different forwarding behavior per cluster.

# CoreDNS custom configmap - forward internal corporate domain to on-prem resolver
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  corp.server: |
    corp.example.com:53 {
        errors
        cache 30
        forward . 10.0.0.53 {
            prefer_udp
        }
    }

Apply with kubectl apply -f coredns-custom.yaml. AKS detects the coredns-custom ConfigMap automatically. For the reverse path, configure on-prem DNS to forward *.privatelink.blob.core.windows.net and similar zones to Azure’s virtual resolver at 168.63.129.16.

Author note: DNS is usually where hybrid setups produce the most subtle and hardest-to-debug failures. A pod resolves a name correctly in testing, then silently times out in production because the CoreDNS cache held a stale entry across a VPN reconnect. Keep TTLs short for cross-boundary records and verify the full resolver chain with nslookup from inside the cluster, not just from a workstation.

Key principle: Avoid split-horizon DNS designs where the same name resolves differently depending on source location. Use Azure Private DNS as the primary zone authority where possible, and fall back to conditional forwarding only for domains you don’t control.

Identity Across Boundaries: Federation Without Duplication

Hybrid deployments shouldn’t duplicate identity systems. Azure AD (now Microsoft Entra ID) integration extends to Arc clusters, providing centralized authentication and significantly reducing the number of credential systems to maintain.

Service Principals for Cross-Environment Access

Applications running on-prem that need access to Azure services (Key Vault, storage accounts, managed databases) can use Azure AD service principals with certificate-based authentication. Create a service principal, assign the appropriate role, and mount the certificate as a Kubernetes secret in the on-prem pod.

# Create service principal and assign Key Vault access
az ad sp create-for-rbac \
  --name "onprem-app-sp" \
  --role "Key Vault Secrets User" \
  --scopes "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault>"

This works reliably, but carries ongoing maintenance: certificate rotation, secret distribution across on-prem clusters, and audit trails that span two systems. For new workloads, federated credentials are worth the initial setup complexity.

Federated Credentials for Workload Identity

Workload identity federation allows on-prem Kubernetes service accounts to authenticate as Azure AD identities without long-lived secrets. The on-prem cluster’s OIDC issuer endpoint issues tokens for service accounts; Azure AD trusts that issuer and exchanges the projected token for an Azure AD access token.

# Register the on-prem cluster's OIDC issuer with an Azure AD app registration
az ad app federated-credential create \
  --id <app-registration-id> \
  --parameters '{
    "name": "onprem-k8s-workload",
    "issuer": "https://<your-onprem-oidc-issuer>",
    "subject": "system:serviceaccount:production:my-app",
    "audiences": ["api://AzureADTokenExchange"]
  }'

The on-prem cluster needs to expose its OIDC discovery document at a publicly reachable (or Azure-reachable) endpoint. That’s the step that most commonly blocks initial setup. Verify the discovery document is accessible before spending time debugging token exchange errors.

Author note: Migrating workloads from service principal secrets to federated credentials removes certificate rotation as a recurring task entirely. Secret sprawl across on-prem clusters was one of the more uncomfortable findings in the security reviews I’ve participated in. Federated credentials make the problem structurally impossible rather than just less likely.

Operational Consistency: Making Hybrid Work Long-Term

Hybrid deployments fail when operational practices diverge between environments. Consistency requires deliberate effort.

Monitoring and Observability

Use Azure Monitor Container Insights for both AKS and Arc clusters. Install the extension on Arc-connected clusters explicitly (AKS picks it up automatically with the add-on flag):

az k8s-extension create \
  --name azuremonitor-containers \
  --cluster-type connectedClusters \
  --cluster-name onprem-k8s-01 \
  --resource-group hybrid-infra-rg \
  --extension-type Microsoft.AzureMonitor.Containers \
  --configuration-settings logAnalyticsWorkspaceResourceID=<workspace-resource-id>

Metrics, logs, and cluster health flow to a single Log Analytics workspace regardless of where the cluster runs. A simple Kusto Query Language (KQL) query surfaces pod restart counts across all environments at once:

KubePodInventory
| where TimeGenerated > ago(24h)
| summarize Restarts=sum(ContainerRestartCount) by ClusterName, Namespace
| order by Restarts desc

Having AKS and on-prem clusters reporting to the same workspace makes cross-environment incident correlation significantly faster.

Update Management

Azure Arc cluster autoupgrade reduces the operational gap between AKS (where upgrades are automated and well-understood) and self-managed on-prem clusters (where upgrades have historically been postponed due to complexity). You can define upgrade channels, schedule maintenance windows, and receive notifications through the same Azure portal used for AKS fleet management.

This doesn’t eliminate the need for upgrade validation in staging environments. But it removes the operational friction that leads to on-prem clusters running three minor versions behind production AKS.

Cost and Resource Tracking

Arc-enabled clusters report resource utilization to Azure. Tag clusters consistently with environment, cost-center, and region labels using az connectedk8s update. Use Azure Cost Management to track total Kubernetes spend across cloud and on-prem, enabling accurate chargeback and budget planning.

Key Takeaways

Hybrid AKS deployments succeed when you:

  1. Choose the right connectivity: ExpressRoute for production, S2S VPN for dev/test, VNet peering for Azure-only scenarios
  2. Use Azure Arc for unified management: Extend Azure’s control plane rather than building parallel tooling
  3. Enforce policies consistently: Azure Policy + GitOps eliminate configuration drift
  4. Simplify DNS: Azure Private DNS with conditional forwarding avoids complexity
  5. Federate identity: Azure AD integration reduces secret sprawl and management overhead
  6. Monitor everything in one place: Azure Monitor provides visibility across environments

Hybrid infrastructure doesn’t have to mean duplicated effort. Arc, proper networking, and consistent operational practices make multi-environment Kubernetes manageable.

The goal isn’t cloud purity. It’s operational efficiency wherever your workloads run.

Comments