Green Dashboard, Dead Application

Your application just crashed in production. Azure App Service kept routing traffic to the failing instance for ninety seconds. Users saw timeouts. Your monitoring dashboard stayed green because the web server responded with HTTP 200 while the database connection pool was exhausted.

I’ve watched this exact scenario play out at three different organizations in the past year. Each time, the post-mortem revealed the same root cause: health checks that verified the process was breathing without checking whether it could actually do its job. ISO/IEC 27001 Control A.17.2.1 exists precisely for this reason—availability is a security control, not an operational afterthought.

Why availability is a security control

ISO 27001 treats availability as a core pillar of information security alongside confidentiality and integrity. Control A.17.2.1 explicitly requires organizations to implement “information processing facilities” with sufficient redundancy to meet availability requirements. Redundancy without health awareness, though? That’s a dangerous illusion of resilience.

Control A.12.1.4 mandates environmental isolation to prevent development instability from affecting production. Health checks enforce this separation. An unhealthy instance—whether due to misconfiguration, dependency failure, or environmental contamination—should never receive production traffic. Period.

Then there’s A.12.6.1, which requires timely identification of technical vulnerabilities. A failed health check signals exactly that: a vulnerability in real-time. Unreachable Key Vault? Expired certificate? Overloaded message queue? These are security vulnerabilities that proper health checks expose before they cascade into complete system failure.

Teams treating health checks as operational monitoring miss the security implications entirely. Availability failures create security incidents. Degraded systems leak information through error messages, bypass authentication under load, or fail to log security events. Catching degradation early prevents these failures from becoming breaches.

The fatal pattern: “Is the website responding?”

Most applications I encounter implement health monitoring at the infrastructure layer only. Load balancers ping an endpoint, the endpoint returns HTTP 200 if the web server process is running, and everyone assumes the system works. This approach fails catastrophically because it conflates process health with application health.

Here’s the code I see everywhere:

// Program.cs - The "is it alive?" anti-pattern
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddControllers();

var app = builder.Build();

app.MapControllers();

// "Health check" that checks nothing meaningful
app.MapGet("/health", () => Results.Ok("Healthy"));

app.Run();

This endpoint happily reports “healthy” when:

The database connection pool is exhausted
Azure Key Vault is unreachable (configuration secrets unavailable)
The Redis cache is down (session state lost)
Service Bus queue is full (messages dropped)
Application Insights ingestion is failing (no telemetry)
Certificate validation is failing (external API calls rejected)

Load balancers keep routing traffic to instances reporting “healthy” while the application cannot serve a single request. Users experience timeouts and errors. Your monitoring shows 100% uptime. Meanwhile, your organization violates ISO 27001 availability requirements while believing the system is compliant.

The information disclosure vulnerability

There’s something worse than inadequate health checks: health checks that leak configuration details.

// DO NOT DO THIS - Security vulnerability
app.MapGet("/health", async (ApplicationDbContext db, IConfiguration config) =>
{
    try
    {
        await db.Database.CanConnectAsync();
        return Results.Ok(new
        {
            Status = "Healthy",
            Database = config["ConnectionStrings:Default"],  // Exposed!
            KeyVault = config["Azure:KeyVault:Uri"],         // Exposed!
            Version = Assembly.GetExecutingAssembly().GetName().Version,
            Environment = builder.Environment.EnvironmentName,
            MachineName = Environment.MachineName             // Internal infrastructure details
        });
    }
    catch (Exception ex)
    {
        return Results.Ok(new
        {
            Status = "Unhealthy",
            Error = ex.ToString()  // Stack trace exposure
        });
    }
});

This violates Control A.9.4.5 by exposing internal configuration URIs and infrastructure topology. Unauthenticated health endpoints should return minimal information—detailed diagnostics belong behind authentication.

The correct implementation: Comprehensive health checks

ASP.NET Core’s health check middleware provides everything you need: dependency validation, startup verification, and runtime degradation detection. Done right, health monitoring transforms from a checkbox exercise into an actual security control.

Basic health check registration

Start with the infrastructure:

// Program.cs
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy("Application process is running"));

var app = builder.Build();

// Liveness endpoint - "Is the process alive?"
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = registration => registration.Name == "self",
    AllowCachingResponses = false
});

// Readiness endpoint - "Can the application serve requests?"
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = _ => true,
    AllowCachingResponses = false
});

app.Run();

Two endpoints, two different purposes:

/health/live answers “Is the process running?” Orchestrators use this to restart crashed instances.
/health/ready answers “Can the application serve requests?” Load balancers use this to route traffic.

Dependency-specific health checks

Now add checks for actual dependencies:

using Microsoft.Extensions.Diagnostics.HealthChecks;

builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy())
    .AddDbContextCheck<ApplicationDbContext>("database",
        failureStatus: HealthStatus.Unhealthy,
        tags: ["db", "ready"])
    .AddSqlServer("sqlserver", options =>
    {
        options.ConnectionString = builder.Configuration["ConnectionStrings:SqlServer"]!;
        options.Timeout = 100;
    })
    .AddRedis("redis", options =>
    {
        options.ConnectionString = builder.Configuration["ConnectionStrings:Redis"]!;
    })
    .AddServiceBusQueue("servicebus-orders", options =>
    {
        options.FullyQualifiedNamespace = builder.Configuration["Azure:ServiceBus:Namespace"]!;
        options.QueueName = "orders";
    });

These checks use health check packages from my open-source collection:

The NetEvolve.HealthChecks.* packages provide a configuration-first approach. You can configure health checks via code as shown above, or through appsettings.json:

{
  "HealthChecks": {
    "SqlServer": {
      "sqlserver": {
        "ConnectionString": "Server=tcp:localhost,1433;Database=master;...",
        "Timeout": 100
      }
    },
    "Redis": {
      "redis": {
        "ConnectionString": "localhost:6379"
      }
    }
  }
}

The tags parameter on the DbContext check matters here. Tags control which checks run for which endpoint. The self check has no tags—it runs for liveness only. Dependency checks tagged ready run for readiness.

Extended Azure health checks

For Azure-specific services, the NetEvolve.HealthChecks.Azure.* packages cover most scenarios out of the box:

builder.Services.AddHealthChecks()
    .AddApplicationInsights("appinsights", options =>
    {
        options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"]!;
    })
    .AddBlobServiceClient("blob-storage", options =>
    {
        options.AccountName = builder.Configuration["Azure:Storage:AccountName"]!;
    })
    .AddServiceBusQueue("servicebus-orders", options =>
    {
        options.FullyQualifiedNamespace = builder.Configuration["Azure:ServiceBus:Namespace"]!;
        options.QueueName = "orders";
    });

Application Insights failures degrade observability but shouldn’t stop the application from serving requests—the packages handle this distinction properly.

Startup health checks

ISO 27001 Control A.12.1.4 requires environment separation. Startup health checks enforce this—they prevent misconfigured deployments from ever receiving traffic:

builder.Services.AddHealthChecks()
    .AddCheck("startup-configuration", () =>
    {
        var required = new[]
        {
            "ConnectionStrings:Default",
            "Azure:KeyVault:Uri",
            "Azure:ServiceBus:ConnectionString"
        };

        var missing = required
            .Where(key => string.IsNullOrEmpty(builder.Configuration[key]))
            .ToList();

        return missing.Any()
            ? HealthCheckResult.Unhealthy($"Missing configuration: {string.Join(", ", missing)}")
            : HealthCheckResult.Healthy("All required configuration present");
    }, tags: new[] { "startup" });

// Run startup checks before accepting traffic
var startupHealthCheck = app.Services.GetRequiredService<HealthCheckService>();
var startupResult = await startupHealthCheck.CheckHealthAsync(
    registration => registration.Tags.Contains("startup"));

if (startupResult.Status != HealthStatus.Healthy)
{
    foreach (var entry in startupResult.Entries.Where(e => e.Value.Status != HealthStatus.Healthy))
    {
        app.Logger.LogCritical(
            "Startup health check '{CheckName}' failed: {Description}",
            entry.Key,
            entry.Value.Description);
    }

    throw new InvalidOperationException(
        "Application failed startup health checks. See logs for details.");
}

app.Run();

Misconfigured instances never start. Deployment pipelines fail fast with clear error messages instead of deploying broken configurations to production.

Secure health check UI

Public health endpoints should expose minimal information. Keep the detailed diagnostics behind authentication:

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = registration => registration.Tags.Contains("ready"),
    AllowCachingResponses = false,
    ResultStatusCodes =
    {
        [HealthStatus.Healthy] = StatusCodes.Status200OK,
        [HealthStatus.Degraded] = StatusCodes.Status200OK,
        [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
    },
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";
        await context.Response.WriteAsJsonAsync(new
        {
            status = report.Status.ToString(),
            // No detailed information in unauthenticated endpoint
        });
    }
});

// Detailed diagnostics require authentication
app.MapHealthChecks("/health/details", new HealthCheckOptions
{
    Predicate = _ => true,
    AllowCachingResponses = false,
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";
        await context.Response.WriteAsJsonAsync(new
        {
            status = report.Status.ToString(),
            duration = report.TotalDuration,
            checks = report.Entries.Select(e => new
            {
                name = e.Key,
                status = e.Value.Status.ToString(),
                description = e.Value.Description,
                duration = e.Value.Duration,
                tags = e.Value.Tags
            })
        });
    }
}).RequireAuthorization("HealthCheckPolicy");

// Define authorization policy
builder.Services.AddAuthorization(options =>
{
    options.AddPolicy("HealthCheckPolicy", policy =>
        policy.RequireRole("Administrator", "HealthCheckReader"));
});

The /health/ready endpoint returns minimal status for load balancers. /health/details requires authorization and returns the full picture for operations teams.

Integration with Azure Monitor and alerting

Health checks only become a security control when you connect them to alerting. Azure Monitor provides the infrastructure:

using Azure.Monitor.OpenTelemetry.AspNetCore;

builder.Services.AddOpenTelemetry()
    .UseAzureMonitor(options =>
    {
        options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
    })
    .WithMetrics(metrics => metrics
        .AddMeter("Microsoft.AspNetCore.HealthChecks"));

// Publish health check results as metrics
builder.Services.AddHealthChecks()
    .AddCheck("database", /* ... */)
    .AddCheck("keyvault", /* ... */);

builder.Services.Configure<HealthCheckPublisherOptions>(options =>
{
    options.Delay = TimeSpan.FromSeconds(5);
    options.Period = TimeSpan.FromSeconds(30);
});

builder.Services.AddSingleton<IHealthCheckPublisher, ApplicationInsightsPublisher>();

public class ApplicationInsightsPublisher : IHealthCheckPublisher
{
    private readonly TelemetryClient _telemetryClient;

    public ApplicationInsightsPublisher(TelemetryClient telemetryClient)
    {
        _telemetryClient = telemetryClient;
    }

    public Task PublishAsync(HealthReport report, CancellationToken cancellationToken)
    {
        foreach (var entry in report.Entries)
        {
            _telemetryClient.TrackMetric(
                $"HealthCheck.{entry.Key}",
                entry.Value.Status == HealthStatus.Healthy ? 1 : 0,
                new Dictionary<string, string>
                {
                    ["Status"] = entry.Value.Status.ToString(),
                    ["Description"] = entry.Value.Description ?? string.Empty
                });
        }

        return Task.CompletedTask;
    }
}

This publishes health check results to Application Insights every thirty seconds. Create Azure Monitor alerts based on these metrics:

# Azure CLI - Create alert rule for database health check failures
az monitor metrics alert create \
  --name "Database Health Check Failed" \
  --resource-group "production-rg" \
  --scopes "/subscriptions/{subscription-id}/resourceGroups/production-rg/providers/Microsoft.Insights/components/myapp-appinsights" \
  --condition "max HealthCheck.database < 1" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action "security-team-action-group" \
  --description "Database health check has failed - potential availability impact"

Alert failures notify security and operations teams before users notice. That’s Control A.17.2.1 in action.

GitHub Actions deployment gates

Health checks should gate your deployments. Don’t let a deployment complete until the application proves it’s healthy:

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Azure App Service
        uses: azure/webapps-deploy@v2
        with:
          app-name: myapp-production
          package: ./publish

      - name: Wait for deployment
        run: sleep 30

      - name: Verify startup health
        run: |
          for i in {1..10}; do
            response=$(curl -s -o /dev/null -w "%{http_code}" https://myapp-production.azurewebsites.net/health/live)
            if [ $response -eq 200 ]; then
              echo "Liveness check passed"
              break
            fi
            if [ $i -eq 10 ]; then
              echo "Liveness check failed after 10 attempts"
              exit 1
            fi
            sleep 10
          done

      - name: Verify application readiness
        run: |
          for i in {1..20}; do
            response=$(curl -s -o /dev/null -w "%{http_code}" https://myapp-production.azurewebsites.net/health/ready)
            if [ $response -eq 200 ]; then
              echo "Readiness check passed - deployment successful"
              exit 0
            fi
            if [ $i -eq 20 ]; then
              echo "Readiness check failed - rolling back deployment"
              exit 1
            fi
            sleep 15
          done

      - name: Rollback on failure
        if: failure()
        run: |
          # Trigger Azure App Service deployment slot swap back to previous version
          az webapp deployment slot swap \
            --resource-group production-rg \
            --name myapp-production \
            --slot staging \
            --target-slot production

This workflow deploys, waits for startup, verifies liveness, then checks readiness. If readiness fails within five minutes, the deployment rolls back automatically. Unhealthy deployments never receive production traffic.

What I’ve learned

After fifteen years implementing monitoring systems across enterprise environments, these patterns consistently separate teams that catch failures early from those that discover them via angry user reports:

1. Separate liveness from readiness. Orchestrators need to know if the process crashed. Load balancers need to know if the application can serve requests. These are different questions requiring different endpoints.

2. Tag health checks by purpose. Use tags to control which checks run for liveness, readiness, and startup verification. Not all checks apply to all scenarios.

3. Use appropriate failure statuses. Database failures are Unhealthy. Cache failures are Degraded. Telemetry failures are Degraded. Choose statuses that reflect actual impact on request handling.

4. Authenticate detailed diagnostics. Public endpoints return minimal status. Detailed information requires authorization. This prevents information disclosure while enabling troubleshooting.

5. Implement startup health checks. Fail deployments immediately when configuration is invalid. Don’t wait for runtime failures to discover environment separation violations.

6. Publish health metrics to monitoring systems. Health checks are worthless without alerting. Integrate with Azure Monitor, Application Insights, or your monitoring platform of choice.

7. Automate deployment verification. Health checks in CI/CD pipelines prevent broken deployments from reaching production. Automated rollback on health check failure implements Control A.12.1.4.

Health checks are not optional observability features. They are security controls that implement ISO 27001 availability requirements. Every team I’ve seen treat them as afterthoughts eventually discovers this the hard way—during an incident when degraded instances serve errors to users while reporting “healthy” to monitoring systems.

Your availability posture depends on checking what actually matters, not just whether the process is breathing.

Comments