Health Check Improvements

Priority: LOW Estimated Time: 4-6 hours Status: ✅ IMPLEMENTED (2026-01-06)

Overview

Enhance the /health endpoint with comprehensive dependency validation, readiness vs liveness probes, and Kubernetes-compatible health checks.

Implementation Summary

Implemented Features:

✅ Liveness probe at /health with shutdown state tracking
✅ Readiness probe at /health/ready with dependency checks
✅ Redis health check with circuit breaker status
✅ JWKS cache health check with expiration tracking
✅ Session count health check with utilization metrics
✅ Comprehensive test coverage

Implementation Date: 2026-01-06 Actual Effort: 4-6 hours

Implemented Solution

1. Liveness Probe (`GET /health`)

Basic health check that returns 200 OK if the application is running.

Features:

Returns status: "ok" when healthy
Returns status: "shutting_down" with 503 during graceful shutdown
Includes server version in response

Example Response:

json

{
  "status": "ok",
  "version": "0.1.3"
}

During Shutdown:

json

HTTP/1.1 503 Service Unavailable

{
  "status": "shutting_down",
  "version": "0.1.3"
}

2. Readiness Probe (`GET /health/ready`)

Comprehensive readiness check that validates critical dependencies.

Features:

Redis connectivity check with circuit breaker status
JWKS cache validation with expiration tracking
Active session count with utilization metrics
Returns 200 when ready, 503 when degraded

Example Response (Healthy):

json

HTTP/1.1 200 OK

{
  "status": "ready",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": true,
      "connected": true,
      "circuitBreaker": {
        "state": "closed",
        "failureCount": 0,
        "successCount": 100,
        "lastFailure": null,
        "nextRetry": null
      }
    },
    "jwks": {
      "healthy": true,
      "cached": true,
      "isExpired": false,
      "fetchedAt": "2026-01-06T10:00:00.000Z",
      "expiresAt": "2026-01-06T11:00:00.000Z",
      "cacheAge": 1800000
    },
    "sessions": {
      "healthy": true,
      "activeSessions": 42,
      "maxSessions": 10000,
      "utilizationPercent": 0.42
    }
  }
}

Example Response (Degraded):

json

HTTP/1.1 503 Service Unavailable

{
  "status": "degraded",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": false,
      "connected": false,
      "circuitBreaker": {
        "state": "open",
        "failureCount": 5,
        "successCount": 0,
        "lastFailure": "2026-01-06T10:00:00.000Z",
        "nextRetry": "2026-01-06T10:01:00.000Z"
      }
    },
    "jwks": {
      "healthy": true,
      "cached": true,
      "isExpired": false
    },
    "sessions": {
      "healthy": true,
      "activeSessions": 42,
      "maxSessions": 10000,
      "utilizationPercent": 0.42
    }
  }
}

Configuration

Environment Variables

Added MCP_SESSION_MAX_TOTAL to configure session limit:

bash

# Maximum concurrent sessions (Optional)
# Health check reports degraded status when this limit is exceeded
# Default: 10000
MCP_SESSION_MAX_TOTAL=10000

File: .env.example

Original Proposed Solution

Below is the original proposed solution. The actual implementation is simpler and more pragmatic, focusing on the critical checks needed for production deployment.

Implementation

Note: Redis is included in the local development environment when using ./scripts/local (part of the Docker stack).

1. Health Check Service

Create src/services/health-check.ts:

typescript

interface ComponentHealth {
  status: 'healthy' | 'degraded' | 'unhealthy';
  message?: string;
  latency?: number;
  details?: Record<string, any>;
}

interface HealthCheckResult {
  status: 'up' | 'down' | 'degraded';
  timestamp: string;
  uptime: number;
  version: string;
  components: Record<string, ComponentHealth>;
}

export class HealthCheckService {
  async checkRedis(): Promise<ComponentHealth> {
    const start = Date.now();

    try {
      await redis.ping();
      const latency = Date.now() - start;

      // Check Redis memory usage
      const info = await redis.info('memory');
      const memoryUsed = parseMemoryUsage(info);

      return {
        status: latency < 100 ? 'healthy' : 'degraded',
        message: `Redis responding in ${latency}ms`,
        latency,
        details: {
          memoryUsedMb: memoryUsed,
          connected: true
        }
      };
    } catch (error) {
      return {
        status: 'unhealthy',
        message: `Redis connection failed: ${error.message}`,
        details: { connected: false }
      };
    }
  }

  async checkOIDC(): Promise<ComponentHealth> {
    const start = Date.now();

    try {
      // Fetch JWKS to verify OIDC provider is reachable
      const response = await fetch(config.oidc.jwksUri, {
        signal: AbortSignal.timeout(5000)
      });

      const latency = Date.now() - start;

      if (!response.ok) {
        return {
          status: 'degraded',
          message: `OIDC provider returned ${response.status}`,
          latency,
          details: { statusCode: response.status }
        };
      }

      return {
        status: latency < 1000 ? 'healthy' : 'degraded',
        message: `OIDC provider responding in ${latency}ms`,
        latency,
        details: {
          issuer: config.oidc.issuer,
          reachable: true
        }
      };
    } catch (error) {
      return {
        status: 'unhealthy',
        message: `OIDC provider unreachable: ${error.message}`,
        details: { reachable: false }
      };
    }
  }

  async checkSessions(): Promise<ComponentHealth> {
    try {
      const activeSessions = await sessionStore.getSessionCount();
      const sessionKeys = await redis.keys('session:*');

      return {
        status: 'healthy',
        message: `${activeSessions} active sessions`,
        details: {
          active: activeSessions,
          inRedis: sessionKeys.length
        }
      };
    } catch (error) {
      return {
        status: 'degraded',
        message: `Session check failed: ${error.message}`
      };
    }
  }

  async getHealthStatus(): Promise<HealthCheckResult> {
    const components = await Promise.all([
      this.checkRedis().then(r => ['redis', r] as const),
      this.checkOIDC().then(r => ['oidc', r] as const),
      this.checkSessions().then(r => ['sessions', r] as const)
    ]);

    const componentMap = Object.fromEntries(components);

    // Determine overall status
    const hasUnhealthy = Object.values(componentMap).some(c => c.status === 'unhealthy');
    const hasDegraded = Object.values(componentMap).some(c => c.status === 'degraded');

    const overallStatus = hasUnhealthy ? 'down' : hasDegraded ? 'degraded' : 'up';

    return {
      status: overallStatus,
      timestamp: new Date().toISOString(),
      uptime: process.uptime(),
      version: process.env.npm_package_version ?? 'unknown',
      components: componentMap
    };
  }
}

2. Health Endpoints

Update src/routes/health.ts:

typescript

import { Router } from 'express';
import { HealthCheckService } from '../services/health-check.js';

const router = Router();
const healthCheck = new HealthCheckService();

// Detailed health check
router.get('/health', async (req, res) => {
  const result = await healthCheck.getHealthStatus();

  const statusCode = result.status === 'up' ? 200 : result.status === 'degraded' ? 200 : 503;

  res.status(statusCode).json(result);
});

// Liveness probe (Kubernetes)
// "Is the application running?" (not hung/deadlocked)
router.get('/health/live', (req, res) => {
  // Simple check - if we can respond, we're alive
  res.json({
    status: 'up',
    timestamp: new Date().toISOString()
  });
});

// Readiness probe (Kubernetes)
// "Is the application ready to serve traffic?"
router.get('/health/ready', async (req, res) => {
  // Check critical dependencies
  const redis = await healthCheck.checkRedis();

  if (redis.status === 'unhealthy') {
    return res.status(503).json({
      status: 'not ready',
      reason: 'Redis unavailable',
      details: redis
    });
  }

  res.json({
    status: 'ready',
    timestamp: new Date().toISOString()
  });
});

// Startup probe (Kubernetes)
// "Has the application finished starting?"
router.get('/health/startup', async (req, res) => {
  // Check if all initialization is complete
  const isInitialized = jwksClient.isInitialized() && redis.status === 'ready';

  if (!isInitialized) {
    return res.status(503).json({
      status: 'starting',
      message: 'Application still initializing'
    });
  }

  res.json({
    status: 'started',
    timestamp: new Date().toISOString()
  });
});

export default router;

Kubernetes Integration

Deployment Configuration

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: seed-mcp-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: seed
        image: seed-mcp:latest
        ports:
        - containerPort: 3000

        # Startup probe - allow up to 60s for app to start
        startupProbe:
          httpGet:
            path: /health/startup
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 12  # 60 seconds total

        # Liveness probe - restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health/live
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe - remove from service if not ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3

Response Examples

Healthy System

json

GET /health

{
  "status": "up",
  "timestamp": "2026-01-05T12:00:00Z",
  "uptime": 86400,
  "version": "0.1.3",
  "components": {
    "redis": {
      "status": "healthy",
      "message": "Redis responding in 15ms",
      "latency": 15,
      "details": {
        "memoryUsedMb": 45,
        "connected": true
      }
    },
    "oidc": {
      "status": "healthy",
      "message": "OIDC provider responding in 250ms",
      "latency": 250,
      "details": {
        "issuer": "https://auth.example.com",
        "reachable": true
      }
    },
    "sessions": {
      "status": "healthy",
      "message": "42 active sessions",
      "details": {
        "active": 42,
        "inRedis": 42
      }
    }
  }
}

Degraded System

json

GET /health

HTTP/1.1 200 OK

{
  "status": "degraded",
  "timestamp": "2026-01-05T12:00:00Z",
  "uptime": 86400,
  "version": "0.1.3",
  "components": {
    "redis": {
      "status": "degraded",
      "message": "Redis responding in 450ms",
      "latency": 450,
      "details": {
        "memoryUsedMb": 980,
        "connected": true
      }
    },
    "oidc": {
      "status": "healthy",
      "message": "OIDC provider responding in 200ms",
      "latency": 200
    }
  }
}

Unhealthy System

json

GET /health

HTTP/1.1 503 Service Unavailable

{
  "status": "down",
  "timestamp": "2026-01-05T12:00:00Z",
  "uptime": 86400,
  "version": "0.1.3",
  "components": {
    "redis": {
      "status": "unhealthy",
      "message": "Redis connection failed: Connection timeout",
      "details": {
        "connected": false
      }
    }
  }
}

Configuration

Add to .env.example:

bash

# Health Checks
HEALTH_CHECK_TIMEOUT_MS=5000              # Timeout for dependency checks
HEALTH_CHECK_REDIS_ENABLED=true
HEALTH_CHECK_OIDC_ENABLED=true
HEALTH_CHECK_SESSIONS_ENABLED=true

Monitoring Integration

Prometheus Metrics

typescript

export const healthCheckDuration = new promClient.Histogram({
  name: 'health_check_duration_seconds',
  help: 'Health check duration in seconds',
  labelNames: ['component', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

export const componentStatus = new promClient.Gauge({
  name: 'component_healthy',
  help: 'Component health status (1=healthy, 0=unhealthy)',
  labelNames: ['component']
});

Alerting Rules

promql

# Alert if Redis is unhealthy
component_healthy{component="redis"} == 0

# Alert if health checks are slow
health_check_duration_seconds{component="redis"} > 1

# Alert if any component is down for 5 minutes
avg_over_time(component_healthy[5m]) < 0.5

Load Balancer Integration

AWS Application Load Balancer

hcl

resource "aws_lb_target_group" "seed" {
  health_check {
    enabled             = true
    path                = "/health/ready"
    protocol            = "HTTP"
    matcher             = "200"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }
}

NGINX

nginx

upstream seed_backend {
  server seed-1:3000 max_fails=3 fail_timeout=30s;
  server seed-2:3000 max_fails=3 fail_timeout=30s;
  server seed-3:3000 max_fails=3 fail_timeout=30s;
}

# Health check
location /health/ready {
  proxy_pass http://seed_backend;
  proxy_connect_timeout 5s;
  proxy_read_timeout 5s;
}

Testing

Unit Tests

typescript

describe('Health Check Service', () => {
  it('should report healthy when all components are up', async () => {
    mockRedis.ping.mockResolvedValue('PONG');
    mockFetch.mockResolvedValue({ ok: true, status: 200 });

    const result = await healthCheck.getHealthStatus();

    expect(result.status).toBe('up');
    expect(result.components.redis.status).toBe('healthy');
    expect(result.components.oidc.status).toBe('healthy');
  });

  it('should report unhealthy when Redis is down', async () => {
    mockRedis.ping.mockRejectedValue(new Error('Connection refused'));

    const result = await healthCheck.getHealthStatus();

    expect(result.status).toBe('down');
    expect(result.components.redis.status).toBe('unhealthy');
  });
});

Integration Tests

typescript

describe('Health Endpoints', () => {
  it('GET /health should return 200 when healthy', async () => {
    const response = await request(app).get('/health');

    expect(response.status).toBe(200);
    expect(response.body.status).toBe('up');
  });

  it('GET /health/ready should return 503 when Redis is down', async () => {
    await stopRedis();

    const response = await request(app).get('/health/ready');

    expect(response.status).toBe(503);
    expect(response.body.status).toBe('not ready');
  });
});

Advanced Features

Cached Health Checks

Cache health check results to reduce load:

typescript

class CachedHealthCheck {
  private cache: HealthCheckResult | null = null;
  private cacheExpiry: number = 0;
  private cacheTtl: number = 5000; // 5 seconds

  async getHealthStatus(): Promise<HealthCheckResult> {
    const now = Date.now();

    if (this.cache && now < this.cacheExpiry) {
      return this.cache;
    }

    const result = await this.performHealthCheck();
    this.cache = result;
    this.cacheExpiry = now + this.cacheTtl;

    return result;
  }
}

Dependency Graph

Track which components depend on others:

typescript

const dependencies = {
  'api': ['redis', 'oidc'],
  'mcp': ['redis', 'api'],
  'oauth': ['oidc', 'redis']
};

// Calculate cascading failures
function calculateImpact(failedComponent: string): string[] {
  return Object.entries(dependencies)
    .filter(([, deps]) => deps.includes(failedComponent))
    .map(([component]) => component);
}

Open Questions

Synthetic Monitoring: Should health checks include end-to-end request simulation?
External Dependencies: Should we check external APIs (beyond OIDC)?
Alerting Integration: Direct integration with PagerDuty/Opsgenie?
Performance Impact: What's acceptable latency for health checks?

Distributed Tracing - Trace health check requests
Audit Logging - Log health check failures
Monitoring - Prometheus metrics integration

← Back to Enhancements

Health Check Improvements ​

Overview ​

Implementation Summary ​

Implemented Solution ​

1. Liveness Probe (GET /health) ​

2. Readiness Probe (GET /health/ready) ​

Configuration ​

Environment Variables ​

Original Proposed Solution ​

Implementation ​

1. Health Check Service ​

2. Health Endpoints ​

Kubernetes Integration ​

Deployment Configuration ​

Response Examples ​

Healthy System ​

Degraded System ​

Unhealthy System ​

Configuration ​

Monitoring Integration ​

Prometheus Metrics ​

Alerting Rules ​

Load Balancer Integration ​

AWS Application Load Balancer ​

NGINX ​

Testing ​

Unit Tests ​

Integration Tests ​

Advanced Features ​

Cached Health Checks ​

Dependency Graph ​

Open Questions ​

Related Enhancements ​

Health Check Improvements

Overview

Implementation Summary

Implemented Solution

1. Liveness Probe (`GET /health`)

2. Readiness Probe (`GET /health/ready`)

Configuration

Environment Variables

Original Proposed Solution

Implementation

1. Health Check Service

2. Health Endpoints

Kubernetes Integration

Deployment Configuration

Response Examples

Healthy System

Degraded System

Unhealthy System

Configuration

Monitoring Integration

Prometheus Metrics

Alerting Rules

Load Balancer Integration

AWS Application Load Balancer

NGINX

Testing

Unit Tests

Integration Tests

Advanced Features

Cached Health Checks

Dependency Graph

Open Questions

Related Enhancements