Skip to content

Health Check Improvements

Priority: LOW Estimated Time: 4-6 hours Status:IMPLEMENTED (2026-01-06)

← Back to Enhancements


Overview

Enhance the /health endpoint with comprehensive dependency validation, readiness vs liveness probes, and Kubernetes-compatible health checks.


Implementation Summary

Implemented Features:

  • ✅ Liveness probe at /health with shutdown state tracking
  • ✅ Readiness probe at /health/ready with dependency checks
  • ✅ Redis health check with circuit breaker status
  • ✅ JWKS cache health check with expiration tracking
  • ✅ Session count health check with utilization metrics
  • ✅ Comprehensive test coverage

Implementation Date: 2026-01-06 Actual Effort: 4-6 hours


Implemented Solution

1. Liveness Probe (GET /health)

Basic health check that returns 200 OK if the application is running.

Features:

  • Returns status: "ok" when healthy
  • Returns status: "shutting_down" with 503 during graceful shutdown
  • Includes server version in response

Example Response:

json
{
  "status": "ok",
  "version": "0.1.3"
}

During Shutdown:

json
HTTP/1.1 503 Service Unavailable

{
  "status": "shutting_down",
  "version": "0.1.3"
}

2. Readiness Probe (GET /health/ready)

Comprehensive readiness check that validates critical dependencies.

Features:

  • Redis connectivity check with circuit breaker status
  • JWKS cache validation with expiration tracking
  • Active session count with utilization metrics
  • Returns 200 when ready, 503 when degraded

Example Response (Healthy):

json
HTTP/1.1 200 OK

{
  "status": "ready",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": true,
      "connected": true,
      "circuitBreaker": {
        "state": "closed",
        "failureCount": 0,
        "successCount": 100,
        "lastFailure": null,
        "nextRetry": null
      }
    },
    "jwks": {
      "healthy": true,
      "cached": true,
      "isExpired": false,
      "fetchedAt": "2026-01-06T10:00:00.000Z",
      "expiresAt": "2026-01-06T11:00:00.000Z",
      "cacheAge": 1800000
    },
    "sessions": {
      "healthy": true,
      "activeSessions": 42,
      "maxSessions": 10000,
      "utilizationPercent": 0.42
    }
  }
}

Example Response (Degraded):

json
HTTP/1.1 503 Service Unavailable

{
  "status": "degraded",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": false,
      "connected": false,
      "circuitBreaker": {
        "state": "open",
        "failureCount": 5,
        "successCount": 0,
        "lastFailure": "2026-01-06T10:00:00.000Z",
        "nextRetry": "2026-01-06T10:01:00.000Z"
      }
    },
    "jwks": {
      "healthy": true,
      "cached": true,
      "isExpired": false
    },
    "sessions": {
      "healthy": true,
      "activeSessions": 42,
      "maxSessions": 10000,
      "utilizationPercent": 0.42
    }
  }
}

Configuration

Environment Variables

Added MCP_SESSION_MAX_TOTAL to configure session limit:

bash
# Maximum concurrent sessions (Optional)
# Health check reports degraded status when this limit is exceeded
# Default: 10000
MCP_SESSION_MAX_TOTAL=10000

File: .env.example


Original Proposed Solution

Below is the original proposed solution. The actual implementation is simpler and more pragmatic, focusing on the critical checks needed for production deployment.


Implementation

Note: Redis is included in the local development environment when using ./scripts/local (part of the Docker stack).

1. Health Check Service

Create src/services/health-check.ts:

typescript
interface ComponentHealth {
  status: 'healthy' | 'degraded' | 'unhealthy';
  message?: string;
  latency?: number;
  details?: Record<string, any>;
}

interface HealthCheckResult {
  status: 'up' | 'down' | 'degraded';
  timestamp: string;
  uptime: number;
  version: string;
  components: Record<string, ComponentHealth>;
}

export class HealthCheckService {
  async checkRedis(): Promise<ComponentHealth> {
    const start = Date.now();

    try {
      await redis.ping();
      const latency = Date.now() - start;

      // Check Redis memory usage
      const info = await redis.info('memory');
      const memoryUsed = parseMemoryUsage(info);

      return {
        status: latency < 100 ? 'healthy' : 'degraded',
        message: `Redis responding in ${latency}ms`,
        latency,
        details: {
          memoryUsedMb: memoryUsed,
          connected: true
        }
      };
    } catch (error) {
      return {
        status: 'unhealthy',
        message: `Redis connection failed: ${error.message}`,
        details: { connected: false }
      };
    }
  }

  async checkOIDC(): Promise<ComponentHealth> {
    const start = Date.now();

    try {
      // Fetch JWKS to verify OIDC provider is reachable
      const response = await fetch(config.oidc.jwksUri, {
        signal: AbortSignal.timeout(5000)
      });

      const latency = Date.now() - start;

      if (!response.ok) {
        return {
          status: 'degraded',
          message: `OIDC provider returned ${response.status}`,
          latency,
          details: { statusCode: response.status }
        };
      }

      return {
        status: latency < 1000 ? 'healthy' : 'degraded',
        message: `OIDC provider responding in ${latency}ms`,
        latency,
        details: {
          issuer: config.oidc.issuer,
          reachable: true
        }
      };
    } catch (error) {
      return {
        status: 'unhealthy',
        message: `OIDC provider unreachable: ${error.message}`,
        details: { reachable: false }
      };
    }
  }

  async checkSessions(): Promise<ComponentHealth> {
    try {
      const activeSessions = await sessionStore.getSessionCount();
      const sessionKeys = await redis.keys('session:*');

      return {
        status: 'healthy',
        message: `${activeSessions} active sessions`,
        details: {
          active: activeSessions,
          inRedis: sessionKeys.length
        }
      };
    } catch (error) {
      return {
        status: 'degraded',
        message: `Session check failed: ${error.message}`
      };
    }
  }

  async getHealthStatus(): Promise<HealthCheckResult> {
    const components = await Promise.all([
      this.checkRedis().then(r => ['redis', r] as const),
      this.checkOIDC().then(r => ['oidc', r] as const),
      this.checkSessions().then(r => ['sessions', r] as const)
    ]);

    const componentMap = Object.fromEntries(components);

    // Determine overall status
    const hasUnhealthy = Object.values(componentMap).some(c => c.status === 'unhealthy');
    const hasDegraded = Object.values(componentMap).some(c => c.status === 'degraded');

    const overallStatus = hasUnhealthy ? 'down' : hasDegraded ? 'degraded' : 'up';

    return {
      status: overallStatus,
      timestamp: new Date().toISOString(),
      uptime: process.uptime(),
      version: process.env.npm_package_version ?? 'unknown',
      components: componentMap
    };
  }
}

2. Health Endpoints

Update src/routes/health.ts:

typescript
import { Router } from 'express';
import { HealthCheckService } from '../services/health-check.js';

const router = Router();
const healthCheck = new HealthCheckService();

// Detailed health check
router.get('/health', async (req, res) => {
  const result = await healthCheck.getHealthStatus();

  const statusCode = result.status === 'up' ? 200 : result.status === 'degraded' ? 200 : 503;

  res.status(statusCode).json(result);
});

// Liveness probe (Kubernetes)
// "Is the application running?" (not hung/deadlocked)
router.get('/health/live', (req, res) => {
  // Simple check - if we can respond, we're alive
  res.json({
    status: 'up',
    timestamp: new Date().toISOString()
  });
});

// Readiness probe (Kubernetes)
// "Is the application ready to serve traffic?"
router.get('/health/ready', async (req, res) => {
  // Check critical dependencies
  const redis = await healthCheck.checkRedis();

  if (redis.status === 'unhealthy') {
    return res.status(503).json({
      status: 'not ready',
      reason: 'Redis unavailable',
      details: redis
    });
  }

  res.json({
    status: 'ready',
    timestamp: new Date().toISOString()
  });
});

// Startup probe (Kubernetes)
// "Has the application finished starting?"
router.get('/health/startup', async (req, res) => {
  // Check if all initialization is complete
  const isInitialized = jwksClient.isInitialized() && redis.status === 'ready';

  if (!isInitialized) {
    return res.status(503).json({
      status: 'starting',
      message: 'Application still initializing'
    });
  }

  res.json({
    status: 'started',
    timestamp: new Date().toISOString()
  });
});

export default router;

Kubernetes Integration

Deployment Configuration

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: seed-mcp-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: seed
        image: seed-mcp:latest
        ports:
        - containerPort: 3000

        # Startup probe - allow up to 60s for app to start
        startupProbe:
          httpGet:
            path: /health/startup
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 12  # 60 seconds total

        # Liveness probe - restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health/live
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe - remove from service if not ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3

Response Examples

Healthy System

json
GET /health

{
  "status": "up",
  "timestamp": "2026-01-05T12:00:00Z",
  "uptime": 86400,
  "version": "0.1.3",
  "components": {
    "redis": {
      "status": "healthy",
      "message": "Redis responding in 15ms",
      "latency": 15,
      "details": {
        "memoryUsedMb": 45,
        "connected": true
      }
    },
    "oidc": {
      "status": "healthy",
      "message": "OIDC provider responding in 250ms",
      "latency": 250,
      "details": {
        "issuer": "https://auth.example.com",
        "reachable": true
      }
    },
    "sessions": {
      "status": "healthy",
      "message": "42 active sessions",
      "details": {
        "active": 42,
        "inRedis": 42
      }
    }
  }
}

Degraded System

json
GET /health

HTTP/1.1 200 OK

{
  "status": "degraded",
  "timestamp": "2026-01-05T12:00:00Z",
  "uptime": 86400,
  "version": "0.1.3",
  "components": {
    "redis": {
      "status": "degraded",
      "message": "Redis responding in 450ms",
      "latency": 450,
      "details": {
        "memoryUsedMb": 980,
        "connected": true
      }
    },
    "oidc": {
      "status": "healthy",
      "message": "OIDC provider responding in 200ms",
      "latency": 200
    }
  }
}

Unhealthy System

json
GET /health

HTTP/1.1 503 Service Unavailable

{
  "status": "down",
  "timestamp": "2026-01-05T12:00:00Z",
  "uptime": 86400,
  "version": "0.1.3",
  "components": {
    "redis": {
      "status": "unhealthy",
      "message": "Redis connection failed: Connection timeout",
      "details": {
        "connected": false
      }
    }
  }
}

Configuration

Add to .env.example:

bash
# Health Checks
HEALTH_CHECK_TIMEOUT_MS=5000              # Timeout for dependency checks
HEALTH_CHECK_REDIS_ENABLED=true
HEALTH_CHECK_OIDC_ENABLED=true
HEALTH_CHECK_SESSIONS_ENABLED=true

Monitoring Integration

Prometheus Metrics

typescript
export const healthCheckDuration = new promClient.Histogram({
  name: 'health_check_duration_seconds',
  help: 'Health check duration in seconds',
  labelNames: ['component', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

export const componentStatus = new promClient.Gauge({
  name: 'component_healthy',
  help: 'Component health status (1=healthy, 0=unhealthy)',
  labelNames: ['component']
});

Alerting Rules

promql
# Alert if Redis is unhealthy
component_healthy{component="redis"} == 0

# Alert if health checks are slow
health_check_duration_seconds{component="redis"} > 1

# Alert if any component is down for 5 minutes
avg_over_time(component_healthy[5m]) < 0.5

Load Balancer Integration

AWS Application Load Balancer

hcl
resource "aws_lb_target_group" "seed" {
  health_check {
    enabled             = true
    path                = "/health/ready"
    protocol            = "HTTP"
    matcher             = "200"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }
}

NGINX

nginx
upstream seed_backend {
  server seed-1:3000 max_fails=3 fail_timeout=30s;
  server seed-2:3000 max_fails=3 fail_timeout=30s;
  server seed-3:3000 max_fails=3 fail_timeout=30s;
}

# Health check
location /health/ready {
  proxy_pass http://seed_backend;
  proxy_connect_timeout 5s;
  proxy_read_timeout 5s;
}

Testing

Unit Tests

typescript
describe('Health Check Service', () => {
  it('should report healthy when all components are up', async () => {
    mockRedis.ping.mockResolvedValue('PONG');
    mockFetch.mockResolvedValue({ ok: true, status: 200 });

    const result = await healthCheck.getHealthStatus();

    expect(result.status).toBe('up');
    expect(result.components.redis.status).toBe('healthy');
    expect(result.components.oidc.status).toBe('healthy');
  });

  it('should report unhealthy when Redis is down', async () => {
    mockRedis.ping.mockRejectedValue(new Error('Connection refused'));

    const result = await healthCheck.getHealthStatus();

    expect(result.status).toBe('down');
    expect(result.components.redis.status).toBe('unhealthy');
  });
});

Integration Tests

typescript
describe('Health Endpoints', () => {
  it('GET /health should return 200 when healthy', async () => {
    const response = await request(app).get('/health');

    expect(response.status).toBe(200);
    expect(response.body.status).toBe('up');
  });

  it('GET /health/ready should return 503 when Redis is down', async () => {
    await stopRedis();

    const response = await request(app).get('/health/ready');

    expect(response.status).toBe(503);
    expect(response.body.status).toBe('not ready');
  });
});

Advanced Features

Cached Health Checks

Cache health check results to reduce load:

typescript
class CachedHealthCheck {
  private cache: HealthCheckResult | null = null;
  private cacheExpiry: number = 0;
  private cacheTtl: number = 5000; // 5 seconds

  async getHealthStatus(): Promise<HealthCheckResult> {
    const now = Date.now();

    if (this.cache && now < this.cacheExpiry) {
      return this.cache;
    }

    const result = await this.performHealthCheck();
    this.cache = result;
    this.cacheExpiry = now + this.cacheTtl;

    return result;
  }
}

Dependency Graph

Track which components depend on others:

typescript
const dependencies = {
  'api': ['redis', 'oidc'],
  'mcp': ['redis', 'api'],
  'oauth': ['oidc', 'redis']
};

// Calculate cascading failures
function calculateImpact(failedComponent: string): string[] {
  return Object.entries(dependencies)
    .filter(([, deps]) => deps.includes(failedComponent))
    .map(([component]) => component);
}

Open Questions

  1. Synthetic Monitoring: Should health checks include end-to-end request simulation?
  2. External Dependencies: Should we check external APIs (beyond OIDC)?
  3. Alerting Integration: Direct integration with PagerDuty/Opsgenie?
  4. Performance Impact: What's acceptable latency for health checks?

  • Distributed Tracing - Trace health check requests
  • Audit Logging - Log health check failures
  • Monitoring - Prometheus metrics integration

← Back to Enhancements

Released under the MIT License.