Health Check Improvements
Priority: LOW Estimated Time: 4-6 hours Status: ✅ IMPLEMENTED (2026-01-06)
Overview
Enhance the /health endpoint with comprehensive dependency validation, readiness vs liveness probes, and Kubernetes-compatible health checks.
Implementation Summary
Implemented Features:
- ✅ Liveness probe at
/healthwith shutdown state tracking - ✅ Readiness probe at
/health/readywith dependency checks - ✅ Redis health check with circuit breaker status
- ✅ JWKS cache health check with expiration tracking
- ✅ Session count health check with utilization metrics
- ✅ Comprehensive test coverage
Implementation Date: 2026-01-06 Actual Effort: 4-6 hours
Implemented Solution
1. Liveness Probe (GET /health)
Basic health check that returns 200 OK if the application is running.
Features:
- Returns
status: "ok"when healthy - Returns
status: "shutting_down"with 503 during graceful shutdown - Includes server version in response
Example Response:
{
"status": "ok",
"version": "0.1.3"
}During Shutdown:
HTTP/1.1 503 Service Unavailable
{
"status": "shutting_down",
"version": "0.1.3"
}2. Readiness Probe (GET /health/ready)
Comprehensive readiness check that validates critical dependencies.
Features:
- Redis connectivity check with circuit breaker status
- JWKS cache validation with expiration tracking
- Active session count with utilization metrics
- Returns 200 when ready, 503 when degraded
Example Response (Healthy):
HTTP/1.1 200 OK
{
"status": "ready",
"version": "0.1.3",
"checks": {
"redis": {
"healthy": true,
"connected": true,
"circuitBreaker": {
"state": "closed",
"failureCount": 0,
"successCount": 100,
"lastFailure": null,
"nextRetry": null
}
},
"jwks": {
"healthy": true,
"cached": true,
"isExpired": false,
"fetchedAt": "2026-01-06T10:00:00.000Z",
"expiresAt": "2026-01-06T11:00:00.000Z",
"cacheAge": 1800000
},
"sessions": {
"healthy": true,
"activeSessions": 42,
"maxSessions": 10000,
"utilizationPercent": 0.42
}
}
}Example Response (Degraded):
HTTP/1.1 503 Service Unavailable
{
"status": "degraded",
"version": "0.1.3",
"checks": {
"redis": {
"healthy": false,
"connected": false,
"circuitBreaker": {
"state": "open",
"failureCount": 5,
"successCount": 0,
"lastFailure": "2026-01-06T10:00:00.000Z",
"nextRetry": "2026-01-06T10:01:00.000Z"
}
},
"jwks": {
"healthy": true,
"cached": true,
"isExpired": false
},
"sessions": {
"healthy": true,
"activeSessions": 42,
"maxSessions": 10000,
"utilizationPercent": 0.42
}
}
}Configuration
Environment Variables
Added MCP_SESSION_MAX_TOTAL to configure session limit:
# Maximum concurrent sessions (Optional)
# Health check reports degraded status when this limit is exceeded
# Default: 10000
MCP_SESSION_MAX_TOTAL=10000File: .env.example
Original Proposed Solution
Below is the original proposed solution. The actual implementation is simpler and more pragmatic, focusing on the critical checks needed for production deployment.
Implementation
Note: Redis is included in the local development environment when using ./scripts/local (part of the Docker stack).
1. Health Check Service
Create src/services/health-check.ts:
interface ComponentHealth {
status: 'healthy' | 'degraded' | 'unhealthy';
message?: string;
latency?: number;
details?: Record<string, any>;
}
interface HealthCheckResult {
status: 'up' | 'down' | 'degraded';
timestamp: string;
uptime: number;
version: string;
components: Record<string, ComponentHealth>;
}
export class HealthCheckService {
async checkRedis(): Promise<ComponentHealth> {
const start = Date.now();
try {
await redis.ping();
const latency = Date.now() - start;
// Check Redis memory usage
const info = await redis.info('memory');
const memoryUsed = parseMemoryUsage(info);
return {
status: latency < 100 ? 'healthy' : 'degraded',
message: `Redis responding in ${latency}ms`,
latency,
details: {
memoryUsedMb: memoryUsed,
connected: true
}
};
} catch (error) {
return {
status: 'unhealthy',
message: `Redis connection failed: ${error.message}`,
details: { connected: false }
};
}
}
async checkOIDC(): Promise<ComponentHealth> {
const start = Date.now();
try {
// Fetch JWKS to verify OIDC provider is reachable
const response = await fetch(config.oidc.jwksUri, {
signal: AbortSignal.timeout(5000)
});
const latency = Date.now() - start;
if (!response.ok) {
return {
status: 'degraded',
message: `OIDC provider returned ${response.status}`,
latency,
details: { statusCode: response.status }
};
}
return {
status: latency < 1000 ? 'healthy' : 'degraded',
message: `OIDC provider responding in ${latency}ms`,
latency,
details: {
issuer: config.oidc.issuer,
reachable: true
}
};
} catch (error) {
return {
status: 'unhealthy',
message: `OIDC provider unreachable: ${error.message}`,
details: { reachable: false }
};
}
}
async checkSessions(): Promise<ComponentHealth> {
try {
const activeSessions = await sessionStore.getSessionCount();
const sessionKeys = await redis.keys('session:*');
return {
status: 'healthy',
message: `${activeSessions} active sessions`,
details: {
active: activeSessions,
inRedis: sessionKeys.length
}
};
} catch (error) {
return {
status: 'degraded',
message: `Session check failed: ${error.message}`
};
}
}
async getHealthStatus(): Promise<HealthCheckResult> {
const components = await Promise.all([
this.checkRedis().then(r => ['redis', r] as const),
this.checkOIDC().then(r => ['oidc', r] as const),
this.checkSessions().then(r => ['sessions', r] as const)
]);
const componentMap = Object.fromEntries(components);
// Determine overall status
const hasUnhealthy = Object.values(componentMap).some(c => c.status === 'unhealthy');
const hasDegraded = Object.values(componentMap).some(c => c.status === 'degraded');
const overallStatus = hasUnhealthy ? 'down' : hasDegraded ? 'degraded' : 'up';
return {
status: overallStatus,
timestamp: new Date().toISOString(),
uptime: process.uptime(),
version: process.env.npm_package_version ?? 'unknown',
components: componentMap
};
}
}2. Health Endpoints
Update src/routes/health.ts:
import { Router } from 'express';
import { HealthCheckService } from '../services/health-check.js';
const router = Router();
const healthCheck = new HealthCheckService();
// Detailed health check
router.get('/health', async (req, res) => {
const result = await healthCheck.getHealthStatus();
const statusCode = result.status === 'up' ? 200 : result.status === 'degraded' ? 200 : 503;
res.status(statusCode).json(result);
});
// Liveness probe (Kubernetes)
// "Is the application running?" (not hung/deadlocked)
router.get('/health/live', (req, res) => {
// Simple check - if we can respond, we're alive
res.json({
status: 'up',
timestamp: new Date().toISOString()
});
});
// Readiness probe (Kubernetes)
// "Is the application ready to serve traffic?"
router.get('/health/ready', async (req, res) => {
// Check critical dependencies
const redis = await healthCheck.checkRedis();
if (redis.status === 'unhealthy') {
return res.status(503).json({
status: 'not ready',
reason: 'Redis unavailable',
details: redis
});
}
res.json({
status: 'ready',
timestamp: new Date().toISOString()
});
});
// Startup probe (Kubernetes)
// "Has the application finished starting?"
router.get('/health/startup', async (req, res) => {
// Check if all initialization is complete
const isInitialized = jwksClient.isInitialized() && redis.status === 'ready';
if (!isInitialized) {
return res.status(503).json({
status: 'starting',
message: 'Application still initializing'
});
}
res.json({
status: 'started',
timestamp: new Date().toISOString()
});
});
export default router;Kubernetes Integration
Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: seed-mcp-server
spec:
replicas: 3
template:
spec:
containers:
- name: seed
image: seed-mcp:latest
ports:
- containerPort: 3000
# Startup probe - allow up to 60s for app to start
startupProbe:
httpGet:
path: /health/startup
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 12 # 60 seconds total
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe - remove from service if not ready
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3Response Examples
Healthy System
GET /health
{
"status": "up",
"timestamp": "2026-01-05T12:00:00Z",
"uptime": 86400,
"version": "0.1.3",
"components": {
"redis": {
"status": "healthy",
"message": "Redis responding in 15ms",
"latency": 15,
"details": {
"memoryUsedMb": 45,
"connected": true
}
},
"oidc": {
"status": "healthy",
"message": "OIDC provider responding in 250ms",
"latency": 250,
"details": {
"issuer": "https://auth.example.com",
"reachable": true
}
},
"sessions": {
"status": "healthy",
"message": "42 active sessions",
"details": {
"active": 42,
"inRedis": 42
}
}
}
}Degraded System
GET /health
HTTP/1.1 200 OK
{
"status": "degraded",
"timestamp": "2026-01-05T12:00:00Z",
"uptime": 86400,
"version": "0.1.3",
"components": {
"redis": {
"status": "degraded",
"message": "Redis responding in 450ms",
"latency": 450,
"details": {
"memoryUsedMb": 980,
"connected": true
}
},
"oidc": {
"status": "healthy",
"message": "OIDC provider responding in 200ms",
"latency": 200
}
}
}Unhealthy System
GET /health
HTTP/1.1 503 Service Unavailable
{
"status": "down",
"timestamp": "2026-01-05T12:00:00Z",
"uptime": 86400,
"version": "0.1.3",
"components": {
"redis": {
"status": "unhealthy",
"message": "Redis connection failed: Connection timeout",
"details": {
"connected": false
}
}
}
}Configuration
Add to .env.example:
# Health Checks
HEALTH_CHECK_TIMEOUT_MS=5000 # Timeout for dependency checks
HEALTH_CHECK_REDIS_ENABLED=true
HEALTH_CHECK_OIDC_ENABLED=true
HEALTH_CHECK_SESSIONS_ENABLED=trueMonitoring Integration
Prometheus Metrics
export const healthCheckDuration = new promClient.Histogram({
name: 'health_check_duration_seconds',
help: 'Health check duration in seconds',
labelNames: ['component', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
export const componentStatus = new promClient.Gauge({
name: 'component_healthy',
help: 'Component health status (1=healthy, 0=unhealthy)',
labelNames: ['component']
});Alerting Rules
# Alert if Redis is unhealthy
component_healthy{component="redis"} == 0
# Alert if health checks are slow
health_check_duration_seconds{component="redis"} > 1
# Alert if any component is down for 5 minutes
avg_over_time(component_healthy[5m]) < 0.5Load Balancer Integration
AWS Application Load Balancer
resource "aws_lb_target_group" "seed" {
health_check {
enabled = true
path = "/health/ready"
protocol = "HTTP"
matcher = "200"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
}
}NGINX
upstream seed_backend {
server seed-1:3000 max_fails=3 fail_timeout=30s;
server seed-2:3000 max_fails=3 fail_timeout=30s;
server seed-3:3000 max_fails=3 fail_timeout=30s;
}
# Health check
location /health/ready {
proxy_pass http://seed_backend;
proxy_connect_timeout 5s;
proxy_read_timeout 5s;
}Testing
Unit Tests
describe('Health Check Service', () => {
it('should report healthy when all components are up', async () => {
mockRedis.ping.mockResolvedValue('PONG');
mockFetch.mockResolvedValue({ ok: true, status: 200 });
const result = await healthCheck.getHealthStatus();
expect(result.status).toBe('up');
expect(result.components.redis.status).toBe('healthy');
expect(result.components.oidc.status).toBe('healthy');
});
it('should report unhealthy when Redis is down', async () => {
mockRedis.ping.mockRejectedValue(new Error('Connection refused'));
const result = await healthCheck.getHealthStatus();
expect(result.status).toBe('down');
expect(result.components.redis.status).toBe('unhealthy');
});
});Integration Tests
describe('Health Endpoints', () => {
it('GET /health should return 200 when healthy', async () => {
const response = await request(app).get('/health');
expect(response.status).toBe(200);
expect(response.body.status).toBe('up');
});
it('GET /health/ready should return 503 when Redis is down', async () => {
await stopRedis();
const response = await request(app).get('/health/ready');
expect(response.status).toBe(503);
expect(response.body.status).toBe('not ready');
});
});Advanced Features
Cached Health Checks
Cache health check results to reduce load:
class CachedHealthCheck {
private cache: HealthCheckResult | null = null;
private cacheExpiry: number = 0;
private cacheTtl: number = 5000; // 5 seconds
async getHealthStatus(): Promise<HealthCheckResult> {
const now = Date.now();
if (this.cache && now < this.cacheExpiry) {
return this.cache;
}
const result = await this.performHealthCheck();
this.cache = result;
this.cacheExpiry = now + this.cacheTtl;
return result;
}
}Dependency Graph
Track which components depend on others:
const dependencies = {
'api': ['redis', 'oidc'],
'mcp': ['redis', 'api'],
'oauth': ['oidc', 'redis']
};
// Calculate cascading failures
function calculateImpact(failedComponent: string): string[] {
return Object.entries(dependencies)
.filter(([, deps]) => deps.includes(failedComponent))
.map(([component]) => component);
}Open Questions
- Synthetic Monitoring: Should health checks include end-to-end request simulation?
- External Dependencies: Should we check external APIs (beyond OIDC)?
- Alerting Integration: Direct integration with PagerDuty/Opsgenie?
- Performance Impact: What's acceptable latency for health checks?
Related Enhancements
- Distributed Tracing - Trace health check requests
- Audit Logging - Log health check failures
- Monitoring - Prometheus metrics integration