Observability & Monitoring
Seed implements comprehensive observability through Prometheus metrics and structured logging with Winston, enabling monitoring, alerting, and debugging of production systems.
Overview
Monitoring Strategy:
- Metrics (Prometheus) - Quantitative data for dashboards and alerts
- Logging (Winston) - Qualitative context for debugging and auditing
- Structured Events - Categorized logging for easy querying
Prometheus Metrics
Seed exposes metrics in Prometheus format at /metrics endpoint when enabled.
Configuration
Enable/Disable Metrics:
METRICS_ENABLED=true # Default: trueAccess Control:
/metricsendpoint is public when enabled (no authentication)- Recommended: Use network-level access control (firewall rules)
- Only enable in environments with protected network access
File: src/config/metrics.ts
export const metricsConfig = {
enabled: process.env.METRICS_ENABLED !== "false",
path: "/metrics",
};Default Metrics
Collected automatically when metrics are enabled:
Node.js Process Metrics:
# CPU usage
process_cpu_user_seconds_total
process_cpu_system_seconds_total
# Memory usage
process_resident_memory_bytes
process_heap_bytes
# Event loop lag
nodejs_eventloop_lag_seconds
# Garbage collection
nodejs_gc_duration_secondsLabels Applied to All Metrics:
{
app: "seed",
version: "0.1.3"
}HTTP Metrics
Request Duration Histogram:
http_request_duration_seconds{method, route, status_code}Labels:
method: HTTP method (GET, POST, DELETE)route: Route path (/mcp,/oauth/token, etc.)status_code: HTTP status code (200, 401, 429, etc.)
Buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5] seconds
Example Queries:
# 95th percentile response time by route
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Average response time for MCP endpoint
rate(http_request_duration_seconds_sum{route="/mcp"}[5m])
/ rate(http_request_duration_seconds_count{route="/mcp"}[5m])
# Slow requests (>1 second)
http_request_duration_seconds_bucket{le="1"}
- http_request_duration_seconds_bucket{le="0.5"}Request Counter:
http_request_total{method, route, status_code}Example Queries:
# Request rate by route
rate(http_request_total[5m])
# Error rate (5xx responses)
rate(http_request_total{status_code=~"5.."}[5m])
/ rate(http_request_total[5m])
# Requests per minute
increase(http_request_total[1m])MCP Metrics
Active Sessions Gauge:
mcp_sessions_activeTracks number of currently active MCP sessions.
Example Queries:
# Current active sessions
mcp_sessions_active
# Maximum concurrent sessions in last hour
max_over_time(mcp_sessions_active[1h])
# Alert: Too many active sessions
mcp_sessions_active > 1000Session Lifecycle Counter:
mcp_sessions_total{status}Labels:
status: Session lifecycle eventcreated: New session initializedexpired: Session expired via Redis TTLterminated: Session explicitly terminated via DELETE
Example Queries:
# Session creation rate
rate(mcp_sessions_total{status="created"}[5m])
# Session churn (terminated + expired)
rate(mcp_sessions_total{status=~"expired|terminated"}[5m])
# Session duration (approximation)
mcp_sessions_total{status="created"}
/ mcp_sessions_total{status=~"expired|terminated"}Tool Invocation Counter:
mcp_tool_invocations_total{tool, status}Labels:
tool: Tool name (random-number,echo-message, etc.)status: Invocation result (success,error)
Example Queries:
# Tool usage by type
rate(mcp_tool_invocations_total[5m])
# Tool error rate
rate(mcp_tool_invocations_total{status="error"}[5m])
/ rate(mcp_tool_invocations_total[5m])
# Most used tools
topk(10, sum by (tool) (
rate(mcp_tool_invocations_total[5m])
))Tool Duration Histogram:
mcp_tool_duration_seconds{tool}Buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] seconds
Example Queries:
# 99th percentile tool duration
histogram_quantile(0.99,
rate(mcp_tool_duration_seconds_bucket[5m])
)
# Slow tools (>100ms)
histogram_quantile(0.95,
rate(mcp_tool_duration_seconds_bucket{tool=~".*"}[5m])
) > 0.1Authentication Metrics
Authentication Attempts Counter:
auth_attempts_total{result}Labels:
result:successorfailure
Example Queries:
# Authentication success rate
rate(auth_attempts_total{result="success"}[5m])
/ rate(auth_attempts_total[5m])
# Failed authentication rate
rate(auth_attempts_total{result="failure"}[5m])
# Alert: High authentication failure rate
rate(auth_attempts_total{result="failure"}[5m]) > 10Token Validation Duration:
auth_token_validation_duration_secondsBuckets: [0.001, 0.005, 0.01, 0.05, 0.1] seconds
Example Queries:
# 95th percentile validation time
histogram_quantile(0.95,
rate(auth_token_validation_duration_seconds_bucket[5m])
)
# Alert: Slow token validation
histogram_quantile(0.95,
rate(auth_token_validation_duration_seconds_bucket[5m])
) > 0.05JWKS Metrics
JWKS Refresh Counter:
jwks_refresh_total{result}Labels:
result:successorfailure
Example Queries:
# JWKS refresh rate
rate(jwks_refresh_total[5m])
# JWKS refresh failure rate
rate(jwks_refresh_total{result="failure"}[5m])
# Alert: JWKS refresh failures
rate(jwks_refresh_total{result="failure"}[5m]) > 0JWKS Cache Performance:
jwks_cache_hits_total
jwks_cache_misses_totalExample Queries:
# Cache hit rate
rate(jwks_cache_hits_total[5m])
/ (rate(jwks_cache_hits_total[5m]) + rate(jwks_cache_misses_total[5m]))
# Alert: Low cache hit rate
rate(jwks_cache_hits_total[5m])
/ (rate(jwks_cache_hits_total[5m]) + rate(jwks_cache_misses_total[5m]))
< 0.9Redis Metrics
Redis Operations Counter:
redis_operations_total{operation, result}Labels:
operation:get,set,del,zadd,zremrangebyscore,zcard, etc.result:successorfailure
Example Queries:
# Redis operation rate by type
rate(redis_operations_total[5m])
# Redis error rate
rate(redis_operations_total{result="failure"}[5m])
/ rate(redis_operations_total[5m])
# Alert: High Redis error rate
rate(redis_operations_total{result="failure"}[5m])
/ rate(redis_operations_total[5m]) > 0.01Redis Operation Duration:
redis_operation_duration_seconds{operation}Buckets: [0.001, 0.005, 0.01, 0.05, 0.1] seconds
Example Queries:
# 95th percentile Redis latency
histogram_quantile(0.95,
rate(redis_operation_duration_seconds_bucket[5m])
)
# Alert: Slow Redis operations
histogram_quantile(0.95,
rate(redis_operation_duration_seconds_bucket[5m])
) > 0.05Rate Limiting Metrics
Rate Limit Violations Counter:
rate_limit_hits_total{type, reason}Labels:
type: Endpoint type (mcp,dcr)reason: Limit type exceededrate_limit_exceeded: Per-IP limitglobal_rate_limit_exceeded: Global limit
Example Queries:
# Rate limit violations by endpoint
rate(rate_limit_hits_total[5m])
# Per-IP vs global limit violations
sum by (reason) (rate(rate_limit_hits_total[5m]))
# Alert: High rate limit violations
rate(rate_limit_hits_total{type="mcp"}[5m]) > 1Rate Limit Usage Histogram:
rate_limit_usage_ratio{endpoint}Buckets: [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 1.0]
Tracks usage as ratio of maximum allowed (0.0 to 1.0).
Example Queries:
# Average rate limit usage
avg(rate_limit_usage_ratio)
# IPs near rate limit (>90% usage)
rate_limit_usage_ratio > 0.9
# Alert: Sustained high usage
avg_over_time(rate_limit_usage_ratio{endpoint="mcp"}[5m]) > 0.8DCR Metrics
Dynamic Client Registration Counter:
dcr_registrations_total{result}Labels:
result:successorfailure
Example Queries:
# DCR registration rate
rate(dcr_registrations_total[5m])
# DCR failure rate
rate(dcr_registrations_total{result="failure"}[5m])
/ rate(dcr_registrations_total[5m])
# Alert: High DCR failure rate
rate(dcr_registrations_total{result="failure"}[5m])
/ rate(dcr_registrations_total[5m]) > 0.1OAuth Flow Metrics
✅ IMPLEMENTED (2026-01-07) - Comprehensive metrics for OAuth 2.1 authorization and token flows.
OAuth Authorization Requests Counter:
oauth_authorization_requests_total{result}Labels:
result: Authorization resultsuccess: Request successfully proxied to IdPerror: Validation error (invalid code_challenge, etc.)invalid_client: Client not found in DCR store
Example Queries:
# Authorization success rate
rate(oauth_authorization_requests_total{result="success"}[5m])
/ rate(oauth_authorization_requests_total[5m])
# Invalid client rate
rate(oauth_authorization_requests_total{result="invalid_client"}[5m])
# Alert: High authorization error rate
rate(oauth_authorization_requests_total{result="error"}[5m])
/ rate(oauth_authorization_requests_total[5m]) > 0.05OAuth Token Exchanges Counter:
oauth_token_exchanges_total{grant_type, result}Labels:
grant_type: Grant type usedauthorization_code: Initial code exchangerefresh_token: Token refresh
result: Exchange result (success,failure)
Example Queries:
# Token exchange success rate by grant type
rate(oauth_token_exchanges_total{result="success"}[5m])
/ rate(oauth_token_exchanges_total[5m])
# Refresh token usage rate
rate(oauth_token_exchanges_total{grant_type="refresh_token"}[5m])
# Alert: High token exchange failure rate
rate(oauth_token_exchanges_total{result="failure"}[5m])
/ rate(oauth_token_exchanges_total[5m]) > 0.01OAuth Token Exchange Duration Histogram:
oauth_token_exchange_duration_seconds{grant_type}Buckets: [0.1, 0.5, 1, 2, 5] seconds
Example Queries:
# P99 token exchange latency for authorization_code
histogram_quantile(0.99,
rate(oauth_token_exchange_duration_seconds_bucket{grant_type="authorization_code"}[5m])
)
# Average IdP response time
rate(oauth_token_exchange_duration_seconds_sum[5m])
/ rate(oauth_token_exchange_duration_seconds_count[5m])
# Alert: Slow IdP token endpoint
histogram_quantile(0.95,
rate(oauth_token_exchange_duration_seconds_bucket[5m])
) > 2Token Refresh Metrics
✅ IMPLEMENTED (2026-01-07) - Metrics for automatic token refresh operations.
Token Refresh Attempts Counter:
token_refresh_attempts_total{type, result}Labels:
type: Refresh typeproactive: Token refreshed before expiration (5-min buffer)reactive: Token refreshed after auth failure
result: Refresh outcomesuccess: Token successfully refreshedfailure: Refresh failedskipped: No refresh token available
Example Queries:
# Token refresh success rate
rate(token_refresh_attempts_total{result="success"}[5m])
/ rate(token_refresh_attempts_total{result!="skipped"}[5m])
# Proactive vs reactive refresh ratio
rate(token_refresh_attempts_total{type="proactive"}[5m])
/ rate(token_refresh_attempts_total{type="reactive"}[5m])
# Alert: High refresh failure rate
rate(token_refresh_attempts_total{result="failure"}[5m])
/ rate(token_refresh_attempts_total{result!="skipped"}[5m]) > 0.1Token Refresh Duration Histogram:
token_refresh_duration_seconds{result}Buckets: [0.1, 0.5, 1, 2, 5] seconds
Example Queries:
# P95 token refresh latency
histogram_quantile(0.95,
rate(token_refresh_duration_seconds_bucket[5m])
)
# Average refresh duration for successful refreshes
rate(token_refresh_duration_seconds_sum{result="success"}[5m])
/ rate(token_refresh_duration_seconds_count{result="success"}[5m])
# Alert: Slow token refresh
histogram_quantile(0.95,
rate(token_refresh_duration_seconds_bucket{result="success"}[5m])
) > 2Pending Tokens Claimed Counter:
pending_tokens_claimed_totalTracks how many pending tokens (stored by user ID during OAuth flow) are successfully claimed by MCP sessions.
Example Queries:
# Pending token claim rate
rate(pending_tokens_claimed_total[5m])
# Token claim efficiency (claims vs token exchanges)
rate(pending_tokens_claimed_total[5m])
/ rate(oauth_token_exchanges_total{grant_type="authorization_code",result="success"}[5m])Circuit Breaker Metrics
✅ IMPLEMENTED (2026-01-06) - Metrics for circuit breaker pattern protecting Redis connections.
Circuit Breaker State Gauge:
circuit_breaker_state{name}Values:
0: Closed (normal operation)1: Half-open (testing recovery)2: Open (failing fast)
Labels:
name: Circuit breaker name (redis)
Example Queries:
# Current circuit breaker state
circuit_breaker_state{name="redis"}
# Alert: Circuit breaker open
circuit_breaker_state{name="redis"} == 2Circuit Breaker Failures Counter:
circuit_breaker_failures_total{name}Example Queries:
# Failure rate
rate(circuit_breaker_failures_total{name="redis"}[5m])
# Alert: High failure rate
rate(circuit_breaker_failures_total{name="redis"}[5m]) > 1Circuit Breaker Successes Counter:
circuit_breaker_successes_total{name}Example Queries:
# Success rate
rate(circuit_breaker_successes_total{name="redis"}[5m])
/ (rate(circuit_breaker_successes_total{name="redis"}[5m]) + rate(circuit_breaker_failures_total{name="redis"}[5m]))Circuit Breaker State Changes Counter:
circuit_breaker_state_changes_total{name, from_state, to_state}Labels:
from_state: Previous state (closed,half_open,open)to_state: New state (closed,half_open,open)
Example Queries:
# State change rate
rate(circuit_breaker_state_changes_total[5m])
# Transitions to open state (service degradation)
rate(circuit_breaker_state_changes_total{to_state="open"}[5m])
# Alert: Frequent state changes (flapping)
rate(circuit_breaker_state_changes_total[5m]) > 0.5Metrics Endpoint
Route: GET /metrics
Response Format: Prometheus exposition format
# HELP http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="POST",route="/mcp",status_code="200",le="0.01"} 45
http_request_duration_seconds_bucket{method="POST",route="/mcp",status_code="200",le="0.05"} 89
...
http_request_duration_seconds_sum{method="POST",route="/mcp",status_code="200"} 12.5
http_request_duration_seconds_count{method="POST",route="/mcp",status_code="200"} 100Content-Type: application/openmetrics-text; version=1.0.0; charset=utf-8
Implementation:
// GET /metrics
export async function getMetrics(): Promise<string> {
if (!config.metrics.enabled) {
return "# Metrics disabled\n";
}
return await register.metrics();
}Structured Logging
Seed uses Winston for structured JSON logging with categorized events.
Log Configuration
Log Level:
LOG_LEVEL=info # debug, info, warn, errorLog Format:
LOG_FORMAT=json # json (production) or simple (development)Implementation:
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL ?? "info",
format: winston.format.json(),
defaultMeta: {
service: "seed",
version: "0.1.3",
},
transports: [
new winston.transports.Console({
format: process.env.LOG_FORMAT === "simple"
? winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
: winston.format.json(),
}),
],
});Log Structure
JSON Format (production):
{
"timestamp": "2026-01-06T12:00:00.000Z",
"level": "info",
"message": "Auth event",
"event": "token_validated",
"userId": "user|12345",
"method": "jwt",
"ip": "192.168.1.100",
"category": "authentication",
"service": "seed",
"version": "0.1.3"
}Simple Format (development):
info: Auth event {"event":"token_validated","userId":"user|12345"}Log Categories
Logs are categorized for easy filtering and querying:
Categories:
authentication- JWT validation, auth failuresmcp- MCP tool invocations, session lifecyclesecurity- Origin validation, security eventsrate_limiting- Rate limit violationsoauth- OAuth flows, DCR events
Query Examples (using log aggregation tools):
# Find all authentication failures
category:authentication AND level:warn
# Find MCP tool errors
category:mcp AND success:false
# Find security events
category:security AND severity:highAuthentication Logging
Function: logAuthEvent(event, details)
Successful Authentication:
logAuthEvent("token_validated", {
userId: payload.sub,
method: "jwt",
ip: req.ip,
});Output:
{
"timestamp": "2026-01-06T12:00:00Z",
"level": "info",
"message": "Auth event",
"event": "token_validated",
"userId": "user|12345",
"method": "jwt",
"ip": "192.168.1.100",
"category": "authentication"
}Failed Authentication:
logAuthEvent("token_rejected", {
reason: "expired_token",
ip: req.ip,
details: error.message,
});Output:
{
"timestamp": "2026-01-06T12:00:00Z",
"level": "info",
"message": "Auth event",
"event": "token_rejected",
"reason": "expired_token",
"ip": "192.168.1.100",
"details": "Token expired at 2026-01-05T12:00:00Z",
"category": "authentication"
}MCP Tool Logging
Function: logToolInvocation(tool, sessionId, userId, success, duration)
Example:
logToolInvocation(
"random-number",
"session-123",
"user|12345",
true,
0.005
);Output:
{
"timestamp": "2026-01-06T12:00:00Z",
"level": "info",
"message": "Tool invocation",
"tool": "random-number",
"sessionId": "session-123",
"userId": "user|12345",
"success": true,
"duration": 0.005,
"category": "mcp"
}Use Cases:
- Audit trail of tool usage
- Performance analysis per tool
- Error investigation
- User activity tracking
Security Event Logging
Function: logSecurityEvent(event, severity, details)
Origin Validation Failure:
logSecurityEvent("origin_blocked", "medium", {
origin: "https://evil.com",
path: "/mcp",
ip: "192.168.1.100",
});Output:
{
"timestamp": "2026-01-06T12:00:00Z",
"level": "warn",
"message": "Security event",
"event": "origin_blocked",
"severity": "medium",
"origin": "https://evil.com",
"path": "/mcp",
"ip": "192.168.1.100",
"category": "security"
}Severity Levels:
low: Informational security eventsmedium: Potential attacks or policy violationshigh: Active attacks or critical security failurescritical: System compromise or data breach
Session Lifecycle Logging
Session Creation:
logger.info("MCP session created", {
sessionId: "session-123",
userId: "user|12345",
capabilities: ["tools", "prompts", "resources"],
category: "mcp",
});Session Access (with TTL refresh):
logger.debug("Session accessed", {
sessionId: "session-123",
ttlRefreshed: true,
category: "mcp",
});Session Termination:
logger.info("MCP session terminated", {
sessionId: "session-123",
reason: "client_request",
duration: 3600,
category: "mcp",
});Session Expiration:
logger.info("MCP session expired", {
sessionId: "session-123",
reason: "ttl_expired",
lastAccessedAt: "2026-01-05T12:00:00Z",
category: "mcp",
});Rate Limiting Logging
Request Allowed:
logger.debug("Rate limit check passed", {
endpoint: "mcp",
ip: "192.168.1.100",
count: 25,
limit: 100,
category: "rate_limiting",
});Request Blocked:
logger.warn("Rate limit exceeded", {
endpoint: "mcp",
ip: "192.168.1.100",
count: 101,
limit: 100,
retryAfter: 45,
category: "rate_limiting",
});Error Logging
Standard Errors:
logger.error("Redis connection failed", {
error: error.message,
stack: error.stack,
operation: "get",
key: "session:123",
});Structured Error Context:
{
"timestamp": "2026-01-06T12:00:00Z",
"level": "error",
"message": "Redis connection failed",
"error": "ECONNREFUSED",
"stack": "Error: ECONNREFUSED\n at ...",
"operation": "get",
"key": "session:123",
"service": "seed"
}Child Loggers
Create contextual child loggers for scoped logging:
// Create child logger with context
const toolLogger = createChildLogger({
sessionId: "session-123",
userId: "user|12345",
});
// All logs from this logger include context
toolLogger.info("Tool invocation started", { tool: "random-number" });
toolLogger.info("Tool invocation completed", { duration: 0.005 });Output:
{
"timestamp": "2026-01-06T12:00:00Z",
"level": "info",
"message": "Tool invocation started",
"tool": "random-number",
"sessionId": "session-123",
"userId": "user|12345",
"service": "seed"
}Monitoring Best Practices
Recommended Alerts
High Priority:
# Auth failure spike
rate(auth_attempts_total{result="failure"}[5m]) > 10
# High error rate
rate(http_request_total{status_code=~"5.."}[5m])
/ rate(http_request_total[5m]) > 0.01
# JWKS refresh failures
rate(jwks_refresh_total{result="failure"}[5m]) > 0
# Redis connection failures
rate(redis_operations_total{result="failure"}[5m])
/ rate(redis_operations_total[5m]) > 0.05
# Circuit breaker open
circuit_breaker_state{name="redis"} == 2
# High OAuth token exchange failure rate
rate(oauth_token_exchanges_total{result="failure"}[5m])
/ rate(oauth_token_exchanges_total[5m]) > 0.01
# High token refresh failure rate
rate(token_refresh_attempts_total{result="failure"}[5m])
/ rate(token_refresh_attempts_total{result!="skipped"}[5m]) > 0.1Medium Priority:
# Slow API responses
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
# High rate limit violations
rate(rate_limit_hits_total[5m]) > 1
# High session churn
rate(mcp_sessions_total{status=~"expired|terminated"}[5m])
/ rate(mcp_sessions_total{status="created"}[5m]) > 0.5
# Slow IdP token endpoint
histogram_quantile(0.95,
rate(oauth_token_exchange_duration_seconds_bucket[5m])
) > 2
# Slow token refresh operations
histogram_quantile(0.95,
rate(token_refresh_duration_seconds_bucket{result="success"}[5m])
) > 2
# High OAuth authorization error rate
rate(oauth_authorization_requests_total{result="error"}[5m])
/ rate(oauth_authorization_requests_total[5m]) > 0.05Dashboard Recommendations
HTTP Overview Dashboard:
- Request rate by endpoint
- Error rate over time
- P95 response time
- Status code distribution
MCP Dashboard:
- Active sessions gauge
- Session creation/termination rate
- Tool invocation rate by tool
- Tool duration P95 by tool
OAuth Dashboard (✅ Added 2026-01-07):
- OAuth authorization requests by result (stacked area)
- OAuth token exchanges by grant type and result (stacked area)
- Token exchange duration percentiles (P50/P95/P99)
- Token refresh success rate by type (proactive vs reactive)
- Token refresh duration histogram
- Pending token claim rate
Security Dashboard:
- Authentication success/failure rate
- Origin validation blocks
- Rate limit violations by endpoint
- Security events by severity
Infrastructure Dashboard:
- Redis latency P95
- Redis error rate
- Redis circuit breaker state
- JWKS cache hit rate
- Process memory/CPU usage
Log Aggregation
Recommended Setup:
- Centralized log aggregation (ELK, Loki, Datadog)
- Index by
categoryfield for fast filtering - Retention policy based on compliance requirements
- Alerts on error spikes and security events
Query Patterns:
# Failed authentications in last hour
category:authentication AND event:token_rejected AND @timestamp:[now-1h TO now]
# MCP tool errors
category:mcp AND success:false
# Security events (high severity)
category:security AND severity:highHealth Checks
✅ IMPLEMENTED (2026-01-06) - Seed provides comprehensive Kubernetes-compatible health check endpoints with liveness and readiness probes.
Liveness Probe
Endpoint: GET /health
Purpose: Determines if the application is running (not hung or deadlocked).
Response (Healthy):
HTTP/1.1 200 OK
{
"status": "ok",
"version": "0.1.3"
}Response (Shutting Down):
HTTP/1.1 503 Service Unavailable
{
"status": "shutting_down",
"version": "0.1.3"
}Implementation: Returns "shutting_down" during graceful shutdown when SIGTERM/SIGINT is received. See src/routes/health.ts for implementation details.
Use Case: Kubernetes liveness probe to restart unhealthy pods.
Readiness Probe
Endpoint: GET /health/ready
Purpose: Determines if the application is ready to serve traffic.
Checks Performed:
- Redis Connectivity - Validates Redis connection with circuit breaker status
- JWKS Cache - Checks JWKS cache is populated and not expired
- Session Capacity - Verifies active sessions are below maximum threshold
Response (Ready):
HTTP/1.1 200 OK
{
"status": "ready",
"version": "0.1.3",
"checks": {
"redis": {
"healthy": true,
"connected": true,
"circuitBreaker": {
"state": "closed",
"failureCount": 0,
"successCount": 100,
"lastFailure": null,
"nextRetry": null
}
},
"jwks": {
"healthy": true,
"cached": true,
"isExpired": false,
"fetchedAt": "2026-01-06T10:00:00.000Z",
"expiresAt": "2026-01-06T11:00:00.000Z",
"cacheAge": 1800000
},
"sessions": {
"healthy": true,
"activeSessions": 42,
"maxSessions": 10000,
"utilizationPercent": 0.42
}
}
}Response (Degraded):
HTTP/1.1 503 Service Unavailable
{
"status": "degraded",
"version": "0.1.3",
"checks": {
"redis": {
"healthy": false,
"connected": false,
"circuitBreaker": {
"state": "open",
"failureCount": 5,
"successCount": 0,
"lastFailure": "2026-01-06T10:00:00.000Z",
"nextRetry": "2026-01-06T10:01:00.000Z"
}
},
"jwks": {
"healthy": true,
"cached": true,
"isExpired": false
},
"sessions": {
"healthy": true,
"activeSessions": 42,
"maxSessions": 10000,
"utilizationPercent": 0.42
}
}
}Implementation: Readiness probe includes comprehensive dependency health checks implemented in src/routes/health.ts:
- Redis health check with circuit breaker pattern integration
- JWKS cache validation with expiration tracking
- Session capacity monitoring with configurable thresholds
Use Case: Kubernetes readiness probe to remove unhealthy pods from service load balancer.
Kubernetes Integration
apiVersion: apps/v1
kind: Deployment
metadata:
name: seed-mcp-server
spec:
replicas: 3
template:
spec:
containers:
- name: seed
image: seed-mcp:latest
ports:
- containerPort: 3000
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe - remove from service if not ready
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3Health Check Configuration
# Maximum concurrent sessions (Optional)
# Health check reports degraded status when this limit is exceeded
# Default: 10000
MCP_SESSION_MAX_TOTAL=10000File: .env.example
Configuration Reference
Environment Variables
# Metrics
METRICS_ENABLED=true # Enable Prometheus metrics
# Logging
LOG_LEVEL=info # debug, info, warn, error
LOG_FORMAT=json # json (prod) or simple (dev)
# Health Checks
MCP_SESSION_MAX_TOTAL=10000 # Max sessions before degraded statusMetrics Collection
Scrape Configuration (Prometheus):
scrape_configs:
- job_name: 'seed'
static_configs:
- targets: ['seed:3000']
metrics_path: '/metrics'
scrape_interval: 15sImplementation Files
- Metrics Service:
src/services/metrics.ts- Prometheus metrics definitions - Logger Service:
src/services/logger.ts- Winston logger with helper functions - Metrics Middleware:
src/middleware/metrics.ts- HTTP metrics collection - Metrics Config:
src/config/metrics.ts- Metrics configuration - Logging Config:
src/config/logging.ts- Logging configuration - Metrics Route:
src/routes/metrics.ts-/metricsendpoint handler
Related Documentation
- Authentication Flow - Auth metrics and logging
- Rate Limiting - Rate limit metrics
- MCP Server Design - MCP metrics and logging
- Sessions - Session lifecycle logging
- Security - Security event logging