Skip to content

Observability & Monitoring

Seed implements comprehensive observability through Prometheus metrics and structured logging with Winston, enabling monitoring, alerting, and debugging of production systems.

Overview

Monitoring Strategy:

  • Metrics (Prometheus) - Quantitative data for dashboards and alerts
  • Logging (Winston) - Qualitative context for debugging and auditing
  • Structured Events - Categorized logging for easy querying

Prometheus Metrics

Seed exposes metrics in Prometheus format at /metrics endpoint when enabled.

Configuration

Enable/Disable Metrics:

bash
METRICS_ENABLED=true  # Default: true

Access Control:

  • /metrics endpoint is public when enabled (no authentication)
  • Recommended: Use network-level access control (firewall rules)
  • Only enable in environments with protected network access

File: src/config/metrics.ts

typescript
export const metricsConfig = {
  enabled: process.env.METRICS_ENABLED !== "false",
  path: "/metrics",
};

Default Metrics

Collected automatically when metrics are enabled:

Node.js Process Metrics:

promql
# CPU usage
process_cpu_user_seconds_total
process_cpu_system_seconds_total

# Memory usage
process_resident_memory_bytes
process_heap_bytes

# Event loop lag
nodejs_eventloop_lag_seconds

# Garbage collection
nodejs_gc_duration_seconds

Labels Applied to All Metrics:

typescript
{
  app: "seed",
  version: "0.1.3"
}

HTTP Metrics

Request Duration Histogram:

typescript
http_request_duration_seconds{method, route, status_code}

Labels:

  • method: HTTP method (GET, POST, DELETE)
  • route: Route path (/mcp, /oauth/token, etc.)
  • status_code: HTTP status code (200, 401, 429, etc.)

Buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5] seconds

Example Queries:

promql
# 95th percentile response time by route
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

# Average response time for MCP endpoint
rate(http_request_duration_seconds_sum{route="/mcp"}[5m])
/ rate(http_request_duration_seconds_count{route="/mcp"}[5m])

# Slow requests (>1 second)
http_request_duration_seconds_bucket{le="1"}
- http_request_duration_seconds_bucket{le="0.5"}

Request Counter:

typescript
http_request_total{method, route, status_code}

Example Queries:

promql
# Request rate by route
rate(http_request_total[5m])

# Error rate (5xx responses)
rate(http_request_total{status_code=~"5.."}[5m])
/ rate(http_request_total[5m])

# Requests per minute
increase(http_request_total[1m])

MCP Metrics

Active Sessions Gauge:

typescript
mcp_sessions_active

Tracks number of currently active MCP sessions.

Example Queries:

promql
# Current active sessions
mcp_sessions_active

# Maximum concurrent sessions in last hour
max_over_time(mcp_sessions_active[1h])

# Alert: Too many active sessions
mcp_sessions_active > 1000

Session Lifecycle Counter:

typescript
mcp_sessions_total{status}

Labels:

  • status: Session lifecycle event
    • created: New session initialized
    • expired: Session expired via Redis TTL
    • terminated: Session explicitly terminated via DELETE

Example Queries:

promql
# Session creation rate
rate(mcp_sessions_total{status="created"}[5m])

# Session churn (terminated + expired)
rate(mcp_sessions_total{status=~"expired|terminated"}[5m])

# Session duration (approximation)
mcp_sessions_total{status="created"}
/ mcp_sessions_total{status=~"expired|terminated"}

Tool Invocation Counter:

typescript
mcp_tool_invocations_total{tool, status}

Labels:

  • tool: Tool name (random-number, echo-message, etc.)
  • status: Invocation result (success, error)

Example Queries:

promql
# Tool usage by type
rate(mcp_tool_invocations_total[5m])

# Tool error rate
rate(mcp_tool_invocations_total{status="error"}[5m])
/ rate(mcp_tool_invocations_total[5m])

# Most used tools
topk(10, sum by (tool) (
  rate(mcp_tool_invocations_total[5m])
))

Tool Duration Histogram:

typescript
mcp_tool_duration_seconds{tool}

Buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] seconds

Example Queries:

promql
# 99th percentile tool duration
histogram_quantile(0.99,
  rate(mcp_tool_duration_seconds_bucket[5m])
)

# Slow tools (>100ms)
histogram_quantile(0.95,
  rate(mcp_tool_duration_seconds_bucket{tool=~".*"}[5m])
) > 0.1

Authentication Metrics

Authentication Attempts Counter:

typescript
auth_attempts_total{result}

Labels:

  • result: success or failure

Example Queries:

promql
# Authentication success rate
rate(auth_attempts_total{result="success"}[5m])
/ rate(auth_attempts_total[5m])

# Failed authentication rate
rate(auth_attempts_total{result="failure"}[5m])

# Alert: High authentication failure rate
rate(auth_attempts_total{result="failure"}[5m]) > 10

Token Validation Duration:

typescript
auth_token_validation_duration_seconds

Buckets: [0.001, 0.005, 0.01, 0.05, 0.1] seconds

Example Queries:

promql
# 95th percentile validation time
histogram_quantile(0.95,
  rate(auth_token_validation_duration_seconds_bucket[5m])
)

# Alert: Slow token validation
histogram_quantile(0.95,
  rate(auth_token_validation_duration_seconds_bucket[5m])
) > 0.05

JWKS Metrics

JWKS Refresh Counter:

typescript
jwks_refresh_total{result}

Labels:

  • result: success or failure

Example Queries:

promql
# JWKS refresh rate
rate(jwks_refresh_total[5m])

# JWKS refresh failure rate
rate(jwks_refresh_total{result="failure"}[5m])

# Alert: JWKS refresh failures
rate(jwks_refresh_total{result="failure"}[5m]) > 0

JWKS Cache Performance:

typescript
jwks_cache_hits_total
jwks_cache_misses_total

Example Queries:

promql
# Cache hit rate
rate(jwks_cache_hits_total[5m])
/ (rate(jwks_cache_hits_total[5m]) + rate(jwks_cache_misses_total[5m]))

# Alert: Low cache hit rate
rate(jwks_cache_hits_total[5m])
/ (rate(jwks_cache_hits_total[5m]) + rate(jwks_cache_misses_total[5m]))
< 0.9

Redis Metrics

Redis Operations Counter:

typescript
redis_operations_total{operation, result}

Labels:

  • operation: get, set, del, zadd, zremrangebyscore, zcard, etc.
  • result: success or failure

Example Queries:

promql
# Redis operation rate by type
rate(redis_operations_total[5m])

# Redis error rate
rate(redis_operations_total{result="failure"}[5m])
/ rate(redis_operations_total[5m])

# Alert: High Redis error rate
rate(redis_operations_total{result="failure"}[5m])
/ rate(redis_operations_total[5m]) > 0.01

Redis Operation Duration:

typescript
redis_operation_duration_seconds{operation}

Buckets: [0.001, 0.005, 0.01, 0.05, 0.1] seconds

Example Queries:

promql
# 95th percentile Redis latency
histogram_quantile(0.95,
  rate(redis_operation_duration_seconds_bucket[5m])
)

# Alert: Slow Redis operations
histogram_quantile(0.95,
  rate(redis_operation_duration_seconds_bucket[5m])
) > 0.05

Rate Limiting Metrics

Rate Limit Violations Counter:

typescript
rate_limit_hits_total{type, reason}

Labels:

  • type: Endpoint type (mcp, dcr)
  • reason: Limit type exceeded
    • rate_limit_exceeded: Per-IP limit
    • global_rate_limit_exceeded: Global limit

Example Queries:

promql
# Rate limit violations by endpoint
rate(rate_limit_hits_total[5m])

# Per-IP vs global limit violations
sum by (reason) (rate(rate_limit_hits_total[5m]))

# Alert: High rate limit violations
rate(rate_limit_hits_total{type="mcp"}[5m]) > 1

Rate Limit Usage Histogram:

typescript
rate_limit_usage_ratio{endpoint}

Buckets: [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 1.0]

Tracks usage as ratio of maximum allowed (0.0 to 1.0).

Example Queries:

promql
# Average rate limit usage
avg(rate_limit_usage_ratio)

# IPs near rate limit (>90% usage)
rate_limit_usage_ratio > 0.9

# Alert: Sustained high usage
avg_over_time(rate_limit_usage_ratio{endpoint="mcp"}[5m]) > 0.8

DCR Metrics

Dynamic Client Registration Counter:

typescript
dcr_registrations_total{result}

Labels:

  • result: success or failure

Example Queries:

promql
# DCR registration rate
rate(dcr_registrations_total[5m])

# DCR failure rate
rate(dcr_registrations_total{result="failure"}[5m])
/ rate(dcr_registrations_total[5m])

# Alert: High DCR failure rate
rate(dcr_registrations_total{result="failure"}[5m])
/ rate(dcr_registrations_total[5m]) > 0.1

OAuth Flow Metrics

IMPLEMENTED (2026-01-07) - Comprehensive metrics for OAuth 2.1 authorization and token flows.

OAuth Authorization Requests Counter:

typescript
oauth_authorization_requests_total{result}

Labels:

  • result: Authorization result
    • success: Request successfully proxied to IdP
    • error: Validation error (invalid code_challenge, etc.)
    • invalid_client: Client not found in DCR store

Example Queries:

promql
# Authorization success rate
rate(oauth_authorization_requests_total{result="success"}[5m])
/ rate(oauth_authorization_requests_total[5m])

# Invalid client rate
rate(oauth_authorization_requests_total{result="invalid_client"}[5m])

# Alert: High authorization error rate
rate(oauth_authorization_requests_total{result="error"}[5m])
/ rate(oauth_authorization_requests_total[5m]) > 0.05

OAuth Token Exchanges Counter:

typescript
oauth_token_exchanges_total{grant_type, result}

Labels:

  • grant_type: Grant type used
    • authorization_code: Initial code exchange
    • refresh_token: Token refresh
  • result: Exchange result (success, failure)

Example Queries:

promql
# Token exchange success rate by grant type
rate(oauth_token_exchanges_total{result="success"}[5m])
/ rate(oauth_token_exchanges_total[5m])

# Refresh token usage rate
rate(oauth_token_exchanges_total{grant_type="refresh_token"}[5m])

# Alert: High token exchange failure rate
rate(oauth_token_exchanges_total{result="failure"}[5m])
/ rate(oauth_token_exchanges_total[5m]) > 0.01

OAuth Token Exchange Duration Histogram:

typescript
oauth_token_exchange_duration_seconds{grant_type}

Buckets: [0.1, 0.5, 1, 2, 5] seconds

Example Queries:

promql
# P99 token exchange latency for authorization_code
histogram_quantile(0.99,
  rate(oauth_token_exchange_duration_seconds_bucket{grant_type="authorization_code"}[5m])
)

# Average IdP response time
rate(oauth_token_exchange_duration_seconds_sum[5m])
/ rate(oauth_token_exchange_duration_seconds_count[5m])

# Alert: Slow IdP token endpoint
histogram_quantile(0.95,
  rate(oauth_token_exchange_duration_seconds_bucket[5m])
) > 2

Token Refresh Metrics

IMPLEMENTED (2026-01-07) - Metrics for automatic token refresh operations.

Token Refresh Attempts Counter:

typescript
token_refresh_attempts_total{type, result}

Labels:

  • type: Refresh type
    • proactive: Token refreshed before expiration (5-min buffer)
    • reactive: Token refreshed after auth failure
  • result: Refresh outcome
    • success: Token successfully refreshed
    • failure: Refresh failed
    • skipped: No refresh token available

Example Queries:

promql
# Token refresh success rate
rate(token_refresh_attempts_total{result="success"}[5m])
/ rate(token_refresh_attempts_total{result!="skipped"}[5m])

# Proactive vs reactive refresh ratio
rate(token_refresh_attempts_total{type="proactive"}[5m])
/ rate(token_refresh_attempts_total{type="reactive"}[5m])

# Alert: High refresh failure rate
rate(token_refresh_attempts_total{result="failure"}[5m])
/ rate(token_refresh_attempts_total{result!="skipped"}[5m]) > 0.1

Token Refresh Duration Histogram:

typescript
token_refresh_duration_seconds{result}

Buckets: [0.1, 0.5, 1, 2, 5] seconds

Example Queries:

promql
# P95 token refresh latency
histogram_quantile(0.95,
  rate(token_refresh_duration_seconds_bucket[5m])
)

# Average refresh duration for successful refreshes
rate(token_refresh_duration_seconds_sum{result="success"}[5m])
/ rate(token_refresh_duration_seconds_count{result="success"}[5m])

# Alert: Slow token refresh
histogram_quantile(0.95,
  rate(token_refresh_duration_seconds_bucket{result="success"}[5m])
) > 2

Pending Tokens Claimed Counter:

typescript
pending_tokens_claimed_total

Tracks how many pending tokens (stored by user ID during OAuth flow) are successfully claimed by MCP sessions.

Example Queries:

promql
# Pending token claim rate
rate(pending_tokens_claimed_total[5m])

# Token claim efficiency (claims vs token exchanges)
rate(pending_tokens_claimed_total[5m])
/ rate(oauth_token_exchanges_total{grant_type="authorization_code",result="success"}[5m])

Circuit Breaker Metrics

IMPLEMENTED (2026-01-06) - Metrics for circuit breaker pattern protecting Redis connections.

Circuit Breaker State Gauge:

typescript
circuit_breaker_state{name}

Values:

  • 0: Closed (normal operation)
  • 1: Half-open (testing recovery)
  • 2: Open (failing fast)

Labels:

  • name: Circuit breaker name (redis)

Example Queries:

promql
# Current circuit breaker state
circuit_breaker_state{name="redis"}

# Alert: Circuit breaker open
circuit_breaker_state{name="redis"} == 2

Circuit Breaker Failures Counter:

typescript
circuit_breaker_failures_total{name}

Example Queries:

promql
# Failure rate
rate(circuit_breaker_failures_total{name="redis"}[5m])

# Alert: High failure rate
rate(circuit_breaker_failures_total{name="redis"}[5m]) > 1

Circuit Breaker Successes Counter:

typescript
circuit_breaker_successes_total{name}

Example Queries:

promql
# Success rate
rate(circuit_breaker_successes_total{name="redis"}[5m])
/ (rate(circuit_breaker_successes_total{name="redis"}[5m]) + rate(circuit_breaker_failures_total{name="redis"}[5m]))

Circuit Breaker State Changes Counter:

typescript
circuit_breaker_state_changes_total{name, from_state, to_state}

Labels:

  • from_state: Previous state (closed, half_open, open)
  • to_state: New state (closed, half_open, open)

Example Queries:

promql
# State change rate
rate(circuit_breaker_state_changes_total[5m])

# Transitions to open state (service degradation)
rate(circuit_breaker_state_changes_total{to_state="open"}[5m])

# Alert: Frequent state changes (flapping)
rate(circuit_breaker_state_changes_total[5m]) > 0.5

Metrics Endpoint

Route: GET /metrics

Response Format: Prometheus exposition format

# HELP http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="POST",route="/mcp",status_code="200",le="0.01"} 45
http_request_duration_seconds_bucket{method="POST",route="/mcp",status_code="200",le="0.05"} 89
...
http_request_duration_seconds_sum{method="POST",route="/mcp",status_code="200"} 12.5
http_request_duration_seconds_count{method="POST",route="/mcp",status_code="200"} 100

Content-Type: application/openmetrics-text; version=1.0.0; charset=utf-8

Implementation:

typescript
// GET /metrics
export async function getMetrics(): Promise<string> {
  if (!config.metrics.enabled) {
    return "# Metrics disabled\n";
  }
  return await register.metrics();
}

Structured Logging

Seed uses Winston for structured JSON logging with categorized events.

Log Configuration

Log Level:

bash
LOG_LEVEL=info  # debug, info, warn, error

Log Format:

bash
LOG_FORMAT=json    # json (production) or simple (development)

Implementation:

typescript
export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL ?? "info",
  format: winston.format.json(),
  defaultMeta: {
    service: "seed",
    version: "0.1.3",
  },
  transports: [
    new winston.transports.Console({
      format: process.env.LOG_FORMAT === "simple"
        ? winston.format.combine(
            winston.format.colorize(),
            winston.format.simple()
          )
        : winston.format.json(),
    }),
  ],
});

Log Structure

JSON Format (production):

json
{
  "timestamp": "2026-01-06T12:00:00.000Z",
  "level": "info",
  "message": "Auth event",
  "event": "token_validated",
  "userId": "user|12345",
  "method": "jwt",
  "ip": "192.168.1.100",
  "category": "authentication",
  "service": "seed",
  "version": "0.1.3"
}

Simple Format (development):

info: Auth event {"event":"token_validated","userId":"user|12345"}

Log Categories

Logs are categorized for easy filtering and querying:

Categories:

  • authentication - JWT validation, auth failures
  • mcp - MCP tool invocations, session lifecycle
  • security - Origin validation, security events
  • rate_limiting - Rate limit violations
  • oauth - OAuth flows, DCR events

Query Examples (using log aggregation tools):

# Find all authentication failures
category:authentication AND level:warn

# Find MCP tool errors
category:mcp AND success:false

# Find security events
category:security AND severity:high

Authentication Logging

Function: logAuthEvent(event, details)

Successful Authentication:

typescript
logAuthEvent("token_validated", {
  userId: payload.sub,
  method: "jwt",
  ip: req.ip,
});

Output:

json
{
  "timestamp": "2026-01-06T12:00:00Z",
  "level": "info",
  "message": "Auth event",
  "event": "token_validated",
  "userId": "user|12345",
  "method": "jwt",
  "ip": "192.168.1.100",
  "category": "authentication"
}

Failed Authentication:

typescript
logAuthEvent("token_rejected", {
  reason: "expired_token",
  ip: req.ip,
  details: error.message,
});

Output:

json
{
  "timestamp": "2026-01-06T12:00:00Z",
  "level": "info",
  "message": "Auth event",
  "event": "token_rejected",
  "reason": "expired_token",
  "ip": "192.168.1.100",
  "details": "Token expired at 2026-01-05T12:00:00Z",
  "category": "authentication"
}

MCP Tool Logging

Function: logToolInvocation(tool, sessionId, userId, success, duration)

Example:

typescript
logToolInvocation(
  "random-number",
  "session-123",
  "user|12345",
  true,
  0.005
);

Output:

json
{
  "timestamp": "2026-01-06T12:00:00Z",
  "level": "info",
  "message": "Tool invocation",
  "tool": "random-number",
  "sessionId": "session-123",
  "userId": "user|12345",
  "success": true,
  "duration": 0.005,
  "category": "mcp"
}

Use Cases:

  • Audit trail of tool usage
  • Performance analysis per tool
  • Error investigation
  • User activity tracking

Security Event Logging

Function: logSecurityEvent(event, severity, details)

Origin Validation Failure:

typescript
logSecurityEvent("origin_blocked", "medium", {
  origin: "https://evil.com",
  path: "/mcp",
  ip: "192.168.1.100",
});

Output:

json
{
  "timestamp": "2026-01-06T12:00:00Z",
  "level": "warn",
  "message": "Security event",
  "event": "origin_blocked",
  "severity": "medium",
  "origin": "https://evil.com",
  "path": "/mcp",
  "ip": "192.168.1.100",
  "category": "security"
}

Severity Levels:

  • low: Informational security events
  • medium: Potential attacks or policy violations
  • high: Active attacks or critical security failures
  • critical: System compromise or data breach

Session Lifecycle Logging

Session Creation:

typescript
logger.info("MCP session created", {
  sessionId: "session-123",
  userId: "user|12345",
  capabilities: ["tools", "prompts", "resources"],
  category: "mcp",
});

Session Access (with TTL refresh):

typescript
logger.debug("Session accessed", {
  sessionId: "session-123",
  ttlRefreshed: true,
  category: "mcp",
});

Session Termination:

typescript
logger.info("MCP session terminated", {
  sessionId: "session-123",
  reason: "client_request",
  duration: 3600,
  category: "mcp",
});

Session Expiration:

typescript
logger.info("MCP session expired", {
  sessionId: "session-123",
  reason: "ttl_expired",
  lastAccessedAt: "2026-01-05T12:00:00Z",
  category: "mcp",
});

Rate Limiting Logging

Request Allowed:

typescript
logger.debug("Rate limit check passed", {
  endpoint: "mcp",
  ip: "192.168.1.100",
  count: 25,
  limit: 100,
  category: "rate_limiting",
});

Request Blocked:

typescript
logger.warn("Rate limit exceeded", {
  endpoint: "mcp",
  ip: "192.168.1.100",
  count: 101,
  limit: 100,
  retryAfter: 45,
  category: "rate_limiting",
});

Error Logging

Standard Errors:

typescript
logger.error("Redis connection failed", {
  error: error.message,
  stack: error.stack,
  operation: "get",
  key: "session:123",
});

Structured Error Context:

json
{
  "timestamp": "2026-01-06T12:00:00Z",
  "level": "error",
  "message": "Redis connection failed",
  "error": "ECONNREFUSED",
  "stack": "Error: ECONNREFUSED\n    at ...",
  "operation": "get",
  "key": "session:123",
  "service": "seed"
}

Child Loggers

Create contextual child loggers for scoped logging:

typescript
// Create child logger with context
const toolLogger = createChildLogger({
  sessionId: "session-123",
  userId: "user|12345",
});

// All logs from this logger include context
toolLogger.info("Tool invocation started", { tool: "random-number" });
toolLogger.info("Tool invocation completed", { duration: 0.005 });

Output:

json
{
  "timestamp": "2026-01-06T12:00:00Z",
  "level": "info",
  "message": "Tool invocation started",
  "tool": "random-number",
  "sessionId": "session-123",
  "userId": "user|12345",
  "service": "seed"
}

Monitoring Best Practices

High Priority:

promql
# Auth failure spike
rate(auth_attempts_total{result="failure"}[5m]) > 10

# High error rate
rate(http_request_total{status_code=~"5.."}[5m])
/ rate(http_request_total[5m]) > 0.01

# JWKS refresh failures
rate(jwks_refresh_total{result="failure"}[5m]) > 0

# Redis connection failures
rate(redis_operations_total{result="failure"}[5m])
/ rate(redis_operations_total[5m]) > 0.05

# Circuit breaker open
circuit_breaker_state{name="redis"} == 2

# High OAuth token exchange failure rate
rate(oauth_token_exchanges_total{result="failure"}[5m])
/ rate(oauth_token_exchanges_total[5m]) > 0.01

# High token refresh failure rate
rate(token_refresh_attempts_total{result="failure"}[5m])
/ rate(token_refresh_attempts_total{result!="skipped"}[5m]) > 0.1

Medium Priority:

promql
# Slow API responses
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
) > 1

# High rate limit violations
rate(rate_limit_hits_total[5m]) > 1

# High session churn
rate(mcp_sessions_total{status=~"expired|terminated"}[5m])
/ rate(mcp_sessions_total{status="created"}[5m]) > 0.5

# Slow IdP token endpoint
histogram_quantile(0.95,
  rate(oauth_token_exchange_duration_seconds_bucket[5m])
) > 2

# Slow token refresh operations
histogram_quantile(0.95,
  rate(token_refresh_duration_seconds_bucket{result="success"}[5m])
) > 2

# High OAuth authorization error rate
rate(oauth_authorization_requests_total{result="error"}[5m])
/ rate(oauth_authorization_requests_total[5m]) > 0.05

Dashboard Recommendations

HTTP Overview Dashboard:

  • Request rate by endpoint
  • Error rate over time
  • P95 response time
  • Status code distribution

MCP Dashboard:

  • Active sessions gauge
  • Session creation/termination rate
  • Tool invocation rate by tool
  • Tool duration P95 by tool

OAuth Dashboard (✅ Added 2026-01-07):

  • OAuth authorization requests by result (stacked area)
  • OAuth token exchanges by grant type and result (stacked area)
  • Token exchange duration percentiles (P50/P95/P99)
  • Token refresh success rate by type (proactive vs reactive)
  • Token refresh duration histogram
  • Pending token claim rate

Security Dashboard:

  • Authentication success/failure rate
  • Origin validation blocks
  • Rate limit violations by endpoint
  • Security events by severity

Infrastructure Dashboard:

  • Redis latency P95
  • Redis error rate
  • Redis circuit breaker state
  • JWKS cache hit rate
  • Process memory/CPU usage

Log Aggregation

Recommended Setup:

  • Centralized log aggregation (ELK, Loki, Datadog)
  • Index by category field for fast filtering
  • Retention policy based on compliance requirements
  • Alerts on error spikes and security events

Query Patterns:

# Failed authentications in last hour
category:authentication AND event:token_rejected AND @timestamp:[now-1h TO now]

# MCP tool errors
category:mcp AND success:false

# Security events (high severity)
category:security AND severity:high

Health Checks

IMPLEMENTED (2026-01-06) - Seed provides comprehensive Kubernetes-compatible health check endpoints with liveness and readiness probes.

Liveness Probe

Endpoint: GET /health

Purpose: Determines if the application is running (not hung or deadlocked).

Response (Healthy):

json
HTTP/1.1 200 OK

{
  "status": "ok",
  "version": "0.1.3"
}

Response (Shutting Down):

json
HTTP/1.1 503 Service Unavailable

{
  "status": "shutting_down",
  "version": "0.1.3"
}

Implementation: Returns "shutting_down" during graceful shutdown when SIGTERM/SIGINT is received. See src/routes/health.ts for implementation details.

Use Case: Kubernetes liveness probe to restart unhealthy pods.

Readiness Probe

Endpoint: GET /health/ready

Purpose: Determines if the application is ready to serve traffic.

Checks Performed:

  1. Redis Connectivity - Validates Redis connection with circuit breaker status
  2. JWKS Cache - Checks JWKS cache is populated and not expired
  3. Session Capacity - Verifies active sessions are below maximum threshold

Response (Ready):

json
HTTP/1.1 200 OK

{
  "status": "ready",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": true,
      "connected": true,
      "circuitBreaker": {
        "state": "closed",
        "failureCount": 0,
        "successCount": 100,
        "lastFailure": null,
        "nextRetry": null
      }
    },
    "jwks": {
      "healthy": true,
      "cached": true,
      "isExpired": false,
      "fetchedAt": "2026-01-06T10:00:00.000Z",
      "expiresAt": "2026-01-06T11:00:00.000Z",
      "cacheAge": 1800000
    },
    "sessions": {
      "healthy": true,
      "activeSessions": 42,
      "maxSessions": 10000,
      "utilizationPercent": 0.42
    }
  }
}

Response (Degraded):

json
HTTP/1.1 503 Service Unavailable

{
  "status": "degraded",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": false,
      "connected": false,
      "circuitBreaker": {
        "state": "open",
        "failureCount": 5,
        "successCount": 0,
        "lastFailure": "2026-01-06T10:00:00.000Z",
        "nextRetry": "2026-01-06T10:01:00.000Z"
      }
    },
    "jwks": {
      "healthy": true,
      "cached": true,
      "isExpired": false
    },
    "sessions": {
      "healthy": true,
      "activeSessions": 42,
      "maxSessions": 10000,
      "utilizationPercent": 0.42
    }
  }
}

Implementation: Readiness probe includes comprehensive dependency health checks implemented in src/routes/health.ts:

  • Redis health check with circuit breaker pattern integration
  • JWKS cache validation with expiration tracking
  • Session capacity monitoring with configurable thresholds

Use Case: Kubernetes readiness probe to remove unhealthy pods from service load balancer.

Kubernetes Integration

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: seed-mcp-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: seed
        image: seed-mcp:latest
        ports:
        - containerPort: 3000

        # Liveness probe - restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Readiness probe - remove from service if not ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

Health Check Configuration

bash
# Maximum concurrent sessions (Optional)
# Health check reports degraded status when this limit is exceeded
# Default: 10000
MCP_SESSION_MAX_TOTAL=10000

File: .env.example


Configuration Reference

Environment Variables

bash
# Metrics
METRICS_ENABLED=true             # Enable Prometheus metrics

# Logging
LOG_LEVEL=info                   # debug, info, warn, error
LOG_FORMAT=json                  # json (prod) or simple (dev)

# Health Checks
MCP_SESSION_MAX_TOTAL=10000      # Max sessions before degraded status

Metrics Collection

Scrape Configuration (Prometheus):

yaml
scrape_configs:
  - job_name: 'seed'
    static_configs:
      - targets: ['seed:3000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Implementation Files

  • Metrics Service: src/services/metrics.ts - Prometheus metrics definitions
  • Logger Service: src/services/logger.ts - Winston logger with helper functions
  • Metrics Middleware: src/middleware/metrics.ts - HTTP metrics collection
  • Metrics Config: src/config/metrics.ts - Metrics configuration
  • Logging Config: src/config/logging.ts - Logging configuration
  • Metrics Route: src/routes/metrics.ts - /metrics endpoint handler

Released under the MIT License.