Skip to content

Redis Resilience & Circuit Breaker Pattern

Overview

The Seed server implements comprehensive Redis connection failure handling using the Circuit Breaker pattern to provide graceful degradation during Redis outages.

Implementation Status

Implemented (2026-01-06)

Architecture

Circuit Breaker Service

Located in src/services/circuit-breaker.ts

The circuit breaker implements three states:

  • Closed: Normal operation, all requests forwarded to Redis
  • Open: Fast-fail mode, uses fallback/cache immediately
  • Half-Open: Recovery testing with limited request attempts

Configuration:

typescript
{
  failureThreshold: 5,        // Open after N failures
  resetTimeout: 30000,        // Try recovery after 30s
  monitoringWindow: 60000,    // Track failures over 1 minute
  halfOpenMaxAttempts: 3      // Max test attempts in half-open
}

Redis Integration

Located in src/services/redis.ts

executeRedisOperation Wrapper:

typescript
export async function executeRedisOperation<T>(
  operation: () => Promise<T>,
  fallback: () => T,
  operationName: string,
): Promise<T>

All Redis operations are wrapped with this function to provide:

  • Automatic circuit breaker protection
  • Consistent error logging
  • Graceful fallback behavior

Subsystem Integration

1. Session Store

src/services/session-store.ts

Fallback Behavior:

  • get(): Returns null when Redis unavailable (session not found)
  • set(): Logs warning, continues without persistence
  • delete(): Logs warning, relies on TTL expiration
  • touch(): Logs warning, session may expire earlier

Impact: Sessions become stateless during outages but service remains available.

2. Token Store

src/services/token-store.ts

Fallback Behavior:

  • get(): Returns null (requires re-authentication)
  • set(): Logs error, tokens not persisted
  • setPending(): Logs error, user may need to retry OAuth flow
  • claimPending(): Returns null (pending tokens lost)

Impact: Users may need to re-authenticate during outages, but service stays up.

3. Distributed Rate Limiting

src/middleware/distributed-rate-limit.ts

Fallback Behavior:

  • Rate limit checks return allowed: true (fail-open)
  • Tracking operations log warnings but don't block requests

Impact: Rate limiting disabled during outages - acceptable for maintaining availability.

Observability

Metrics

Four Prometheus metrics exposed at /metrics:

# Circuit breaker state (0=closed, 1=half-open, 2=open)
circuit_breaker_state{name="redis"} 0

# Total failures recorded
circuit_breaker_failures_total{name="redis"} 5

# Total successful operations
circuit_breaker_successes_total{name="redis"} 1000

# State transitions
circuit_breaker_state_changes_total{name="redis",from_state="closed",to_state="open"} 1

Health Checks

Readiness Probe: /health/ready

Returns 503 (Service Unavailable) when:

  • Redis is disconnected, OR
  • Circuit breaker is in "open" state

Response includes detailed circuit breaker stats:

json
{
  "status": "degraded",
  "version": "0.1.3",
  "checks": {
    "redis": {
      "healthy": false,
      "connected": false,
      "circuitBreaker": {
        "state": "open",
        "failureCount": 5,
        "successCount": 0,
        "lastFailure": "2026-01-06T16:30:45.123Z",
        "nextRetry": "2026-01-06T16:31:15.123Z"
      }
    }
  }
}

Logging

All Redis operations log with structured context:

Success:

json
{
  "level": "info",
  "message": "Redis operation completed",
  "operation": "session-get",
  "category": "redis"
}

Failure:

json
{
  "level": "error",
  "message": "Redis operation failed",
  "operation": "token-set",
  "error": "Connection timeout",
  "category": "redis"
}

Circuit Opened:

json
{
  "level": "error",
  "message": "Circuit breaker opened",
  "name": "redis",
  "failureCount": 5,
  "resetTimeout": 30000,
  "category": "circuit-breaker"
}

Testing

Unit Tests

Circuit breaker tests: src/services/circuit-breaker.test.ts

Coverage:

  • ✅ Normal operation (closed state)
  • ✅ Failure threshold and circuit opening
  • ✅ Half-open state transitions
  • ✅ Recovery after timeout
  • ✅ Statistics tracking

Integration Considerations

Existing tests need updates to account for circuit breaker fallback behavior:

  • Tests expecting Redis errors need to expect fallback values instead
  • Mock executeRedisOperation to control circuit breaker behavior
  • Test both success and fallback paths

Operational Guidelines

Monitoring

Alert on:

  1. circuit_breaker_state{name="redis"} == 2 (circuit open)
  2. High circuit_breaker_failures_total rate
  3. /health/ready returning 503

Recovery

Circuit breaker automatically attempts recovery after resetTimeout (30s). No manual intervention required.

Manual Recovery: If needed, restart the service to reset circuit breaker state.

Capacity Planning

With circuit breaker in place:

  • Service continues during Redis outages
  • Graceful degradation maintains core functionality
  • Users may experience: re-authentication, missing sessions, no rate limiting

Released under the MIT License.