Redis Resilience & Circuit Breaker Pattern
Overview
The Seed server implements comprehensive Redis connection failure handling using the Circuit Breaker pattern to provide graceful degradation during Redis outages.
Implementation Status
✅ Implemented (2026-01-06)
Architecture
Circuit Breaker Service
Located in src/services/circuit-breaker.ts
The circuit breaker implements three states:
- Closed: Normal operation, all requests forwarded to Redis
- Open: Fast-fail mode, uses fallback/cache immediately
- Half-Open: Recovery testing with limited request attempts
Configuration:
{
failureThreshold: 5, // Open after N failures
resetTimeout: 30000, // Try recovery after 30s
monitoringWindow: 60000, // Track failures over 1 minute
halfOpenMaxAttempts: 3 // Max test attempts in half-open
}Redis Integration
Located in src/services/redis.ts
executeRedisOperation Wrapper:
export async function executeRedisOperation<T>(
operation: () => Promise<T>,
fallback: () => T,
operationName: string,
): Promise<T>All Redis operations are wrapped with this function to provide:
- Automatic circuit breaker protection
- Consistent error logging
- Graceful fallback behavior
Subsystem Integration
1. Session Store
src/services/session-store.ts
Fallback Behavior:
get(): Returns null when Redis unavailable (session not found)set(): Logs warning, continues without persistencedelete(): Logs warning, relies on TTL expirationtouch(): Logs warning, session may expire earlier
Impact: Sessions become stateless during outages but service remains available.
2. Token Store
src/services/token-store.ts
Fallback Behavior:
get(): Returns null (requires re-authentication)set(): Logs error, tokens not persistedsetPending(): Logs error, user may need to retry OAuth flowclaimPending(): Returns null (pending tokens lost)
Impact: Users may need to re-authenticate during outages, but service stays up.
3. Distributed Rate Limiting
src/middleware/distributed-rate-limit.ts
Fallback Behavior:
- Rate limit checks return
allowed: true(fail-open) - Tracking operations log warnings but don't block requests
Impact: Rate limiting disabled during outages - acceptable for maintaining availability.
Observability
Metrics
Four Prometheus metrics exposed at /metrics:
# Circuit breaker state (0=closed, 1=half-open, 2=open)
circuit_breaker_state{name="redis"} 0
# Total failures recorded
circuit_breaker_failures_total{name="redis"} 5
# Total successful operations
circuit_breaker_successes_total{name="redis"} 1000
# State transitions
circuit_breaker_state_changes_total{name="redis",from_state="closed",to_state="open"} 1Health Checks
Readiness Probe: /health/ready
Returns 503 (Service Unavailable) when:
- Redis is disconnected, OR
- Circuit breaker is in "open" state
Response includes detailed circuit breaker stats:
{
"status": "degraded",
"version": "0.1.3",
"checks": {
"redis": {
"healthy": false,
"connected": false,
"circuitBreaker": {
"state": "open",
"failureCount": 5,
"successCount": 0,
"lastFailure": "2026-01-06T16:30:45.123Z",
"nextRetry": "2026-01-06T16:31:15.123Z"
}
}
}
}Logging
All Redis operations log with structured context:
Success:
{
"level": "info",
"message": "Redis operation completed",
"operation": "session-get",
"category": "redis"
}Failure:
{
"level": "error",
"message": "Redis operation failed",
"operation": "token-set",
"error": "Connection timeout",
"category": "redis"
}Circuit Opened:
{
"level": "error",
"message": "Circuit breaker opened",
"name": "redis",
"failureCount": 5,
"resetTimeout": 30000,
"category": "circuit-breaker"
}Testing
Unit Tests
Circuit breaker tests: src/services/circuit-breaker.test.ts
Coverage:
- ✅ Normal operation (closed state)
- ✅ Failure threshold and circuit opening
- ✅ Half-open state transitions
- ✅ Recovery after timeout
- ✅ Statistics tracking
Integration Considerations
Existing tests need updates to account for circuit breaker fallback behavior:
- Tests expecting Redis errors need to expect fallback values instead
- Mock
executeRedisOperationto control circuit breaker behavior - Test both success and fallback paths
Operational Guidelines
Monitoring
Alert on:
circuit_breaker_state{name="redis"} == 2(circuit open)- High
circuit_breaker_failures_totalrate /health/readyreturning 503
Recovery
Circuit breaker automatically attempts recovery after resetTimeout (30s). No manual intervention required.
Manual Recovery: If needed, restart the service to reset circuit breaker state.
Capacity Planning
With circuit breaker in place:
- Service continues during Redis outages
- Graceful degradation maintains core functionality
- Users may experience: re-authentication, missing sessions, no rate limiting