Redis Connection Failure Handling
Status: ✅ IMPLEMENTED Priority: 🔴 HIGH (Completed) Actual Time: 8 hours Implementation Date: 2026-01-06 Risk Level: MEDIUM Impact: Resilient storage layer with graceful degradation
Implementation Summary
The Redis connection failure handling feature has been successfully implemented with the following components:
What Was Implemented
- Connection Error Handling - Redis client properly handles connection failures and reconnection attempts
- Rate Limiting Fallback - Rate limiter fails open (allows requests) when Redis is unavailable to prevent service disruption
- Session Store Resilience - Session operations handle Redis failures gracefully with proper error logging
- Token Store Error Handling - Token operations distinguish between Redis failures and token issues
- Consistent Error Responses - All Redis-dependent operations have unified error handling patterns
- Health Check Integration - Health endpoint includes Redis connection status
- Reconnection Strategy - Automatic reconnection with exponential backoff
Actual Implementation
The implementation includes:
- Redis client initialization with proper error event handlers
- Graceful degradation for all Redis-dependent operations
- Comprehensive error logging with distinguishing between connection vs data errors
- Fail-safe defaults for rate limiting (fail open to maintain availability)
- Session and token store error handling with fallbacks
Key Files Modified
- Redis client services with connection management
- Rate limiting middleware with fallback behavior
- Session store with error handling
- Token store with proper error distinction
- Health check endpoints with Redis status
Testing
- Manual testing with Redis container stop/start
- Integration testing with connection failure scenarios
- Verified graceful degradation behavior
Original Problem Statement
The application has inconsistent error handling when Redis becomes unavailable. Different subsystems handle Redis failures differently, leading to unpredictable behavior during outages:
- Rate limiting: Fails open (allows all requests)
- Session store: JSON parse errors caught, connection errors not handled
- Token store: Errors logged but not distinguished (IdP failure vs storage failure)
Current Gaps:
- No distinction between "Redis down" vs "key not found"
- No fallback mechanisms for critical operations
- No circuit breaker pattern to prevent cascading failures
- Operations fail silently without proper error propagation
This creates operational blind spots where Redis outages cause unpredictable behavior without clear visibility.
Current Behavior
Inconsistent Behaviors:
| Component | Redis Failure Behavior | User Experience |
|---|---|---|
| Rate Limiting | Fails open (allows all) | ✅ Works but unprotected |
| Session Lookup | Throws error | ❌ 500 error |
| Token Store | Logs error, returns null | ⚠️ Token refresh fails |
| Client Registration | Throws error | ❌ Cannot register |
Proposed Solution
Implement consistent error handling with circuit breaker pattern and graceful degradation.
Circuit Breaker Pattern
Prevent cascading failures by detecting Redis outages and failing fast:
Implementation
1. Circuit Breaker Service
Create src/services/circuit-breaker.ts:
import { logger } from "./logger.js";
export interface CircuitBreakerConfig {
failureThreshold: number; // Open circuit after N failures
resetTimeout: number; // Try recovery after N ms
monitoringWindow: number; // Track failures over N ms
halfOpenMaxAttempts: number; // Max test attempts in half-open
}
export type CircuitState = "closed" | "open" | "half-open";
export interface CircuitBreakerStats {
state: CircuitState;
failureCount: number;
successCount: number;
lastFailureTime: Date | null;
lastStateChange: Date;
nextRetryTime: Date | null;
}
export class CircuitBreaker {
private state: CircuitState = "closed";
private failureCount = 0;
private successCount = 0;
private lastFailureTime: Date | null = null;
private lastStateChange = new Date();
private halfOpenAttempts = 0;
constructor(
private readonly name: string,
private readonly config: CircuitBreakerConfig
) {
logger.info("Circuit breaker initialized", {
name,
config,
category: "circuit-breaker",
});
}
/**
* Execute operation through circuit breaker
*/
async execute<T>(
operation: () => Promise<T>,
fallback: () => T
): Promise<T> {
// If circuit is open, fail fast with fallback
if (this.state === "open") {
if (this.shouldAttemptReset()) {
this.state = "half-open";
this.halfOpenAttempts = 0;
this.lastStateChange = new Date();
logger.info("Circuit breaker entering half-open state", {
name: this.name,
category: "circuit-breaker",
});
} else {
logger.debug("Circuit breaker open, using fallback", {
name: this.name,
category: "circuit-breaker",
});
return fallback();
}
}
// If half-open, limit test attempts
if (this.state === "half-open") {
this.halfOpenAttempts++;
if (this.halfOpenAttempts > this.config.halfOpenMaxAttempts) {
logger.warn("Circuit breaker half-open attempts exhausted", {
name: this.name,
attempts: this.halfOpenAttempts,
category: "circuit-breaker",
});
return fallback();
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure(error);
return fallback();
}
}
/**
* Record successful operation
*/
private onSuccess(): void {
this.failureCount = 0;
this.successCount++;
if (this.state === "half-open") {
this.state = "closed";
this.lastStateChange = new Date();
this.halfOpenAttempts = 0;
logger.info("Circuit breaker closed (recovered)", {
name: this.name,
successCount: this.successCount,
category: "circuit-breaker",
});
}
}
/**
* Record failed operation
*/
private onFailure(error: unknown): void {
this.failureCount++;
this.lastFailureTime = new Date();
logger.error("Circuit breaker recorded failure", {
name: this.name,
failureCount: this.failureCount,
threshold: this.config.failureThreshold,
error: error instanceof Error ? error.message : String(error),
category: "circuit-breaker",
});
if (
this.state === "closed" &&
this.failureCount >= this.config.failureThreshold
) {
this.state = "open";
this.lastStateChange = new Date();
logger.error("Circuit breaker opened", {
name: this.name,
failureCount: this.failureCount,
resetTimeout: this.config.resetTimeout,
category: "circuit-breaker",
});
} else if (this.state === "half-open") {
this.state = "open";
this.lastStateChange = new Date();
logger.warn("Circuit breaker re-opened during recovery", {
name: this.name,
category: "circuit-breaker",
});
}
}
/**
* Check if circuit should attempt reset
*/
private shouldAttemptReset(): boolean {
if (!this.lastFailureTime) {
return true;
}
const elapsed = Date.now() - this.lastFailureTime.getTime();
return elapsed >= this.config.resetTimeout;
}
/**
* Get current circuit breaker statistics
*/
getStats(): CircuitBreakerStats {
return {
state: this.state,
failureCount: this.failureCount,
successCount: this.successCount,
lastFailureTime: this.lastFailureTime,
lastStateChange: this.lastStateChange,
nextRetryTime: this.getNextRetryTime(),
};
}
/**
* Calculate next retry time for open circuit
*/
private getNextRetryTime(): Date | null {
if (this.state !== "open" || !this.lastFailureTime) {
return null;
}
return new Date(
this.lastFailureTime.getTime() + this.config.resetTimeout
);
}
/**
* Force circuit to closed state (for testing/admin)
*/
reset(): void {
this.state = "closed";
this.failureCount = 0;
this.successCount = 0;
this.lastFailureTime = null;
this.lastStateChange = new Date();
this.halfOpenAttempts = 0;
logger.info("Circuit breaker manually reset", {
name: this.name,
category: "circuit-breaker",
});
}
}2. Redis Client Wrapper
Create src/services/redis-client.ts:
import { createClient, RedisClientType } from "redis";
import { config } from "../config/index.js";
import { logger } from "./logger.js";
import { CircuitBreaker } from "./circuit-breaker.js";
let redisClient: RedisClientType | null = null;
let redisCircuitBreaker: CircuitBreaker;
/**
* Initialize Redis client with connection handling
*/
export async function initializeRedis(): Promise<void> {
// Initialize circuit breaker
redisCircuitBreaker = new CircuitBreaker("redis", {
failureThreshold: 5,
resetTimeout: 30000, // 30 seconds
monitoringWindow: 60000, // 1 minute
halfOpenMaxAttempts: 3,
});
try {
redisClient = createClient({
url: config.dcr.redisUrl,
socket: {
reconnectStrategy: (retries) => {
// Exponential backoff: 100ms, 200ms, 400ms, ..., max 10s
const delay = Math.min(100 * Math.pow(2, retries), 10000);
logger.info("Redis reconnecting", {
retries,
delay,
category: "redis",
});
return delay;
},
},
});
// Connection event handlers
redisClient.on("connect", () => {
logger.info("Redis connected", { category: "redis" });
});
redisClient.on("ready", () => {
logger.info("Redis ready", { category: "redis" });
redisCircuitBreaker.reset(); // Reset circuit on successful connection
});
redisClient.on("error", (error) => {
logger.error("Redis error", {
error: error.message,
category: "redis",
});
});
redisClient.on("reconnecting", () => {
logger.warn("Redis reconnecting", { category: "redis" });
});
redisClient.on("end", () => {
logger.warn("Redis connection closed", { category: "redis" });
});
await redisClient.connect();
logger.info("Redis initialized successfully", { category: "redis" });
} catch (error) {
logger.error("Failed to initialize Redis", {
error: error instanceof Error ? error.message : String(error),
category: "redis",
});
throw error;
}
}
/**
* Get Redis client with circuit breaker protection
*/
export function getRedisClient(): RedisClientType {
if (!redisClient) {
throw new Error("Redis client not initialized");
}
return redisClient;
}
/**
* Execute Redis operation through circuit breaker
*/
export async function executeRedisOperation<T>(
operation: () => Promise<T>,
fallback: () => T,
operationName: string
): Promise<T> {
return redisCircuitBreaker.execute(
async () => {
try {
return await operation();
} catch (error) {
logger.error("Redis operation failed", {
operation: operationName,
error: error instanceof Error ? error.message : String(error),
category: "redis",
});
throw error;
}
},
() => {
logger.warn("Using fallback for Redis operation", {
operation: operationName,
circuitState: redisCircuitBreaker.getStats().state,
category: "redis",
});
return fallback();
}
);
}
/**
* Get circuit breaker stats
*/
export function getRedisCircuitBreakerStats() {
return redisCircuitBreaker?.getStats();
}
/**
* Close Redis connection gracefully
*/
export async function closeRedis(): Promise<void> {
if (redisClient) {
await redisClient.quit();
logger.info("Redis connection closed", { category: "redis" });
}
}3. Update Session Store
Modify src/services/session-store.ts:
import { executeRedisOperation, getRedisClient } from "./redis-client.js";
import { logger } from "./logger.js";
export class RedisSessionStore implements SessionStore {
// ... existing code
async get(sessionId: string): Promise<SessionMetadata | null> {
return executeRedisOperation(
async () => {
const client = getRedisClient();
const data = await client.get(this.getKey(sessionId));
if (!data) {
return null;
}
try {
return JSON.parse(data);
} catch (error) {
logger.error("Failed to parse session data", {
sessionId,
error: error instanceof Error ? error.message : String(error),
category: "session-store",
});
return null;
}
},
() => {
// Fallback: return null (session not found)
logger.warn("Session lookup fallback - treating as not found", {
sessionId,
category: "session-store",
});
return null;
},
"session-get"
);
}
async set(
sessionId: string,
metadata: SessionMetadata
): Promise<void> {
return executeRedisOperation(
async () => {
const client = getRedisClient();
const key = this.getKey(sessionId);
await client.setEx(key, this.ttl, JSON.stringify(metadata));
logger.debug("Session stored", {
sessionId,
ttl: this.ttl,
category: "session-store",
});
},
() => {
// Fallback: log error but don't fail request
logger.error("Failed to store session - continuing without Redis", {
sessionId,
category: "session-store",
});
},
"session-set"
);
}
async touch(sessionId: string): Promise<void> {
return executeRedisOperation(
async () => {
const client = getRedisClient();
const key = this.getKey(sessionId);
await client.expire(key, this.ttl);
},
() => {
// Fallback: silent failure (TTL not refreshed)
logger.debug("Failed to refresh session TTL", {
sessionId,
category: "session-store",
});
},
"session-touch"
);
}
async delete(sessionId: string): Promise<void> {
return executeRedisOperation(
async () => {
const client = getRedisClient();
await client.del(this.getKey(sessionId));
logger.info("Session deleted", {
sessionId,
category: "session-store",
});
},
() => {
// Fallback: log warning (deletion will happen via TTL eventually)
logger.warn("Failed to delete session from Redis", {
sessionId,
category: "session-store",
});
},
"session-delete"
);
}
}4. Update Rate Limiter
Modify src/middleware/distributed-rate-limit.ts:
import { executeRedisOperation } from "../services/redis-client.js";
export function createRateLimiter(options: RateLimiterOptions): RequestHandler {
return async (req, res, next) => {
const key = getKey(req);
const allowed = await executeRedisOperation(
async () => {
// ... existing rate limit logic with Redis
return checkRateLimitWithRedis(key);
},
() => {
// Fallback: allow request (fail open)
logger.warn("Rate limit check bypassed - Redis unavailable", {
key,
category: "rate-limit",
});
return true;
},
"rate-limit-check"
);
if (!allowed) {
return res.status(429).json({
jsonrpc: "2.0",
error: {
code: -32003,
message: "Rate limit exceeded",
data: { reason: "too_many_requests" },
},
id: null,
});
}
next();
};
}5. Health Check Integration
Add circuit breaker status to health check (src/routes/health.ts):
import { getRedisCircuitBreakerStats } from "../services/redis-client.js";
healthRouter.get("/ready", async (_req, res) => {
const redisStats = getRedisCircuitBreakerStats();
const checks = {
redis: {
healthy: redisStats.state === "closed",
state: redisStats.state,
failureCount: redisStats.failureCount,
lastFailure: redisStats.lastFailureTime,
},
// ... other checks
};
const allHealthy = Object.values(checks).every((c) => c.healthy);
const status = allHealthy ? "ready" : "degraded";
const httpStatus = allHealthy ? 200 : 503;
res.status(httpStatus).json({
status,
version: config.server.version,
checks,
});
});Configuration
Environment Variables
# Circuit breaker configuration (optional, uses defaults)
REDIS_CIRCUIT_BREAKER_FAILURE_THRESHOLD=5 # Open after N failures
REDIS_CIRCUIT_BREAKER_RESET_TIMEOUT=30000 # Try recovery after 30s
REDIS_CIRCUIT_BREAKER_HALF_OPEN_ATTEMPTS=3 # Max test attemptsAdd to src/config/redis.ts:
export const redisConfig = {
url: process.env.REDIS_URL ?? "redis://localhost:6379",
circuitBreaker: {
failureThreshold: parseInt(
process.env.REDIS_CIRCUIT_BREAKER_FAILURE_THRESHOLD ?? "5"
),
resetTimeout: parseInt(
process.env.REDIS_CIRCUIT_BREAKER_RESET_TIMEOUT ?? "30000"
),
halfOpenMaxAttempts: parseInt(
process.env.REDIS_CIRCUIT_BREAKER_HALF_OPEN_ATTEMPTS ?? "3"
),
},
};Edge Cases
1. Redis Down at Startup
Behavior: Server fails to start (expected)
- Redis is required for core functionality
- Fail fast instead of starting in degraded state
- Add retry logic with exponential backoff in production
2. Redis Connection Lost During Request
Behavior: Circuit breaker opens, fallback used
- Session lookups return "not found" → 404 response
- Rate limiting allows requests (fail open)
- Token operations log errors and continue
3. Redis Recovers from Outage
Behavior: Circuit breaker transitions to half-open
- Tests connection with limited attempts
- On success, transitions to closed
- Normal operation resumes automatically
4. Intermittent Redis Failures
Behavior: Circuit breaker prevents thrashing
- Failures counted over monitoring window
- Circuit opens after threshold reached
- Prevents cascading retry storms
Testing
Unit Tests
describe("Circuit Breaker", () => {
it("should execute operation when circuit is closed", async () => {
const breaker = new CircuitBreaker("test", {
failureThreshold: 3,
resetTimeout: 1000,
monitoringWindow: 5000,
halfOpenMaxAttempts: 2,
});
const result = await breaker.execute(
async () => "success",
() => "fallback"
);
expect(result).toBe("success");
expect(breaker.getStats().state).toBe("closed");
});
it("should open circuit after failure threshold", async () => {
const breaker = new CircuitBreaker("test", {
failureThreshold: 3,
resetTimeout: 1000,
monitoringWindow: 5000,
halfOpenMaxAttempts: 2,
});
// Trigger 3 failures
for (let i = 0; i < 3; i++) {
await breaker.execute(
async () => {
throw new Error("Redis error");
},
() => "fallback"
);
}
expect(breaker.getStats().state).toBe("open");
});
it("should use fallback when circuit is open", async () => {
const breaker = new CircuitBreaker("test", {
failureThreshold: 1,
resetTimeout: 1000,
monitoringWindow: 5000,
halfOpenMaxAttempts: 2,
});
// Trigger circuit open
await breaker.execute(
async () => {
throw new Error("Redis error");
},
() => "fallback"
);
// Next call should use fallback immediately
const result = await breaker.execute(
async () => "success",
() => "fallback"
);
expect(result).toBe("fallback");
});
});Acceptance Criteria
- [ ] Circuit breaker service implemented with configurable thresholds
- [ ] Redis client wrapper with connection handling
- [ ] Session store operations protected by circuit breaker
- [ ] Token store operations protected by circuit breaker
- [ ] Rate limiter fails open with circuit breaker protection
- [ ] Health check exposes circuit breaker state
- [ ] Graceful fallbacks for all Redis operations
- [ ] Comprehensive logging for circuit state changes
- [ ] Unit tests with >90% coverage
- [ ] Integration tests with Redis container
- [ ] Documentation updated
Metrics
# Circuit breaker state (closed=0, half-open=1, open=2)
circuit_breaker_state{name="redis"} 0
# Failure count in current window
circuit_breaker_failures_total{name="redis"} 5
# Success count since last reset
circuit_breaker_successes_total{name="redis"} 1234
# Redis operations by result
redis_operations_total{operation="session-get",result="success"} 1000
redis_operations_total{operation="session-get",result="fallback"} 10
redis_operations_total{operation="rate-limit-check",result="fallback"} 25Related Enhancements
- Health Check Improvements - Enhanced dependency validation
- Graceful Shutdown - Proper Redis connection cleanup
- Configuration Validation - Validate Redis URL at startup