Skip to content

Redis Connection Failure Handling

Status: ✅ IMPLEMENTED Priority: 🔴 HIGH (Completed) Actual Time: 8 hours Implementation Date: 2026-01-06 Risk Level: MEDIUM Impact: Resilient storage layer with graceful degradation

← Back to Enhancements


Implementation Summary

The Redis connection failure handling feature has been successfully implemented with the following components:

What Was Implemented

  1. Connection Error Handling - Redis client properly handles connection failures and reconnection attempts
  2. Rate Limiting Fallback - Rate limiter fails open (allows requests) when Redis is unavailable to prevent service disruption
  3. Session Store Resilience - Session operations handle Redis failures gracefully with proper error logging
  4. Token Store Error Handling - Token operations distinguish between Redis failures and token issues
  5. Consistent Error Responses - All Redis-dependent operations have unified error handling patterns
  6. Health Check Integration - Health endpoint includes Redis connection status
  7. Reconnection Strategy - Automatic reconnection with exponential backoff

Actual Implementation

The implementation includes:

  • Redis client initialization with proper error event handlers
  • Graceful degradation for all Redis-dependent operations
  • Comprehensive error logging with distinguishing between connection vs data errors
  • Fail-safe defaults for rate limiting (fail open to maintain availability)
  • Session and token store error handling with fallbacks

Key Files Modified

  • Redis client services with connection management
  • Rate limiting middleware with fallback behavior
  • Session store with error handling
  • Token store with proper error distinction
  • Health check endpoints with Redis status

Testing

  • Manual testing with Redis container stop/start
  • Integration testing with connection failure scenarios
  • Verified graceful degradation behavior

Original Problem Statement

The application has inconsistent error handling when Redis becomes unavailable. Different subsystems handle Redis failures differently, leading to unpredictable behavior during outages:

  • Rate limiting: Fails open (allows all requests)
  • Session store: JSON parse errors caught, connection errors not handled
  • Token store: Errors logged but not distinguished (IdP failure vs storage failure)

Current Gaps:

  1. No distinction between "Redis down" vs "key not found"
  2. No fallback mechanisms for critical operations
  3. No circuit breaker pattern to prevent cascading failures
  4. Operations fail silently without proper error propagation

This creates operational blind spots where Redis outages cause unpredictable behavior without clear visibility.


Current Behavior

Inconsistent Behaviors:

ComponentRedis Failure BehaviorUser Experience
Rate LimitingFails open (allows all)✅ Works but unprotected
Session LookupThrows error❌ 500 error
Token StoreLogs error, returns null⚠️ Token refresh fails
Client RegistrationThrows error❌ Cannot register

Proposed Solution

Implement consistent error handling with circuit breaker pattern and graceful degradation.

Circuit Breaker Pattern

Prevent cascading failures by detecting Redis outages and failing fast:


Implementation

1. Circuit Breaker Service

Create src/services/circuit-breaker.ts:

typescript
import { logger } from "./logger.js";

export interface CircuitBreakerConfig {
  failureThreshold: number;      // Open circuit after N failures
  resetTimeout: number;           // Try recovery after N ms
  monitoringWindow: number;       // Track failures over N ms
  halfOpenMaxAttempts: number;    // Max test attempts in half-open
}

export type CircuitState = "closed" | "open" | "half-open";

export interface CircuitBreakerStats {
  state: CircuitState;
  failureCount: number;
  successCount: number;
  lastFailureTime: Date | null;
  lastStateChange: Date;
  nextRetryTime: Date | null;
}

export class CircuitBreaker {
  private state: CircuitState = "closed";
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime: Date | null = null;
  private lastStateChange = new Date();
  private halfOpenAttempts = 0;

  constructor(
    private readonly name: string,
    private readonly config: CircuitBreakerConfig
  ) {
    logger.info("Circuit breaker initialized", {
      name,
      config,
      category: "circuit-breaker",
    });
  }

  /**
   * Execute operation through circuit breaker
   */
  async execute<T>(
    operation: () => Promise<T>,
    fallback: () => T
  ): Promise<T> {
    // If circuit is open, fail fast with fallback
    if (this.state === "open") {
      if (this.shouldAttemptReset()) {
        this.state = "half-open";
        this.halfOpenAttempts = 0;
        this.lastStateChange = new Date();

        logger.info("Circuit breaker entering half-open state", {
          name: this.name,
          category: "circuit-breaker",
        });
      } else {
        logger.debug("Circuit breaker open, using fallback", {
          name: this.name,
          category: "circuit-breaker",
        });
        return fallback();
      }
    }

    // If half-open, limit test attempts
    if (this.state === "half-open") {
      this.halfOpenAttempts++;

      if (this.halfOpenAttempts > this.config.halfOpenMaxAttempts) {
        logger.warn("Circuit breaker half-open attempts exhausted", {
          name: this.name,
          attempts: this.halfOpenAttempts,
          category: "circuit-breaker",
        });
        return fallback();
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure(error);
      return fallback();
    }
  }

  /**
   * Record successful operation
   */
  private onSuccess(): void {
    this.failureCount = 0;
    this.successCount++;

    if (this.state === "half-open") {
      this.state = "closed";
      this.lastStateChange = new Date();
      this.halfOpenAttempts = 0;

      logger.info("Circuit breaker closed (recovered)", {
        name: this.name,
        successCount: this.successCount,
        category: "circuit-breaker",
      });
    }
  }

  /**
   * Record failed operation
   */
  private onFailure(error: unknown): void {
    this.failureCount++;
    this.lastFailureTime = new Date();

    logger.error("Circuit breaker recorded failure", {
      name: this.name,
      failureCount: this.failureCount,
      threshold: this.config.failureThreshold,
      error: error instanceof Error ? error.message : String(error),
      category: "circuit-breaker",
    });

    if (
      this.state === "closed" &&
      this.failureCount >= this.config.failureThreshold
    ) {
      this.state = "open";
      this.lastStateChange = new Date();

      logger.error("Circuit breaker opened", {
        name: this.name,
        failureCount: this.failureCount,
        resetTimeout: this.config.resetTimeout,
        category: "circuit-breaker",
      });
    } else if (this.state === "half-open") {
      this.state = "open";
      this.lastStateChange = new Date();

      logger.warn("Circuit breaker re-opened during recovery", {
        name: this.name,
        category: "circuit-breaker",
      });
    }
  }

  /**
   * Check if circuit should attempt reset
   */
  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) {
      return true;
    }

    const elapsed = Date.now() - this.lastFailureTime.getTime();
    return elapsed >= this.config.resetTimeout;
  }

  /**
   * Get current circuit breaker statistics
   */
  getStats(): CircuitBreakerStats {
    return {
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
      lastFailureTime: this.lastFailureTime,
      lastStateChange: this.lastStateChange,
      nextRetryTime: this.getNextRetryTime(),
    };
  }

  /**
   * Calculate next retry time for open circuit
   */
  private getNextRetryTime(): Date | null {
    if (this.state !== "open" || !this.lastFailureTime) {
      return null;
    }

    return new Date(
      this.lastFailureTime.getTime() + this.config.resetTimeout
    );
  }

  /**
   * Force circuit to closed state (for testing/admin)
   */
  reset(): void {
    this.state = "closed";
    this.failureCount = 0;
    this.successCount = 0;
    this.lastFailureTime = null;
    this.lastStateChange = new Date();
    this.halfOpenAttempts = 0;

    logger.info("Circuit breaker manually reset", {
      name: this.name,
      category: "circuit-breaker",
    });
  }
}

2. Redis Client Wrapper

Create src/services/redis-client.ts:

typescript
import { createClient, RedisClientType } from "redis";
import { config } from "../config/index.js";
import { logger } from "./logger.js";
import { CircuitBreaker } from "./circuit-breaker.js";

let redisClient: RedisClientType | null = null;
let redisCircuitBreaker: CircuitBreaker;

/**
 * Initialize Redis client with connection handling
 */
export async function initializeRedis(): Promise<void> {
  // Initialize circuit breaker
  redisCircuitBreaker = new CircuitBreaker("redis", {
    failureThreshold: 5,
    resetTimeout: 30000, // 30 seconds
    monitoringWindow: 60000, // 1 minute
    halfOpenMaxAttempts: 3,
  });

  try {
    redisClient = createClient({
      url: config.dcr.redisUrl,
      socket: {
        reconnectStrategy: (retries) => {
          // Exponential backoff: 100ms, 200ms, 400ms, ..., max 10s
          const delay = Math.min(100 * Math.pow(2, retries), 10000);
          logger.info("Redis reconnecting", {
            retries,
            delay,
            category: "redis",
          });
          return delay;
        },
      },
    });

    // Connection event handlers
    redisClient.on("connect", () => {
      logger.info("Redis connected", { category: "redis" });
    });

    redisClient.on("ready", () => {
      logger.info("Redis ready", { category: "redis" });
      redisCircuitBreaker.reset(); // Reset circuit on successful connection
    });

    redisClient.on("error", (error) => {
      logger.error("Redis error", {
        error: error.message,
        category: "redis",
      });
    });

    redisClient.on("reconnecting", () => {
      logger.warn("Redis reconnecting", { category: "redis" });
    });

    redisClient.on("end", () => {
      logger.warn("Redis connection closed", { category: "redis" });
    });

    await redisClient.connect();
    logger.info("Redis initialized successfully", { category: "redis" });
  } catch (error) {
    logger.error("Failed to initialize Redis", {
      error: error instanceof Error ? error.message : String(error),
      category: "redis",
    });
    throw error;
  }
}

/**
 * Get Redis client with circuit breaker protection
 */
export function getRedisClient(): RedisClientType {
  if (!redisClient) {
    throw new Error("Redis client not initialized");
  }
  return redisClient;
}

/**
 * Execute Redis operation through circuit breaker
 */
export async function executeRedisOperation<T>(
  operation: () => Promise<T>,
  fallback: () => T,
  operationName: string
): Promise<T> {
  return redisCircuitBreaker.execute(
    async () => {
      try {
        return await operation();
      } catch (error) {
        logger.error("Redis operation failed", {
          operation: operationName,
          error: error instanceof Error ? error.message : String(error),
          category: "redis",
        });
        throw error;
      }
    },
    () => {
      logger.warn("Using fallback for Redis operation", {
        operation: operationName,
        circuitState: redisCircuitBreaker.getStats().state,
        category: "redis",
      });
      return fallback();
    }
  );
}

/**
 * Get circuit breaker stats
 */
export function getRedisCircuitBreakerStats() {
  return redisCircuitBreaker?.getStats();
}

/**
 * Close Redis connection gracefully
 */
export async function closeRedis(): Promise<void> {
  if (redisClient) {
    await redisClient.quit();
    logger.info("Redis connection closed", { category: "redis" });
  }
}

3. Update Session Store

Modify src/services/session-store.ts:

typescript
import { executeRedisOperation, getRedisClient } from "./redis-client.js";
import { logger } from "./logger.js";

export class RedisSessionStore implements SessionStore {
  // ... existing code

  async get(sessionId: string): Promise<SessionMetadata | null> {
    return executeRedisOperation(
      async () => {
        const client = getRedisClient();
        const data = await client.get(this.getKey(sessionId));

        if (!data) {
          return null;
        }

        try {
          return JSON.parse(data);
        } catch (error) {
          logger.error("Failed to parse session data", {
            sessionId,
            error: error instanceof Error ? error.message : String(error),
            category: "session-store",
          });
          return null;
        }
      },
      () => {
        // Fallback: return null (session not found)
        logger.warn("Session lookup fallback - treating as not found", {
          sessionId,
          category: "session-store",
        });
        return null;
      },
      "session-get"
    );
  }

  async set(
    sessionId: string,
    metadata: SessionMetadata
  ): Promise<void> {
    return executeRedisOperation(
      async () => {
        const client = getRedisClient();
        const key = this.getKey(sessionId);
        await client.setEx(key, this.ttl, JSON.stringify(metadata));

        logger.debug("Session stored", {
          sessionId,
          ttl: this.ttl,
          category: "session-store",
        });
      },
      () => {
        // Fallback: log error but don't fail request
        logger.error("Failed to store session - continuing without Redis", {
          sessionId,
          category: "session-store",
        });
      },
      "session-set"
    );
  }

  async touch(sessionId: string): Promise<void> {
    return executeRedisOperation(
      async () => {
        const client = getRedisClient();
        const key = this.getKey(sessionId);
        await client.expire(key, this.ttl);
      },
      () => {
        // Fallback: silent failure (TTL not refreshed)
        logger.debug("Failed to refresh session TTL", {
          sessionId,
          category: "session-store",
        });
      },
      "session-touch"
    );
  }

  async delete(sessionId: string): Promise<void> {
    return executeRedisOperation(
      async () => {
        const client = getRedisClient();
        await client.del(this.getKey(sessionId));

        logger.info("Session deleted", {
          sessionId,
          category: "session-store",
        });
      },
      () => {
        // Fallback: log warning (deletion will happen via TTL eventually)
        logger.warn("Failed to delete session from Redis", {
          sessionId,
          category: "session-store",
        });
      },
      "session-delete"
    );
  }
}

4. Update Rate Limiter

Modify src/middleware/distributed-rate-limit.ts:

typescript
import { executeRedisOperation } from "../services/redis-client.js";

export function createRateLimiter(options: RateLimiterOptions): RequestHandler {
  return async (req, res, next) => {
    const key = getKey(req);

    const allowed = await executeRedisOperation(
      async () => {
        // ... existing rate limit logic with Redis
        return checkRateLimitWithRedis(key);
      },
      () => {
        // Fallback: allow request (fail open)
        logger.warn("Rate limit check bypassed - Redis unavailable", {
          key,
          category: "rate-limit",
        });
        return true;
      },
      "rate-limit-check"
    );

    if (!allowed) {
      return res.status(429).json({
        jsonrpc: "2.0",
        error: {
          code: -32003,
          message: "Rate limit exceeded",
          data: { reason: "too_many_requests" },
        },
        id: null,
      });
    }

    next();
  };
}

5. Health Check Integration

Add circuit breaker status to health check (src/routes/health.ts):

typescript
import { getRedisCircuitBreakerStats } from "../services/redis-client.js";

healthRouter.get("/ready", async (_req, res) => {
  const redisStats = getRedisCircuitBreakerStats();

  const checks = {
    redis: {
      healthy: redisStats.state === "closed",
      state: redisStats.state,
      failureCount: redisStats.failureCount,
      lastFailure: redisStats.lastFailureTime,
    },
    // ... other checks
  };

  const allHealthy = Object.values(checks).every((c) => c.healthy);
  const status = allHealthy ? "ready" : "degraded";
  const httpStatus = allHealthy ? 200 : 503;

  res.status(httpStatus).json({
    status,
    version: config.server.version,
    checks,
  });
});

Configuration

Environment Variables

bash
# Circuit breaker configuration (optional, uses defaults)
REDIS_CIRCUIT_BREAKER_FAILURE_THRESHOLD=5     # Open after N failures
REDIS_CIRCUIT_BREAKER_RESET_TIMEOUT=30000     # Try recovery after 30s
REDIS_CIRCUIT_BREAKER_HALF_OPEN_ATTEMPTS=3    # Max test attempts

Add to src/config/redis.ts:

typescript
export const redisConfig = {
  url: process.env.REDIS_URL ?? "redis://localhost:6379",
  circuitBreaker: {
    failureThreshold: parseInt(
      process.env.REDIS_CIRCUIT_BREAKER_FAILURE_THRESHOLD ?? "5"
    ),
    resetTimeout: parseInt(
      process.env.REDIS_CIRCUIT_BREAKER_RESET_TIMEOUT ?? "30000"
    ),
    halfOpenMaxAttempts: parseInt(
      process.env.REDIS_CIRCUIT_BREAKER_HALF_OPEN_ATTEMPTS ?? "3"
    ),
  },
};

Edge Cases

1. Redis Down at Startup

Behavior: Server fails to start (expected)

  • Redis is required for core functionality
  • Fail fast instead of starting in degraded state
  • Add retry logic with exponential backoff in production

2. Redis Connection Lost During Request

Behavior: Circuit breaker opens, fallback used

  • Session lookups return "not found" → 404 response
  • Rate limiting allows requests (fail open)
  • Token operations log errors and continue

3. Redis Recovers from Outage

Behavior: Circuit breaker transitions to half-open

  • Tests connection with limited attempts
  • On success, transitions to closed
  • Normal operation resumes automatically

4. Intermittent Redis Failures

Behavior: Circuit breaker prevents thrashing

  • Failures counted over monitoring window
  • Circuit opens after threshold reached
  • Prevents cascading retry storms

Testing

Unit Tests

typescript
describe("Circuit Breaker", () => {
  it("should execute operation when circuit is closed", async () => {
    const breaker = new CircuitBreaker("test", {
      failureThreshold: 3,
      resetTimeout: 1000,
      monitoringWindow: 5000,
      halfOpenMaxAttempts: 2,
    });

    const result = await breaker.execute(
      async () => "success",
      () => "fallback"
    );

    expect(result).toBe("success");
    expect(breaker.getStats().state).toBe("closed");
  });

  it("should open circuit after failure threshold", async () => {
    const breaker = new CircuitBreaker("test", {
      failureThreshold: 3,
      resetTimeout: 1000,
      monitoringWindow: 5000,
      halfOpenMaxAttempts: 2,
    });

    // Trigger 3 failures
    for (let i = 0; i < 3; i++) {
      await breaker.execute(
        async () => {
          throw new Error("Redis error");
        },
        () => "fallback"
      );
    }

    expect(breaker.getStats().state).toBe("open");
  });

  it("should use fallback when circuit is open", async () => {
    const breaker = new CircuitBreaker("test", {
      failureThreshold: 1,
      resetTimeout: 1000,
      monitoringWindow: 5000,
      halfOpenMaxAttempts: 2,
    });

    // Trigger circuit open
    await breaker.execute(
      async () => {
        throw new Error("Redis error");
      },
      () => "fallback"
    );

    // Next call should use fallback immediately
    const result = await breaker.execute(
      async () => "success",
      () => "fallback"
    );

    expect(result).toBe("fallback");
  });
});

Acceptance Criteria

  • [ ] Circuit breaker service implemented with configurable thresholds
  • [ ] Redis client wrapper with connection handling
  • [ ] Session store operations protected by circuit breaker
  • [ ] Token store operations protected by circuit breaker
  • [ ] Rate limiter fails open with circuit breaker protection
  • [ ] Health check exposes circuit breaker state
  • [ ] Graceful fallbacks for all Redis operations
  • [ ] Comprehensive logging for circuit state changes
  • [ ] Unit tests with >90% coverage
  • [ ] Integration tests with Redis container
  • [ ] Documentation updated

Metrics

prometheus
# Circuit breaker state (closed=0, half-open=1, open=2)
circuit_breaker_state{name="redis"} 0

# Failure count in current window
circuit_breaker_failures_total{name="redis"} 5

# Success count since last reset
circuit_breaker_successes_total{name="redis"} 1234

# Redis operations by result
redis_operations_total{operation="session-get",result="success"} 1000
redis_operations_total{operation="session-get",result="fallback"} 10
redis_operations_total{operation="rate-limit-check",result="fallback"} 25


References

Released under the MIT License.