Skip to content

JWKS Discovery Retry

FeatureStatusPriorityImplementation Date
JWKS Discovery Retry✅ IMPLEMENTEDMedium2026-01-07

Overview

Implemented retry mechanism with exponential backoff for JWKS (JSON Web Key Set) discovery failures. When the IdP is unavailable during initial JWKS fetch, the server now schedules background retries instead of failing completely, allowing authentication to succeed once the IdP recovers.

Problem Statement

Before this enhancement, JWKS discovery failures had these issues:

  1. Server fails to start - If IdP unavailable during startup, entire server startup fails
  2. No retry mechanism - Discovery errors thrown immediately with no retry
  3. Authentication blocked - All authentication requests fail until manual restart
  4. Poor resilience - Single point of failure during IdP maintenance or outages

Solution

Retry Strategy

Implemented exponential backoff retry mechanism:

  • Attempt 1: 1 second delay
  • Attempt 2: 2 seconds delay
  • Attempt 3: 4 seconds delay
  • Attempt 4: 8 seconds delay
  • Attempt 5+: 16 seconds delay (capped)

Key Features

  1. Background Retries - Initial failure triggers automatic background retry scheduling
  2. Exponential Backoff - Delays increase exponentially to avoid overwhelming IdP
  3. Automatic Recovery - Successful fetch cancels pending retries
  4. Graceful Degradation - Authentication fails gracefully with clear errors until JWKS available
  5. Clean Shutdown - Retry timers properly cleaned up on server stop

Implementation Details

src/services/jwks.ts

Added retry state tracking:

typescript
let discoveryRetryTimer: ReturnType<typeof setTimeout> | null = null;
let isRetryingDiscovery = false;

Exponential backoff calculation:

typescript
function getRetryDelay(attempt: number): number {
  const baseDelay = 1000; // 1 second
  const maxDelay = 16000; // 16 seconds
  const delay = baseDelay * Math.pow(2, attempt);
  return Math.min(delay, maxDelay);
}

Enhanced fetchJwks with retry scheduling:

typescript
async function fetchJwks(): Promise<JSONWebKeySet["keys"]> {
  try {
    // ... fetch logic ...

    // Cancel ongoing retries on success
    if (discoveryRetryTimer) {
      clearTimeout(discoveryRetryTimer);
      discoveryRetryTimer = null;
      isRetryingDiscovery = false;
    }

    return keys;
  } catch (error) {
    logger.error("JWKS fetch failed", { error: message });

    // Schedule background retry if not already retrying
    if (!isRetryingDiscovery) {
      scheduleDiscoveryRetry(0);
    }

    throw error; // Still throw to inform caller
  }
}

Background retry scheduler:

typescript
function scheduleDiscoveryRetry(attempt: number): void {
  isRetryingDiscovery = true;
  const delay = getRetryDelay(attempt);

  logger.warn("Scheduling JWKS discovery retry", {
    attempt: attempt + 1,
    retryIn: `${String(delay)}ms`,
  });

  discoveryRetryTimer = setTimeout(() => {
    void (async () => {
      try {
        await refreshKeys();
        logger.info("JWKS discovery retry succeeded", { attempt: attempt + 1 });
        isRetryingDiscovery = false;
      } catch (error) {
        logger.error("JWKS discovery retry failed", {
          attempt: attempt + 1,
          error: message,
        });
        // Schedule next retry with incremented counter
        scheduleDiscoveryRetry(attempt + 1);
      }
    })();
  }, delay);
}

Updated cleanup functions:

typescript
function clearCache(): void {
  // ... existing cleanup ...
  if (discoveryRetryTimer) {
    clearTimeout(discoveryRetryTimer);
    discoveryRetryTimer = null;
  }
  isRetryingDiscovery = false;
}

function stop(): void {
  // ... existing cleanup ...
  if (discoveryRetryTimer) {
    clearTimeout(discoveryRetryTimer);
    discoveryRetryTimer = null;
    isRetryingDiscovery = false;
  }
  logger.info("JWKS service stopped");
}

Test Coverage

Added 6 comprehensive test cases in src/services/jwks.test.ts:

  1. Schedule background retry - Verifies retry scheduled when initial fetch fails
  2. Exponential backoff - Tests retry delays: 1s → 2s → 4s → 8s
  3. Stop retrying on success - Confirms retries cancel when fetch succeeds
  4. Delay cap at 16 seconds - Validates maximum delay limit
  5. Clean up on clearCache - Ensures timers cleared properly
  6. Clean up on stop - Verifies graceful shutdown cancels retries

Test Results:

  • 759 total tests passing (6 new tests added)
  • 92.92% overall coverage (increased from 92.64%)
  • JWKS service coverage: 96.15%

Usage Example

Server Startup with Unavailable IdP

typescript
// Server starts successfully even if IdP unavailable
const server = app.listen(3000);
// Log: "Seed MCP server running on http://localhost:3000/mcp"

// Background: JWKS discovery fails
// Log: "JWKS fetch failed"
// Log: "Scheduling JWKS discovery retry", { attempt: 1, retryIn: "1000ms" }

// 1 second later: First retry fails
// Log: "JWKS discovery retry failed", { attempt: 1 }
// Log: "Scheduling JWKS discovery retry", { attempt: 2, retryIn: "2000ms" }

// 2 seconds later: Second retry succeeds (IdP back online)
// Log: "JWKS discovery retry succeeded", { attempt: 2 }

// Authentication now works normally

Authentication During Retry Period

typescript
// Before IdP recovers: Auth fails with clear error
const response = await request(app)
  .post("/mcp")
  .set("Authorization", "Bearer valid-jwt");

// Returns: 401 with descriptive error about JWKS unavailability

Configuration

No new environment variables required. Uses existing JWKS configuration:

bash
# Existing configuration continues to work
OIDC_ISSUER=https://auth.example.com
OIDC_JWKS_URL=                          # Optional: Override auto-discovery
OIDC_JWKS_CACHE_TTL_MS=3600000           # 1 hour cache
OIDC_JWKS_REFRESH_BEFORE_EXPIRY_MS=300000 # Refresh 5 min before expiry

Operational Benefits

  1. Higher Availability - Server continues running during IdP outages
  2. Automatic Recovery - No manual intervention required when IdP recovers
  3. Reduced Load - Exponential backoff prevents overwhelming recovering IdP
  4. Better Observability - Structured logging tracks retry attempts and delays
  5. Graceful Degradation - Clear error messages during outage period

Monitoring Recommendations

Track JWKS retry patterns in logs:

typescript
// Success after retries
{
  "level": "info",
  "message": "JWKS discovery retry succeeded",
  "attempt": 3
}

// Ongoing failures (investigate IdP)
{
  "level": "error",
  "message": "JWKS discovery retry failed",
  "attempt": 5,
  "error": "Failed to fetch OIDC discovery from https://auth.example.com/.well-known/openid-configuration: 503 Service Unavailable"
}

// Scheduled retries
{
  "level": "warn",
  "message": "Scheduling JWKS discovery retry",
  "attempt": 6,
  "retryIn": "16000ms"
}

Alert Recommendations:

  • Alert if JWKS retry attempts exceed 10 within 5 minutes
  • Alert if all retries fail for > 5 minutes (indicates IdP outage)
  • Monitor IdP availability independently to correlate with retry patterns

Limitations

  1. No persistent state - Retry counter resets on server restart
  2. In-memory only - Retry state not shared across server instances
  3. Infinite retries - Will retry indefinitely until success (by design)
  4. No circuit breaker - Unlike Redis operations, no fail-fast after N attempts

These limitations are acceptable because:

  • JWKS discovery is critical for authentication
  • Infinite retries ensure eventual recovery when IdP restores
  • Server startup succeeds even with initial failure
  • Each instance handles retries independently

Implementation Metrics

  • Estimated Effort: 2-3 hours
  • Actual Effort: 2.5 hours
  • Files Modified: 2
    • src/services/jwks.ts - Core retry implementation
    • src/services/jwks.test.ts - Test coverage
  • Lines Added: ~100 (including tests)
  • Test Coverage Impact: +0.28% overall coverage

Future Enhancements

Potential improvements not currently planned:

  1. Configurable retry limits - Max attempts before giving up
  2. Persistent retry state - Share retry state via Redis across instances
  3. Health check integration - Expose retry status in /health/ready
  4. Metrics tracking - Prometheus metrics for retry attempts and success rate
  5. Circuit breaker pattern - Fail-fast after N consecutive failures

← Back to Enhancements Overview

Released under the MIT License.