JWKS Discovery Retry
| Feature | Status | Priority | Implementation Date |
|---|---|---|---|
| JWKS Discovery Retry | ✅ IMPLEMENTED | Medium | 2026-01-07 |
Overview
Implemented retry mechanism with exponential backoff for JWKS (JSON Web Key Set) discovery failures. When the IdP is unavailable during initial JWKS fetch, the server now schedules background retries instead of failing completely, allowing authentication to succeed once the IdP recovers.
Problem Statement
Before this enhancement, JWKS discovery failures had these issues:
- Server fails to start - If IdP unavailable during startup, entire server startup fails
- No retry mechanism - Discovery errors thrown immediately with no retry
- Authentication blocked - All authentication requests fail until manual restart
- Poor resilience - Single point of failure during IdP maintenance or outages
Solution
Retry Strategy
Implemented exponential backoff retry mechanism:
- Attempt 1: 1 second delay
- Attempt 2: 2 seconds delay
- Attempt 3: 4 seconds delay
- Attempt 4: 8 seconds delay
- Attempt 5+: 16 seconds delay (capped)
Key Features
- Background Retries - Initial failure triggers automatic background retry scheduling
- Exponential Backoff - Delays increase exponentially to avoid overwhelming IdP
- Automatic Recovery - Successful fetch cancels pending retries
- Graceful Degradation - Authentication fails gracefully with clear errors until JWKS available
- Clean Shutdown - Retry timers properly cleaned up on server stop
Implementation Details
src/services/jwks.ts
Added retry state tracking:
let discoveryRetryTimer: ReturnType<typeof setTimeout> | null = null;
let isRetryingDiscovery = false;Exponential backoff calculation:
function getRetryDelay(attempt: number): number {
const baseDelay = 1000; // 1 second
const maxDelay = 16000; // 16 seconds
const delay = baseDelay * Math.pow(2, attempt);
return Math.min(delay, maxDelay);
}Enhanced fetchJwks with retry scheduling:
async function fetchJwks(): Promise<JSONWebKeySet["keys"]> {
try {
// ... fetch logic ...
// Cancel ongoing retries on success
if (discoveryRetryTimer) {
clearTimeout(discoveryRetryTimer);
discoveryRetryTimer = null;
isRetryingDiscovery = false;
}
return keys;
} catch (error) {
logger.error("JWKS fetch failed", { error: message });
// Schedule background retry if not already retrying
if (!isRetryingDiscovery) {
scheduleDiscoveryRetry(0);
}
throw error; // Still throw to inform caller
}
}Background retry scheduler:
function scheduleDiscoveryRetry(attempt: number): void {
isRetryingDiscovery = true;
const delay = getRetryDelay(attempt);
logger.warn("Scheduling JWKS discovery retry", {
attempt: attempt + 1,
retryIn: `${String(delay)}ms`,
});
discoveryRetryTimer = setTimeout(() => {
void (async () => {
try {
await refreshKeys();
logger.info("JWKS discovery retry succeeded", { attempt: attempt + 1 });
isRetryingDiscovery = false;
} catch (error) {
logger.error("JWKS discovery retry failed", {
attempt: attempt + 1,
error: message,
});
// Schedule next retry with incremented counter
scheduleDiscoveryRetry(attempt + 1);
}
})();
}, delay);
}Updated cleanup functions:
function clearCache(): void {
// ... existing cleanup ...
if (discoveryRetryTimer) {
clearTimeout(discoveryRetryTimer);
discoveryRetryTimer = null;
}
isRetryingDiscovery = false;
}
function stop(): void {
// ... existing cleanup ...
if (discoveryRetryTimer) {
clearTimeout(discoveryRetryTimer);
discoveryRetryTimer = null;
isRetryingDiscovery = false;
}
logger.info("JWKS service stopped");
}Test Coverage
Added 6 comprehensive test cases in src/services/jwks.test.ts:
- Schedule background retry - Verifies retry scheduled when initial fetch fails
- Exponential backoff - Tests retry delays: 1s → 2s → 4s → 8s
- Stop retrying on success - Confirms retries cancel when fetch succeeds
- Delay cap at 16 seconds - Validates maximum delay limit
- Clean up on clearCache - Ensures timers cleared properly
- Clean up on stop - Verifies graceful shutdown cancels retries
Test Results:
- 759 total tests passing (6 new tests added)
- 92.92% overall coverage (increased from 92.64%)
- JWKS service coverage: 96.15%
Usage Example
Server Startup with Unavailable IdP
// Server starts successfully even if IdP unavailable
const server = app.listen(3000);
// Log: "Seed MCP server running on http://localhost:3000/mcp"
// Background: JWKS discovery fails
// Log: "JWKS fetch failed"
// Log: "Scheduling JWKS discovery retry", { attempt: 1, retryIn: "1000ms" }
// 1 second later: First retry fails
// Log: "JWKS discovery retry failed", { attempt: 1 }
// Log: "Scheduling JWKS discovery retry", { attempt: 2, retryIn: "2000ms" }
// 2 seconds later: Second retry succeeds (IdP back online)
// Log: "JWKS discovery retry succeeded", { attempt: 2 }
// Authentication now works normallyAuthentication During Retry Period
// Before IdP recovers: Auth fails with clear error
const response = await request(app)
.post("/mcp")
.set("Authorization", "Bearer valid-jwt");
// Returns: 401 with descriptive error about JWKS unavailabilityConfiguration
No new environment variables required. Uses existing JWKS configuration:
# Existing configuration continues to work
OIDC_ISSUER=https://auth.example.com
OIDC_JWKS_URL= # Optional: Override auto-discovery
OIDC_JWKS_CACHE_TTL_MS=3600000 # 1 hour cache
OIDC_JWKS_REFRESH_BEFORE_EXPIRY_MS=300000 # Refresh 5 min before expiryOperational Benefits
- Higher Availability - Server continues running during IdP outages
- Automatic Recovery - No manual intervention required when IdP recovers
- Reduced Load - Exponential backoff prevents overwhelming recovering IdP
- Better Observability - Structured logging tracks retry attempts and delays
- Graceful Degradation - Clear error messages during outage period
Monitoring Recommendations
Track JWKS retry patterns in logs:
// Success after retries
{
"level": "info",
"message": "JWKS discovery retry succeeded",
"attempt": 3
}
// Ongoing failures (investigate IdP)
{
"level": "error",
"message": "JWKS discovery retry failed",
"attempt": 5,
"error": "Failed to fetch OIDC discovery from https://auth.example.com/.well-known/openid-configuration: 503 Service Unavailable"
}
// Scheduled retries
{
"level": "warn",
"message": "Scheduling JWKS discovery retry",
"attempt": 6,
"retryIn": "16000ms"
}Alert Recommendations:
- Alert if JWKS retry attempts exceed 10 within 5 minutes
- Alert if all retries fail for > 5 minutes (indicates IdP outage)
- Monitor IdP availability independently to correlate with retry patterns
Limitations
- No persistent state - Retry counter resets on server restart
- In-memory only - Retry state not shared across server instances
- Infinite retries - Will retry indefinitely until success (by design)
- No circuit breaker - Unlike Redis operations, no fail-fast after N attempts
These limitations are acceptable because:
- JWKS discovery is critical for authentication
- Infinite retries ensure eventual recovery when IdP restores
- Server startup succeeds even with initial failure
- Each instance handles retries independently
Related Features
- JWKS Key Rotation - Graceful handling of key rotation during active requests
- Health Check Improvements - Health endpoints report JWKS cache status
- Configuration Validation - Validates OIDC configuration at startup
- Redis Connection Failure - Similar retry pattern for Redis
Implementation Metrics
- Estimated Effort: 2-3 hours
- Actual Effort: 2.5 hours
- Files Modified: 2
src/services/jwks.ts- Core retry implementationsrc/services/jwks.test.ts- Test coverage
- Lines Added: ~100 (including tests)
- Test Coverage Impact: +0.28% overall coverage
Future Enhancements
Potential improvements not currently planned:
- Configurable retry limits - Max attempts before giving up
- Persistent retry state - Share retry state via Redis across instances
- Health check integration - Expose retry status in
/health/ready - Metrics tracking - Prometheus metrics for retry attempts and success rate
- Circuit breaker pattern - Fail-fast after N consecutive failures