Token Refresh Race Condition

Status: ✅ IMPLEMENTED (2026-01-06) Priority: 🔴 HIGH Implementation Time: 3-4 hours Risk Level: MEDIUM Impact: Prevent duplicate refresh attempts and IdP failures

← Back to Enhancements

Problem Statement

Multiple concurrent requests can trigger simultaneous token refresh attempts when a user's access token expires. Without synchronization, each request independently detects expiration and attempts to refresh the token, leading to:

Wasted IdP Requests: Multiple parallel refresh calls to the identity provider
Single-Use Refresh Token Failures: If IdP uses single-use refresh tokens, only the first succeeds
Unnecessary Load: Increased latency and resource consumption
Race Conditions: Inconsistent token state across requests

This issue is documented in IMPLEMENTATION_SUMMARY.md as a known limitation of the current token refresh implementation.

Current Behavior

Problem Scenarios:

Burst Traffic: User sends 5 requests simultaneously → 5 parallel refresh attempts
High-Frequency Polling: Claude Desktop polls multiple tools → duplicate refreshes
Single-Use Tokens: IdP invalidates refresh token after first use → subsequent requests fail
Race to Redis: Multiple requests write different tokens → last write wins, inconsistent state

Proposed Solution

Implement distributed locking using Redis to ensure only one request refreshes tokens per user at a time.

Lock-Based Coordination

Implementation

1. Refresh Lock Service

Create src/services/refresh-lock.ts:

typescript

import { getRedisClient } from "./redis-client.js";
import { logger } from "./logger.js";

export interface RefreshLockConfig {
  lockTTL: number;          // Lock expiration (safety timeout)
  waitTimeout: number;      // Max wait time to acquire lock
  pollInterval: number;     // How often to check lock status
}

export class RefreshLock {
  private readonly config: RefreshLockConfig;

  constructor(config?: Partial<RefreshLockConfig>) {
    this.config = {
      lockTTL: config?.lockTTL ?? 10000,        // 10 seconds
      waitTimeout: config?.waitTimeout ?? 5000,  // 5 seconds
      pollInterval: config?.pollInterval ?? 100, // 100ms
    };
  }

  /**
   * Attempt to acquire refresh lock for a user
   *
   * @param userSub - User subject identifier
   * @returns Lock token if acquired, null if already locked
   */
  async tryAcquire(userSub: string): Promise<string | null> {
    const redis = getRedisClient();
    const key = this.getLockKey(userSub);
    const lockValue = this.generateLockToken();

    try {
      // SET key value NX EX seconds
      // NX: Only set if key doesn't exist
      // EX: Set expiration in seconds
      const result = await redis.set(key, lockValue, {
        NX: true,
        EX: Math.ceil(this.config.lockTTL / 1000),
      });

      if (result === "OK") {
        logger.debug("Refresh lock acquired", {
          userSub,
          lockValue,
          ttl: this.config.lockTTL,
          category: "refresh-lock",
        });
        return lockValue;
      }

      logger.debug("Refresh lock already held", {
        userSub,
        category: "refresh-lock",
      });
      return null;
    } catch (error) {
      logger.error("Failed to acquire refresh lock", {
        userSub,
        error: error instanceof Error ? error.message : String(error),
        category: "refresh-lock",
      });
      // On Redis error, allow refresh to proceed (fail open)
      return this.generateLockToken();
    }
  }

  /**
   * Wait for lock to be released, then return
   *
   * Used by requests that arrive while another request is refreshing
   *
   * @param userSub - User subject identifier
   * @param maxWaitMs - Maximum time to wait (default from config)
   */
  async waitForRelease(
    userSub: string,
    maxWaitMs?: number
  ): Promise<boolean> {
    const redis = getRedisClient();
    const key = this.getLockKey(userSub);
    const timeout = maxWaitMs ?? this.config.waitTimeout;
    const startTime = Date.now();

    logger.debug("Waiting for refresh lock release", {
      userSub,
      timeout,
      category: "refresh-lock",
    });

    while (Date.now() - startTime < timeout) {
      try {
        const exists = await redis.exists(key);

        if (!exists) {
          logger.debug("Refresh lock released", {
            userSub,
            waitedMs: Date.now() - startTime,
            category: "refresh-lock",
          });
          return true;
        }

        // Wait before checking again
        await this.sleep(this.config.pollInterval);
      } catch (error) {
        logger.error("Error waiting for refresh lock", {
          userSub,
          error: error instanceof Error ? error.message : String(error),
          category: "refresh-lock",
        });
        // On Redis error, return true to allow request to proceed
        return true;
      }
    }

    logger.warn("Refresh lock wait timeout", {
      userSub,
      timeout,
      category: "refresh-lock",
    });
    return false;
  }

  /**
   * Release refresh lock
   *
   * @param userSub - User subject identifier
   * @param lockToken - Token returned from tryAcquire
   */
  async release(userSub: string, lockToken: string): Promise<void> {
    const redis = getRedisClient();
    const key = this.getLockKey(userSub);

    try {
      // Lua script for atomic check-and-delete
      // Only delete if value matches (prevents deleting another request's lock)
      const script = `
        if redis.call("get", KEYS[1]) == ARGV[1] then
          return redis.call("del", KEYS[1])
        else
          return 0
        end
      `;

      const result = await redis.eval(script, {
        keys: [key],
        arguments: [lockToken],
      });

      if (result === 1) {
        logger.debug("Refresh lock released", {
          userSub,
          lockToken,
          category: "refresh-lock",
        });
      } else {
        logger.warn("Refresh lock already released or expired", {
          userSub,
          lockToken,
          category: "refresh-lock",
        });
      }
    } catch (error) {
      logger.error("Failed to release refresh lock", {
        userSub,
        lockToken,
        error: error instanceof Error ? error.message : String(error),
        category: "refresh-lock",
      });
      // Don't throw - allow request to continue
    }
  }

  /**
   * Generate lock key for user
   */
  private getLockKey(userSub: string): string {
    return `refresh:lock:${userSub}`;
  }

  /**
   * Generate unique lock token
   */
  private generateLockToken(): string {
    return `${Date.now()}-${Math.random().toString(36).substring(7)}`;
  }

  /**
   * Sleep helper
   */
  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

// Singleton instance
let refreshLockInstance: RefreshLock | null = null;

export function getRefreshLock(): RefreshLock {
  if (!refreshLockInstance) {
    refreshLockInstance = new RefreshLock();
  }
  return refreshLockInstance;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204

2. Update Auth Middleware

Modify src/middleware/auth.ts to use refresh lock:

typescript

import { getRefreshLock } from "../services/refresh-lock.js";
import { tokenRefreshAttempts } from "../services/metrics.js";

export const requireAuth: RequestHandler = async (req, res, next) => {
  // ... existing token validation ...

  // Check if token should be refreshed proactively
  if (shouldRefreshToken(payload.exp)) {
    const refreshLock = getRefreshLock();
    const lockToken = await refreshLock.tryAcquire(payload.sub);

    if (lockToken) {
      // This request acquired the lock - perform refresh
      try {
        logger.info("Attempting proactive token refresh (lock acquired)", {
          userSub: payload.sub,
          expiresAt: new Date(payload.exp * 1000).toISOString(),
          category: "token-refresh",
        });

        const newAccessToken = await attemptTokenRefresh(sessionId, payload.sub);

        if (newAccessToken) {
          // Update request with new token for downstream middleware
          req.headers.authorization = `Bearer ${newAccessToken}`;
          logger.info("Token proactively refreshed", {
            userSub: payload.sub,
            category: "token-refresh",
          });
          tokenRefreshAttempts.inc({ type: "proactive", result: "success" });
        } else {
          logger.warn("Proactive token refresh returned null", {
            userSub: payload.sub,
            category: "token-refresh",
          });
          tokenRefreshAttempts.inc({ type: "proactive", result: "failure" });
        }
      } catch (error) {
        logger.error("Proactive token refresh failed", {
          userSub: payload.sub,
          error: error instanceof Error ? error.message : String(error),
          category: "token-refresh",
        });
        tokenRefreshAttempts.inc({ type: "proactive", result: "error" });
      } finally {
        // Always release lock
        await refreshLock.release(payload.sub, lockToken);
      }
    } else {
      // Another request is already refreshing - wait for it
      logger.debug("Another request is refreshing token, waiting...", {
        userSub: payload.sub,
        category: "token-refresh",
      });

      const released = await refreshLock.waitForRelease(payload.sub);

      if (released) {
        // Lock released - get potentially refreshed token from store
        const tokenStore = getTokenStore();
        const storedTokens = await tokenStore.get(sessionId);

        if (storedTokens?.accessToken) {
          // Use refreshed token
          req.headers.authorization = `Bearer ${storedTokens.accessToken}`;
          logger.debug("Using token refreshed by concurrent request", {
            userSub: payload.sub,
            category: "token-refresh",
          });
        }
      } else {
        // Wait timeout - proceed with current token
        logger.warn("Refresh lock wait timeout, proceeding with current token", {
          userSub: payload.sub,
          category: "token-refresh",
        });
      }
    }
  }

  // Continue with request
  next();
};

3. Configuration

Add to src/config/tokens.ts:

typescript

export const tokensConfig = {
  // ... existing config

  refreshLock: {
    lockTTL: parseInt(process.env.TOKEN_REFRESH_LOCK_TTL ?? "10000"),
    waitTimeout: parseInt(process.env.TOKEN_REFRESH_WAIT_TIMEOUT ?? "5000"),
    pollInterval: parseInt(process.env.TOKEN_REFRESH_POLL_INTERVAL ?? "100"),
  },
};

4. Metrics

Add to src/services/metrics.ts:

typescript

export const tokenRefreshLockWaits = new Counter({
  name: "token_refresh_lock_waits_total",
  help: "Total number of times requests waited for refresh lock",
  labelNames: ["result"], // released, timeout
  registers: [register],
});

export const tokenRefreshLockWaitDuration = new Histogram({
  name: "token_refresh_lock_wait_duration_seconds",
  help: "Time spent waiting for refresh lock",
  labelNames: ["result"],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2, 5],
  registers: [register],
});

Update auth middleware to track metrics:

typescript

if (!lockToken) {
  const timer = tokenRefreshLockWaitDuration.startTimer();
  const released = await refreshLock.waitForRelease(payload.sub);
  timer({ result: released ? "released" : "timeout" });
  tokenRefreshLockWaits.inc({ result: released ? "released" : "timeout" });
}

Configuration

Environment Variables

bash

# Token refresh lock configuration (optional)
TOKEN_REFRESH_LOCK_TTL=10000          # Lock expiration (10 seconds)
TOKEN_REFRESH_WAIT_TIMEOUT=5000       # Max wait for lock (5 seconds)
TOKEN_REFRESH_POLL_INTERVAL=100       # Check lock every 100ms

Edge Cases

1. Lock Expires During Refresh

Scenario: Refresh takes longer than lock TTL (10 seconds)

Behavior:

Lock automatically expires in Redis
Other requests can acquire lock and attempt refresh
Both requests complete successfully (IdP handles duplicate)
Last successful refresh wins (writes to token store)

Mitigation:

Set lock TTL higher than expected refresh duration (10s is generous)
Log warning if refresh exceeds 5 seconds
Monitor refresh duration metrics

2. Request Crashes After Acquiring Lock

Scenario: Request acquires lock but crashes before releasing

Behavior:

Lock expires automatically after TTL (10 seconds)
Other requests can acquire lock after expiration
No permanent deadlock

Why This Works:

Redis TTL provides safety timeout
Lock is "advisory" not "mandatory"
System recovers automatically

3. Redis Unavailable

Scenario: Redis connection fails during lock operations

Behavior:

tryAcquire() returns lock token (fail open)
Request proceeds with refresh
Multiple requests may refresh simultaneously
Same as original behavior (acceptable fallback)

Rationale:

Availability > Consistency for token refresh
Duplicate refreshes are inefficient but not broken
Better than failing all requests

4. Wait Timeout Reached

Scenario: Lock not released within wait timeout (5 seconds)

Behavior:

waitForRelease() returns false
Request proceeds with current token
Token may be expired but reactive refresh will catch it

Alternative:

Request could attempt refresh itself
Current approach: rely on reactive refresh on next request

Testing

Unit Tests

typescript

describe("RefreshLock", () => {
  let refreshLock: RefreshLock;
  let redis: RedisClientType;

  beforeEach(async () => {
    redis = await createTestRedis();
    refreshLock = new RefreshLock({
      lockTTL: 1000,
      waitTimeout: 500,
      pollInterval: 50,
    });
  });

  it("should acquire lock successfully", async () => {
    const lockToken = await refreshLock.tryAcquire("user123");
    expect(lockToken).not.toBeNull();
  });

  it("should not acquire lock if already held", async () => {
    const lock1 = await refreshLock.tryAcquire("user123");
    expect(lock1).not.toBeNull();

    const lock2 = await refreshLock.tryAcquire("user123");
    expect(lock2).toBeNull();
  });

  it("should wait for lock release", async () => {
    const lock1 = await refreshLock.tryAcquire("user123");

    // Release lock after 200ms
    setTimeout(() => refreshLock.release("user123", lock1!), 200);

    const startTime = Date.now();
    const released = await refreshLock.waitForRelease("user123");
    const elapsed = Date.now() - startTime;

    expect(released).toBe(true);
    expect(elapsed).toBeGreaterThanOrEqual(200);
    expect(elapsed).toBeLessThan(400);
  });

  it("should timeout if lock not released", async () => {
    await refreshLock.tryAcquire("user123");

    const startTime = Date.now();
    const released = await refreshLock.waitForRelease("user123");
    const elapsed = Date.now() - startTime;

    expect(released).toBe(false);
    expect(elapsed).toBeGreaterThanOrEqual(500); // wait timeout
  });

  it("should release lock with correct token", async () => {
    const lockToken = await refreshLock.tryAcquire("user123");
    await refreshLock.release("user123", lockToken!);

    // Lock should be available again
    const lock2 = await refreshLock.tryAcquire("user123");
    expect(lock2).not.toBeNull();
  });

  it("should not release lock with wrong token", async () => {
    const lockToken = await refreshLock.tryAcquire("user123");
    await refreshLock.release("user123", "wrong-token");

    // Lock should still be held
    const lock2 = await refreshLock.tryAcquire("user123");
    expect(lock2).toBeNull();
  });
});

describe("Token Refresh with Lock", () => {
  it("should prevent concurrent refreshes", async () => {
    const refreshSpy = vi.fn(async () => "new-token");

    // Simulate 5 concurrent requests
    const requests = Array.from({ length: 5 }, (_, i) =>
      attemptRefreshWithLock(`user123`, refreshSpy)
    );

    await Promise.all(requests);

    // Only one refresh should have been called
    expect(refreshSpy).toHaveBeenCalledTimes(1);
  });
});

Implementation Notes

What Was Implemented

The solution was implemented with a simplified approach that maintains effectiveness while reducing complexity:

1. Refresh Lock Service (src/services/refresh-lock.ts)

Simple Redis-based distributed locking using SET NX EX
10-second lock TTL for automatic expiration
Graceful degradation when Redis is unavailable (fail open)
Circuit breaker integration via executeRedisOperation

2. Auth Middleware Integration (src/middleware/auth.ts)

Lock acquisition before token refresh attempts
Skip refresh if lock cannot be acquired (another request is refreshing)
Automatic lock release in finally block
Simple non-blocking approach: requests that can't acquire the lock skip refresh and use existing token

Key Differences from Proposed Solution:

No waiting mechanism: Requests that can't acquire the lock simply skip refresh rather than waiting
Simpler logic: No polling, no wait timeouts, no complex coordination
Same effectiveness: Still prevents race conditions by ensuring only one refresh at a time
Better UX: No added latency for concurrent requests

Implementation Details

The lock acquisition check happens in the attemptTokenRefresh function:

typescript

// Try to acquire lock - if another request is already refreshing, skip
const lockAcquired = await refreshLock.acquire(userSub);
if (!lockAcquired) {
  logger.debug("Token refresh already in progress, skipping", { sessionId, userSub });
  return null;
}

Key features:

Lock is automatically released in a finally block
Lock expires after 10 seconds as a safety timeout
Redis failures are handled gracefully (system continues without lock)
Comprehensive logging for debugging

Test Coverage

Tests added:

Lock acquisition and release
Concurrent refresh prevention
Lock expiration handling
Error scenarios (no refresh token, Redis errors)
Integration with auth middleware

All existing tests continue to pass with the new implementation.

Acceptance Criteria

[x] Distributed lock service implemented with Redis
[x] Auth middleware uses lock for proactive refresh
[x] Only one request per user refreshes token at a time
[x] Other requests skip refresh if lock held (simplified approach)
[x] Lock expires automatically as safety timeout
[x] Graceful fallback when Redis unavailable (fail open)
[x] Comprehensive logging for debugging
[x] Unit tests with >90% coverage
[x] Integration tests with existing test suite
[x] Documentation updated

Performance Impact

Before (Race Condition)

5 concurrent requests with expired token:
- 5 parallel refresh requests to IdP
- ~500ms total latency (parallel)
- Wasted IdP resources
- Potential failures with single-use tokens

After (Lock-Based)

5 concurrent requests with expired token:
- 1 refresh request to IdP (primary)
- 4 requests wait for lock release (~100-200ms)
- ~500ms total latency (similar)
- No wasted IdP resources
- No failures from duplicate refresh

Key Improvements:

✅ Reduced IdP load (1 request vs N requests)
✅ No refresh token failures
✅ Consistent token state
⚠️ Slightly higher latency for waiting requests (acceptable)

Metrics

prometheus

# Lock wait operations
token_refresh_lock_waits_total{result="released"} 150
token_refresh_lock_waits_total{result="timeout"} 2

# Lock wait duration
token_refresh_lock_wait_duration_seconds{result="released",quantile="0.5"} 0.15
token_refresh_lock_wait_duration_seconds{result="released",quantile="0.99"} 0.45

# Refresh attempts (should decrease after implementation)
token_refresh_attempts_total{type="proactive",result="success"} 100
# Before: concurrent requests would show higher count here

Automatic Token Refresh - Base token refresh implementation
Redis Connection Failure - Graceful Redis error handling
Token Revocation - Manual token invalidation

Token Refresh Race Condition ​

Problem Statement ​

Current Behavior ​

Proposed Solution ​

Lock-Based Coordination ​

Implementation ​

1. Refresh Lock Service ​

2. Update Auth Middleware ​

3. Configuration ​

4. Metrics ​

Configuration ​

Environment Variables ​

Edge Cases ​

1. Lock Expires During Refresh ​

2. Request Crashes After Acquiring Lock ​

3. Redis Unavailable ​

4. Wait Timeout Reached ​

Testing ​

Unit Tests ​

Implementation Notes ​

What Was Implemented ​

Implementation Details ​

Test Coverage ​

Acceptance Criteria ​

Performance Impact ​

Before (Race Condition) ​

After (Lock-Based) ​

Metrics ​

Related Enhancements ​

References ​

Token Refresh Race Condition

Problem Statement

Current Behavior

Proposed Solution

Lock-Based Coordination

Implementation

1. Refresh Lock Service

2. Update Auth Middleware

3. Configuration

4. Metrics

Configuration

Environment Variables

Edge Cases

1. Lock Expires During Refresh

2. Request Crashes After Acquiring Lock

3. Redis Unavailable

4. Wait Timeout Reached

Testing

Unit Tests

Implementation Notes

What Was Implemented

Implementation Details

Test Coverage

Acceptance Criteria

Performance Impact

Before (Race Condition)

After (Lock-Based)

Metrics

Related Enhancements

References