Skip to content

Token Refresh Metrics

Status: ✅ IMPLEMENTED Priority: 🟢 MEDIUM (Completed) Actual Time: 2-3 hours Implementation Date: 2026-01-07 Risk Level: LOW Impact: Enhanced observability for OAuth token refresh operations

← Back to Enhancements


Implementation Summary

Token refresh metrics have been successfully implemented to provide comprehensive visibility into proactive token refresh operations and pending token claims.

What Was Implemented

  1. Token Refresh Attempts Counter - Tracks all refresh attempts by type (proactive/reactive) and result (success/failure/skipped) in src/services/metrics.ts
  2. Token Refresh Duration Histogram - Measures refresh operation latency with buckets [0.1, 0.5, 1, 2, 5] seconds
  3. Pending Tokens Claimed Counter - Tracks when pending tokens are successfully claimed during MCP session creation
  4. Metrics Integration - Instrumented src/middleware/auth.ts and src/mcp/mcp.ts
  5. Comprehensive Tests - Added 6 test cases in src/services/metrics.test.ts
  6. Grafana Dashboard - Added 3 new panels to grafana/seed-mcp-server-dashboard.json

Key Files Modified

  • src/services/metrics.ts - Metric definitions
  • src/middleware/auth.ts - Refresh operation instrumentation
  • src/mcp/mcp.ts - Pending token claim tracking
  • src/services/metrics.test.ts - Test coverage
  • src/middleware/auth.test.ts - Updated metrics mock
  • grafana/seed-mcp-server-dashboard.json - Dashboard panels

Testing

  • 6 comprehensive test cases covering all metrics
  • Mock validation for proper metric instrumentation
  • Test coverage: >90% for metrics service
  • All tests passing via npm run validate

Original Problem Statement

Token refresh operations are logged but not exposed as Prometheus metrics. This limits operational visibility:

  • Cannot track refresh success/failure rates
  • No visibility into proactive vs reactive refresh patterns
  • Missing duration metrics for IdP response times
  • No alerting on refresh failures

Proposed Solution

Add comprehensive Prometheus metrics for token refresh operations.

Metrics to Add

typescript
// src/services/metrics.ts

export const tokenRefreshAttempts = new Counter({
  name: 'token_refresh_attempts_total',
  help: 'Total token refresh attempts',
  labelNames: ['type', 'result'], // proactive/reactive, success/failure
  registers: [register],
});

export const tokenRefreshDuration = new Histogram({
  name: 'token_refresh_duration_seconds',
  help: 'Token refresh operation duration',
  labelNames: ['result'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register],
});

export const pendingTokensClaimed = new Counter({
  name: 'pending_tokens_claimed_total',
  help: 'Total pending tokens claimed by sessions',
  registers: [register],
});

export const tokenStoreOperations = new Counter({
  name: 'token_store_operations_total',
  help: 'Token store operations',
  labelNames: ['operation', 'result'], // get/set/delete, success/failure
  registers: [register],
});

Usage Example

typescript
// In attemptTokenRefresh
const timer = tokenRefreshDuration.startTimer();
try {
  const newToken = await refreshTokenFromIdP();
  tokenRefreshAttempts.inc({ type: 'proactive', result: 'success' });
  timer({ result: 'success' });
  return newToken;
} catch (error) {
  tokenRefreshAttempts.inc({ type: 'proactive', result: 'failure' });
  timer({ result: 'failure' });
  throw error;
}

Sample Queries

promql
# Refresh success rate
rate(token_refresh_attempts_total{result="success"}[5m])
/ rate(token_refresh_attempts_total[5m])

# P99 refresh latency
histogram_quantile(0.99,
  rate(token_refresh_duration_seconds_bucket[5m])
)

# Proactive vs reactive ratio
rate(token_refresh_attempts_total{type="proactive"}[5m])
/ rate(token_refresh_attempts_total[5m])

Acceptance Criteria

  • [x] Token refresh attempt counter with type and result labels
  • [x] Token refresh duration histogram
  • [x] Pending token claim counter
  • [x] Metrics integrated into existing refresh logic
  • [x] Comprehensive test coverage (>90%)
  • [x] Documentation with example queries
  • [x] Grafana dashboard panels

Estimated Effort

2-3 hours - Add metrics, integrate into code, documentation


Released under the MIT License.