Skip to content

Health Check Rate Limiting

FeatureStatusPriorityDecision Date
Health Check Rate Limiting❌ NOT PLANNEDLow2026-01-07

Overview

Rate limiting was intentionally NOT implemented for health check endpoints (/health and /health/ready) due to operational requirements that prioritize high availability for monitoring infrastructure over the minimal security benefits.

Decision Rationale

Operational Requirements

  1. High Availability Required

    • Kubernetes liveness and readiness probes make frequent health check requests
    • Monitoring systems (Prometheus, Datadog, etc.) poll health endpoints every few seconds
    • Rate limiting could cause false positives triggering alerts and auto-scaling events
    • False negatives could lead to unnecessary pod restarts or service degradation
  2. No Sensitive Data Exposure

    • Health responses contain minimal information (status, version, dependency states)
    • No user data, tokens, or sensitive configuration exposed
    • Information disclosed is already available through other means
  3. Low Security Risk

    • Timing attacks on health endpoints have negligible impact
    • Response times don't reveal sensitive operational details
    • Endpoint cannot be abused for data exfiltration or privilege escalation

Alternative Mitigations

More appropriate protections for health endpoint abuse:

  1. Network-Level Rate Limiting

    • Firewall rules limiting requests per IP
    • DDoS protection at CDN/load balancer layer
    • More effective for preventing abuse without operational impact
  2. Monitoring and Alerting

    • Track health endpoint request patterns via existing metrics
    • Alert on anomalous traffic patterns
    • Respond to actual attacks rather than preventing legitimate monitoring
  3. Access Control

    • Restrict health endpoint access to monitoring systems via network policies
    • Use separate internal health endpoints for Kubernetes vs. external monitoring
    • Implement IP allowlists at infrastructure layer if needed

Technical Context

Current Implementation

Health check endpoints in src/routes/health.ts:

typescript
// Liveness probe - returns 503 during shutdown
healthRouter.get("/", (_req, res) => {
  if (isShuttingDown) {
    res.status(503).json({
      status: "shutting_down",
      version: config.server.version,
    });
    return;
  }

  res.json({
    status: "ok",
    version: config.server.version,
  });
});

// Readiness probe - validates dependencies
healthRouter.get("/ready", async (_req, res) => {
  // Checks Redis connectivity, JWKS cache, session capacity
});

Why Implementation Was Attempted

The gap was identified in Gap Analysis section 8.2 as a potential timing attack vector. Initial implementation added distributed rate limiting via Redis-backed middleware:

typescript
// ❌ This approach was rejected
if (config.rateLimit.enabled) {
  healthRouter.use(
    createDistributedRateLimiter({
      windowMs: config.rateLimit.health.windowMs,
      maxRequests: config.rateLimit.health.maxRequests,
      globalMax: config.rateLimit.health.globalMax,
      keyPrefix: "ratelimit:health:",
      endpointType: "health",
    }),
  );
}

Problems Discovered During Implementation

  1. Test Infrastructure Breakage

    • Existing tests in src/app.test.ts and src/config/helmet.test.ts broke when rate limiting added
    • Required extensive mocking of Redis operations
    • Integration tests expect health endpoint to always be available
  2. Circular Dependency

    • Rate limiting requires Redis
    • Health check monitors Redis availability
    • Rate limiting health check creates circular dependency where Redis failure prevents health status reporting
  3. Operational Impact

    • Kubernetes makes health check requests every few seconds
    • Rate limits could cause legitimate health checks to fail
    • False positives trigger unnecessary pod restarts

Monitoring Recommendations

If health endpoint abuse becomes a concern:

  1. Monitor Health Endpoint Metrics

    typescript
    // Track requests per IP via existing logging
    logger.info("Health check request", {
      ip: req.ip,
      userAgent: req.get("user-agent"),
      endpoint: req.path,
    });
  2. Alert on Anomalous Patterns

    • Sudden spike in health check requests from single IP
    • Requests from unexpected user agents
    • Geographic anomalies (requests from unexpected regions)
  3. Network-Level Protection

    • Implement rate limiting at load balancer/CDN layer
    • Use DDoS protection services (Cloudflare, AWS Shield)
    • Restrict health endpoint access via network policies

References


← Back to Enhancements Overview

Released under the MIT License.