Distributed Tracing

Priority: 🔵 LONG-TERM Estimated Time: 8 hours Risk Level: LOW Impact: Observability, debugging, performance monitoring

← Back to Enhancements

Issue Description

Current observability:

✅ Prometheus metrics
✅ Winston logging
❌ No distributed tracing
❌ No request flow visualization
❌ Difficult to debug cross-service issues

Proposed Solution

Implement OpenTelemetry for distributed tracing:

Trace requests through MCP server
Correlate logs and traces
Measure operation latency
Visualize in Jaeger or Grafana Tempo

Implementation

Note: Redis is included in the local development environment when using ./scripts/local (part of the Docker stack).

Install Dependencies

bash

npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

Tracing Setup

Create src/services/tracing.ts:

typescript

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";

// Enable OpenTelemetry diagnostics in development
if (process.env.NODE_ENV === "development") {
  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);
}

const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});

export const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "seed-mcp-server",
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version ?? "0.1.3",
  }),
  traceExporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false },
      "@opentelemetry/instrumentation-http": { enabled: true },
      "@opentelemetry/instrumentation-express": { enabled: true },
    }),
  ],
});

export function initTracing(): void {
  if (process.env.OTEL_ENABLED !== "false") {
    sdk.start();
    console.log("OpenTelemetry tracing initialized");
  }
}

export function shutdownTracing(): Promise<void> {
  return sdk.shutdown();
}

Server Integration

Update src/index.ts:

typescript

import { initTracing, shutdownTracing } from "./services/tracing.js";

// Initialize tracing BEFORE importing other modules
initTracing();

import { app } from "./app.js";
import { config } from "./config/index.js";
import { logger } from "./services/logger.js";

// ... rest of server setup

// Graceful shutdown
process.on("SIGTERM", async () => {
  logger.info("SIGTERM received, shutting down gracefully");

  await shutdownTracing();

  server.close(() => {
    logger.info("Server closed");
    process.exit(0);
  });
});

Custom Spans

Add custom spans for MCP operations in src/mcp/tools/index.ts:

typescript

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("seed-mcp-tools");

server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
  // Create span for tool execution
  return await tracer.startActiveSpan(
    `tool.execute.${request.params.name}`,
    async (span) => {
      try {
        span.setAttributes({
          "tool.name": request.params.name,
          "user.id": extra.userContext?.sub ?? "anonymous",
          "session.id": extra.sessionId ?? "unknown",
        });

        const result = await executeTool(request.params.name, request.params.arguments);

        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : "Unknown error",
        });
        span.recordException(error as Error);
        throw error;
      } finally {
        span.end();
      }
    }
  );
});

Configuration

Add to .env.example:

env

# OpenTelemetry Tracing
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
OTEL_SERVICE_NAME=seed-mcp-server

Local Development with Jaeger

Add to docker-compose.dev.yml:

yaml

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4318:4318"    # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Start Jaeger:

bash

docker-compose -f docker-compose.dev.yml up jaeger

Access Jaeger UI at http://localhost:16686

Trace Visualization

After implementing, you'll see:

Request flow through middleware
JWT validation timing
Tool execution duration
Redis operations
HTTP calls to OIDC provider

Example Trace

GET /mcp [200ms]
  ├─ middleware.auth [50ms]
  │   └─ jwks.getKey [45ms]
  │       └─ http.fetch [40ms]
  ├─ mcp.handleRequest [140ms]
  │   ├─ tool.execute.system_status [130ms]
  │   │   ├─ redis.get [20ms]
  │   │   └─ metrics.collect [100ms]
  │   └─ session.touch [10ms]
  │       └─ redis.set [8ms]

Production Integration

Grafana Cloud

typescript

const traceExporter = new OTLPTraceExporter({
  url: "https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/traces",
  headers: {
    "Authorization": `Basic ${Buffer.from(`${process.env.GRAFANA_INSTANCE_ID}:${process.env.GRAFANA_API_KEY}`).toString("base64")}`,
  },
});

Grafana Tempo

yaml

# docker-compose.prod.yml
tempo:
  image: grafana/tempo:latest
  command: ["-config.file=/etc/tempo.yaml"]
  volumes:
    - ./tempo-config.yaml:/etc/tempo.yaml
  ports:
    - "4318:4318"  # OTLP HTTP

Jaeger

env

OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger-collector:4318/v1/traces

Custom Instrumentation

Business Logic Spans

typescript

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("seed-mcp-server");

export async function processBusinessLogic(): Promise<void> {
  const span = tracer.startSpan("business.process");

  try {
    // Your logic here
    span.addEvent("Processing started");

    await doWork();

    span.addEvent("Processing completed");
  } finally {
    span.end();
  }
}

Adding Attributes

typescript

span.setAttributes({
  "user.id": userId,
  "operation.type": "data_processing",
  "batch.size": batchSize,
});

Recording Events

typescript

span.addEvent("cache_hit", {
  "cache.key": key,
  "cache.ttl": ttl,
});

Log-Trace Correlation

Update Winston logger to include trace context in src/services/logger.ts:

typescript

import { trace, context } from "@opentelemetry/api";
import { format } from "winston";

const logFormat = format.printf(({ level, message, timestamp, ...metadata }) => {
  const span = trace.getSpan(context.active());
  const traceId = span?.spanContext().traceId;
  const spanId = span?.spanContext().spanId;

  return JSON.stringify({
    timestamp,
    level,
    message,
    traceId,
    spanId,
    ...metadata,
  });
});

export const logger = createLogger({
  format: format.combine(
    format.timestamp(),
    logFormat
  ),
  transports: [/* ... */],
});

Now logs include trace IDs for correlation in Grafana.

Performance Impact

OpenTelemetry overhead:

CPU: ~1-2% additional overhead
Memory: ~10MB additional memory usage
Network: ~100KB/minute of trace data

Sampling

Reduce overhead with sampling:

typescript

import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";

export const sdk = new NodeSDK({
  // ... other config
  sampler: new TraceIdRatioBasedSampler(0.1), // Sample 10% of traces
});

Monitoring Tracing Health

Add Prometheus metrics in src/services/metrics.ts:

typescript

export const tracingSpans = new promClient.Counter({
  name: "tracing_spans_total",
  help: "Total number of spans created",
  labelNames: ["operation"],
});

export const tracingErrors = new promClient.Counter({
  name: "tracing_errors_total",
  help: "Total number of tracing errors",
});

Set up alerts:

promql

# High error rate in tracing
rate(tracing_errors_total[5m]) > 10

# No traces being generated (possible issue)
rate(tracing_spans_total[5m]) == 0

Query Examples

Find Slow Requests

In Jaeger:

duration > 500ms

Find Failed Requests

error=true

Find Tool Executions by User

user.id="user-123" AND tool.name="system_status"

Trace Specific Session

session.id="mcp-abc-123"

Dashboard Creation

Grafana Dashboard Panels

1. Request Latency by Endpoint

promql

histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket{job="seed-mcp"}[5m])
)

2. Trace Count

promql

rate(tracing_spans_total[5m])

3. Error Rate

promql

rate(tracing_errors_total[5m])

Testing

Create tests in src/services/tracing.test.ts:

typescript

import { trace, context } from "@opentelemetry/api";

describe("Distributed Tracing", () => {
  it("should create spans for tool execution", async () => {
    const tracer = trace.getTracer("test");
    const span = tracer.startSpan("test-span");

    expect(span).toBeDefined();
    expect(span.spanContext().traceId).toBeTruthy();

    span.end();
  });

  it("should include trace context in logs", async () => {
    const span = tracer.startSpan("test-span");
    context.with(trace.setSpan(context.active(), span), () => {
      const activeSpan = trace.getSpan(context.active());
      expect(activeSpan?.spanContext().traceId).toBe(span.spanContext().traceId);
    });
    span.end();
  });
});

Troubleshooting

No Traces in Jaeger

Check OTEL_ENABLED is true
Verify OTLP endpoint is reachable
Check Jaeger collector logs

Enable OTel diagnostics:

typescript

diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

High Memory Usage

Reduce sampling rate
Limit span attributes
Check for span leaks (not ended)

Missing Spans

Ensure tracer.startActiveSpan is awaited
Check span.end() is called in finally block
Verify context propagation

Audit Logging - Correlate audit events with traces
See Observability - Current monitoring setup

← Back to Enhancements

Distributed Tracing ​

Issue Description ​

Proposed Solution ​

Implementation ​

Install Dependencies ​

Tracing Setup ​

Server Integration ​

Custom Spans ​

Configuration ​

Local Development with Jaeger ​

Trace Visualization ​

Example Trace ​

Production Integration ​

Grafana Cloud ​

Grafana Tempo ​

Jaeger ​

Custom Instrumentation ​

Business Logic Spans ​

Adding Attributes ​

Recording Events ​

Log-Trace Correlation ​

Performance Impact ​

Sampling ​

Monitoring Tracing Health ​

Query Examples ​

Find Slow Requests ​

Find Failed Requests ​

Find Tool Executions by User ​

Trace Specific Session ​

Dashboard Creation ​

Grafana Dashboard Panels ​

Testing ​

Troubleshooting ​

No Traces in Jaeger ​

High Memory Usage ​

Missing Spans ​

Related Enhancements ​

Distributed Tracing

Issue Description

Proposed Solution

Implementation

Install Dependencies

Tracing Setup

Server Integration

Custom Spans

Configuration

Local Development with Jaeger

Trace Visualization

Example Trace

Production Integration

Grafana Cloud

Grafana Tempo

Jaeger

Custom Instrumentation

Business Logic Spans

Adding Attributes

Recording Events

Log-Trace Correlation

Performance Impact

Sampling

Monitoring Tracing Health

Query Examples

Find Slow Requests

Find Failed Requests

Find Tool Executions by User

Trace Specific Session

Dashboard Creation

Grafana Dashboard Panels

Testing

Troubleshooting

No Traces in Jaeger

High Memory Usage

Missing Spans

Related Enhancements