Skip to content

Distributed Tracing

Priority: 🔵 LONG-TERM Estimated Time: 8 hours Risk Level: LOW Impact: Observability, debugging, performance monitoring

← Back to Enhancements


Issue Description

Current observability:

  • ✅ Prometheus metrics
  • ✅ Winston logging
  • ❌ No distributed tracing
  • ❌ No request flow visualization
  • ❌ Difficult to debug cross-service issues

Proposed Solution

Implement OpenTelemetry for distributed tracing:

  • Trace requests through MCP server
  • Correlate logs and traces
  • Measure operation latency
  • Visualize in Jaeger or Grafana Tempo

Implementation

Note: Redis is included in the local development environment when using ./scripts/local (part of the Docker stack).

Install Dependencies

bash
npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

Tracing Setup

Create src/services/tracing.ts:

typescript
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";

// Enable OpenTelemetry diagnostics in development
if (process.env.NODE_ENV === "development") {
  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);
}

const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});

export const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "seed-mcp-server",
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version ?? "0.1.3",
  }),
  traceExporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false },
      "@opentelemetry/instrumentation-http": { enabled: true },
      "@opentelemetry/instrumentation-express": { enabled: true },
    }),
  ],
});

export function initTracing(): void {
  if (process.env.OTEL_ENABLED !== "false") {
    sdk.start();
    console.log("OpenTelemetry tracing initialized");
  }
}

export function shutdownTracing(): Promise<void> {
  return sdk.shutdown();
}

Server Integration

Update src/index.ts:

typescript
import { initTracing, shutdownTracing } from "./services/tracing.js";

// Initialize tracing BEFORE importing other modules
initTracing();

import { app } from "./app.js";
import { config } from "./config/index.js";
import { logger } from "./services/logger.js";

// ... rest of server setup

// Graceful shutdown
process.on("SIGTERM", async () => {
  logger.info("SIGTERM received, shutting down gracefully");

  await shutdownTracing();

  server.close(() => {
    logger.info("Server closed");
    process.exit(0);
  });
});

Custom Spans

Add custom spans for MCP operations in src/mcp/tools/index.ts:

typescript
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("seed-mcp-tools");

server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
  // Create span for tool execution
  return await tracer.startActiveSpan(
    `tool.execute.${request.params.name}`,
    async (span) => {
      try {
        span.setAttributes({
          "tool.name": request.params.name,
          "user.id": extra.userContext?.sub ?? "anonymous",
          "session.id": extra.sessionId ?? "unknown",
        });

        const result = await executeTool(request.params.name, request.params.arguments);

        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : "Unknown error",
        });
        span.recordException(error as Error);
        throw error;
      } finally {
        span.end();
      }
    }
  );
});

Configuration

Add to .env.example:

env
# OpenTelemetry Tracing
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
OTEL_SERVICE_NAME=seed-mcp-server

Local Development with Jaeger

Add to docker-compose.dev.yml:

yaml
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4318:4318"    # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Start Jaeger:

bash
docker-compose -f docker-compose.dev.yml up jaeger

Access Jaeger UI at http://localhost:16686


Trace Visualization

After implementing, you'll see:

  • Request flow through middleware
  • JWT validation timing
  • Tool execution duration
  • Redis operations
  • HTTP calls to OIDC provider

Example Trace

GET /mcp [200ms]
  ├─ middleware.auth [50ms]
  │   └─ jwks.getKey [45ms]
  │       └─ http.fetch [40ms]
  ├─ mcp.handleRequest [140ms]
  │   ├─ tool.execute.system_status [130ms]
  │   │   ├─ redis.get [20ms]
  │   │   └─ metrics.collect [100ms]
  │   └─ session.touch [10ms]
  │       └─ redis.set [8ms]

Production Integration

Grafana Cloud

typescript
const traceExporter = new OTLPTraceExporter({
  url: "https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/traces",
  headers: {
    "Authorization": `Basic ${Buffer.from(`${process.env.GRAFANA_INSTANCE_ID}:${process.env.GRAFANA_API_KEY}`).toString("base64")}`,
  },
});

Grafana Tempo

yaml
# docker-compose.prod.yml
tempo:
  image: grafana/tempo:latest
  command: ["-config.file=/etc/tempo.yaml"]
  volumes:
    - ./tempo-config.yaml:/etc/tempo.yaml
  ports:
    - "4318:4318"  # OTLP HTTP

Jaeger

env
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger-collector:4318/v1/traces

Custom Instrumentation

Business Logic Spans

typescript
import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("seed-mcp-server");

export async function processBusinessLogic(): Promise<void> {
  const span = tracer.startSpan("business.process");

  try {
    // Your logic here
    span.addEvent("Processing started");

    await doWork();

    span.addEvent("Processing completed");
  } finally {
    span.end();
  }
}

Adding Attributes

typescript
span.setAttributes({
  "user.id": userId,
  "operation.type": "data_processing",
  "batch.size": batchSize,
});

Recording Events

typescript
span.addEvent("cache_hit", {
  "cache.key": key,
  "cache.ttl": ttl,
});

Log-Trace Correlation

Update Winston logger to include trace context in src/services/logger.ts:

typescript
import { trace, context } from "@opentelemetry/api";
import { format } from "winston";

const logFormat = format.printf(({ level, message, timestamp, ...metadata }) => {
  const span = trace.getSpan(context.active());
  const traceId = span?.spanContext().traceId;
  const spanId = span?.spanContext().spanId;

  return JSON.stringify({
    timestamp,
    level,
    message,
    traceId,
    spanId,
    ...metadata,
  });
});

export const logger = createLogger({
  format: format.combine(
    format.timestamp(),
    logFormat
  ),
  transports: [/* ... */],
});

Now logs include trace IDs for correlation in Grafana.


Performance Impact

OpenTelemetry overhead:

  • CPU: ~1-2% additional overhead
  • Memory: ~10MB additional memory usage
  • Network: ~100KB/minute of trace data

Sampling

Reduce overhead with sampling:

typescript
import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";

export const sdk = new NodeSDK({
  // ... other config
  sampler: new TraceIdRatioBasedSampler(0.1), // Sample 10% of traces
});

Monitoring Tracing Health

Add Prometheus metrics in src/services/metrics.ts:

typescript
export const tracingSpans = new promClient.Counter({
  name: "tracing_spans_total",
  help: "Total number of spans created",
  labelNames: ["operation"],
});

export const tracingErrors = new promClient.Counter({
  name: "tracing_errors_total",
  help: "Total number of tracing errors",
});

Set up alerts:

promql
# High error rate in tracing
rate(tracing_errors_total[5m]) > 10

# No traces being generated (possible issue)
rate(tracing_spans_total[5m]) == 0

Query Examples

Find Slow Requests

In Jaeger:

duration > 500ms

Find Failed Requests

error=true

Find Tool Executions by User

user.id="user-123" AND tool.name="system_status"

Trace Specific Session

session.id="mcp-abc-123"

Dashboard Creation

Grafana Dashboard Panels

1. Request Latency by Endpoint

promql
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket{job="seed-mcp"}[5m])
)

2. Trace Count

promql
rate(tracing_spans_total[5m])

3. Error Rate

promql
rate(tracing_errors_total[5m])

Testing

Create tests in src/services/tracing.test.ts:

typescript
import { trace, context } from "@opentelemetry/api";

describe("Distributed Tracing", () => {
  it("should create spans for tool execution", async () => {
    const tracer = trace.getTracer("test");
    const span = tracer.startSpan("test-span");

    expect(span).toBeDefined();
    expect(span.spanContext().traceId).toBeTruthy();

    span.end();
  });

  it("should include trace context in logs", async () => {
    const span = tracer.startSpan("test-span");
    context.with(trace.setSpan(context.active(), span), () => {
      const activeSpan = trace.getSpan(context.active());
      expect(activeSpan?.spanContext().traceId).toBe(span.spanContext().traceId);
    });
    span.end();
  });
});

Troubleshooting

No Traces in Jaeger

  1. Check OTEL_ENABLED is true
  2. Verify OTLP endpoint is reachable
  3. Check Jaeger collector logs
  4. Enable OTel diagnostics:
    typescript
    diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

High Memory Usage

  1. Reduce sampling rate
  2. Limit span attributes
  3. Check for span leaks (not ended)

Missing Spans

  1. Ensure tracer.startActiveSpan is awaited
  2. Check span.end() is called in finally block
  3. Verify context propagation


← Back to Enhancements

Released under the MIT License.