Distributed Tracing
Priority: 🔵 LONG-TERM Estimated Time: 8 hours Risk Level: LOW Impact: Observability, debugging, performance monitoring
Issue Description
Current observability:
- ✅ Prometheus metrics
- ✅ Winston logging
- ❌ No distributed tracing
- ❌ No request flow visualization
- ❌ Difficult to debug cross-service issues
Proposed Solution
Implement OpenTelemetry for distributed tracing:
- Trace requests through MCP server
- Correlate logs and traces
- Measure operation latency
- Visualize in Jaeger or Grafana Tempo
Implementation
Note: Redis is included in the local development environment when using ./scripts/local (part of the Docker stack).
Install Dependencies
npm install @opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-httpTracing Setup
Create src/services/tracing.ts:
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";
// Enable OpenTelemetry diagnostics in development
if (process.env.NODE_ENV === "development") {
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);
}
const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
});
export const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: "seed-mcp-server",
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version ?? "0.1.3",
}),
traceExporter,
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-fs": { enabled: false },
"@opentelemetry/instrumentation-http": { enabled: true },
"@opentelemetry/instrumentation-express": { enabled: true },
}),
],
});
export function initTracing(): void {
if (process.env.OTEL_ENABLED !== "false") {
sdk.start();
console.log("OpenTelemetry tracing initialized");
}
}
export function shutdownTracing(): Promise<void> {
return sdk.shutdown();
}Server Integration
Update src/index.ts:
import { initTracing, shutdownTracing } from "./services/tracing.js";
// Initialize tracing BEFORE importing other modules
initTracing();
import { app } from "./app.js";
import { config } from "./config/index.js";
import { logger } from "./services/logger.js";
// ... rest of server setup
// Graceful shutdown
process.on("SIGTERM", async () => {
logger.info("SIGTERM received, shutting down gracefully");
await shutdownTracing();
server.close(() => {
logger.info("Server closed");
process.exit(0);
});
});Custom Spans
Add custom spans for MCP operations in src/mcp/tools/index.ts:
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("seed-mcp-tools");
server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
// Create span for tool execution
return await tracer.startActiveSpan(
`tool.execute.${request.params.name}`,
async (span) => {
try {
span.setAttributes({
"tool.name": request.params.name,
"user.id": extra.userContext?.sub ?? "anonymous",
"session.id": extra.sessionId ?? "unknown",
});
const result = await executeTool(request.params.name, request.params.arguments);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : "Unknown error",
});
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
}
);
});Configuration
Add to .env.example:
# OpenTelemetry Tracing
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
OTEL_SERVICE_NAME=seed-mcp-serverLocal Development with Jaeger
Add to docker-compose.dev.yml:
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver
environment:
- COLLECTOR_OTLP_ENABLED=trueStart Jaeger:
docker-compose -f docker-compose.dev.yml up jaegerAccess Jaeger UI at http://localhost:16686
Trace Visualization
After implementing, you'll see:
- Request flow through middleware
- JWT validation timing
- Tool execution duration
- Redis operations
- HTTP calls to OIDC provider
Example Trace
GET /mcp [200ms]
├─ middleware.auth [50ms]
│ └─ jwks.getKey [45ms]
│ └─ http.fetch [40ms]
├─ mcp.handleRequest [140ms]
│ ├─ tool.execute.system_status [130ms]
│ │ ├─ redis.get [20ms]
│ │ └─ metrics.collect [100ms]
│ └─ session.touch [10ms]
│ └─ redis.set [8ms]Production Integration
Grafana Cloud
const traceExporter = new OTLPTraceExporter({
url: "https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/traces",
headers: {
"Authorization": `Basic ${Buffer.from(`${process.env.GRAFANA_INSTANCE_ID}:${process.env.GRAFANA_API_KEY}`).toString("base64")}`,
},
});Grafana Tempo
# docker-compose.prod.yml
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml
ports:
- "4318:4318" # OTLP HTTPJaeger
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger-collector:4318/v1/tracesCustom Instrumentation
Business Logic Spans
import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("seed-mcp-server");
export async function processBusinessLogic(): Promise<void> {
const span = tracer.startSpan("business.process");
try {
// Your logic here
span.addEvent("Processing started");
await doWork();
span.addEvent("Processing completed");
} finally {
span.end();
}
}Adding Attributes
span.setAttributes({
"user.id": userId,
"operation.type": "data_processing",
"batch.size": batchSize,
});Recording Events
span.addEvent("cache_hit", {
"cache.key": key,
"cache.ttl": ttl,
});Log-Trace Correlation
Update Winston logger to include trace context in src/services/logger.ts:
import { trace, context } from "@opentelemetry/api";
import { format } from "winston";
const logFormat = format.printf(({ level, message, timestamp, ...metadata }) => {
const span = trace.getSpan(context.active());
const traceId = span?.spanContext().traceId;
const spanId = span?.spanContext().spanId;
return JSON.stringify({
timestamp,
level,
message,
traceId,
spanId,
...metadata,
});
});
export const logger = createLogger({
format: format.combine(
format.timestamp(),
logFormat
),
transports: [/* ... */],
});Now logs include trace IDs for correlation in Grafana.
Performance Impact
OpenTelemetry overhead:
- CPU: ~1-2% additional overhead
- Memory: ~10MB additional memory usage
- Network: ~100KB/minute of trace data
Sampling
Reduce overhead with sampling:
import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";
export const sdk = new NodeSDK({
// ... other config
sampler: new TraceIdRatioBasedSampler(0.1), // Sample 10% of traces
});Monitoring Tracing Health
Add Prometheus metrics in src/services/metrics.ts:
export const tracingSpans = new promClient.Counter({
name: "tracing_spans_total",
help: "Total number of spans created",
labelNames: ["operation"],
});
export const tracingErrors = new promClient.Counter({
name: "tracing_errors_total",
help: "Total number of tracing errors",
});Set up alerts:
# High error rate in tracing
rate(tracing_errors_total[5m]) > 10
# No traces being generated (possible issue)
rate(tracing_spans_total[5m]) == 0Query Examples
Find Slow Requests
In Jaeger:
duration > 500msFind Failed Requests
error=trueFind Tool Executions by User
user.id="user-123" AND tool.name="system_status"Trace Specific Session
session.id="mcp-abc-123"Dashboard Creation
Grafana Dashboard Panels
1. Request Latency by Endpoint
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="seed-mcp"}[5m])
)2. Trace Count
rate(tracing_spans_total[5m])3. Error Rate
rate(tracing_errors_total[5m])Testing
Create tests in src/services/tracing.test.ts:
import { trace, context } from "@opentelemetry/api";
describe("Distributed Tracing", () => {
it("should create spans for tool execution", async () => {
const tracer = trace.getTracer("test");
const span = tracer.startSpan("test-span");
expect(span).toBeDefined();
expect(span.spanContext().traceId).toBeTruthy();
span.end();
});
it("should include trace context in logs", async () => {
const span = tracer.startSpan("test-span");
context.with(trace.setSpan(context.active(), span), () => {
const activeSpan = trace.getSpan(context.active());
expect(activeSpan?.spanContext().traceId).toBe(span.spanContext().traceId);
});
span.end();
});
});Troubleshooting
No Traces in Jaeger
- Check OTEL_ENABLED is true
- Verify OTLP endpoint is reachable
- Check Jaeger collector logs
- Enable OTel diagnostics:typescript
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
High Memory Usage
- Reduce sampling rate
- Limit span attributes
- Check for span leaks (not ended)
Missing Spans
- Ensure tracer.startActiveSpan is awaited
- Check span.end() is called in finally block
- Verify context propagation
Related Enhancements
- Audit Logging - Correlate audit events with traces
- See Observability - Current monitoring setup