Instrumenting WebSockets with OpenTelemetry #

You added OpenTelemetry, your REST routes show clean traces, and yet every WebSocket interaction collapses into a single span that ends at 101 Switching Protocols. A handler that takes 800 ms to process an inbound frame is invisible — there is no span, no latency record, nothing linking it back to the connection that opened hours ago. This page shows how to trace post-upgrade WebSocket traffic with @opentelemetry/api and the ws package: a span on the upgrade, a long-lived connection context, and one child span per message so a slow handler becomes a node in your trace tree.

Root cause #

Auto-instrumentation hooks the request/response lifecycle. The http instrumentation wraps emit('request'), starts a span, and ends it when the response finishes. A WebSocket upgrade is an HTTP request — so you get exactly one span covering the handshake, and it closes the moment the socket switches protocols. Everything after that travels as raw WebSocket frames over the same TCP socket, never touching the HTTP request emitter again. The auto-instrumentation has no concept of a “frame,” so the entire duplex conversation — potentially thousands of messages — falls outside any span.

Three consequences follow:

  • No per-message latency. A handler blocking the event loop for 800 ms produces zero telemetry. You see CPU rise in metrics but cannot attribute it to a specific message type or tenant.
  • No causal link to the upgrade. Even if you manually start a span per message, it floats as an orphan root unless you stash the connection’s trace context and re-activate it for each frame.
  • Broken traces across fan-out. When a message triggers a publish to Redis and another node delivers it to a subscriber, the consumer-side span starts a fresh trace. The two halves of the same logical flow never join, because W3C traceparent context does not ride along the Redis payload by default.

The fix is to treat the connection as a context carrier. Start a span when the upgrade arrives, extract traceparent from the upgrade headers, store the resulting Context on the socket, and use context.with() to re-enter it every time a frame lands. For the Redis Pub/Sub fan-out boundary, you serialize the active context into the message envelope and re-inject it on the consumer side, stitching the two spans into one trace.

WebSocket span tree under OpenTelemetry An upgrade span holds connection context; each inbound frame opens a child span, and a publish span links across the Redis fan-out boundary to a consumer span. Span: ws.upgrade holds connection Context (traceparent) Span: ws.message type=chat.send Span: ws.message slow handler 800ms Span: redis.publish inject traceparent Span: ws.deliver other node, same trace Redis fan-out boundary (context carried in payload)

Resolution #

The pattern below uses @opentelemetry/api directly so it works with any SDK/exporter you already configured. Each non-obvious line is annotated. The keys are: a span per upgrade, the connection’s Context stored on the socket, context.with() to re-activate it per frame, and propagation.inject / extract to cross the Redis boundary.

import { WebSocketServer, WebSocket, RawData } from 'ws';
import { IncomingMessage } from 'http';
import {
trace, context, propagation, SpanStatusCode, Context, Span,
} from '@opentelemetry/api';

const tracer = trace.getTracer('websocket-gateway');

// Carrier shape for W3C traceparent — plain string map, both ways.
type Carrier = Record<string, string>;

// We attach the connection's Context + its long-lived span to each socket.
interface TracedSocket extends WebSocket {
otelContext?: Context;
connSpan?: Span;
}

const wss = new WebSocketServer({ noServer: true });

// Hook the raw upgrade so we can read headers BEFORE the protocol switches.
server.on('upgrade', (req: IncomingMessage, socket, head) => {
// Extract any inbound traceparent (e.g. a browser that injected one, or a
// gateway hop). Falls back to active context if none present.
const parentCtx = propagation.extract(context.active(), req.headers as Carrier);

// Start the upgrade span INSIDE the extracted parent so it nests correctly.
const connSpan = tracer.startSpan('ws.upgrade', {
attributes: {
'ws.path': req.url ?? '',
'net.peer.ip': req.socket.remoteAddress ?? '',
},
}, parentCtx);

// Bind the new span into a Context we will re-enter for every frame.
const otelContext = trace.setSpan(parentCtx, connSpan);

wss.handleUpgrade(req, socket, head, (ws: TracedSocket) => {
ws.otelContext = otelContext; // travels with the connection for its lifetime
ws.connSpan = connSpan;
wss.emit('connection', ws, req);
});
});

wss.on('connection', (ws: TracedSocket) => {
ws.on('message', (raw: RawData) => {
// Re-enter the connection's context so the message span becomes a CHILD
// of ws.upgrade instead of a detached root span.
context.with(ws.otelContext!, () => {
const msg = parse(raw);
const span = tracer.startSpan(`ws.message ${msg.type}`, {
attributes: { 'ws.message.type': msg.type, 'ws.message.bytes': raw.toString().length },
});

// Activate THIS span so downstream work (incl. the Redis publish) nests under it.
context.with(trace.setSpan(context.active(), span), () => {
try {
handle(ws, msg); // a slow handler here now shows its real duration
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
} finally {
span.end(); // ends per-message; ws.upgrade stays open
}
});
});
});

ws.on('close', () => ws.connSpan?.end()); // close the long-lived span on disconnect
});

// --- Crossing the Redis fan-out boundary ---
// On publish: inject the CURRENT context into the message envelope.
function publishToChannel(channel: string, body: unknown) {
const carrier: Carrier = {};
propagation.inject(context.active(), carrier); // writes traceparent into carrier
redis.publish(channel, JSON.stringify({ body, _otel: carrier }));
}

// On the consuming node: extract the carrier and start a linked delivery span.
redisSub.on('message', (_channel: string, payload: string) => {
const { body, _otel } = JSON.parse(payload);
const ctx = propagation.extract(context.active(), _otel as Carrier); // re-hydrate trace
context.with(ctx, () => {
const span = tracer.startSpan('ws.deliver'); // same trace as the publisher
deliverToLocalSubscribers(body);
span.end();
});
});

The decisive moves are ws.otelContext = otelContext (storing the connection context so it outlives the request) and the nested context.with() calls that re-activate it per frame. Without them, span parentage breaks and your message spans scatter into single-span traces. The _otel carrier on the Redis envelope is what makes the Redis Pub/Sub fan-out appear as one continuous trace rather than two unrelated ones.

Operational checklist #

  • Confirm ws.upgrade spans appear and end on close, not on 101
  • Verify ws.message spans are children of ws.upgrade
  • Add ws.message.type
  • Bound the long-lived ws.upgrade
  • Test the Redis path end-to-end: publish on node A, assert the ws.deliver span on node B shares the same trace_id

FAQ #

Why does auto-instrumentation miss WebSocket messages? #

HTTP auto-instrumentation spans the request/response cycle. The upgrade is one request, so it gets one span that ends at the protocol switch. Subsequent frames are raw WebSocket data on the same socket and never re-enter the HTTP request emitter, so no instrumentation fires. You must start spans manually per frame, as shown in WebSocket Observability & Monitoring.

Should the connection span stay open for the whole connection? #

It can, and that gives the cleanest parent for every message span — but a multi-hour span is awkward for some backends and delays export. The alternative is to end ws.upgrade quickly and use span links to relate each per-message root back to the connection. Pick based on how your tracing backend handles long-running spans.

How do I avoid flooding the exporter on high-traffic sockets? #

Sample. Use head sampling to keep a fraction of message spans, or tail-based sampling to keep only spans exceeding a latency threshold. Either way, set span attribute and event limits so a misbehaving client cannot exhaust memory. This becomes critical as you grow connection counts — see Scaling Real-Time Infrastructure.

Does this work with Socket.IO instead of raw ws? #

The principle is identical, but Socket.IO wraps frames in its own protocol and exposes socket.on(event) rather than a single message event. Start the per-message span inside each event handler and store the context on the Socket.IO socket object the same way. The Redis adapter path also needs the _otel carrier injected into the adapter payload.

Where do these spans fit in the broader connection lifecycle? #

They sit on top of the connection management layer. The upgrade span begins exactly where authorization and handshake handling end, so it composes cleanly with the rest of Backend WebSocket Connection Management — heartbeats, reconnection, and teardown each become spannable events on the same connection context.

Back to WebSocket Observability & Monitoring.