Instrumenting WebSockets with OpenTelemetry #
You added OpenTelemetry, your REST routes show clean traces, and yet every WebSocket interaction collapses into a single span that ends at 101 Switching Protocols. A handler that takes 800 ms to process an inbound frame is invisible — there is no span, no latency record, nothing linking it back to the connection that opened hours ago. This page shows how to trace post-upgrade WebSocket traffic with @opentelemetry/api and the ws package: a span on the upgrade, a long-lived connection context, and one child span per message so a slow handler becomes a node in your trace tree.
Root cause #
Auto-instrumentation hooks the request/response lifecycle. The http instrumentation wraps emit('request'), starts a span, and ends it when the response finishes. A WebSocket upgrade is an HTTP request — so you get exactly one span covering the handshake, and it closes the moment the socket switches protocols. Everything after that travels as raw WebSocket frames over the same TCP socket, never touching the HTTP request emitter again. The auto-instrumentation has no concept of a “frame,” so the entire duplex conversation — potentially thousands of messages — falls outside any span.
Three consequences follow:
- No per-message latency. A handler blocking the event loop for 800 ms produces zero telemetry. You see CPU rise in metrics but cannot attribute it to a specific message type or tenant.
- No causal link to the upgrade. Even if you manually start a span per message, it floats as an orphan root unless you stash the connection’s trace context and re-activate it for each frame.
- Broken traces across fan-out. When a message triggers a publish to Redis and another node delivers it to a subscriber, the consumer-side span starts a fresh trace. The two halves of the same logical flow never join, because W3C
traceparentcontext does not ride along the Redis payload by default.
The fix is to treat the connection as a context carrier. Start a span when the upgrade arrives, extract traceparent from the upgrade headers, store the resulting Context on the socket, and use context.with() to re-enter it every time a frame lands. For the Redis Pub/Sub fan-out boundary, you serialize the active context into the message envelope and re-inject it on the consumer side, stitching the two spans into one trace.
Resolution #
The pattern below uses @opentelemetry/api directly so it works with any SDK/exporter you already configured. Each non-obvious line is annotated. The keys are: a span per upgrade, the connection’s Context stored on the socket, context.with() to re-activate it per frame, and propagation.inject / extract to cross the Redis boundary.
import { WebSocketServer, WebSocket, RawData } from 'ws';
import { IncomingMessage } from 'http';
import {
trace, context, propagation, SpanStatusCode, Context, Span,
} from '@opentelemetry/api';
const tracer = trace.getTracer('websocket-gateway');
// Carrier shape for W3C traceparent — plain string map, both ways.
type Carrier = Record<string, string>;
// We attach the connection's Context + its long-lived span to each socket.
interface TracedSocket extends WebSocket {
otelContext?: Context;
connSpan?: Span;
}
const wss = new WebSocketServer({ noServer: true });
// Hook the raw upgrade so we can read headers BEFORE the protocol switches.
server.on('upgrade', (req: IncomingMessage, socket, head) => {
// Extract any inbound traceparent (e.g. a browser that injected one, or a
// gateway hop). Falls back to active context if none present.
const parentCtx = propagation.extract(context.active(), req.headers as Carrier);
// Start the upgrade span INSIDE the extracted parent so it nests correctly.
const connSpan = tracer.startSpan('ws.upgrade', {
attributes: {
'ws.path': req.url ?? '',
'net.peer.ip': req.socket.remoteAddress ?? '',
},
}, parentCtx);
// Bind the new span into a Context we will re-enter for every frame.
const otelContext = trace.setSpan(parentCtx, connSpan);
wss.handleUpgrade(req, socket, head, (ws: TracedSocket) => {
ws.otelContext = otelContext; // travels with the connection for its lifetime
ws.connSpan = connSpan;
wss.emit('connection', ws, req);
});
});
wss.on('connection', (ws: TracedSocket) => {
ws.on('message', (raw: RawData) => {
// Re-enter the connection's context so the message span becomes a CHILD
// of ws.upgrade instead of a detached root span.
context.with(ws.otelContext!, () => {
const msg = parse(raw);
const span = tracer.startSpan(`ws.message ${msg.type}`, {
attributes: { 'ws.message.type': msg.type, 'ws.message.bytes': raw.toString().length },
});
// Activate THIS span so downstream work (incl. the Redis publish) nests under it.
context.with(trace.setSpan(context.active(), span), () => {
try {
handle(ws, msg); // a slow handler here now shows its real duration
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
} finally {
span.end(); // ends per-message; ws.upgrade stays open
}
});
});
});
ws.on('close', () => ws.connSpan?.end()); // close the long-lived span on disconnect
});
// --- Crossing the Redis fan-out boundary ---
// On publish: inject the CURRENT context into the message envelope.
function publishToChannel(channel: string, body: unknown) {
const carrier: Carrier = {};
propagation.inject(context.active(), carrier); // writes traceparent into carrier
redis.publish(channel, JSON.stringify({ body, _otel: carrier }));
}
// On the consuming node: extract the carrier and start a linked delivery span.
redisSub.on('message', (_channel: string, payload: string) => {
const { body, _otel } = JSON.parse(payload);
const ctx = propagation.extract(context.active(), _otel as Carrier); // re-hydrate trace
context.with(ctx, () => {
const span = tracer.startSpan('ws.deliver'); // same trace as the publisher
deliverToLocalSubscribers(body);
span.end();
});
});
The decisive moves are ws.otelContext = otelContext (storing the connection context so it outlives the request) and the nested context.with() calls that re-activate it per frame. Without them, span parentage breaks and your message spans scatter into single-span traces. The _otel carrier on the Redis envelope is what makes the Redis Pub/Sub fan-out appear as one continuous trace rather than two unrelated ones.
Operational checklist #
- Confirm
ws.upgradespans appear and end onclose, not on101 - Verify
ws.messagespans are children ofws.upgrade - Add
ws.message.type - Bound the long-lived
ws.upgrade - Test the Redis path end-to-end: publish on node A, assert the
ws.deliverspan on node B shares the sametrace_id
FAQ #
Why does auto-instrumentation miss WebSocket messages? #
HTTP auto-instrumentation spans the request/response cycle. The upgrade is one request, so it gets one span that ends at the protocol switch. Subsequent frames are raw WebSocket data on the same socket and never re-enter the HTTP request emitter, so no instrumentation fires. You must start spans manually per frame, as shown in WebSocket Observability & Monitoring.
Should the connection span stay open for the whole connection? #
It can, and that gives the cleanest parent for every message span — but a multi-hour span is awkward for some backends and delays export. The alternative is to end ws.upgrade quickly and use span links to relate each per-message root back to the connection. Pick based on how your tracing backend handles long-running spans.
How do I avoid flooding the exporter on high-traffic sockets? #
Sample. Use head sampling to keep a fraction of message spans, or tail-based sampling to keep only spans exceeding a latency threshold. Either way, set span attribute and event limits so a misbehaving client cannot exhaust memory. This becomes critical as you grow connection counts — see Scaling Real-Time Infrastructure.
Does this work with Socket.IO instead of raw ws? #
The principle is identical, but Socket.IO wraps frames in its own protocol and exposes socket.on(event) rather than a single message event. Start the per-message span inside each event handler and store the context on the Socket.IO socket object the same way. The Redis adapter path also needs the _otel carrier injected into the adapter payload.
Where do these spans fit in the broader connection lifecycle? #
They sit on top of the connection management layer. The upgrade span begins exactly where authorization and handshake handling end, so it composes cleanly with the rest of Backend WebSocket Connection Management — heartbeats, reconnection, and teardown each become spannable events on the same connection context.
Related #
- WebSocket Observability & Monitoring — the parent area covering tracing, metrics, and structured logging for real-time systems.
- Backend WebSocket Connection Management — connection lifecycle, heartbeats, and teardown that your spans annotate.
- Redis Pub/Sub Fan-Out — the fan-out boundary you stitch traces across with an injected carrier.
- Scaling Real-Time Infrastructure — sampling and span-volume concerns once connection counts climb.
Back to WebSocket Observability & Monitoring.