WebSocket Observability & Monitoring #

A real-time fleet without telemetry is undebuggable. Consider a 2 a.m. page: users report “the dashboard froze,” but every HTTP health check is green, CPU is flat, and your access logs are silent because a WebSocket connection opens once and then says nothing for hours. Without an active-connection gauge you cannot tell whether 50,000 sockets quietly evaporated; without a close-code distribution you cannot tell a deploy-driven 1001 drain from a 1006 abnormal closure storm; without heartbeat RTT histograms you cannot see the one node where pongs are arriving 8 seconds late. Request/response observability tooling assumes short-lived requests — long-lived bidirectional sockets fall straight through the gap. This guide wires prom-client metrics, OpenTelemetry spans, and structured per-connection logs into the ws connection, message, and close events so that the next outage is a query, not a guess.

Prerequisites #

Before instrumenting, you need a stable connection lifecycle to hang metrics on. Heartbeat ping/pong must already be running so you can measure RTT and detect zombie sockets — see Connection Lifecycle & Heartbeats. The instrumentation below assumes the foundational Backend WebSocket Connection Management patterns are in place: a single WebSocketServer, deterministic session IDs, and explicit RFC 6455 close codes. If you run more than one node — and once you instrument, you will see exactly how unevenly load spreads — read Scaling Real-Time Infrastructure, because per-node gauges only make sense when each pod exposes its own /metrics endpoint scraped independently.

You will also need a Prometheus server (or any OpenMetrics scraper) reachable to the pods, a Grafana instance for dashboards, and an OTel collector endpoint if you adopt distributed tracing across the upgrade.

WebSocket telemetry pipeline A WebSocket server emits metrics, spans, and logs that flow through Prometheus and an OpenTelemetry collector into Grafana dashboards and an alert manager. WS Server connect / msg / close /metrics + spans Prometheus scrape /metrics OTel Collector spans + logs Grafana dashboards Alertmanager on-call paging

Applying RED and USE to WebSockets #

The RED method (Rate, Errors, Duration) and USE method (Utilization, Saturation, Errors) were written for request/response services, but they map cleanly onto sockets once you reframe the unit. For RED, the “request” is a message or a connection event: Rate becomes connect/disconnect rate and message throughput, Errors become abnormal close codes, and Duration becomes connection lifetime and heartbeat RTT. For USE, the resource is the socket fleet itself: Utilization is the active-connection gauge against capacity, Saturation is bufferedAmount backpressure and CLOSE_WAIT socket counts, and Errors are the same abnormal closures. Instrument all of these as named Prometheus series so a single dashboard answers “is it the clients, the network, or this one node?”

Core implementation #

The pattern is to declare the metric instruments once at module scope, then mutate them inside the ws lifecycle handlers. Gauges go up on connection and down on close; counters increment on close (labeled by code) and on each message; histograms observe heartbeat RTT and message size.

import { WebSocketServer, WebSocket } from 'ws';
import { IncomingMessage } from 'http';
import { Registry, Gauge, Counter, Histogram, collectDefaultMetrics } from 'prom-client';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const registry = new Registry();
collectDefaultMetrics({ register: registry }); // event-loop lag, GC, fd count

// USE: live count of OPEN sockets on THIS node. The single most important number.
const activeConnections = new Gauge({
name: 'ws_connections_active',
help: 'Currently open WebSocket connections',
registers: [registry],
});

// RED: connect/disconnect rate is the derivative of these counters.
const connectsTotal = new Counter({
name: 'ws_connects_total',
help: 'Total accepted WebSocket connections',
registers: [registry],
});

// RED errors: close-code distribution. Keep `code` low-cardinality (it is bounded by RFC 6455).
const closesTotal = new Counter({
name: 'ws_closes_total',
help: 'Total closed connections by RFC 6455 close code',
labelNames: ['code'] as const,
registers: [registry],
});

// RED rate: message throughput, split by direction.
const messagesTotal = new Counter({
name: 'ws_messages_total',
help: 'WebSocket messages processed',
labelNames: ['direction'] as const, // 'in' | 'out'
registers: [registry],
});

// RED/USE duration: heartbeat round-trip time. Buckets in seconds.
const heartbeatRtt = new Histogram({
name: 'ws_heartbeat_rtt_seconds',
help: 'Ping/pong round-trip latency',
buckets: [0.005, 0.025, 0.1, 0.25, 1, 2.5, 5, 10],
registers: [registry],
});

// USE saturation: bufferedAmount sampled at close, flagging slow consumers.
const sendBufferBytes = new Histogram({
name: 'ws_send_buffer_bytes',
help: 'socket.bufferedAmount observed when backpressure builds',
buckets: [0, 16_384, 262_144, 1_048_576, 8_388_608],
registers: [registry],
});

const tracer = trace.getTracer('ws-server');
const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (socket: WebSocket & { sessionId?: string }, req: IncomingMessage) => {
// Start a span that represents the whole connection lifetime, parented to any
// traceparent header carried on the HTTP upgrade request (see propagation note below).
const span = tracer.startSpan('ws.connection', {
attributes: { 'ws.path': req.url ?? '/', 'net.peer.ip': req.socket.remoteAddress ?? '' },
});

const sessionId = crypto.randomUUID();
socket.sessionId = sessionId;
activeConnections.inc(); // gauge up
connectsTotal.inc(); // counter up

// Structured per-connection log line: one event, machine-parseable fields.
log.info({ event: 'ws_open', sessionId, traceId: span.spanContext().traceId, path: req.url });

let lastPingAt = 0;
socket.on('pong', () => {
if (lastPingAt) heartbeatRtt.observe((Date.now() - lastPingAt) / 1000);
});
// Your heartbeat scheduler calls this before each ping.
socket.recordPing = () => { lastPingAt = Date.now(); socket.ping(); };

socket.on('message', (data) => {
messagesTotal.inc({ direction: 'in' });
// ... route the message; instrument outbound sends with { direction: 'out' }.
});

socket.on('close', (code) => {
activeConnections.dec(); // gauge down — CRITICAL on every path
closesTotal.inc({ code: String(code) }); // bounded label cardinality
if (socket.bufferedAmount > 0) sendBufferBytes.observe(socket.bufferedAmount);
span.setAttribute('ws.close_code', code);
if (code >= 1006) span.setStatus({ code: SpanStatusCode.ERROR });
span.end();
log.info({ event: 'ws_close', sessionId, code, bufferedAtClose: socket.bufferedAmount });
});

socket.on('error', (err) => {
span.recordException(err);
log.error({ event: 'ws_error', sessionId, err: err.message });
// Note: 'error' is followed by 'close', so the gauge decrement happens there — not here.
});
});

// Expose for Prometheus to scrape.
import http from 'http';
http.createServer(async (req, res) => {
if (req.url === '/metrics') {
res.setHeader('Content-Type', registry.contentType);
res.end(await registry.metrics());
} else {
res.writeHead(404).end();
}
}).listen(9090);

The OpenTelemetry span is started on connection and ended on close, so its duration is the connection lifetime — a 9-hour socket produces a 9-hour span you can correlate with the message events nested beneath it. To propagate trace context across the upgrade, extract the traceparent header from req.headers inside handleUpgrade and use it as the parent context when starting the span; that links the browser’s fetch/upgrade to the server-side connection in a single trace.

Configuration reference #

Metric Type Labels Alert threshold
ws_connections_active gauge node sustained == 0 while traffic expected, or > capacity * 0.9
ws_connects_total counter node rate(...[1m]) spike vs baseline (reconnect storm)
ws_closes_total counter code rate(ws_closes_total{code="1006"}[5m]) > 0.1
ws_messages_total counter direction throughput drop to 0 for 2m (silent stall)
ws_heartbeat_rtt_seconds histogram p99 > 5s for 5m (degraded network segment)
ws_send_buffer_bytes histogram p95 > 1 MiB (slow-consumer backpressure)
process_open_fds gauge within 20% of ulimit -n (fd / CLOSE_WAIT leak)

Edge cases & gotchas #

High-cardinality labels. Never label a metric with sessionId, userId, or remote IP — each unique value creates a permanent time series and will OOM Prometheus within hours on a busy fleet. Keep labels bounded: code (a dozen RFC 6455 values), direction (two), node (your pod count). Per-connection detail belongs in logs and spans, never in metric labels.

Gauge drift on crash. ws_connections_active is a process-local counter that you manually inc/dec. If a pod is SIGKILLed, the gauge never decrements — but because Prometheus scrapes per-pod and the pod is gone, the series simply goes stale and disappears, which is correct. The real bug is decrementing on error and close: since error is always followed by close, you would double-count and drift negative. Decrement on close only.

Scrape during restart / rollout. A rolling deploy briefly runs old and new pods; summing ws_connections_active across them double-counts draining connections. Use sum by (node) in dashboards and expect the total to bulge during a deploy. Also, the first scrape after a pod starts shows near-zero connections — alert on for: 2m durations so startup transients do not page anyone.

CLOSE_WAIT accumulation. Sockets the OS has closed but your process has not close()d pile up in CLOSE_WAIT. They do not appear in ws_connections_active (the ws close event fired) but they do consume file descriptors. Watch process_open_fds and the OS-level ss count together; a divergence between them is your leak signal.

Verification #

Scrape the endpoint directly and confirm the series exist with sane values:

curl -s localhost:9090/metrics | grep -E '^ws_(connections_active|connects_total|closes_total)'
# ws_connections_active 1287
# ws_connects_total 90431
# ws_closes_total{code="1000"} 89144
# ws_closes_total{code="1006"} 12

Count real kernel socket states to cross-check the gauge against the OS, and look for CLOSE_WAIT leaks:

ss -tan 'sport = :8080' | awk '{print $1}' | sort | uniq -c
# 1287 ESTAB
# 3 CLOSE-WAIT <- should stay near zero; growth = unclosed sockets

Then assert behavior in PromQL on the Prometheus side:

# Abnormal-closure rate per node over 5m — should hover near zero.
sum by (node) (rate(ws_closes_total{code="1006"}[5m]))

# Heartbeat p99 RTT in seconds — flags a degraded network segment.
histogram_quantile(0.99, sum by (le) (rate(ws_heartbeat_rtt_seconds_bucket[5m])))

# Connect rate spike vs disconnect rate — a reconnect storm shows both climbing together.
sum(rate(ws_connects_total[1m])) and sum(rate(ws_closes_total[1m]))

Guides in this area #

For distributed tracing across the upgrade and span propagation from browser to backend, see Instrumenting WebSockets with OpenTelemetry.

For the full prom-client registry setup, the /metrics endpoint, and the Prometheus scrape config, see Exporting WebSocket Metrics to Prometheus.

FAQ #

Why not just use access logs like a normal HTTP service? #

Because a WebSocket connection logs once at upgrade and then goes silent for the entire session — minutes to hours. Access logs capture the handshake but tell you nothing about message throughput, heartbeat latency, or how many sockets are alive right now. You need a live gauge and rate counters, not request lines.

Should the active-connection count be a gauge or a counter? #

A gauge. The number goes both up (on connect) and down (on disconnect), which is the defining property of a gauge. Counters only ever increase — use those for cumulative totals like ws_connects_total and ws_closes_total, and let PromQL rate() derive the per-second connect/disconnect rate from them.

How do I track heartbeat RTT without a high-cardinality explosion? #

Use a single histogram (ws_heartbeat_rtt_seconds) with fixed buckets and no per-connection labels. Each ping observes one value into the shared histogram; you read percentiles back with histogram_quantile. Per-socket latency belongs in a span attribute, not a metric label.

Does this work behind AWS ALB or Nginx? #

Yes — the metrics are emitted by your Node process, independent of the proxy. The one thing the proxy affects is close-code distribution: an idle-timeout cut at the load balancer surfaces as a 1006 abnormal closure rather than a clean 1000, so watch ws_closes_total{code="1006"} after tuning proxy timeouts.

What changes for Socket.IO vs raw ws? #

Socket.IO multiplexes namespaces and rooms over one connection and may fall back to HTTP long-polling, so a single transport reconnect can churn your connect/disconnect counters. Instrument at the Engine.IO transport level for the raw socket gauge, and add separate counters for Socket.IO-level connect/disconnect events if you need application-layer visibility.

Back to Backend WebSocket Connection Management