Connection Lifecycle & Heartbeats #

A WebSocket that looks OPEN is not necessarily alive. When a client’s network drops without a TCP FIN — a phone leaving Wi-Fi, a laptop sleeping mid-tunnel, a load balancer silently reaping an idle flow — the kernel keeps the socket in ESTABLISHED for minutes. Your server holds a file descriptor, a session entry, and a pub/sub subscription for a peer that will never send another byte. Multiply that across a fleet and you get descriptor exhaustion, ghost message fan-out, and presence lists that lie. The only reliable fix is an application-layer liveness probe: a ping-pong heartbeat that proves the peer is still there, plus a deterministic lifecycle so every socket moves through OPEN → ACTIVE → DRAINING → CLOSED predictably.

This guide covers the full server-side lifecycle in Node.js with the ws package: validating the upgrade, running a missed-pong heartbeat loop, detecting state drift, and tearing connections down cleanly with the right close codes. It anchors the broader Backend WebSocket Connection Management area.

Prerequisites #

Before the heartbeat logic matters, the surrounding infrastructure has to be correct:

  • A reverse proxy that forwards the Upgrade and Connection headers and does not impose an idle-read timeout shorter than your heartbeat interval. See the nginx setup in Configuring NGINX for WebSocket Upgrades.
  • Node-affinity in multi-node deployments so a socket and its session state stay on the same process — see Load Balancer Sticky Sessions.
  • A reconnection policy on the client so that a terminated zombie is re-established rather than silently lost — see Auto-Reconnection Strategies.
  • The ws package (npm i ws) and Node 18+ for the global crypto.randomUUID().

The liveness problem #

The hard part of this topic is not sending a ping — it is deciding when a silent socket is dead. The diagram below shows the state a server-side socket moves through, and how the missed-pong counter is the only edge that catches a half-open TCP connection.

WebSocket liveness state machine A socket moves from OPEN to ACTIVE; each heartbeat tick sends a ping. A pong resets the missed counter; a missed pong increments it, and exceeding the threshold terminates the socket. OPEN upgrade validated ACTIVE routing messages TERMINATED fd reclaimed Heartbeat tick send ping, missed++ pong resets missed=0 every 30s pong: alive missed > threshold A half-open TCP socket never sends a pong, so the counter climbs.

Core implementation #

The lifecycle has three responsibilities: validate the upgrade and allocate session state, run the heartbeat loop, and tear down deterministically. The block below wires all three together. In the ws library the connection event fires after the handshake completes — the socket is already OPEN, and there is no server-side open event — so origin checks and session allocation happen synchronously at the top of the handler.

import { WebSocketServer, WebSocket } from 'ws';
import { IncomingMessage } from 'http';
import { randomUUID } from 'crypto';

const HEARTBEAT_INTERVAL_MS = 30_000; // ping cadence; must be < proxy idle timeout
const MAX_MISSED_PONGS = 2; // tolerate brief loss before terminating
const ALLOWED_ORIGINS = new Set(['https://app.example.com']);

interface SessionMetadata {
status: 'ACTIVE' | 'DRAINING';
connectedAt: number;
seq: number; // monotonic server sequence for drift detection
missedPongs: number; // reset to 0 on every pong frame
channels: string[]; // pub/sub subscriptions to release on teardown
}

// Augment the socket so per-connection state travels with it.
type Live = WebSocket & { sessionId?: string; meta?: SessionMetadata };

const sessions = new Map<string, SessionMetadata>();
const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (socket: Live, req: IncomingMessage) => {
// 1. Validate the upgrade synchronously — unvalidated sockets are reaped at once.
if (!ALLOWED_ORIGINS.has(req.headers.origin ?? '')) {
socket.terminate(); // RST, no close frame, frees the fd immediately
return;
}

// 2. Allocate deterministic session state.
const sessionId = randomUUID();
const meta: SessionMetadata = {
status: 'ACTIVE',
connectedAt: Date.now(),
seq: 0,
missedPongs: 0,
channels: [],
};
sessions.set(sessionId, meta);
socket.sessionId = sessionId;
socket.meta = meta;

// 3. A pong proves liveness — clear the counter.
socket.on('pong', () => { meta.missedPongs = 0; });

socket.on('error', (err) => {
console.error('socket error', { sessionId, err });
socket.terminate();
});

socket.on('close', (code) => {
console.info('socket closed', { sessionId, code });
sessions.delete(sessionId);
});
});

// 4. One shared interval probes EVERY socket — far cheaper than a timer per connection.
const heartbeat = setInterval(() => {
for (const socket of wss.clients as Set<Live>) {
if (!socket.meta) continue;
if (socket.meta.missedPongs >= MAX_MISSED_PONGS) {
socket.terminate(); // zombie: never answered our pings
continue;
}
socket.meta.missedPongs++; // optimistically count this tick as missed...
socket.ping(); // ...the pong handler clears it if the peer replies
}
}, HEARTBEAT_INTERVAL_MS);

wss.on('close', () => clearInterval(heartbeat));

A single shared setInterval over wss.clients is the canonical ws pattern: it scales to tens of thousands of sockets without allocating a timer per connection. The counter is incremented before the ping is sent and reset only by an inbound pong, so a half-open socket — which can never produce a pong — crosses MAX_MISSED_PONGS and gets terminated within MAX_MISSED_PONGS × HEARTBEAT_INTERVAL_MS (60s here).

Graceful teardown and drift detection #

Termination is the violent path. When you decide to close — a deploy drain, a missed-heartbeat threshold, or a detected state divergence — send a close frame with an explicit RFC 6455 code and release subscriptions first so no message is fanned out to a socket on its way down.

import { pubSub } from './pubsub';

export async function gracefulTeardown(
socket: Live,
meta: SessionMetadata & { id: string },
lastClientSeq: number,
): Promise<void> {
meta.status = 'DRAINING';
try {
// Surface state drift to the client before closing so it can resync.
if (lastClientSeq !== meta.seq && socket.bufferedAmount < 1024 * 1024) {
socket.send(JSON.stringify({ type: 'STATE_DRIFT', serverSeq: meta.seq, clientSeq: lastClientSeq }));
}
await pubSub.unsubscribe(meta.channels); // release BEFORE close — no ghost deliveries
sessions.delete(meta.id);
socket.close(1000, 'graceful teardown'); // 1000 = normal closure
} catch (err) {
console.error('teardown failed', { id: meta.id, err });
socket.close(1011, 'server error during teardown'); // 1011 = internal error
}
}

Configuration reference #

Parameter Type Default Production value Notes
HEARTBEAT_INTERVAL_MS number (ms) 30000 25000–30000 Keep below the proxy idle timeout (nginx proxy_read_timeout defaults to 60s).
MAX_MISSED_PONGS number 2 2–3 Detection latency = interval × this. Lower = faster cleanup, more false positives on lossy links.
maxPayload number (bytes) 104857600 1048576 Cap on WebSocketServer to reject oversized frames before buffering.
bufferedAmount guard number (bytes) 1048576 Skip non-critical sends above this to avoid unbounded backpressure.
Close code (normal) number 1000 1000 Use for drains and clean shutdowns.
Close code (error) number 1011 Use when teardown throws or an internal fault aborts the connection.
clientTracking boolean true true Required for wss.clients to be populated by the heartbeat loop.

Edge cases & gotchas #

  • terminate() vs close(). terminate() sends a TCP RST and frees the descriptor immediately — use it only for zombies and unvalidated peers. close() performs the RFC 6455 closing handshake and lets in-flight frames flush; use it for every clean exit.
  • The pong arrives but the app thread is wedged. A pong is answered by the transport layer, not your handler, so a socket whose event loop is blocked still looks alive. If you need application-level liveness, send a JSON heartbeat the handler must echo, not a protocol-level ping.
  • Interval drift under load. A single setInterval can slip when the event loop is saturated, stretching real detection time well past the nominal 60s. Track the wall-clock delta between ticks and alert if it exceeds the configured interval.
  • Timers keeping the process alive. Forgetting clearInterval on wss.on('close') (or per-socket timers in older patterns) keeps the event loop referenced, so the process never exits cleanly on shutdown.

Verification #

Confirm the lifecycle behaves under real conditions:

# Count established WebSocket sockets owned by the Node process.
ss -tnp state established '( sport = :8080 )' | wc -l

# Simulate a half-open client: connect, then drop the route and watch for termination.
sudo iptables -A OUTPUT -p tcp --dport 8080 -j DROP # blackhole the client path
# ...the server should terminate the socket within MAX_MISSED_PONGS × interval
sudo iptables -D OUTPUT -p tcp --dport 8080 -j DROP # restore

In Chrome DevTools, the Network → WS → Messages panel shows ping (⬆) and pong (⬇) control frames — a healthy socket shows alternating pairs at your interval. Assert that ss established counts return to baseline after the iptables blackhole, and that no descriptor leak accumulates across reconnect cycles. Feed the missed-pong counter and termination reasons into your metrics pipeline as described in WebSocket Observability & Monitoring.

Guides in this area #

FAQ #

How fast will a dead connection be detected? #

Detection time is MAX_MISSED_PONGS × HEARTBEAT_INTERVAL_MS. With the defaults above (2 × 30s) a half-open socket is terminated within 60 seconds. Drop the interval to 10s or the threshold to 1 for faster cleanup, at the cost of more false terminations on lossy mobile links.

Does this work behind AWS ALB or nginx? #

Yes, provided the proxy’s idle timeout is longer than your heartbeat interval. ALB’s default idle timeout is 60s and nginx’s proxy_read_timeout is 60s, so a 30s interval keeps the flow active. If you raise the interval above the proxy timeout, the proxy reaps the connection first and your heartbeat never fires.

Should I use protocol ping frames or an application-level heartbeat? #

Protocol-level ping/pong (what socket.ping() sends) is cheaper and answered by the transport, so it confirms the TCP path is alive but not that your application loop is responsive. If a blocked event loop must be detected, send a JSON message the handler explicitly echoes. Many systems run both.

What changes for Socket.IO vs raw ws? #

Socket.IO ships its own heartbeat on top of the WebSocket frame (pingInterval/pingTimeout in the engine options), so you do not write the loop yourself — but the same tuning rules apply relative to proxy timeouts. With raw ws you own the interval, the counter, and the terminate call, which is what this guide implements.

Why increment the missed-pong counter before sending the ping? #

It makes the loop self-correcting: the tick optimistically assumes the pong will not arrive, and the pong handler clears the counter when it does. A socket that never answers simply never gets reset, so it crosses the threshold without any extra bookkeeping or per-socket timers.

Back to Backend WebSocket Connection Management