Backend WebSocket Connection Management #

A WebSocket connection is a long-lived, stateful TCP socket — and that single fact reshapes every backend assumption you carry over from stateless HTTP. Connections live for minutes or hours, pin themselves to one process, hold a file descriptor and kernel buffers the whole time, and fail silently when a mobile network drops a packet. This area is the operational reference for engineers running WebSocket servers in production: how a socket is born, kept alive, authenticated, routed across nodes, observed, and torn down without leaking resources or dropping messages. If you are debugging zombie sockets, sizing a fleet for a million concurrent peers, or chasing why connections die behind a load balancer, start here.

Backend WebSocket connection management overview A client connects through a load balancer with sticky sessions to one of several WebSocket nodes; nodes share state through a Redis pub/sub broker and emit metrics to an observability stack. Clients wss upgrade Load balancer sticky + auth WS node 1 registry + ping WS node 2 registry + ping WS node N registry + ping Pub/sub broker fan-out Observability metrics + traces

Infrastructure baseline #

Before any application code runs, the kernel and the proxy in front of it must be configured to keep long-lived sockets alive. The two most common production incidents — connections dropping at exactly 60 seconds, and EMFILE: too many open files under load — are both infrastructure misconfigurations, not bugs in your handler.

The reverse proxy must forward the Upgrade/Connection headers and set read/send timeouts that comfortably exceed your heartbeat interval. If proxy_read_timeout is shorter than the gap between frames, the proxy silently closes idle-but-healthy connections.

# nginx.conf — WebSocket proxy headers and timeout alignment
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}

location /ws {
proxy_pass http://ws_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 90s; # must exceed HEARTBEAT_INTERVAL_MS
proxy_send_timeout 90s;
}

Each open socket consumes a file descriptor. The default soft limit of 1024 caps a single process at roughly a thousand connections — far below what one Node.js process can actually serve. Raise the descriptor limit and tune TCP keepalive so the kernel itself reaps half-open connections that the application layer never hears about.

# Container / host runtime tuning (apply before the process starts)
ulimit -n 1048576 # per-process FD ceiling
sysctl -w net.ipv4.tcp_keepalive_time=300 # idle seconds before keepalive probe
sysctl -w net.ipv4.tcp_keepalive_intvl=60 # interval between probes
sysctl -w net.core.somaxconn=4096 # accept-queue depth for connection spikes

Terminate TLS at the edge (wss://) so the application process handles plaintext frames — terminating TLS in-process burns CPU you would rather spend on message dispatch. Detailed TLS hardening and cipher selection live with Security & TLS Configuration.

Core mechanism: the connection registry #

Every multi-connection WebSocket server is built around one data structure: a registry mapping a stable client identity to its live socket plus liveness metadata. Sends, broadcasts, and teardown all index into it. Without a heartbeat sweep over that registry you accumulate zombie sockets — TCP connections the OS believes are open but whose peer vanished (laptop lid closed, phone lost signal). They never fire a close event, so they leak descriptors until the process exhausts its limit.

The registry below pairs an application-level ping/pong heartbeat with an idle sweep. The heartbeat doubles as your dead-peer detector and your round-trip latency probe; the deeper mechanics are covered in Connection Lifecycle & Heartbeats.

// Production connection registry with heartbeat sweep and error boundaries
import { WebSocketServer, WebSocket } from 'ws';
import { randomUUID } from 'node:crypto';

interface Conn { ws: WebSocket; lastSeen: number; alive: boolean }
const registry = new Map<string, Conn>();

const HEARTBEAT_INTERVAL_MS = 30_000; // how often we ping live sockets
const IDLE_TIMEOUT_MS = 75_000; // declare dead after ~2.5 missed pings

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws: WebSocket, req) => {
const clientId = (req.headers['x-client-id'] as string) ?? randomUUID();
registry.set(clientId, { ws, lastSeen: Date.now(), alive: true });

// pong is the peer's reply to our ping — proof the socket is two-way alive
ws.on('pong', () => {
const entry = registry.get(clientId);
if (entry) { entry.alive = true; entry.lastSeen = Date.now(); }
});

ws.on('error', (err: Error) => {
console.error(`[WS_ERROR] ${clientId}: ${err.message}`);
cleanup(clientId); // errors don't always precede a close
});

ws.on('close', () => cleanup(clientId));
});

function cleanup(id: string): void {
const entry = registry.get(id);
if (!entry) return;
entry.ws.removeAllListeners(); // break listener references for the GC
entry.ws.terminate(); // hard-close; do not wait for the FIN handshake
registry.delete(id);
}

// Single sweep drives both liveness probing and zombie reaping
const sweep = setInterval(() => {
const now = Date.now();
for (const [id, entry] of registry) {
if (!entry.alive || now - entry.lastSeen > IDLE_TIMEOUT_MS) {
console.warn(`[ZOMBIE] terminating ${id}`);
cleanup(id); // missed the previous ping → presumed dead
continue;
}
entry.alive = false; // cleared here, re-set only by a pong
entry.ws.ping();
}
}, HEARTBEAT_INTERVAL_MS);

process.on('SIGTERM', () => {
clearInterval(sweep);
for (const id of registry.keys()) cleanup(id);
process.exit(0);
});

The alive flag flips false on every sweep and is set true only by an incoming pong. A socket that misses one full interval is terminated on the next pass — bounded, predictable, and free of the unbounded timers that an IDLE_TIMEOUT_MS-only approach leaves dangling.

Scaling & architecture #

A single Node.js process comfortably holds tens of thousands of connections, but one process is a single point of failure and a hard ceiling. Horizontal scaling breaks the convenient assumption that the sender and the recipient share memory: client A is pinned to node 1, client B to node 3, and a direct entry.ws.send() only reaches sockets in the local registry. Cross-node delivery requires a message broker. Each node subscribes to the broker and publishes outbound payloads to it; the broker fans them out so the node that actually owns the target socket performs the final send. Broker selection, fan-out topology, and delivery guarantees are the subject of Scaling Real-Time Infrastructure.

Pub/sub fan-out across WebSocket nodes A publish enters a Redis pub/sub broker that fans the message out to three subscribed WebSocket nodes; only the node whose local registry owns the target socket delivers it, the others find no match. publish Redis pub/sub broker channel: ws fan-out subscribe subscribe subscribe WS node 1 registry A (owner) WS node 2 registry B WS node N registry C target socket no match no match

Two routing strategies trade CPU against bandwidth. Broadcast publishes every message to every node, which then filters by local ownership — simple, but every node pays to inspect every message. Consistent hashing (or directory lookup) routes a message only to the node that owns the target, which scales far better for directed messages but needs an ownership map. Pin each client to one node with Load Balancer Sticky Sessions so a connection survives rolling deploys, and partition channels with Server-Side Routing Patterns for multi-tenant isolation and per-channel rate limiting.

// Cross-node delivery: subscribe to the broker, deliver to the owning socket
import { Redis } from 'ioredis';

const sub = new Redis(process.env.REDIS_URL!);
const FANOUT_CHANNEL = 'ws:fanout';

await sub.subscribe(FANOUT_CHANNEL);
sub.on('message', (_channel, raw) => {
const { targetId, payload } = JSON.parse(raw) as { targetId: string; payload: string };
const entry = registry.get(targetId); // is this socket local to me?
if (entry?.ws.readyState === WebSocket.OPEN) {
entry.ws.send(payload); // only the owning node sends
}
// no local match → another node owns it; safely ignore
});

During rolling deploys, drain rather than kill: signal clients to reconnect, close with code 1001, and wait for the registry to empty before exiting so the load balancer can shift them to healthy nodes.

// Graceful drain on deploy — give clients a reconnect hint, then let them go
async function drain(): Promise<void> {
for (const { ws } of registry.values()) {
if (ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({ type: 'SERVER_SHUTDOWN', reconnectInMs: 2_000 }));
ws.close(1001, 'server restarting'); // 1001 = going away
}
}
await new Promise<void>((resolve) => {
const t = setInterval(() => { if (registry.size === 0) { clearInterval(t); resolve(); } }, 500);
});
}

Clients should rejoin with exponential backoff plus jitter so a fleet restart does not trigger a thundering-herd reconnect; that client-side logic lives in Auto-Reconnection Strategies.

Observability checklist #

You cannot operate what you cannot see, and WebSocket failures are quiet — a stalled fan-out or a slow client backing up its send buffer produces no error log, only degraded latency. Instrument these named signals from day one:

  • ws_connections_active
  • ws_connections_opened_total / ws_connections_closed_total
  • ws_close_total{code} — counter labelled by close code (1001, 1006, 1011); a spike in 1006
  • ws_heartbeat_rtt_seconds
  • ws_messages_sent_total / ws_messages_failed_total
  • ws_send_buffer_bytes (ws.bufferedAmount
  • ws_fanout_lag_seconds
  • Trace context: propagate traceparent

Wire these to a collector and dashboard them per node — the standardized exporters and span conventions are detailed in WebSocket Observability & Monitoring.

Failure modes #

Failure Symptom Root cause Mitigation
Zombie sockets Active-connection gauge climbs, never falls; descriptors leak Peer vanished without a TCP FIN; no close event fires Application ping/pong sweep terminates sockets that miss a heartbeat
60-second drops Connections die on a fixed interval when idle Proxy proxy_read_timeout shorter than heartbeat gap Set proxy timeout above HEARTBEAT_INTERVAL_MS; keep frames flowing
EMFILE under load New connections refused once a node fills File-descriptor soft limit (1024) far below real capacity Raise ulimit -n; alert on ws_connections_active nearing ceiling
Cross-node black hole Messages reach some clients, silently not others Node sends only to its local registry; no broker fan-out Publish to a pub/sub broker; deliver from the owning node
Slow-consumer OOM Heap grows, then the process crashes under broadcast bufferedAmount accumulates faster than a slow client drains Cap ws_send_buffer_bytes; disconnect over threshold

Explore this area #

FAQ #

Why do my WebSocket connections drop after 60 seconds? #

Almost always the reverse proxy, not your code. nginx defaults proxy_read_timeout to 60 seconds and closes any connection idle longer than that — including healthy ones between messages. Set the proxy timeout above your heartbeat interval and send application-level pings so the socket is never idle past the limit.

How many WebSocket connections can one Node.js process handle? #

Tens of thousands per process once the file-descriptor limit is raised — the default soft limit of 1024 is the first ceiling you hit, not memory or CPU. Each idle connection costs a descriptor plus a few kilobytes of kernel and heap buffers, so the practical bound is roughly memory divided by per-connection state. Raise ulimit -n, watch ws_connections_active, and scale out to more nodes well before a single process saturates.

Do I need sticky sessions for WebSockets? #

For the raw protocol, only the HTTP upgrade must land on a node that accepts it; after 101 Switching Protocols the TCP connection stays pinned to that process for its lifetime regardless. Sticky sessions matter when a reconnect or a fallback transport must return to the same node to recover session state, and they make rolling deploys predictable. See Load Balancer Sticky Sessions for ALB and HAProxy specifics.

How do I broadcast a message to clients connected to different servers? #

A local send() only reaches sockets in that process’s registry. Put a pub/sub broker (Redis, NATS, or Kafka) between nodes: every node subscribes, publishes outbound messages to the broker, and the node that owns the target socket performs the final delivery. The fan-out and delivery-guarantee patterns are covered in Scaling Real-Time Infrastructure.

What changes for Socket.IO versus raw ws? #

Socket.IO layers reconnection, rooms, acknowledgements, and a multi-node adapter on top of the protocol, so several mechanisms here come built in — but it also adds a custom framing and handshake that demand its own sticky-session and proxy configuration. The connection-registry, heartbeat, and observability principles are identical; the wire format and the adapter API differ.

Back to Real-Time WebSocket Engineering