Handling WebSocket disconnects gracefully #

You landed here because a socket “closed” but nothing cleaned up: the client UI still shows a live cursor, the server still counts the user as online, and RSS keeps creeping upward while ss -tnp fills with CLOSE_WAIT entries. The disconnect itself is not the bug — the bug is that your teardown path never ran, or ran twice, or fired against a socket reference that had already been nulled. This page covers how to make disconnect handling deterministic on both ends so that every drop, whether a clean 1000 or an abnormal 1006, leaves the system in a known-good state.

Typical symptoms that bring engineers here:

  • Client renders stale data despite an active network interface.
  • Server RSS grows linearly alongside CLOSE_WAIT socket accumulation.
  • Duplicate event listeners fire after a reconnect because old references were never detached.
  • Heartbeat pings get TCP-level ACKs while application state stays frozen.

Root cause #

There are two failure surfaces, and most outages touch both.

On the transport layer, a TCP connection can half-open: the peer’s kernel sends a FIN or RST that never reaches your process (NAT rebinding, a load balancer reaping an idle flow, a laptop lid closing). The OS keepalive that would eventually notice defaults to roughly two hours, so for that window the socket sits in CLOSE_WAIT — bytes will never arrive, but your onclose handler is not invoked because, from the runtime’s perspective, nothing happened. The browser’s 1006 Abnormal Closure code exists precisely for this case: the close frame was never exchanged.

On the application layer, the close event is treated as a single point-of-truth even though it is not guaranteed to fire. Code hangs heartbeat intervals, state-manager subscriptions, and in-flight fetches off the socket’s lifetime, then assumes onclose will tear all of that down. When onclose is missed, those timers and subscriptions leak. When onclose fires and onerror also fires for the same drop, teardown runs twice — double-decrementing presence counts or scheduling two reconnects.

The fix is to stop trusting network events as your only signal. You enforce an application-level liveness timeout via WebSocket ping/pong heartbeats, make teardown idempotent, and ensure the reconnect path always starts from a fully cleaned slate. This is the contract that the broader auto-reconnection strategies depend on — a reconnect loop that fires against a dirty state will stack duplicate listeners on every cycle.

WebSocket disconnect state machine A live socket missing two heartbeats is terminated, runs teardown once, then enters a clean reconnect. OPEN isAlive = true SUSPECT missed 1 pong TERMINATE missed 2 pongs TEARDOWN runs once RECONNECT clean slate

Resolution #

The server enforces liveness and tears down exactly once; the client mirrors that contract and captures any state it needs before nulling its socket reference. Both halves below are annotated on the non-obvious lines.

import { WebSocketServer, WebSocket } from 'ws';

const HEARTBEAT_INTERVAL_MS = 30_000; // must stay well under proxy idle timeout

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (socket: WebSocket & { userId?: string }) => {
let isAlive = true;
let tornDown = false; // guard so cleanup runs at most once per socket

const teardown = () => {
if (tornDown) return; // close + error can both fire for one drop
tornDown = true;
clearInterval(pingInterval); // stop the heartbeat timer first
if (socket.userId) {
cleanupUserSessions(socket.userId); // release presence/state
broadcastStateDelta({ type: 'DISCONNECT', userId: socket.userId });
}
};

const pingInterval = setInterval(() => {
if (!isAlive) return socket.terminate(); // two missed pongs -> kill the half-open socket
isAlive = false; // reset; the next pong must flip it back
socket.ping(); // RFC 6455 ping; ws auto-replies on the peer
}, HEARTBEAT_INTERVAL_MS);

socket.on('pong', () => { isAlive = true; }); // proof the peer is still reachable
socket.on('close', teardown);
socket.on('error', (err) => {
console.error(`WS error for ${socket.userId ?? 'anon'}: ${err.message}`);
socket.terminate(); // force-close; the resulting 'close' event calls teardown
});
});
export class GracefulWSManager {
private ws: WebSocket | null = null;

constructor(private readonly url: string) {}

connect(): void {
this.cleanup(); // never open a second socket over a live one
this.ws = new WebSocket(this.url);
this.ws.onopen = () => this.flushSyncQueue();
this.ws.onmessage = (e) => this.processMessage(e.data);
this.ws.onclose = (e) => this.handleDisconnect(e.code);
this.ws.onerror = () => this.handleDisconnect(1006); // browsers fire error before a 1006 close
}

private handleDisconnect(code: number): void {
console.warn(`WS closed with code ${code}`);
this.cleanup(); // detach handlers + null the ref BEFORE scheduling
this.scheduleReconnect(); // backoff/jitter lives in the reconnect strategy
}

private cleanup(): void {
if (!this.ws) return; // idempotent: safe to call from open and error paths
this.ws.onopen = this.ws.onmessage = null;
this.ws.onclose = this.ws.onerror = null; // detach so a late close can't re-enter handleDisconnect
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.close(1000, 'Graceful teardown'); // clean close frame when the socket is still live
}
this.ws = null; // any later access throws loudly instead of acting on a zombie
}

private scheduleReconnect(): void { /* see exponential backoff with jitter */ }
private processMessage(data: string): void {
try { JSON.parse(data); /* dispatch to state handlers */ }
catch (err) { console.error('Sync parse error:', err); } // never let a bad frame kill the loop
}
private flushSyncQueue(): void { /* drain buffered outbound messages */ }
}

The load-bearing detail in handleDisconnect is ordering: detach handlers and null this.ws first, then schedule the retry. If you instead schedule before cleanup, a late close event can re-enter handleDisconnect and queue a second reconnect. Delegate the actual retry timing to exponential backoff with jitter rather than a raw setTimeout, so a server restart doesn’t trigger a synchronized thundering herd.

One more alignment that prevents most “phantom” disconnects: set your reverse proxy idle timeout to at least 2× the heartbeat interval, so the proxy never reaps a connection between pings.

location /ws/ {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 120s; # >= 2x HEARTBEAT_INTERVAL_MS so a live socket is never reaped mid-cycle
proxy_send_timeout 120s;
proxy_buffering off; # stream frames immediately instead of holding them in a buffer
}

Operational checklist #

  • Server teardown is guarded so close and error
  • Client cleanup()
  • processMessage
  • ss -tnp | grep :8080 shows no growing CLOSE_WAIT
  • A chaos test (tc qdisc … loss 100%

FAQ #

Why does my onclose handler never fire on a dropped Wi-Fi connection? #

Because no close frame was exchanged. When the network vanishes, the peer’s FIN/RST never reaches your process, so the runtime has no event to dispatch. Only an application-level heartbeat timeout detects this — that is why the server terminates the socket after two missed pongs instead of waiting on onclose.

What is the difference between close code 1000 and 1006? #

1000 is a normal closure: both sides exchanged close frames cleanly. 1006 (Abnormal Closure) is generated locally by the runtime when the connection dropped without a close frame — a crash, a killed proxy connection, or lost network. You will never receive 1006 from the peer; it is your own runtime telling you the handshake never completed.

Do I need both a close handler and an error handler? #

Yes, but make teardown idempotent. In browsers an error typically fires immediately before a 1006 close; on the ws server an error may fire without a subsequent close. Wiring both and guarding cleanup with a tornDown flag (server) or a null check (client) ensures recovery runs regardless of which event arrives, and never twice.

Does this work behind AWS ALB? #

Yes, but the ALB has its own idle timeout (default 60s) that is independent of nginx. Set your HEARTBEAT_INTERVAL_MS below that ALB timeout and confirm the value in the target group, otherwise the ALB silently reaps idle sockets and you get a steady drip of 1006 closes that the application heartbeat would otherwise prevent.

Why is CLOSE_WAIT piling up even though clients reconnect fine? #

CLOSE_WAIT means the peer closed but your process has not yet called close() on its end of the socket. It almost always points to a teardown path that exits early — an exception thrown before clearInterval/socket.terminate(), or a code path that skips cleanup. Run the chaos test from the checklist and watch the counts; they should return to baseline within one heartbeat cycle.

Back to WebSocket Auto-Reconnection Strategies