Auto-Reconnection Strategies #
A WebSocket connection is a single TCP socket, and TCP sockets die: a laptop sleeps, a load balancer hits its idle timeout, a backend node rolls during a deploy, or a phone hands off from WiFi to LTE. When that happens the browser fires a close event and the application is offline until something re-establishes the socket. The naive fix — call new WebSocket(url) inside onclose — is also the fastest way to take down your own backend: when a server restarts, every client that was attached to it fires onclose within the same few milliseconds and reconnects in lockstep. That synchronized retry storm (a thundering herd) hammers the freshly-booted node until it falls over again, producing a reconnect loop that looks exactly like an outage to your users.
Resilient reconnection is therefore not “retry on close.” It is a small state machine that retries on a randomized, capped, exponentially-growing schedule, that buffers outgoing messages while offline, and that tears down cleanly so a deliberate disconnect never triggers a phantom reconnect. This guide builds that machine in TypeScript and gives you the configuration knobs and verification steps to run it in production.
Prerequisites #
This page assumes the server-side connection handling described in Backend WebSocket Connection Management is already in place. Specifically you need:
- A working server that closes sockets with meaningful RFC 6455 close codes so the client can distinguish “retry” (
1006,1011,1012) from “do not retry” (1008policy violation,4001auth failure). - Heartbeats configured per Connection Lifecycle & Heartbeats, so half-open sockets surface as a
closeevent instead of hanging forever. - If you run more than one backend node, Load Balancer Sticky Sessions or a shared pub/sub layer, so a reconnecting client can recover state on whichever node it lands on.
The reconnection state machine #
The hardest part of reconnection is concurrency: a close event, a manual disconnect() call, and a network-change event can all fire while a retry timer is already pending. Without a single source of truth they race, and you end up with two live sockets or a retry that fires after the user has navigated away. A finite state machine collapses that ambiguity into four states with explicit transitions.
Core implementation #
The class below wraps a single WebSocket, owns the reconnect timer, buffers outgoing messages while offline, and refuses to retry after a deliberate close. It depends only on the browser-native WebSocket API and AbortController, so it runs unchanged in modern browsers and in Node 21+ (or earlier Node with the ws package’s WebSocket export).
type ConnState = "CONNECTED" | "BACKOFF" | "RECONNECTING" | "CLOSED";
interface ReconnectOptions {
baseDelayMs: number; // first retry delay before jitter
maxDelayMs: number; // hard cap on any single delay
maxRetries: number; // give up (-> CLOSED) after this many failures
jitterRatio: number; // 0..1 fraction of the delay randomized away
}
const DEFAULTS: ReconnectOptions = {
baseDelayMs: 1_000,
maxDelayMs: 30_000,
maxRetries: 10,
jitterRatio: 0.5,
};
// Close codes that mean "the server rejected us on purpose" — never retry these.
const FATAL_CLOSE_CODES = new Set([1008, 4001, 4003]);
export class ReconnectingSocket {
private ws: WebSocket | null = null;
private state: ConnState = "CLOSED";
private attempt = 0;
private timer: ReturnType<typeof setTimeout> | null = null;
private readonly opts: ReconnectOptions;
private readonly outbox: string[] = []; // buffered while offline
private readonly MAX_OUTBOX = 1_000;
constructor(
private readonly url: string,
private readonly onMessage: (data: unknown) => void,
opts: Partial<ReconnectOptions> = {},
) {
this.opts = { ...DEFAULTS, ...opts };
this.connect();
}
/** Full-jitter exponential backoff: delay grows 2^n, capped, then randomized. */
private backoffDelay(): number {
const exp = this.opts.baseDelayMs * 2 ** this.attempt;
const capped = Math.min(exp, this.opts.maxDelayMs);
// Spread the herd: subtract a random slice up to jitterRatio of the delay.
return capped - Math.random() * this.opts.jitterRatio * capped;
}
private connect(): void {
this.state = this.attempt === 0 ? "RECONNECTING" : "BACKOFF";
const ws = new WebSocket(this.url);
this.ws = ws;
ws.onopen = () => {
this.state = "CONNECTED";
this.attempt = 0; // reset the curve on every success
this.flushOutbox();
};
ws.onmessage = (ev) => this.onMessage(ev.data);
ws.onclose = (ev) => {
this.ws = null;
// A deliberate disconnect() nulls onclose first, so reaching here means
// the close was unexpected. Honor fatal codes and the retry ceiling.
if (FATAL_CLOSE_CODES.has(ev.code) || this.attempt >= this.opts.maxRetries) {
this.state = "CLOSED";
return;
}
this.scheduleReconnect();
};
// onerror fires before onclose; let onclose own the retry decision.
ws.onerror = () => ws.close();
}
private scheduleReconnect(): void {
this.state = "BACKOFF";
const delay = this.backoffDelay();
this.attempt++;
this.timer = setTimeout(() => this.connect(), delay);
}
/** Queue when offline; the message replays in order once reconnected. */
send(payload: unknown): void {
const frame = JSON.stringify(payload);
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(frame);
return;
}
if (this.outbox.length >= this.MAX_OUTBOX) this.outbox.shift(); // drop oldest
this.outbox.push(frame);
}
private flushOutbox(): void {
while (this.outbox.length && this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(this.outbox.shift()!);
}
}
/** Deliberate teardown: detach onclose so it cannot schedule a reconnect. */
disconnect(): void {
if (this.timer) clearTimeout(this.timer);
this.timer = null;
this.state = "CLOSED";
if (this.ws) {
this.ws.onclose = null;
this.ws.close(1000, "client disconnect");
this.ws = null;
}
}
}
Two lines carry most of the correctness. In disconnect(), nulling ws.onclose before calling close() is what stops a user-initiated teardown from re-entering the retry path — skip it and every logout triggers a reconnect loop. In onopen, resetting this.attempt to 0 is what makes the backoff curve restart after a successful recovery rather than staying near the cap forever.
Because queued messages may be delivered more than once if a socket dies mid-send, give each application payload a stable ID and dedupe on the server. That ID is the foundation for message delivery guarantees once you move beyond best-effort buffering.
Configuration reference #
| Parameter | Type | Default | Production value | Notes |
|---|---|---|---|---|
baseDelayMs |
number | 1000 |
500–1000 |
First retry delay before jitter. Too low re-creates the herd. |
maxDelayMs |
number | 30000 |
30000–60000 |
Hard cap on any single delay. Keep under your LB idle timeout so an idle reconnect never out-waits the proxy. |
jitterRatio |
number | 0.5 |
0.5–1.0 |
Fraction of the delay randomized away. 1.0 is AWS “full jitter”; below 0.3 herds still form. |
maxRetries |
number | 10 |
10–∞ |
Failures before entering CLOSED. Use a high value plus a manual “Reconnect” control for long outages. |
FATAL_CLOSE_CODES |
Set<number> | {1008, 4001, 4003} |
app-specific | Codes that must NOT retry — policy and auth rejections. Retrying them just burns attempts. |
The math: with base=1000, maxDelay=30000, jitterRatio=0.5, attempt 0 waits 0.5–1s, attempt 3 waits 4–8s, and from attempt 5 onward every client waits a uniformly random 15–30s. The full-jitter spread is what flattens the reconnect spike — see exponential backoff with jitter for WebSocket reconnects for the load-distribution analysis.
Edge cases & gotchas #
- The phantom reconnect on logout. Calling
socket.close()without first detachingonclosere-enters the retry path. Always null the handler in deliberate teardown, asdisconnect()does above. - Backoff that never resets. If you reset
attempton the scheduling of a reconnect rather than ononopen, a flapping network keeps the delay pinned at the cap. Reset only after a confirmed open. - WiFi-to-cellular handoff wastes the backoff window. On mobile, the device may regain connectivity 20 seconds into a 30-second timer. Listen for
navigator.connection’schangeevent (andwindow’sonlineevent) and force an immediate reconnect instead of waiting out a stale timer. - Multi-tab herds within one origin. Ten browser tabs all reconnect independently after a server restart. Elect a single leader tab with the
BroadcastChannelAPI to hold the live socket and fan results out to the others, or accept the per-tab jitter as your only spreading mechanism.
Verification #
Confirm the backoff curve and clean teardown before trusting this in production.
- Watch the socket states. While killing the server, run
ss -tnp | grep :8080on the client host (orchrome://net-internals). You should see the socket move toCLOSE_WAIT/gone, then a single newSYNper backoff interval — never a tight burst. - Assert the delays in DevTools. Open the Network panel, filter to
WS, throttle to Offline, and confirm successive reconnect attempts are spaced by growing, non-identical intervals (jitter makes them differ run to run). - Prove the no-retry path. Call
disconnect()and verify withssthat no new socket appears. Then have the server close with1008and confirm the client lands inCLOSEDwithout retrying. - Load-test the herd. Boot N test clients against one node, restart it, and graph reconnect timestamps. With full jitter the reconnect rate should be a flat plateau, not a spike. Alert when more than ~15% of sessions hit
maxDelayMs— that signals a structural outage, not transient noise.
Guides in this area #
- Handling WebSocket disconnects gracefully — diagnosing half-open and
CLOSE_WAITsockets, and the atomic teardown sequence that keeps a deliberate close from cascading into resource exhaustion. - Exponential backoff with jitter for WebSocket reconnects — the full-jitter vs equal-jitter math, why constant or pure-exponential retries herd, and how to tune the curve to your reconnect-rate budget.
FAQ #
How is exponential backoff different from just retrying every few seconds? #
A fixed interval keeps every disconnected client synchronized: they all retry at the same cadence, so a server restart produces a recurring spike at exactly that interval. Exponential backoff grows the delay after each failure so total load tapers off during a long outage, and the jitter term de-synchronizes clients so no two retry at the same instant.
What close codes should trigger a reconnect, and which should not? #
Retry on the transient codes: 1006 (abnormal closure, the common “connection dropped”), 1011 (server error), and 1012 (server restarting). Do not retry on deliberate rejections: 1008 (policy violation) and your application’s auth-failure codes such as 4001 — retrying those just burns attempts against a server that will keep refusing you. Surface those to the user instead.
Does this work behind AWS ALB or nginx? #
Yes, with one constraint: keep maxDelayMs below the proxy’s idle timeout (ALB defaults to 60s, nginx proxy_read_timeout to 60s). Otherwise the proxy may cut an idle connection while your client is still waiting out a long backoff, adding a confusing extra round of reconnects. With multiple backend nodes you also need sticky sessions or a shared pub/sub layer so the reconnecting client can recover state.
Should the message buffer survive a page reload? #
The in-memory outbox does not survive reload — that is usually correct, since stale queued mutations replayed minutes later cause more harm than the loss. If you genuinely need durable delivery, persist outgoing messages to IndexedDB with a stable ID and reconcile against server acknowledgements rather than relying on the in-memory queue.
What changes for Socket.IO vs raw ws? #
Socket.IO ships its own reconnection manager with reconnectionDelay, reconnectionDelayMax, and a built-in randomizationFactor (its jitter), so you configure rather than implement the curve. The state-machine and message-buffering concerns here still apply; you are tuning Socket.IO’s manager instead of writing your own setTimeout loop.
Related #
- Handling WebSocket disconnects gracefully — diagnose and atomically tear down half-open sockets.
- Exponential backoff with jitter for WebSocket reconnects — the jitter math and reconnect-rate tuning in depth.
- Connection Lifecycle & Heartbeats — heartbeats that turn dead sockets into the
closeevents this machine reacts to. - Load Balancer Sticky Sessions — keep reconnecting clients on a node that holds their session state.
- Message Delivery Guarantees — move past best-effort buffering to acknowledged, at-least-once delivery.