Auto-Reconnection Strategies #

A WebSocket connection is a single TCP socket, and TCP sockets die: a laptop sleeps, a load balancer hits its idle timeout, a backend node rolls during a deploy, or a phone hands off from WiFi to LTE. When that happens the browser fires a close event and the application is offline until something re-establishes the socket. The naive fix — call new WebSocket(url) inside onclose — is also the fastest way to take down your own backend: when a server restarts, every client that was attached to it fires onclose within the same few milliseconds and reconnects in lockstep. That synchronized retry storm (a thundering herd) hammers the freshly-booted node until it falls over again, producing a reconnect loop that looks exactly like an outage to your users.

Resilient reconnection is therefore not “retry on close.” It is a small state machine that retries on a randomized, capped, exponentially-growing schedule, that buffers outgoing messages while offline, and that tears down cleanly so a deliberate disconnect never triggers a phantom reconnect. This guide builds that machine in TypeScript and gives you the configuration knobs and verification steps to run it in production.

Prerequisites #

This page assumes the server-side connection handling described in Backend WebSocket Connection Management is already in place. Specifically you need:

  • A working server that closes sockets with meaningful RFC 6455 close codes so the client can distinguish “retry” (1006, 1011, 1012) from “do not retry” (1008 policy violation, 4001 auth failure).
  • Heartbeats configured per Connection Lifecycle & Heartbeats, so half-open sockets surface as a close event instead of hanging forever.
  • If you run more than one backend node, Load Balancer Sticky Sessions or a shared pub/sub layer, so a reconnecting client can recover state on whichever node it lands on.

The reconnection state machine #

The hardest part of reconnection is concurrency: a close event, a manual disconnect() call, and a network-change event can all fire while a retry timer is already pending. Without a single source of truth they race, and you end up with two live sockets or a retry that fires after the user has navigated away. A finite state machine collapses that ambiguity into four states with explicit transitions.

WebSocket reconnection state machine States Connected, Backoff, Reconnecting and Closed with transitions driven by close events, backoff timers, open events and manual disconnect. CONNECTED socket OPEN BACKOFF timer pending RECONNECTING handshake CLOSED terminal close 1006 timer open fail, retry 1008 / max Manual disconnect from any state goes directly to CLOSED

Core implementation #

The class below wraps a single WebSocket, owns the reconnect timer, buffers outgoing messages while offline, and refuses to retry after a deliberate close. It depends only on the browser-native WebSocket API and AbortController, so it runs unchanged in modern browsers and in Node 21+ (or earlier Node with the ws package’s WebSocket export).

type ConnState = "CONNECTED" | "BACKOFF" | "RECONNECTING" | "CLOSED";

interface ReconnectOptions {
baseDelayMs: number; // first retry delay before jitter
maxDelayMs: number; // hard cap on any single delay
maxRetries: number; // give up (-> CLOSED) after this many failures
jitterRatio: number; // 0..1 fraction of the delay randomized away
}

const DEFAULTS: ReconnectOptions = {
baseDelayMs: 1_000,
maxDelayMs: 30_000,
maxRetries: 10,
jitterRatio: 0.5,
};

// Close codes that mean "the server rejected us on purpose" — never retry these.
const FATAL_CLOSE_CODES = new Set([1008, 4001, 4003]);

export class ReconnectingSocket {
private ws: WebSocket | null = null;
private state: ConnState = "CLOSED";
private attempt = 0;
private timer: ReturnType<typeof setTimeout> | null = null;
private readonly opts: ReconnectOptions;
private readonly outbox: string[] = []; // buffered while offline
private readonly MAX_OUTBOX = 1_000;

constructor(
private readonly url: string,
private readonly onMessage: (data: unknown) => void,
opts: Partial<ReconnectOptions> = {},
) {
this.opts = { ...DEFAULTS, ...opts };
this.connect();
}

/** Full-jitter exponential backoff: delay grows 2^n, capped, then randomized. */
private backoffDelay(): number {
const exp = this.opts.baseDelayMs * 2 ** this.attempt;
const capped = Math.min(exp, this.opts.maxDelayMs);
// Spread the herd: subtract a random slice up to jitterRatio of the delay.
return capped - Math.random() * this.opts.jitterRatio * capped;
}

private connect(): void {
this.state = this.attempt === 0 ? "RECONNECTING" : "BACKOFF";
const ws = new WebSocket(this.url);
this.ws = ws;

ws.onopen = () => {
this.state = "CONNECTED";
this.attempt = 0; // reset the curve on every success
this.flushOutbox();
};

ws.onmessage = (ev) => this.onMessage(ev.data);

ws.onclose = (ev) => {
this.ws = null;
// A deliberate disconnect() nulls onclose first, so reaching here means
// the close was unexpected. Honor fatal codes and the retry ceiling.
if (FATAL_CLOSE_CODES.has(ev.code) || this.attempt >= this.opts.maxRetries) {
this.state = "CLOSED";
return;
}
this.scheduleReconnect();
};

// onerror fires before onclose; let onclose own the retry decision.
ws.onerror = () => ws.close();
}

private scheduleReconnect(): void {
this.state = "BACKOFF";
const delay = this.backoffDelay();
this.attempt++;
this.timer = setTimeout(() => this.connect(), delay);
}

/** Queue when offline; the message replays in order once reconnected. */
send(payload: unknown): void {
const frame = JSON.stringify(payload);
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(frame);
return;
}
if (this.outbox.length >= this.MAX_OUTBOX) this.outbox.shift(); // drop oldest
this.outbox.push(frame);
}

private flushOutbox(): void {
while (this.outbox.length && this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(this.outbox.shift()!);
}
}

/** Deliberate teardown: detach onclose so it cannot schedule a reconnect. */
disconnect(): void {
if (this.timer) clearTimeout(this.timer);
this.timer = null;
this.state = "CLOSED";
if (this.ws) {
this.ws.onclose = null;
this.ws.close(1000, "client disconnect");
this.ws = null;
}
}
}

Two lines carry most of the correctness. In disconnect(), nulling ws.onclose before calling close() is what stops a user-initiated teardown from re-entering the retry path — skip it and every logout triggers a reconnect loop. In onopen, resetting this.attempt to 0 is what makes the backoff curve restart after a successful recovery rather than staying near the cap forever.

Because queued messages may be delivered more than once if a socket dies mid-send, give each application payload a stable ID and dedupe on the server. That ID is the foundation for message delivery guarantees once you move beyond best-effort buffering.

Configuration reference #

Parameter Type Default Production value Notes
baseDelayMs number 1000 5001000 First retry delay before jitter. Too low re-creates the herd.
maxDelayMs number 30000 3000060000 Hard cap on any single delay. Keep under your LB idle timeout so an idle reconnect never out-waits the proxy.
jitterRatio number 0.5 0.51.0 Fraction of the delay randomized away. 1.0 is AWS “full jitter”; below 0.3 herds still form.
maxRetries number 10 10 Failures before entering CLOSED. Use a high value plus a manual “Reconnect” control for long outages.
FATAL_CLOSE_CODES Set<number> {1008, 4001, 4003} app-specific Codes that must NOT retry — policy and auth rejections. Retrying them just burns attempts.

The math: with base=1000, maxDelay=30000, jitterRatio=0.5, attempt 0 waits 0.5–1s, attempt 3 waits 4–8s, and from attempt 5 onward every client waits a uniformly random 15–30s. The full-jitter spread is what flattens the reconnect spike — see exponential backoff with jitter for WebSocket reconnects for the load-distribution analysis.

Edge cases & gotchas #

  • The phantom reconnect on logout. Calling socket.close() without first detaching onclose re-enters the retry path. Always null the handler in deliberate teardown, as disconnect() does above.
  • Backoff that never resets. If you reset attempt on the scheduling of a reconnect rather than on onopen, a flapping network keeps the delay pinned at the cap. Reset only after a confirmed open.
  • WiFi-to-cellular handoff wastes the backoff window. On mobile, the device may regain connectivity 20 seconds into a 30-second timer. Listen for navigator.connection’s change event (and window’s online event) and force an immediate reconnect instead of waiting out a stale timer.
  • Multi-tab herds within one origin. Ten browser tabs all reconnect independently after a server restart. Elect a single leader tab with the BroadcastChannel API to hold the live socket and fan results out to the others, or accept the per-tab jitter as your only spreading mechanism.

Verification #

Confirm the backoff curve and clean teardown before trusting this in production.

  • Watch the socket states. While killing the server, run ss -tnp | grep :8080 on the client host (or chrome://net-internals). You should see the socket move to CLOSE_WAIT/gone, then a single new SYN per backoff interval — never a tight burst.
  • Assert the delays in DevTools. Open the Network panel, filter to WS, throttle to Offline, and confirm successive reconnect attempts are spaced by growing, non-identical intervals (jitter makes them differ run to run).
  • Prove the no-retry path. Call disconnect() and verify with ss that no new socket appears. Then have the server close with 1008 and confirm the client lands in CLOSED without retrying.
  • Load-test the herd. Boot N test clients against one node, restart it, and graph reconnect timestamps. With full jitter the reconnect rate should be a flat plateau, not a spike. Alert when more than ~15% of sessions hit maxDelayMs — that signals a structural outage, not transient noise.

Guides in this area #

FAQ #

How is exponential backoff different from just retrying every few seconds? #

A fixed interval keeps every disconnected client synchronized: they all retry at the same cadence, so a server restart produces a recurring spike at exactly that interval. Exponential backoff grows the delay after each failure so total load tapers off during a long outage, and the jitter term de-synchronizes clients so no two retry at the same instant.

What close codes should trigger a reconnect, and which should not? #

Retry on the transient codes: 1006 (abnormal closure, the common “connection dropped”), 1011 (server error), and 1012 (server restarting). Do not retry on deliberate rejections: 1008 (policy violation) and your application’s auth-failure codes such as 4001 — retrying those just burns attempts against a server that will keep refusing you. Surface those to the user instead.

Does this work behind AWS ALB or nginx? #

Yes, with one constraint: keep maxDelayMs below the proxy’s idle timeout (ALB defaults to 60s, nginx proxy_read_timeout to 60s). Otherwise the proxy may cut an idle connection while your client is still waiting out a long backoff, adding a confusing extra round of reconnects. With multiple backend nodes you also need sticky sessions or a shared pub/sub layer so the reconnecting client can recover state.

Should the message buffer survive a page reload? #

The in-memory outbox does not survive reload — that is usually correct, since stale queued mutations replayed minutes later cause more harm than the loss. If you genuinely need durable delivery, persist outgoing messages to IndexedDB with a stable ID and reconcile against server acknowledgements rather than relying on the in-memory queue.

What changes for Socket.IO vs raw ws? #

Socket.IO ships its own reconnection manager with reconnectionDelay, reconnectionDelayMax, and a built-in randomizationFactor (its jitter), so you configure rather than implement the curve. The state-machine and message-buffering concerns here still apply; you are tuning Socket.IO’s manager instead of writing your own setTimeout loop.

Back to Backend WebSocket Connection Management