Exponential backoff with jitter for WebSocket reconnects #

Your WebSocket fleet was fine until a single server restart took ten thousand clients offline at the same instant. Every one of them fired onclose, waited the same fixed delay, and slammed the load balancer in a synchronized wave. The backend never recovers because each retry burst arrives in lockstep. This is the thundering herd, and the fix is exponential backoff with jitter — not a longer fixed delay.

If you are landing here from a search like “WebSocket reconnect storm” or “clients reconnect all at once after deploy”, the symptom is a sawtooth in ws_connections_active: a cliff to zero, then a vertical spike that overwhelms accept queues, exhausts file descriptors, and trips your CPU autoscaler into a feedback loop. The retries themselves become the outage.

Root cause #

A WebSocket close is a synchronizing event. When a server process dies, every connected client observes the disconnect within the same TCP RTT window — often inside a few hundred milliseconds. If each client then reconnects after a constant delay (say 1000 ms), all of them retry at T + 1000 ms, again at T + 2000 ms, and so on. The clients are phase-locked. Fixed-interval retry does not spread load; it preserves the exact correlation that caused the spike.

Plain exponential backoff fixes the rate but not the phase. The delay for attempt n is:

delay = min(base * 2^n, cap)

With base = 500 ms and cap = 30 s, attempts land at 500, 1000, 2000, 4000, 8000, 16000, then 30000 ms. The intervals grow, which lowers average request rate — but every client computes the same deterministic delay, so they still fire in tight bands. You have stretched the herd, not dispersed it. Ten thousand clients still arrive together; they just arrive together less often.

The missing ingredient is randomness. Jitter decorrelates clients by spreading each one’s wake-up time across a window instead of pinning it to a point. AWS’s analysis of this problem defines two practical variants:

Full jitter: delay = random(0, min(cap, base * 2^n)). Each client picks a uniform random delay anywhere from zero up to the capped exponential ceiling. This produces the flattest arrival distribution and the lowest contention, at the cost of occasionally retrying very quickly.
Equal jitter: temp = min(cap, base * 2^n); delay = temp/2 + random(0, temp/2). Half the interval is fixed, half is random. This guarantees a minimum wait (you never hammer instantly) while still decorrelating the upper half.

For reconnect storms, full jitter is almost always the right default — it minimizes the peak concurrent reconnect count, which is the metric that actually melts your accept queue. The simulation data in the AWS work shows full jitter completing the same total work with dramatically lower server-side contention than equal jitter or no jitter.

This builds directly on a clean teardown: jitter only helps if every client genuinely starts from a closed socket. If you skip the graceful disconnect handling step, orphaned sockets and duplicate listeners will fire overlapping reconnect timers and defeat the backoff math entirely.

Resolution #

The implementation below computes a capped exponential ceiling, applies full jitter, and enforces a maximum-retry ceiling so a permanently dead endpoint stops retrying instead of looping forever. Every reconnect is scheduled from a single timer that is cleared on success.

const BASE_DELAY_MS = 500;      // first-attempt exponential base
const MAX_DELAY_MS = 30_000;    // cap so 2^n cannot run away
const MAX_RETRIES = 10;         // ceiling before we surface a hard failure

export class ReconnectingSocket {
  private ws: WebSocket | null = null;
  private attempt = 0;
  private timer: ReturnType<typeof setTimeout> | null = null;

  constructor(private readonly url: string) {}

  connect(): void {
    this.ws = new WebSocket(this.url);
    this.ws.onopen = () => {
      this.attempt = 0;            // reset the curve once we are healthy again
    };
    this.ws.onclose = () => this.scheduleReconnect();
    this.ws.onerror = () => this.ws?.close(); // funnel errors into onclose
  }

  /** delay = random(0, min(cap, base * 2^attempt)) — full jitter. */
  private nextDelay(): number {
    const ceiling = Math.min(MAX_DELAY_MS, BASE_DELAY_MS * 2 ** this.attempt);
    return Math.random() * ceiling; // uniform over [0, ceiling): decorrelates clients
  }

  private scheduleReconnect(): void {
    if (this.timer) return;        // guard against double-fire (close + error race)
    if (this.attempt >= MAX_RETRIES) {
      this.onGiveUp();             // hard stop: do not retry a dead endpoint forever
      return;
    }
    const delay = this.nextDelay();
    this.attempt += 1;             // advance the curve BEFORE scheduling
    this.timer = setTimeout(() => {
      this.timer = null;
      this.connect();
    }, delay);
  }

  /** Surface a terminal failure so the UI can prompt a manual retry. */
  private onGiveUp(): void {
    console.error(`WebSocket gave up after ${MAX_RETRIES} attempts`);
  }

  close(): void {
    if (this.timer) clearTimeout(this.timer); // cancel any pending reconnect
    this.timer = null;
    this.ws?.close(1000, 'client closing');
  }
}

Two details carry most of the value. First, nextDelay draws from [0, ceiling) rather than returning ceiling directly — that single Math.random() is what breaks the phase lock across the fleet. Second, the if (this.timer) return guard prevents the common bug where onerror and onclose both fire for the same drop and schedule two overlapping reconnects, which silently doubles your effective retry rate.

If your reconnect needs to resume an in-flight session — replaying unacknowledged messages rather than starting cold — coordinate this backoff with your message delivery guarantees layer so the post-reconnect handshake re-sends only what the server has not yet acked.

Operational checklist #

Confirm BASE_DELAY_MS, MAX_DELAY_MS, and MAX_RETRIES Confirm `BASE_DELAY_MS`, `MAX_DELAY_MS`, and `MAX_RETRIES` are named constants, not inline magic numbers, so they are tunable per environment.
Verify full jitter is applied (random(0, ceiling)), not just exponential growth — log a sample of computed delays and confirm they span the full [0, ceiling) Verify full jitter is applied (`random(0, ceiling)`), not just exponential growth — log a sample of computed delays and confirm they span the full `[0, ceiling)` range.
Add the double-fire guard and prove it: trigger onerror followed by onclose Add the double-fire guard and prove it: trigger `onerror` followed by `onclose` and assert only one timer is scheduled.
Reset attempt to 0 inside onopen, not onclose Reset `attempt` to 0 inside `onopen`, not `onclose`, so a flapping connection does not get stuck near the cap.
Load-test a simulated mass disconnect (kill the server with N clients attached) and watch ws_connections_active Load-test a simulated mass disconnect (kill the server with N clients attached) and watch `ws_connections_active` recover as a flat ramp, not a spike.
Define behavior after MAX_RETRIES Define behavior after `MAX_RETRIES`: surface a UI banner or a manual-retry button rather than silently going dark.
Ensure close() Ensure `close()` clears the pending timer to avoid a reconnect firing after intentional teardown.

FAQ #

Why not just use a longer fixed delay instead of jitter? #

A longer fixed delay lowers the average request rate but keeps every client phase-locked, so they still arrive in synchronized bursts — just less often. The peak concurrent reconnect count, which is what saturates accept queues and file descriptors, stays high. Jitter is what flattens the peak; backoff alone only stretches the spacing between identical peaks.

Full jitter or equal jitter — which should I default to? #

Default to full jitter for reconnect storms. It produces the lowest peak concurrency, which is the metric that protects your accept queue. Choose equal jitter only when you specifically want to guarantee a minimum wait (for example, to avoid a near-instant retry hammering a rate limiter), accepting a slightly taller arrival peak in exchange.

Where do I get the exponential ceiling formula from? #

The capped ceiling is min(cap, base * 2^attempt). The full-jitter delay is then a uniform random value in [0, ceiling). This is the formulation popularized by AWS’s “Exponential Backoff And Jitter” analysis, which measured full jitter as completing equivalent work with the least server-side contention.

Should I reset the attempt counter on close or on open? #

On open. Resetting inside onopen means the counter only clears once a connection is genuinely established. Resetting on onclose (or never resetting) leaves a flapping client either stuck near the cap or escalating unboundedly, both of which break the intended curve.

Does this belong on the client or the server? #

The reconnect curve lives on the client, because the client is the one initiating connections. The server’s job is to survive the herd: bounded accept backlogs, connection rate limiting at the load balancer, and fast process startup. Client-side jitter and server-side admission control are complementary — you need both to ride out a fleet-wide restart.

Handling WebSocket Disconnects Gracefully — clean teardown so each reconnect starts from a fully closed socket.
WebSocket Auto-Reconnection Strategies — the broader reconnection design this backoff slots into.
Message Delivery Guarantees — replaying unacknowledged messages after a reconnect completes.
Backend WebSocket Connection Management — server-side lifecycle, heartbeats, and admission control that absorb the herd.

Back to WebSocket Auto-Reconnection Strategies.