Configuring AWS ALB for WebSocket sticky sessions #

You deployed a fleet of WebSocket nodes behind an AWS Application Load Balancer, and now you are chasing intermittent 400 Bad Request errors on reconnect and sudden 502 Bad Gateway drops on active sessions. Presence flickers, optimistic updates roll back, and in-memory session state vanishes the moment a client reconnects. The pattern is almost always the same: the ALB routed the initial upgrade to Target A, the socket dropped, and the reconnect landed on Target B — which has never heard of this client. This page fixes that with ALB target-group stickiness plus connection draining, and shows the backend cookie injection that makes WebSocket upgrades stick. It belongs to the broader topic of Load Balancer Sticky Sessions for WebSocket.

Root cause #

An ALB defaults to round-robin routing for HTTP/HTTPS traffic. It terminates TLS, inspects Layer 7 headers, and hands each new connection to an arbitrarily chosen healthy target. A WebSocket connection is a single long-lived TCP connection after the 101 Switching Protocols handshake, so while a socket is open its routing is perfectly stable — the ALB never re-routes mid-stream. The instability appears on reconnect: every reconnect is a brand-new HTTP request, and without target-group stickiness the ALB picks a fresh target with no memory of the prior session.

That breaks any node-local state — in-process subscription maps, per-connection auth context, and presence tracking. ALB target-group stickiness solves this by setting a routing cookie. With app_cookie stickiness, the backend emits a cookie in the upgrade response, and the ALB pins every later request bearing that cookie to the same target. The catch specific to WebSockets: the ws library writes its own raw 101 response and never goes through Express’s res.setHeader, so a naive Set-Cookie call is silently dropped. You have to inject the header into the raw upgrade response yourself.

The second failure — 502 during deploys — is a draining problem. When a target deregisters during a rolling deployment, the ALB stops sending it new connections but, by default, waits only 300s before force-closing existing ones. If deregistration_delay is shorter than your real WebSocket session lifetime, in-flight sockets are killed mid-stream and clients see a 502. Diagnosing routing drift from application bugs is the first step in backend WebSocket connection management; for multi-node fan-out you will eventually pair stickiness with horizontal scaling on Kubernetes and a shared session store.

ALB reconnect routing without and with stickiness Without a stickiness cookie a reconnect lands on a different target and loses state; with the app cookie the ALB pins the client back to its original target. Client reconnect AWS ALB L7 router Target A Target B cookie pins back no cookie = drift app_cookie stickiness backend Set-Cookie on 101 keeps client on Target A

Resolution #

Configure app_cookie stickiness and a draining window on the target group, then make the backend emit the routing cookie inside the raw 101 response. The annotated blocks below are the full working set.

# Terraform: target group with app-cookie stickiness + draining
resource "aws_lb_target_group" "ws_sticky" {
name = "ws-sticky-tg"
port = 8080
protocol = "HTTP" # ALB terminates TLS; targets speak HTTP
vpc_id = var.vpc_id

stickiness {
type = "app_cookie" # backend-controlled cookie, not lb_cookie
cookie_name = "WS_SESSION_ID" # must match the name the backend sets
cookie_duration = 86400 # 24h pin; cap to your session TTL
enabled = true
}

# Drain longer than your max WebSocket session so deploys don't 502.
deregistration_delay = "360" # seconds; exceeds WS idle timeout below

health_check {
path = "/ws-health" # plain HTTP endpoint, NOT the upgrade path
protocol = "HTTP"
matcher = "200"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 2
}
}

# Idle timeout must exceed your heartbeat interval or the ALB kills idle sockets.
resource "aws_lb" "ws" {
name = "ws-alb"
load_balancer_type = "application"
idle_timeout = 300 # > heartbeat ping interval (e.g. 30s)
# ...subnets, security_groups omitted
}
import express from "express";
import { createServer } from "http";
import { WebSocketServer } from "ws";
import { randomUUID } from "crypto";

const STICKY_COOKIE = "WS_SESSION_ID"; // must match Terraform cookie_name
const app = express();
const server = createServer(app);
const wss = new WebSocketServer({ noServer: true });

app.get("/ws-health", (_req, res) => res.sendStatus(200)); // ALB health check target

server.on("upgrade", (request, socket, head) => {
if (!socket.writable || socket.destroyed) { // guard against half-open sockets
socket.destroy();
return;
}

const sessionId = randomUUID(); // one routing id per new session

// The `ws` library writes its own raw 101 and bypasses res.setHeader, so we
// patch socket.write ONCE to splice Set-Cookie into the handshake response.
const rawWrite = socket.write.bind(socket);
let injected = false;
socket.write = function (chunk: any, enc?: any, cb?: any) {
const isHandshake =
!injected && Buffer.isBuffer(chunk) && chunk.subarray(0, 4).toString() === "HTTP";
if (isHandshake) {
injected = true;
const cookie =
`${STICKY_COOKIE}=${sessionId}; Path=/; SameSite=None; Secure; HttpOnly`;
// Insert the header just before the terminating CRLFCRLF of the 101 response.
const patched = chunk
.toString("utf8")
.replace("\r\n\r\n", `\r\nSet-Cookie: ${cookie}\r\n\r\n`);
return rawWrite(Buffer.from(patched, "utf8"), cb ?? enc);
}
return rawWrite(chunk, enc, cb);
};

wss.handleUpgrade(request, socket, head, (ws) => {
wss.emit("connection", ws, request); // hand off the now-cookied socket
});
});

server.listen(8080, () => console.log("WS server on :8080"));

One nuance to internalize: the ALB reads WS_SESSION_ID on subsequent requests, so the very first connection is still routed by the default algorithm and stickiness only kicks in on the next reconnect. For affinity that survives a target failure entirely, back the per-connection state with Redis so any node can rehydrate a client — that hybrid is the durable pattern once you scale past a single target group.

Verify the header lands before you ship:

# Confirm the 101 carries Set-Cookie. curl won't finish the handshake, but the
# response headers arrive — a 101 with Set-Cookie means stickiness will engage.
curl -v --include \
-H 'Upgrade: websocket' -H 'Connection: Upgrade' \
-H 'Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==' \
-H 'Sec-WebSocket-Version: 13' \
https://api.example.com/ws 2>&1 | grep -i 'set-cookie'

Operational checklist #

  • deregistration_delay (360s) is greater than your longest WebSocket session and idle timeout, so rolling deploys drain instead of 502
  • ALB idle_timeout
  • Health check targets a plain HTTP /ws-health route, never the upgrade path (a 101 is not a 200
  • curl against the upgrade endpoint returns 101 with the WS_SESSION_ID Set-Cookie
  • Cookie carries SameSite=None; Secure
  • CloudWatch alarms fire on HTTPCode_Target_5XX_Count > 0 and TargetResponseTime > 5s
  • Athena query over ALB access logs shows the same target_ip for consecutive reconnects from one client_ip
  • Rollback plan: setting stickiness.enabled = false

FAQ #

Because the ws library writes its own raw 101 Switching Protocols response directly to the socket and never touches Express’s response object. Calling res.setHeader or res.cookie has no effect on a hijacked upgrade. You must splice Set-Cookie into the raw handshake bytes, as the patched socket.write above does.

Should I use app_cookie or lb_cookie stickiness? #

Use app_cookie when your backend can emit a deterministic routing cookie tied to a real session id — it survives client cookie expiry policies you control. Use lb_cookie only if you cannot modify the backend; the ALB generates an opaque AWSALB cookie with a duration you set on the target group, but you lose the ability to align the cookie lifetime with your session TTL.

My health checks keep marking targets unhealthy. Why? #

You are almost certainly pointing the health check at the upgrade path. A WebSocket endpoint replies 101, not 200, so the ALB matcher fails. Expose a separate plain HTTP route like /ws-health that returns 200 and point the health check there.

Do I still get 502s during deploys after enabling stickiness? #

Stickiness does not stop deploy 502s — draining does. If deregistration_delay is shorter than an active session, the ALB force-closes the socket when the target deregisters. Set the drain window above your maximum session length so connections close on their own terms.

Does stickiness alone keep state consistent across nodes? #

No. Stickiness keeps a reconnecting client on the same target, but the first connection and any target failure still route elsewhere. For true consistency, pair it with a shared store (Redis) so any node can rehydrate a client’s state, which is the standard approach when scaling WebSockets horizontally.

Back to Load Balancer Sticky Sessions for WebSocket