Configuring AWS ALB for WebSocket sticky sessions #
Identifying WebSocket State Desync Behind AWS ALB #
Engineers frequently observe intermittent 400 Bad Request errors during reconnection attempts and sudden 502 Bad Gateway drops in active sessions. Real-time state divergence occurs when clients reconnect across distributed nodes. This typically manifests when the ALB routes the initial HTTP upgrade to Target A, but subsequent heartbeat or data frames route to Target B due to missing session affinity.
CloudWatch logs will reveal mismatched Connection: Upgrade headers, dropped frames, and TargetConnectionError spikes. Understanding Load Balancer Sticky Sessions is critical to isolating routing anomalies from application-level bugs.
Diagnostic Steps:
- Query ALB logs for routing failures:
aws logs tail /aws/application-load-balancer/your-lb --filter-pattern 'TargetConnectionError OR WebSocketUpgradeFailed' - Inspect live packet flow for protocol switching:
tcpdump -i eth0 port 443 -n | grep -E 'HTTP/1.1 101|Sec-WebSocket-Accept' - Verify routing drift by comparing
X-Forwarded-Forand target instance IDs across consecutive frames using ALB access logs.
ALB Default Routing vs WebSocket Statefulness #
AWS ALB defaults to round-robin routing for HTTP/HTTPS traffic. Unlike Network Load Balancers, the ALB terminates TLS and inspects Layer 7 headers. It does not natively bind long-lived WebSocket connections to a single target without explicit configuration. When a client reconnects or sends an initial handshake, the ALB lacks a routing cookie and distributes the connection arbitrarily across healthy targets.
This behavior breaks in-memory session state, Pub/Sub routing, and presence tracking. Proper Backend WebSocket Connection Management requires deterministic routing to maintain state consistency across the cluster.
Root Cause Indicators:
- Missing
Set-Cookieheader in the101 Switching Protocolsresponse. - ALB Target Group
stickiness.enableddefaults tofalse. - Cross-target frame routing observed via
ActiveConnectionCountspikes per target. - Backend state stores (Redis/Memory) receive requests from mismatched connection IDs.
Exact ALB Configuration & Backend Cookie Implementation #
Implement application-level cookie stickiness. The backend must explicitly set the routing cookie during the HTTP upgrade response. The ALB target group must be configured to recognize and route based on this exact cookie name.
Terraform Target Group Configuration:
resource "aws_lb_target_group" "ws_sticky" {
name = "ws-sticky-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
stickiness {
type = "app_cookie"
cookie_name = "WS_SESSION_ID"
cookie_duration = 86400
enabled = true
}
health_check {
path = "/ws-health"
protocol = "HTTP"
matcher = "200"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 2
}
}
Node.js Backend Upgrade Handler:
const express = require('express');
const { createServer } = require('http');
const { WebSocketServer } = require('ws');
const crypto = require('crypto');
const app = express();
const server = createServer(app);
const wss = new WebSocketServer({ noServer: true });
server.on('upgrade', (request, socket, head) => {
const sessionId = crypto.randomUUID();
const cookieHeader = `WS_SESSION_ID=${sessionId}; Path=/; SameSite=None; Secure; HttpOnly`;
// Error boundary: validate socket state before hijacking
if (!socket.writable || socket.destroyed) {
console.error('Socket destroyed before 101 response: aborting upgrade');
socket.destroy();
return;
}
wss.handleUpgrade(request, socket, head, (ws) => {
ws.emit('connection', ws, request);
});
// Intercept and modify the 101 response headers
const originalWrite = socket.write.bind(socket);
socket.write = (chunk, encoding, callback) => {
if (typeof chunk === 'string' && chunk.startsWith('HTTP/1.1 101')) {
chunk += `Set-Cookie: ${cookieHeader}\r\n`;
}
return originalWrite(chunk, encoding, callback);
};
});
server.listen(8080, () => console.log('WS server listening on port 8080'));
Error Boundary Note: The socket.write override must execute synchronously before the underlying HTTP parser emits the 101 status. If the socket closes prematurely, destroy it immediately to prevent memory leaks and zombie connections.
Validation, Draining, & CI/CD Guardrails #
Prevent state drift by enforcing connection draining timeouts that exceed WebSocket idle timeouts. Validate cookie propagation in staging environments and monitor ALB metrics continuously. Implement automated tests that verify the Set-Cookie header is present on the 101 response and that subsequent requests route to the same target.
Production Guardrails:
- Set
deregistration_delay.timeout_seconds = 360in the ALB target group to gracefully drain active WebSocket connections during deployments. - Deploy CloudWatch Alarms on
TargetResponseTime > 5sandHTTPCode_Target_5XX_Count > 0. - Run integration test:
curl -v -H 'Upgrade: websocket' -H 'Connection: Upgrade' https://your-domain.com/ws | grep -i 'Set-Cookie: WS_SESSION_ID' - Enforce
SameSite=None; Securecookie attributes to prevent cross-origin routing failures in modern browsers.
Automated Routing Validation Script:
#!/bin/bash
# Verify ALB routing consistency under load
for i in {1..5}; do
RESPONSE=$(curl -s -o /dev/null -w '%{http_code}' -H 'Upgrade: websocket' -H 'Connection: Upgrade' -H 'Cookie: WS_SESSION_ID=test-123' https://api.example.com/ws)
if [[ "$RESPONSE" != "101" && "$RESPONSE" != "400" ]]; then
echo "FAIL: Unexpected status $RESPONSE on iteration $i"
exit 1
fi
done
echo "PASS: Consistent routing verified"