Horizontal Scaling of WebSockets on Kubernetes #

A rolling deploy ships at 14:00. Kubernetes terminates the old ReplicaSet pod-by-pod, and every long-lived WebSocket pinned to those pods dies the instant the container receives SIGTERM. Thirty thousand browsers simultaneously fire onclose, trigger reconnect storms, and re-handshake against the new pods — a self-inflicted thundering herd on every deploy. Meanwhile your CPU-based HorizontalPodAutoscaler sees idle sockets, reads 8% CPU, and scales the Deployment down into the storm.

This is the core mismatch: Kubernetes primitives assume stateless, short-lived requests. WebSockets are stateful and long-lived. Running them on Kubernetes works well, but only once you stop autoscaling on CPU, drain connections on pod shutdown instead of killing them, and bound voluntary disruptions with a budget. This guide walks the Deployment manifest, the autoscaling signal, and the graceful-shutdown handler that make horizontal scaling safe.

Prerequisites #

Before scaling out, the connection layer below the pod must already be correct:

  • A working upgrade path through the Ingress. The ingress controller (NGINX, Contour, or an ALB Ingress) must forward Upgrade/Connection headers and use a long enough idle timeout. See configuring Nginx for WebSocket upgrades.
  • Externalized state. Any shared state must live in Redis or a database so a connection can land on any pod. If you still depend on in-memory affinity, read Load Balancer Sticky Sessions first — affinity and aggressive autoscaling fight each other during scale-down.
  • Server-side lifecycle handling. Heartbeats, idle eviction, and clean close semantics belong to the app, covered under Backend WebSocket Connection Management.
  • A metrics endpoint the pod already exposes (/metrics) reporting active connection count, since that number drives autoscaling.

Why CPU-Based HPA Is the Wrong Signal #

A WebSocket that has completed its handshake costs almost no CPU while idle — it is a file descriptor and a small buffer waiting on the kernel. A single pod can hold tens of thousands of mostly-silent connections at 5–10% CPU. CPU therefore tells you nothing about how full a pod is. Worse, it flaps: a burst of broadcast traffic spikes CPU, the HPA adds pods, the burst ends, CPU collapses, and the HPA scales back down — terminating pods that still hold thousands of live connections.

The correct signal is active connection count per pod. You scale up when pods approach their connection ceiling (file-descriptor limits, memory per connection) and you never scale down purely because traffic went quiet. The diagram below shows the data path and the autoscaling control loop driven by that metric.

Connection-count autoscaling on Kubernetes Clients connect through an Ingress to a Service that load-balances across WebSocket pods; an exporter feeds active connection counts to KEDA, which scales the Deployment. Clients long-lived WS Ingress Upgrade fwd Service ClusterIP ws-pod 1 9.8k conns ws-pod 2 9.9k conns ws-pod 3 scaling up KEDA + HPA conns/pod scrape metric Scale on active connections, never on idle CPU

Core implementation #

Two pieces work together: the Deployment manifest that tells Kubernetes how to start and stop the pod, and the application code that drains connections when SIGTERM arrives.

The Deployment manifest #

The critical knobs are terminationGracePeriodSeconds (how long Kubernetes waits before SIGKILL), a preStop hook that flips the pod out of the load-balancer rotation, and a readiness probe that returns unready the moment draining begins so the Service stops sending new connections.

apiVersion: apps/v1
kind: Deployment
metadata:
name: ws-server
spec:
replicas: 3
selector:
matchLabels: { app: ws-server }
template:
metadata:
labels: { app: ws-server }
spec:
# Must exceed (preStop sleep + drain timeout) so the app finishes draining
# before Kubernetes sends SIGKILL. 120s here = 5s sleep + ~110s drain budget.
terminationGracePeriodSeconds: 120
containers:
- name: ws-server
image: registry.example.com/ws-server:1.8.0
ports:
- containerPort: 8080
lifecycle:
preStop:
exec:
# Sleep so endpoint removal propagates to the Ingress/Service
# BEFORE the app starts refusing new connections. Avoids a race
# where kube-proxy still routes new clients to a draining pod.
command: ["/bin/sh", "-c", "sleep 5"]
readinessProbe:
# Returns 503 once SIGTERM flips the draining flag, pulling the pod
# out of Service endpoints so no NEW connections arrive.
httpGet: { path: /readyz, port: 8080 }
periodSeconds: 2
failureThreshold: 1
livenessProbe:
# Stays healthy during drain — do NOT let liveness kill a draining pod.
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 10
failureThreshold: 3

The SIGTERM graceful-drain handler #

When the container gets SIGTERM, the app must (1) mark itself unready, (2) stop accepting new upgrades, (3) ask existing clients to reconnect with a close frame, and (4) wait for them to drain before exiting.

import { WebSocketServer } from "ws";
import http from "node:http";

const DRAIN_TIMEOUT_MS = 110_000; // must fit inside terminationGracePeriodSeconds
const GOING_AWAY = 1001; // RFC 6455: "server is going away"

let draining = false;
const server = http.createServer((req, res) => {
// Readiness probe: report unready the instant draining begins so the
// Service removes this pod from endpoints and routes new clients elsewhere.
if (req.url === "/readyz") {
res.writeHead(draining ? 503 : 200).end();
return;
}
if (req.url === "/healthz") {
res.writeHead(200).end(); // liveness stays green throughout the drain
return;
}
res.writeHead(404).end();
});

const wss = new WebSocketServer({ server });

function gracefulDrain(): void {
if (draining) return;
draining = true; // /readyz now returns 503 — no new connections from the Service

// Tell every live client we are going away with a clean close frame so the
// client's reconnect logic (backoff + jitter) kicks in against a healthy pod.
for (const ws of wss.clients) {
ws.close(GOING_AWAY, "server draining");
}

// Poll until all sockets have closed, then exit cleanly.
const deadline = Date.now() + DRAIN_TIMEOUT_MS;
const tick = setInterval(() => {
if (wss.clients.size === 0 || Date.now() > deadline) {
clearInterval(tick);
// Force-close any stragglers that ignored the close frame, then exit.
for (const ws of wss.clients) ws.terminate();
server.close(() => process.exit(0));
}
}, 1_000);
}

// SIGTERM is what Kubernetes sends on pod termination AFTER the preStop hook.
process.on("SIGTERM", gracefulDrain);

server.listen(8080);

The sequence on the wire is: preStop sleep 5 runs first (endpoint removal propagates), then Kubernetes sends SIGTERM, gracefulDrain flips readiness and closes sockets, clients reconnect against pods that are still in rotation, and the process exits well before the 120s grace period elapses.

Configuration reference #

Parameter Type Default Production value Notes
terminationGracePeriodSeconds int (s) 30 120 Must exceed preStop sleep + drain timeout, or SIGKILL cuts the drain short.
preStop sleep int (s) none 5 Lets endpoint removal reach the Ingress before the app refuses traffic.
DRAIN_TIMEOUT_MS int (ms) n/a 110000 App-side drain budget; keep ~10s below the grace period for cleanup.
KEDA value (conns/pod) int n/a 8000 Target active connections per pod; set below the fd/memory ceiling.
HPA stabilizationWindowSeconds (down) int (s) 300 600+ Long window stops scale-down flapping on quiet traffic.
PDB minAvailable int / % n/a 80% Caps voluntary disruptions so a node drain can’t evict every pod at once.
Ingress idle timeout int (s) varies ≥ heartbeat × 2 Below the heartbeat interval and the proxy silently kills live sockets.

Edge cases & gotchas #

  • The SIGTERM/endpoint-removal race. SIGTERM and endpoint removal are dispatched concurrently. Without the preStop sleep, the app can start refusing connections while kube-proxy on some nodes still routes new clients to it — they get a connection reset. The sleep buys time for endpoint propagation; it is not optional.
  • Ingress idle timeout silently kills sockets. Many controllers default to a 60s proxy read timeout. A WebSocket with a 30s heartbeat survives; one with a 90s heartbeat gets disconnected mid-connection with no app-level error. Set the timeout to at least twice your heartbeat interval, mirroring the upgrade-header rules in configuring Nginx for WebSocket upgrades.
  • Sticky routing during scale-down. If the Ingress uses session affinity, scaling down can repeatedly hash reconnecting clients back to terminating pods. Either disable affinity for WebSocket Services that externalize state, or make affinity respect readiness — see Load Balancer Sticky Sessions.
  • Liveness probes that kill draining pods. If your liveness probe shares the readiness handler and you return 503 while draining, the kubelet will SIGKILL the pod mid-drain. Keep /healthz and /readyz separate, as shown above.

Verification #

Confirm a rolling deploy drains instead of mass-disconnecting:

# Watch the rollout; pods should leave gradually, not all at once.
kubectl rollout status deployment/ws-server --watch

# Exec into a terminating pod and count live sockets shrinking during drain.
kubectl exec -it ws-server-old-xxxx -- ss -tnp 'sport = :8080' | wc -l

# Confirm the readiness flip pulled the pod from Service endpoints.
kubectl get endpointslices -l kubernetes.io/service-name=ws-server -o wide

# Verify the PodDisruptionBudget is allowing the right number of disruptions.
kubectl get pdb ws-server -o jsonpath='{.status.disruptionsAllowed}'

During the drain window, ss on the terminating pod should show the connection count falling toward zero while the new pods’ counts rise — clients are reconnecting, not erroring. If ss stays flat at SIGTERM, your handler never received the signal (check that the container’s PID 1 forwards signals).

Guides in this area #

Autoscaling WebSockets on Kubernetes with KEDA walks the full ScaledObject definition that turns active-connection-count into a scaling decision, including the Prometheus trigger and stabilization windows that stop the HPA from flapping.

FAQ #

Why not just autoscale WebSocket pods on CPU? #

An idle WebSocket consumes negligible CPU, so CPU never reflects how many connections a pod holds. You can be at 95% of your file-descriptor ceiling while sitting at 8% CPU. Scale on active connection count per pod instead, and gate scale-down with a long stabilization window so a quiet period doesn’t terminate full pods.

How do I stop a rolling deploy from disconnecting every client? #

Combine a preStop sleep (so the pod leaves the load balancer before it refuses traffic), a SIGTERM handler that sends RFC 6455 close code 1001 and waits for sockets to drain, and a terminationGracePeriodSeconds larger than your drain timeout. Clients then reconnect against healthy pods using normal backoff instead of all hitting onclose simultaneously.

What does the preStop sleep actually accomplish? #

Kubernetes sends SIGTERM and removes the pod from Service endpoints at the same time. The sleep delays the app’s own shutdown long enough for endpoint removal to propagate through kube-proxy and the Ingress, closing the window where new clients could still be routed to a pod that is about to terminate.

Does this work behind an AWS ALB or NGINX Ingress? #

Yes, with two caveats: the controller must forward Upgrade/Connection headers, and its idle timeout must exceed twice your heartbeat interval or it will silently drop live sockets. ALB target deregistration delay plays the same role as the preStop sleep, so tune both together.

What does a PodDisruptionBudget protect against here? #

A PDB caps voluntary disruptions — node drains, cluster autoscaler scale-downs, kubectl drain. Setting minAvailable: 80% means a single node drain can’t evict every WebSocket pod simultaneously, so connections always have healthy pods to reconnect to. It does not protect against involuntary disruptions like node crashes.

Back to Scaling Real-Time Infrastructure