Autoscaling WebSockets on Kubernetes with KEDA #

Your WebSocket pods are pinned at 60,000 sockets each, but the HorizontalPodAutoscaler refuses to add replicas because average CPU sits at 8%. Idle-but-connected sockets consume file descriptors, heap, and the per-connection memory of your framework — none of which shows up on a CPU graph. By the time CPU finally spikes, the fleet is already past its connection ceiling and new clients get ECONNREFUSED or land on a pod that OOMKills mid-handshake. The fix is to stop scaling on CPU and start scaling on the one metric that actually tracks load: active connection count.

Root cause #

The default HorizontalPodAutoscaler scales on resource metrics — CPU and memory utilization averaged across the pods in a Deployment. That model assumes work is proportional to CPU. For request/response HTTP it usually is. For long-lived WebSockets it is not.

A WebSocket connection that has completed its upgrade and is sitting idle does almost no CPU work. It costs:

One file descriptor (and a slot against the pod’s nofile ulimit).
A kernel socket buffer pair (send/receive), typically tens of KB each.
Per-connection heap in your runtime — buffers, the ws socket object, any subscription/presence state you attach.

A Node.js pod can hold tens of thousands of these before memory pressure appears, and CPU stays near idle the entire time because no frames are flowing. So the HPA’s CPU signal is flat right up until the moment a pod runs out of descriptors or heap, at which point accept() starts failing and the process degrades non-gracefully. Memory-based HPA is only marginally better: heap fragmentation and GC make the utilization number noisy and lagging, and you still cannot answer “are we near the connection ceiling?” directly.

The metric that is proportional to load is the number of active connections per pod. If each pod is sized for, say, 8,000 connections, you want Kubernetes to add a replica whenever the fleet-wide average crosses a target like 6,000. CPU and memory cannot express that. You need an external metric, and that is exactly what KEDA provides — it acts as a metrics adapter that feeds a custom metric (here, scraped from Prometheus) into the same HPA machinery, so you can scale on ws_connections_active instead of CPU.

This is also why connection-count scaling pairs with load balancer sticky sessions: the autoscaler changes the number of pods, but the load balancer still has to pin each long-lived socket to one pod, and respect that pinning across scale events.

Resolution #

First, the application must export the metric. KEDA scales nothing without a real number to read, so expose a gauge that increments on connection and decrements on close. With prom-client in a Node.js service running ws:

import { WebSocketServer } from 'ws';
import { Gauge, register } from 'prom-client';
import http from 'node:http';

// A gauge, not a counter: it must go DOWN when a socket closes.
const activeConnections = new Gauge({
  name: 'ws_connections_active',
  help: 'Currently open WebSocket connections on this pod',
});

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (socket) => {
  activeConnections.inc();              // one more open socket
  socket.on('close', () => {
    activeConnections.dec();            // decrement so KEDA sees drains
  });
});

// Prometheus scrape endpoint on a separate port from the WS traffic.
http.createServer(async (_req, res) => {
  res.setHeader('Content-Type', register.contentType);
  res.end(await register.metrics());   // exposes ws_connections_active
}).listen(9090);

Then point KEDA at the Prometheus query. The ScaledObject below targets an average of 6,000 connections per pod. KEDA computes desired replicas as ceil(sum(ws_connections_active) / threshold), so a fleet holding 30,000 sockets settles on 5 pods.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ws-gateway-scaler
  namespace: realtime
spec:
  scaleTargetRef:
    name: ws-gateway          # the Deployment to scale
  minReplicaCount: 3          # never below 3 for HA + spread
  maxReplicaCount: 40         # cap so a metric bug can't fan out infinitely
  cooldownPeriod: 300         # wait 5m of no activity before scaling to min
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          # Long window: do NOT yank pods that still hold live sockets.
          stabilizationWindowSeconds: 600
          policies:
            - type: Pods
              value: 1        # remove at most 1 pod per period
              periodSeconds: 120
        scaleUp:
          stabilizationWindowSeconds: 0   # add capacity immediately
          policies:
            - type: Pods
              value: 4
              periodSeconds: 60
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: ws_connections_active
        # Average target per replica; KEDA divides the sum by this.
        threshold: "6000"
        query: sum(ws_connections_active{app="ws-gateway"})

The dangerous direction is scale-down. When the HPA decides to remove a pod, Kubernetes sends SIGTERM and the pod still holds live sockets that will all break at once. A long scaleDown.stabilizationWindowSeconds (here 600s) stops the autoscaler from thrashing pods in and out on brief dips, but it does not drain them — you also need a preStop hook and a terminationGracePeriodSeconds long enough for clients to migrate. In the pod spec, send a WebSocket close frame to every connection, stop accepting new ones, and sleep until the gauge hits zero or the grace period expires. Pair that drain with load balancer sticky sessions so reconnecting clients land on a surviving pod rather than the one being torn down. This connection-count approach is the Kubernetes-specific piece of scaling real-time infrastructure; the broader horizontal-scaling concerns live in Horizontal Scaling on Kubernetes.

Operational checklist #

Confirm ws_connections_active is a gauge that decrements on close — a counter that only goes up will scale you to maxReplicaCount Confirm `ws_connections_active` is a **gauge** that decrements on close — a counter that only goes up will scale you to `maxReplicaCount` and stay there.
Verify Prometheus actually scrapes the metrics port (/metrics Verify Prometheus actually scrapes the metrics port (`/metrics` returns the gauge, ServiceMonitor or scrape config matches the pod labels).
Set threshold Set `threshold` below your real per-pod ceiling with headroom (target 6k when a pod survives 10k) so reconnect storms have room.
Configure terminationGracePeriodSeconds ≥ your drain time and add a preStop Configure `terminationGracePeriodSeconds` ≥ your drain time and add a `preStop` hook that close-frames sockets and waits for the gauge to drain.
Keep scaleDown.stabilizationWindowSeconds long (≥ 300s) and scaleUp Keep `scaleDown.stabilizationWindowSeconds` long (≥ 300s) and `scaleUp` near zero so you add capacity fast and shed it slowly.
Set minReplicaCount ≥ 2–3 for availability and maxReplicaCount Set `minReplicaCount` ≥ 2–3 for availability and `maxReplicaCount` as a circuit breaker against a runaway metric.
Load-test a scale-down event and confirm clients reconnect to other pods without a thundering herd against the new replicas.

FAQ #

Why not just use CPU-based HPA for WebSocket pods? #

Because idle-but-connected sockets burn file descriptors and heap while doing almost no CPU work. CPU utilization stays flat until a pod hits its connection ceiling and starts failing accept(), which is far too late to add capacity. Active-connection count is the only signal that rises in step with actual load.

Does KEDA replace the HorizontalPodAutoscaler? #

No. KEDA creates and manages an HPA under the hood and feeds it an external metric. You still get standard HPA behavior — including the scaleDown/scaleUp behavior block — but the scale signal comes from your Prometheus query instead of CPU or memory.

What happens to open connections when KEDA scales down a pod? #

Kubernetes sends SIGTERM and those sockets close. Use a long scaleDown.stabilizationWindowSeconds to avoid thrashing, a preStop hook that close-frames clients and drains, and a terminationGracePeriodSeconds long enough for the drain. Sticky sessions then route reconnects to surviving pods.

Can I scale on a metric other than raw connection count? #

Yes. Any Prometheus expression works — messages/sec per pod, send-buffer backlog, or a derived ws_connections_active / pod_connection_capacity ratio. Connection count is the cleanest starting point because it maps directly to the descriptor and memory limits that actually break a pod.

What changes for Socket.IO vs raw ws? #

The pattern is identical — increment a gauge on connection, decrement on disconnect. Socket.IO adds polling fallbacks and a heartbeat layer, so make sure you count only fully-upgraded WebSocket transports (or accept that long-poll sessions count too) and that your drain logic emits Socket.IO’s close, not just a raw frame.

Horizontal Scaling on Kubernetes — the parent area covering pod topology, fd limits, and rollout strategy for WebSocket fleets.
Scaling Real-Time Infrastructure — fan-out, presence, and delivery guarantees that sit alongside autoscaling.
Load Balancer Sticky Sessions — pinning long-lived sockets to a pod, which scale events must respect.
WebSocket Observability & Monitoring — exporting the metrics KEDA reads, including ws_connections_active.

Back to Horizontal Scaling on Kubernetes