Implementing WebSocket Ping/Pong in Node.js #

You shipped a ws-based Node.js server, and after a few hours in production something is wrong. Clients receive stale state despite an apparently live connection. Server-side ws.on('close') events stop firing for sockets that the client clearly abandoned. Resident memory climbs 15–20% over 24 hours with no traffic spike, file-descriptor counts drift upward, and the reverse proxy logs 504 Gateway Timeout on long-idle connections. If you landed here searching for why your WebSocket server leaks dead connections, the answer is almost always a missing application-level heartbeat. This guide is part of Connection Lifecycle & Heartbeats and shows the exact ping/pong pattern that kills zombie sockets deterministically.

Confirm the diagnosis before changing code:

lsof -i :8080 | wc -l returns a count far larger than your real active-client number.
A Node.js heap snapshot shows detached WebSocket objects retained in the ws library’s internal client set.
Reverse-proxy access logs show 504 on connections idle longer than the proxy’s read timeout.

Root cause #

A TCP connection that dies abruptly — laptop lid closed, NAT mapping expired, Wi-Fi dropped, proxy reaped the idle socket — does not send a FIN or RST. The peer simply stops responding. The local kernel has no way to know the connection is gone until it tries to send data and the retransmission timer eventually gives up. On Linux, TCP keepalive probes are governed by tcp_keepalive_time, which defaults to 7200 seconds (2 hours). That is far longer than any reverse proxy idle timeout (typically 60–300s), so the proxy silently terminates the upstream socket while your Node process still believes it is open.

The WebSocket protocol (RFC 6455 §5.5.2–5.5.3) defines Ping and Pong control frames precisely to bridge this gap with an application-level liveness check. Critically, the ws library does not send pings for you. Unless you drive a heartbeat yourself, the only thing closing a half-open socket is the 2-hour kernel keepalive — and during those two hours the socket sits in the server’s client set holding a file descriptor, a buffer, and any per-connection state you attached. Multiply by a steady trickle of dropped clients and you get the unbounded growth in the symptoms above. This is the server-side mirror of the client-side problem covered in handling WebSocket disconnects gracefully: both ends need an active liveness signal because TCP alone will not tell you in time.

Resolution #

The pattern below sends a ping frame every HEARTBEAT_INTERVAL_MS. Each interval first checks whether the previous ping was answered: if isAlive is still false, no pong came back and the socket is terminated. A short secondary setTimeout catches the same-cycle case where a pong fails to return before the next tick. Using TypeScript with the ws package, with a typed augmentation for the per-socket bookkeeping fields:

import { WebSocketServer, WebSocket } from 'ws';

const HEARTBEAT_INTERVAL_MS = 30_000; // ping cadence
const PONG_TIMEOUT_MS = 10_000;       // grace before declaring a socket dead

// Augment ws.WebSocket with our per-connection liveness bookkeeping.
interface LiveSocket extends WebSocket {
  isAlive: boolean;
  heartbeat?: NodeJS.Timeout;
  pongTimer?: NodeJS.Timeout;
}

const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (raw: WebSocket) => {
  const ws = raw as LiveSocket;
  ws.isAlive = true; // assume healthy at handshake

  // A pong proves the peer's TCP stack and event loop are both alive.
  ws.on('pong', () => {
    ws.isAlive = true;
    if (ws.pongTimer) clearTimeout(ws.pongTimer); // cancel the death sentence
  });

  ws.heartbeat = setInterval(() => {
    // Last cycle's pong never arrived -> the socket is a zombie.
    if (!ws.isAlive) {
      ws.terminate(); // destroy TCP immediately; do NOT wait for a close handshake
      return;
    }

    ws.isAlive = false; // flips back to true only when a pong is received
    try {
      ws.ping(); // RFC 6455 control frame; ws auto-replies pong on the peer
      // Same-cycle safety net: terminate if no pong before the next tick.
      ws.pongTimer = setTimeout(() => {
        if (!ws.isAlive) ws.terminate();
      }, PONG_TIMEOUT_MS);
    } catch {
      ws.terminate(); // ping() can throw synchronously on a half-closed socket
    }
  }, HEARTBEAT_INTERVAL_MS);

  // Timers outlive the socket and keep the event loop pinned unless cleared.
  ws.on('close', () => {
    if (ws.heartbeat) clearInterval(ws.heartbeat);
    if (ws.pongTimer) clearTimeout(ws.pongTimer);
  });
});

Three details make this production-safe. ws.ping() is wrapped in try/catch because a half-closed socket can throw synchronously instead of emitting an error event. The close handler clears both timers — an orphaned setInterval keeps the Node event loop alive and is itself a memory leak. And unresponsive sockets are killed with ws.terminate(), not ws.close(): close() initiates a closing handshake that a zombie will never complete, leaving you exactly where you started.

One environment note: set your reverse proxy’s idle timeout safely above HEARTBEAT_INTERVAL_MS + PONG_TIMEOUT_MS so the proxy never reaps a connection between heartbeats. For nginx that means proxy_read_timeout and proxy_send_timeout larger than 40s for the values above — most teams set these to a very high value (e.g. 86400s) and rely on the application heartbeat as the sole liveness authority.

Operational checklist #

proxy_read_timeout / proxy_send_timeout (or ALB idle timeout) is strictly greater than HEARTBEAT_INTERVAL_MS + PONG_TIMEOUT_MS `proxy_read_timeout` / `proxy_send_timeout` (or ALB idle timeout) is strictly greater than `HEARTBEAT_INTERVAL_MS + PONG_TIMEOUT_MS`.
Both clearInterval(heartbeat) and clearTimeout(pongTimer) run in the close Both `clearInterval(heartbeat)` and `clearTimeout(pongTimer)` run in the `close` handler — verified with a heap snapshot showing no retained timers.
Zombie sockets are killed with terminate(), never close() Zombie sockets are killed with `terminate()`, never `close()`.
An active-connection gauge (ws_connections_active An active-connection gauge (`ws_connections_active`) is plotted against heap size; divergence flags zombie accumulation.
A load test injects network partitions (tc netem, Toxiproxy, or iptables -j DROP A load test injects network partitions (`tc netem`, Toxiproxy, or `iptables -j DROP`) and confirms sockets are reaped within one interval.
ping() is wrapped in try/catch `ping()` is wrapped in `try/catch` so a synchronous throw cannot crash the connection loop.
Rollback plan: heartbeat interval and pong timeout are config-driven so they can be tuned without a redeploy.

FAQ #

Why not just rely on TCP keepalive instead of ping/pong? #

TCP keepalive defaults to a 2-hour idle time on Linux (tcp_keepalive_time = 7200), which is longer than every reverse proxy and NAT idle timeout you will sit behind. By the time the kernel probes, the proxy has already dropped the upstream socket. You can lower the keepalive socket options, but you still cannot prove the application event loop is responsive — only an application-level pong does that.

What’s the difference between `ws.close()` and `ws.terminate()`? #

close() sends a Close control frame and waits for the peer to acknowledge with its own Close — a handshake a dead peer will never complete, so the socket lingers. terminate() immediately destroys the underlying TCP socket and emits close locally. Use terminate() for any socket you’ve already judged unresponsive.

Does this work behind AWS ALB? #

Yes, with one caveat: set the ALB idle timeout above HEARTBEAT_INTERVAL_MS + PONG_TIMEOUT_MS. The ALB still drops a connection that is idle longer than its timeout, so your heartbeat cadence must keep bytes flowing more often than that. The application logic itself is unchanged.

Should the client send pings too, or only the server? #

The server-driven heartbeat shown here is sufficient to detect dead clients, because the ws library auto-replies to a ping with a pong. The client should still detect a dead server — that’s the reconnection side of the problem, covered in handling WebSocket disconnects gracefully.

How do I pick the interval and timeout values? #

Start with a 30s interval and 10s timeout. Lower the interval if you sit behind an aggressive proxy (some default to 60s), but watch CPU and bandwidth at high connection counts — each ping is a frame per socket per interval. Keep the values in config so you can tune them against real proxy timeouts without a redeploy.

Connection Lifecycle & Heartbeats — the parent guide covering the full open/idle/close lifecycle and where heartbeats fit.
Handling WebSocket Disconnects Gracefully — the client-side counterpart: detecting and recovering from a dead server.
Backend WebSocket Connection Management — how the connection registry, file descriptors, and heartbeats interact at scale.

Back to WebSocket Connection Lifecycle & Heartbeats