Building a WebSocket Presence System with Redis #

You need an authoritative “who is online” roster that survives a fleet of WebSocket nodes, a hard kill -9, and a user with five tabs open. A naive in-memory Set per process answers wrong the instant traffic spans two nodes, and a per-key EXPIRE deletes silently so you never learn whom to un-render. This page builds the complete implementation: a Redis sorted set scored by heartbeat time, a sweep that evicts stale entries and tells you who they were, per-user reference counting for multi-tab sessions, and a last-seen timestamp for the roster UI. It is the hands-on build behind Presence & Online Tracking.

Root cause #

Presence is hard for one reason: a clean WebSocket close is a courtesy, not a guarantee. A browser tab killed from Task Manager, a laptop lid slammed shut, a kill -9 on the client, or a yanked network cable all skip the close frame entirely. The server’s ws 'close' event never fires, so any presence scheme that depends on an explicit leave will leak a ghost — a user shown online forever because nobody ever told the system they left.

The fix is to make presence lease-based instead of event-based. A live connection must keep proving it is alive; absence of proof is the only reliable signal of departure. In Redis terms, every heartbeat writes the current timestamp as the score of the user’s member in a sorted set:

ZADD presence:online GT <now_ms> <userId>

The score is a self-expiring clock. A user whose newest score is older than now - TTL has missed enough heartbeats that we declare them gone, regardless of whether a close ever arrived. That single decision — score-as-TTL — is what makes the system crash-proof. The remaining work is counting connections correctly when one user holds several, and emitting clean diffs so clients never see a flicker. This leases off the same liveness signal described in Connection Lifecycle & Heartbeats; presence and dead-connection detection share one ping/pong cadence.

Heartbeat lease, sweep, and diff lifecycle A socket heartbeat raises its score in a Redis sorted set; a periodic sweep removes members older than the TTL and the resulting join and leave diffs are published to every node. Socket heartbeat every 15s tick refcount + score presence:online u1 to 1718880004 u2 to 1718880003 u3 to stale Sweep job ZREMRANGEBYSCORE 0 to now-TTL ZADD GT now presence:diff join / leave to nodes PUBLISH diff

Resolution #

The service below is one self-contained module that runs on every node. It tracks how many sockets a user holds (the refcount), writes the heartbeat score, sweeps the dead, records last-seen, and publishes join/leave diffs over a dedicated channel. Wire onOpen / onHeartbeat / onClose into your ws server’s events and call startPresence once at boot.

import Redis from "ioredis";
import { WebSocket } from "ws";

const ONLINE_KEY = "presence:online"; // ZSET: member=userId, score=last heartbeat (ms)
const REFCOUNT_KEY = "presence:refcount"; // HASH: userId -> open socket count (this fleet)
const LASTSEEN_KEY = "presence:lastseen"; // HASH: userId -> last-seen epoch ms
const DIFF_CHANNEL = "presence:diff"; // Pub/Sub channel carrying join/leave events

const HEARTBEAT_INTERVAL_MS = 15_000; // each socket refreshes its score this often
const PRESENCE_TTL_MS = 45_000; // ~3 missed beats before a user is declared gone
const SWEEP_INTERVAL_MS = 10_000; // how often THIS node runs the expiry sweep

const redis = new Redis(process.env.REDIS_URL!);
const pub = redis.duplicate(); // publishing must not share the subscriber socket
const sub = redis.duplicate();

type Diff = { type: "join" | "leave"; userId: string; at: number };

// A socket opened. Increment the per-user refcount; only the FIRST tab is a real "join".
async function onOpen(userId: string): Promise<void> {
const now = Date.now();
// HINCRBY is atomic, so concurrent opens on this node can't both think they're first.
const count = await redis.hincrby(REFCOUNT_KEY, userId, 1);
await redis.zadd(ONLINE_KEY, "GT", now, userId); // GT raises the score, never lowers it
if (count === 1) {
await publishDiff({ type: "join", userId, at: now }); // exactly one join per user
}
}

// Heartbeat tick for a live socket: just bump the score so the lease doesn't lapse.
async function onHeartbeat(userId: string): Promise<void> {
await redis.zadd(ONLINE_KEY, "GT", Date.now(), userId);
}

// A socket closed cleanly. Decrement the refcount; only the LAST tab is a real "leave".
async function onClose(userId: string): Promise<void> {
const now = Date.now();
const count = await redis.hincrby(REFCOUNT_KEY, userId, -1);
if (count <= 0) {
// No tabs left on this fleet for this user — remove them and stamp last-seen.
await redis
.multi()
.hdel(REFCOUNT_KEY, userId) // clean up so the hash can't go negative
.zrem(ONLINE_KEY, userId)
.hset(LASTSEEN_KEY, userId, now)
.exec();
await publishDiff({ type: "leave", userId, at: now });
}
}

// The crash-proof part: anyone whose newest score is older than the TTL is gone, no close needed.
async function sweepExpired(): Promise<void> {
const cutoff = Date.now() - PRESENCE_TTL_MS;
// Read stale members BEFORE deleting so we know whom to announce as leaving.
const dead = await redis.zrangebyscore(ONLINE_KEY, 0, cutoff);
if (dead.length === 0) return;
const now = Date.now();
const pipe = redis.pipeline();
pipe.zremrangebyscore(ONLINE_KEY, 0, cutoff); // evict every stale member in one pass
for (const userId of dead) {
pipe.hdel(REFCOUNT_KEY, userId); // a ghost's refcount is garbage; discard it
pipe.hset(LASTSEEN_KEY, userId, now); // record last-seen for the roster UI
}
await pipe.exec();
for (const userId of dead) {
await publishDiff({ type: "leave", userId, at: now });
}
}

async function publishDiff(diff: Diff): Promise<void> {
await pub.publish(DIFF_CHANNEL, JSON.stringify(diff));
}

// "is X online?" and "who is online?" — both read straight from the shared sorted set.
async function isOnline(userId: string): Promise<boolean> {
const score = await redis.zscore(ONLINE_KEY, userId);
return score !== null && Number(score) >= Date.now() - PRESENCE_TTL_MS;
}

async function onlineUsers(): Promise<string[]> {
return redis.zrangebyscore(ONLINE_KEY, Date.now() - PRESENCE_TTL_MS, "+inf");
}

async function lastSeen(userId: string): Promise<number | null> {
const ts = await redis.hget(LASTSEEN_KEY, userId);
return ts ? Number(ts) : null;
}

// Boot once per node: subscribe to diffs and start this node's sweep timer.
function startPresence(broadcast: (d: Diff, all: WebSocket[]) => void, sockets: WebSocket[]): NodeJS.Timeout {
sub.subscribe(DIFF_CHANNEL);
sub.on("message", (_chan, payload) => broadcast(JSON.parse(payload) as Diff, sockets));
return setInterval(() => void sweepExpired(), SWEEP_INTERVAL_MS);
}

export { onOpen, onHeartbeat, onClose, isOnline, onlineUsers, lastSeen, startPresence };

Three details carry the design. The GT flag on ZADD means two nodes holding tabs for the same user can race their heartbeats freely — the newest timestamp wins and no write ever drags the score backward. The refcount hash keyed on userId collapses many sockets into one presence entry, so onOpen/onClose only emit a diff at the true 0↔1 boundary. And reading dead before ZREMRANGEBYSCORE is what turns silent expiry into an announceable leave — the sweep is both the eviction and the source of the leave event.

Operational checklist #

  • Connect a user, then kill -9 the client process (not a clean close) and confirm they vanish from redis-cli ZRANGEBYSCORE presence:online 0 +inf within TTL + sweep
  • Open three tabs for one user, close two, and verify redis-cli HGET presence:refcount <userId> reads 1
  • Close the last tab and assert exactly one leave arrives on redis-cli SUBSCRIBE presence:diff, plus a fresh presence:lastseen
  • Run two nodes, connect the same user to each, and confirm onOpen emits exactly one join
  • Crash a node mid-session and verify the orphaned refcount is reaped by the sweep’s HDEL
  • Trim or TTL the presence:lastseen
  • At fleet scale, gate sweepExpired behind a short Redis lock (SET sweep:lock <id> NX PX 8000

FAQ #

How does this avoid ghost presence after a crash? #

Presence is leased, not declared. Every heartbeat re-stamps the user’s score with ZADD GT now, and the sweep treats any score older than now - PRESENCE_TTL_MS as gone. A kill -9 simply stops the heartbeats; the next sweep finds the stale score and evicts it with ZREMRANGEBYSCORE. You never depend on a close event firing, which is the one thing a hard crash guarantees will not happen.

Why a refcount instead of just removing the user on close? #

A user with five tabs fires five close events as they navigate away. Removing on the first close would flicker them offline while four sockets are still live. The refcount hash counts open sockets per userId, so a leave diff only fires when the count hits zero — the genuine last-tab departure. Just remember the count must live in Redis, not process memory, or two nodes will each think they own the only tab.

How accurate is the last-seen timestamp? #

presence:lastseen is stamped at the moment of departure — either a clean onClose that drained the refcount, or the sweep that evicted a stale score. For ghosts, last-seen reflects when the sweep ran, not the exact instant the cable was pulled, so it can lag by up to SWEEP_INTERVAL_MS. If you need the true last-activity moment, stamp last-seen from the final observed heartbeat score instead of Date.now() in the sweep.

Does this work behind a load balancer with sticky sessions? #

Yes. Stickiness pins a connection to a node for the life of the socket, but presence questions and diffs still cross node boundaries — that is exactly why the store lives in Redis rather than process memory. The refcount, sorted set, and diff channel are all shared, so the answer is identical no matter which node a query lands on. The broader scaling picture lives under Scaling Real-Time Infrastructure.

What is the worst-case delay before a user shows offline? #

PRESENCE_TTL_MS + SWEEP_INTERVAL_MS — about 55 seconds with the defaults. The TTL absorbs missed heartbeats from GC pauses and brief network blips so you do not flap users offline, and the sweep interval bounds how long a stale score sits before eviction. Shrink both for snappier detection at the cost of more Redis writes and sweep passes.

Back to Presence & Online Tracking