Presence & Online Status Tracking Across a WebSocket Fleet #
A single Node process can answer “is user X online?” by checking a Map of its own sockets. The moment you run more than one process, that answer is a lie. User X is connected to node B, but the request asking about them lands on node A, whose in-memory map has never heard of them. The dreaded symptom: a roster that shows a teammate as offline on one tab and online on another, refreshing differently depending on which backend the load balancer happened to pick. Per-node memory cannot see the whole fleet, so presence has to live somewhere every node can read and write.
This page builds a shared presence store on Redis: each connection writes a TTL-bounded heartbeat keyed by user, a background sweep expires the dead ones, and every node broadcasts the join/leave diffs so connected clients keep an accurate roster. It is the presence layer that sits on top of the Scaling Real-Time Infrastructure pillar.
Prerequisites #
Before wiring up presence, the connection plumbing underneath it needs to be solid:
- A heartbeat already running on each socket. Presence TTLs piggyback on the same liveness signal you use to detect dead connections. If you have not built that yet, start with Connection Lifecycle & Heartbeats — presence expiry is only as accurate as your ping/pong cadence.
- A way to fan out events to every node. Presence diffs (one user joined, another left) must reach clients connected to other processes. That delivery uses the same channel as Redis Pub/Sub Fan-Out; presence is a specialized producer for that bus.
- A reachable Redis instance (single node or Cluster) with
ioredisinstalled:npm i ioredis. Use a separate logical DB or key prefix so presence keys do not collide with caching. - Stable user identity per socket. Each connection must carry an authenticated
userIdby the time it is open, so presence is keyed on the human, not the socket.
How presence lives in a shared store #
Every node owns its own sockets but writes their liveness into one Redis sorted set, scored by the timestamp of the last heartbeat. Reads and the expiry sweep run against that shared set, so any node can answer the global question.
Two ideas do all the work. The sorted set (ZADD presence:online <timestamp> <userId>) keeps every online user scored by their freshest heartbeat, so “who is online” is a range query and “is X online?” is a single ZSCORE. The sweep treats any member whose score is older than now - TTL as gone and removes it in one ZREMRANGEBYSCORE, then announces those departures. No per-key EXPIRE is needed — the score is the expiry clock.
Core implementation #
The presence service below runs identically on every node. It records heartbeats, runs a periodic sweep, and emits join/leave diffs. The diff broadcast publishes to a Redis channel so every node forwards it to its own connected clients.
import Redis from "ioredis";
const ONLINE_KEY = "presence:online"; // ZSET: member=userId, score=last heartbeat (ms)
const LASTSEEN_KEY = "presence:lastseen"; // HASH: userId -> last-seen epoch ms
const DIFF_CHANNEL = "presence:diff"; // Pub/Sub channel for join/leave events
const HEARTBEAT_INTERVAL_MS = 15_000; // how often each socket refreshes its score
const PRESENCE_TTL_MS = 45_000; // 3 missed heartbeats before a user is "gone"
const SWEEP_INTERVAL_MS = 10_000; // how often this node runs the expiry sweep
const redis = new Redis(process.env.REDIS_URL!);
const pub = redis.duplicate(); // publishing must not share the subscriber connection
const sub = redis.duplicate();
type Diff = { type: "join" | "leave"; userId: string; at: number };
// Called when a socket opens AND on every heartbeat tick for that socket.
async function recordHeartbeat(userId: string): Promise<void> {
const now = Date.now();
// ZADD returns the number of NEW members added (1 = first time we've seen this user online).
const added = await redis.zadd(ONLINE_KEY, "GT", now, userId);
// "GT" only raises the score, never lowers it — safe if heartbeats race across nodes.
if (added === 1) {
await publishDiff({ type: "join", userId, at: now });
}
}
// Called when a socket closes cleanly. Hard crashes are handled by the sweep instead.
async function recordDisconnect(userId: string, stillConnectedLocally: boolean): Promise<void> {
if (stillConnectedLocally) return; // user has another live tab on THIS node — keep them online
const now = Date.now();
await redis.zrem(ONLINE_KEY, userId);
await redis.hset(LASTSEEN_KEY, userId, now);
await publishDiff({ type: "leave", userId, at: now });
}
// Expiry sweep: any user whose newest heartbeat is older than the TTL is considered offline.
async function sweepExpired(): Promise<void> {
const cutoff = Date.now() - PRESENCE_TTL_MS;
// Read the stale members BEFORE deleting so we know whom to announce.
const dead = await redis.zrangebyscore(ONLINE_KEY, 0, cutoff);
if (dead.length === 0) return;
const pipe = redis.pipeline();
pipe.zremrangebyscore(ONLINE_KEY, 0, cutoff); // drop them from the online set
for (const userId of dead) {
pipe.hset(LASTSEEN_KEY, userId, Date.now()); // stamp last-seen for the roster UI
}
await pipe.exec();
for (const userId of dead) {
await publishDiff({ type: "leave", userId, at: Date.now() });
}
}
async function publishDiff(diff: Diff): Promise<void> {
await pub.publish(DIFF_CHANNEL, JSON.stringify(diff));
}
// Each node subscribes once and fans the diff out to ITS local sockets (your own broadcast fn).
function startPresence(broadcastToLocalClients: (d: Diff) => void): NodeJS.Timeout {
sub.subscribe(DIFF_CHANNEL);
sub.on("message", (_channel, payload) => {
broadcastToLocalClients(JSON.parse(payload) as Diff);
});
// Only this node's sweep timer runs here; with many nodes, see the gotcha on sweep contention.
return setInterval(() => void sweepExpired(), SWEEP_INTERVAL_MS);
}
// Snapshot query for a client that just connected and needs the full roster.
async function onlineUsers(): Promise<string[]> {
return redis.zrangebyscore(ONLINE_KEY, Date.now() - PRESENCE_TTL_MS, "+inf");
}
export { recordHeartbeat, recordDisconnect, startPresence, onlineUsers };
The GT flag on ZADD is the quiet hero: two nodes can hold tabs for the same user, and whichever heartbeat is newest wins without either clobbering the other backward in time. The added === 1 check turns a raw write into a clean join diff exactly once, and the sweep turns silence into a leave diff without any client telling us it left.
Configuration reference #
| Parameter | Type | Default | Production value | Notes |
|---|---|---|---|---|
HEARTBEAT_INTERVAL_MS |
number (ms) | 15_000 |
15_000–30_000 |
Drives both liveness and the presence score refresh. Lower = faster detection, more Redis writes. |
PRESENCE_TTL_MS |
number (ms) | 45_000 |
~3 × heartbeat |
Tolerate 2–3 missed beats before declaring offline, to absorb GC pauses and network blips. |
SWEEP_INTERVAL_MS |
number (ms) | 10_000 |
10_000–15_000 |
Worst-case offline latency ≈ TTL + sweep interval. Keep below TTL for snappy departures. |
ONLINE_KEY |
string | presence:online |
per-tenant prefixed | Sorted set of online users. Prefix with tenant/room for multi-tenant fleets. |
LASTSEEN_KEY |
string | presence:lastseen |
hash, capped | Last-seen timestamps. Trim or TTL old entries so it does not grow unbounded. |
DIFF_CHANNEL |
string | presence:diff |
per-room channel | Scope channels so a node only receives diffs it cares about. |
Edge cases & gotchas #
- Clock skew across nodes. Scores are wall-clock timestamps, so a node whose clock runs fast can keep a user “fresh” too long, or one running slow can expire a live user early. Run NTP on every box, and prefer comparing against
Date.now()on the same node that sweeps rather than mixing per-node clocks. For hard guarantees, score with a Redis-server timestamp via a small Lua script so all nodes share one clock. - Ghost presence on a hard crash. A
kill -9or yanked network cable never firesrecordDisconnect, so the user lingers in the sorted set until the TTL lapses. This is by design — the sweep is your only reliable cleanup for ungraceful exits — but it means offline detection is bounded byPRESENCE_TTL_MS + SWEEP_INTERVAL_MS, not instant. Size the TTL for the latency your UI can tolerate. - Thundering reconnect after a deploy. Rolling a node drops thousands of sockets that all reconnect within a second, each firing
recordHeartbeat. Most are re-joins of users still scored in the set, soZADD GTreturns0and emits no diff — good. But the snapshot reads can stampede; debounce roster snapshots and let clients reuse the diff stream rather than re-queryingonlineUsers()per reconnect. - Duplicate sweeps in a large fleet. Every node running its own sweep is correct but wasteful — N nodes do N identical
ZREMRANGEBYSCOREpasses. The first one wins and the rest are no-ops, so it is safe, but at scale elect a single sweeper with a short Redis lock (SET sweep:lock <id> NX PX 8000) so only one node sweeps per interval.
Verification #
Confirm presence behaves under real disconnects, not just clean ones:
- With two nodes running, connect user A to node 1 and query
redis-cli ZSCORE presence:online userA -
kill -9the client process (not a clean close) and watchredis-cli ZRANGEBYSCORE presence:online 0 +inf— user A must disappear withinTTL + sweep - Subscribe with
redis-cli SUBSCRIBE presence:diffand assert exactly onejoinon connect and exactly oneleave - Open two tabs for the same user on different nodes, close one, and confirm the user stays online (the
stillConnectedLocallyandGT
Guides in this area #
- Building a WebSocket Presence System with Redis walks the full implementation end to end — the sorted-set schema, the heartbeat loop, the sweep job, and the diff broadcast wired into a running fleet.
FAQ #
Why use a sorted set instead of per-key TTLs with EXPIRE? #
Per-key EXPIRE deletes a key silently when it lapses — Redis gives you no event to broadcast a leave, so you would have to poll for absence. A sorted set scored by heartbeat timestamp lets one ZREMRANGEBYSCORE find and enumerate the expired users in a single pass, so you know exactly whom to announce. It also makes “who is online right now” a cheap range query instead of a SCAN.
How fast does a user show as offline after a crash? #
Worst case is PRESENCE_TTL_MS + SWEEP_INTERVAL_MS — with the defaults, about 55 seconds. A hard crash never sends a clean close, so the only signal is the absence of further heartbeats, which the sweep detects on its next pass. Shrink both values for faster detection at the cost of more Redis traffic.
Does this work behind AWS ALB or sticky sessions? #
Yes, and it is precisely the case sticky sessions cannot solve alone. Stickiness keeps one user pinned to one node, but presence queries and diffs still cross node boundaries, which is why the store lives in Redis. Pair this with Load Balancer Sticky Sessions for connection routing and let Redis own the global roster.
What about a user with multiple devices or tabs? #
Key presence on userId, not socket id, and the sorted set naturally collapses multiple connections into one online entry. On disconnect, only emit a leave when no local socket for that user remains and the sweep confirms no other node refreshed the score — the GT flag keeps the newest heartbeat authoritative across devices.
How do I scale the diff broadcast to many rooms? #
Scope the Redis channel per room or tenant (presence:diff:<roomId>) so a node only subscribes to the rooms it actually serves. This rides on the same delivery mechanism as Redis Pub/Sub Fan-Out; presence is just one structured event type flowing over that bus.
Related #
- Building a WebSocket Presence System with Redis — the hands-on build of the schema, sweep, and diff broadcast described here.
- Redis Pub/Sub Fan-Out — the cross-node delivery layer that carries presence diffs to every connected client.
- Connection Lifecycle & Heartbeats — the ping/pong cadence that presence TTLs piggyback on for liveness.
- Load Balancer Sticky Sessions — connection routing that complements a shared presence store across the fleet.
Back to Scaling Real-Time Infrastructure