Redis Streams vs Pub/Sub for WebSocket Fan-Out #
You already bridge your WebSocket nodes with Redis Pub/Sub Fan-Out, and it works — until a node restarts mid-deploy and every client connected to it silently loses the messages published during the gap. Now you are staring at a Redis docs page weighing PUBLISH/SUBSCRIBE against XADD/XREADGROUP, trying to decide whether the upgrade to Streams is worth the added moving parts. This guide is the decision: it lays out exactly what pub/sub buys you, what Streams buys you, and the latency, memory, ordering, and reconnect-replay trade-offs that should drive the call — then shows a Streams consumer-group fan-out you can drop into a node.
Root cause #
The two primitives sit at opposite ends of a durability spectrum, and the difference is structural, not a tuning knob.
Pub/sub is fire-and-forget. When a node calls PUBLISH ws:room:42 <envelope>, Redis delivers that envelope to whatever clients are subscribed at that instant and then forgets it. There is no log, no offset, no acknowledgement. A subscriber that connects one millisecond later, or a node restarting through a deploy, never sees the message — it was never stored. The PUBLISH return value tells you how many subscribers received it, which is the only receipt you will ever get. This is why a Redis pub/sub broadcast is perfect for live cursors and typing indicators and structurally wrong for anything where a dropped message is a correctness bug.
Streams are an append-only log. XADD ws:room:42 * field value appends an entry with a monotonically increasing ID (<ms>-<seq>) and keeps it until you trim. Consumers read by ID range with XREAD, or — the interesting part for fan-out — join a consumer group with XREADGROUP. A group tracks a last-delivered ID and a Pending Entries List (PEL) of messages handed out but not yet XACK-ed. A node that crashes after reading but before acknowledging can reclaim those exact entries on restart with XAUTOCLAIM. That gives you at-least-once delivery, replay from any offset, and natural backpressure: slow consumers simply read fewer entries per XREADGROUP call rather than being drowned by an unthrottled push.
The cost is real. Streams consume memory proportional to retained entries, demand a trimming strategy (MAXLEN/MINID), and force you to think about acknowledgement, idempotency, and stuck-consumer recovery — concerns pub/sub lets you ignore entirely because it never remembers anything.
Side-by-side #
| Dimension | Pub/Sub | Streams |
|---|---|---|
| Delivery semantics | At-most-once (fire-and-forget) | At-least-once with consumer groups |
| Persistence / replay | None — gone after publish | Append-only log, replay by ID |
| Missed while offline? | Lost forever | Retained until trimmed |
| Ordering | Per-channel, not durable | Strict per-stream by entry ID |
| Acknowledgement | None | XACK + Pending Entries List |
| Backpressure | Push, can overwhelm slow subs | Pull, COUNT bounds each read |
| Latency | Lowest (no write, no fsync path) | Low, plus one write + read round-trip |
| Memory cost | ~0 (no storage) | Grows with retained entries |
| Crash recovery | Nothing to recover | XAUTOCLAIM reclaims the PEL |
| Operational complexity | Minimal | Trimming, idempotency, claim loops |
The single sentence that decides it: if a subscriber being momentarily absent is allowed to mean a message is gone forever, pub/sub is correct and cheaper. If absence must mean delayed, not lost, you need a log — and on Redis that log is Streams. For the formal delivery-semantics framing behind this, see message delivery guarantees.
Resolution #
Here is a Streams-based fan-out for a single node. It runs a consumer-group reader loop that pulls new entries, delivers them to local WebSocket sockets, acknowledges, and on startup reclaims anything a previous crash left pending. Each node uses a distinct consumer name but shares one group so the PEL is per-consumer and crash recovery is targeted.
import { WebSocketServer, WebSocket } from 'ws';
import Redis from 'ioredis';
const STREAM_KEY = 'ws:room:42'; // one stream per room
const GROUP = 'ws-fanout'; // shared group across nodes
const CONSUMER = process.env.NODE_ID ?? crypto.randomUUID(); // unique per node
const BLOCK_MS = 5000; // long-poll window per read
const BATCH = 50; // backpressure: cap entries/read
const MAXLEN = 10_000; // trim cap to bound memory
const redis = new Redis(process.env.REDIS_URL!); // reader: blocks in XREADGROUP
const writer = new Redis(process.env.REDIS_URL!); // writer: XADD path, never blocks
const localSockets = new Set<WebSocket>(); // sockets owned by THIS node
// Create the group once; ignore BUSYGROUP if it already exists. MKSTREAM lets the
// group exist before the first XADD. '$' = deliver only entries added from now on.
async function ensureGroup() {
try {
await redis.xgroup('CREATE', STREAM_KEY, GROUP, '$', 'MKSTREAM');
} catch (err: any) {
if (!String(err?.message).includes('BUSYGROUP')) throw err;
}
}
// Publish path: append an entry. '*' lets Redis assign the monotonic ID, and
// MAXLEN '~' trims approximately (cheap) so the log cannot grow unbounded.
function publish(payload: unknown) {
return writer.xadd(STREAM_KEY, 'MAXLEN', '~', MAXLEN, '*', 'data', JSON.stringify(payload));
}
// Deliver one stream entry to every open local socket, then report whether
// delivery succeeded so the caller can decide to XACK.
function deliver(fields: string[]): boolean {
const idx = fields.indexOf('data');
if (idx === -1) return true; // malformed: ack to skip
const frame = fields[idx + 1];
for (const ws of localSockets) {
if (ws.readyState === WebSocket.OPEN) ws.send(frame); // local send only
}
return true;
}
// On startup, reclaim entries this consumer (by name) was handed but never
// acked — the crash-recovery path. XAUTOCLAIM walks the PEL from id '0'.
async function reclaimPending() {
let cursor = '0-0';
do {
const [next, entries] = await redis.xautoclaim(
STREAM_KEY, GROUP, CONSUMER, 60_000, cursor, 'COUNT', BATCH,
) as [string, [string, string[]][]];
for (const [id, fields] of entries) {
if (deliver(fields)) await redis.xack(STREAM_KEY, GROUP, id);
}
cursor = next;
} while (cursor !== '0-0'); // '0-0' = scan complete
}
// Main loop: block for new entries, deliver, ack. '>' means "messages never
// delivered to any consumer in this group" — the live tail.
async function consume() {
while (true) {
const res = await redis.xreadgroup(
'GROUP', GROUP, CONSUMER, 'COUNT', BATCH, 'BLOCK', BLOCK_MS, 'STREAMS', STREAM_KEY, '>',
) as [string, [string, string[]][]][] | null;
if (!res) continue; // BLOCK timed out — loop again
for (const [, entries] of res) {
for (const [id, fields] of entries) {
if (deliver(fields)) await redis.xack(STREAM_KEY, GROUP, id); // at-least-once
}
}
}
}
const wss = new WebSocketServer({ port: 8080 });
wss.on('connection', (ws) => {
localSockets.add(ws);
ws.on('message', (data) => publish(JSON.parse(data.toString())));
ws.on('close', () => localSockets.delete(ws));
});
await ensureGroup();
await reclaimPending(); // recover before tailing live
consume().catch((err) => { console.error('consumer loop died', err); process.exit(1); });
Two things make this safe. First, XACK runs only after a successful local send, so a crash between read and send leaves the entry in the PEL for XAUTOCLAIM to recover — that is what makes it at-least-once rather than at-most-once. Second, BLOCK plus a bounded COUNT is the backpressure mechanism: a node that falls behind reads at most BATCH entries per cycle and the rest wait in the log, instead of being pushed at it faster than it can flush to sockets. Because delivery is at-least-once, your client frames must be idempotent (carry an entry/message ID the client dedupes on) — a redelivery after reclaim will otherwise show as a duplicate.
Operational checklist #
- Confirm
XADDusesMAXLEN '~' <cap>(or aMINID - Verify each node sets a unique consumer name but the same group, so the Pending Entries List is partitioned per node and
XAUTOCLAIM - Run
reclaimPending()(orXAUTOCLAIM) on startup before entering the live> - Alarm on PEL depth (
XPENDING ws:room:42 ws-fanout - Load-test the
BLOCK/COUNTpair under your real message rate; too small aCOUNT - Document the rollback: if Streams overhead is not justified, the fallback is the simpler pub/sub broadcast
FAQ #
Is Redis Streams slower than pub/sub for fan-out? #
Marginally, and almost never enough to matter for WebSocket traffic. Pub/sub skips the write entirely, so its tail latency is the lowest Redis offers. Streams add an XADD write plus an XREADGROUP read, typically a sub-millisecond round-trip on a local instance. The latency you actually feel in production is dominated by network and socket flush, not by which Redis primitive carries the message. Choose on durability needs first; if both satisfy correctness, then prefer pub/sub for the lowest latency.
How do I stop a Redis Stream from eating all my memory? #
Trim it. Pass MAXLEN '~' <cap> on every XADD to keep roughly the last N entries (the ~ makes trimming approximate and cheap), or use MINID to drop entries older than a timestamp. Without trimming, the log is append-only forever and will grow until Redis hits maxmemory. Size the cap to your worst-case reconnect-replay window — how far back a returning client legitimately needs to catch up — not to “keep everything.”
What happens to unacknowledged messages if a node crashes? #
They stay in that consumer’s Pending Entries List. When the node restarts under the same consumer name, XAUTOCLAIM (or XCLAIM after XPENDING) reclaims entries idle longer than your threshold and redelivers them, so nothing read-but-not-acked is lost. This is the entire point of consumer groups over plain XREAD. If the node never comes back, run a reaper that claims the dead consumer’s PEL onto a live one so its backlog still drains.
Can I keep pub/sub for some channels and Streams for others? #
Yes, and you usually should. Route ephemeral, high-frequency, loss-tolerant traffic (cursors, presence pings, typing) over pub/sub fan-out for minimum latency and zero storage, and route must-not-drop traffic (chat history, order updates, notifications) over Streams. They share the same Redis and the same node process; only the transport per message class differs. Pick per topic, not per system.
Do consumer groups distribute messages or copy them to every node? #
Within one group, each entry goes to exactly one consumer — that is load distribution, not fan-out, and it is the opposite of what broadcasting to all sockets needs. For WebSocket fan-out you want every node to deliver to its own local sockets, so give each node its own group (every group sees every entry) while keeping one consumer per group for crash recovery. Use a single shared group only when you want competing consumers to split the work, such as a pool of workers persisting messages.
Related #
- Redis Pub/Sub Fan-Out for WebSockets — the parent area: bridging nodes with fire-and-forget pub/sub, the cheaper default.
- Scaling WebSocket Broadcast with Redis Pub/Sub — the pub/sub broadcast loop with batching and backpressure, and your rollback path from Streams.
- Message Delivery Guarantees — at-least-once delivery, acknowledgements, and idempotency, the semantics Streams enforce.
- Scaling Real-Time Infrastructure — fan-out, presence, delivery guarantees, and horizontal scaling across the fleet.
Back to Redis Pub/Sub Fan-Out