Real-Time Protocol Selection & Architecture #

Picking a real-time transport is an architecture decision, not a library choice. Get it wrong and you inherit silent disconnects behind corporate proxies, half-open sockets that survive a load-balancer restart, or a WebRTC stack you cannot debug at 2 a.m. This guide is for full-stack engineers deciding which protocol to run, then standing it up correctly: the infrastructure that has to be in place first, the WebSocket Upgrade handshake annotated line by line, how to select between WebSocket, Server-Sent Events, and WebRTC, how to degrade when the network fights you, and which signals to watch in production. Read it top to bottom before your first deploy; come back to it the next time a transport “works in dev but not in the data center.”

Real-time transport selection and request path A browser connects through TLS termination and an nginx reverse proxy to a WebSocket, SSE, or WebRTC backend, with fallbacks shown. Browser client WebSocket / fetch TLS + nginx Upgrade headers Backend transports WebSocket: full-duplex SSE: server to client WebRTC: peer media Fallback ladder when Upgrade is blocked WebSocket then SSE then HTTP long-poll Detect at connect, retry with backoff

Infrastructure baseline #

Before any transport works reliably, three things must be true at the edge: TLS terminates correctly, the reverse proxy forwards the Upgrade handshake instead of swallowing it, and the kernel allows enough open file descriptors for your connection count. Skip any one and you will chase a “works locally” ghost.

Serve real-time traffic over TLS (wss://, not ws://). Plaintext WebSocket connections are routinely mangled by intercepting proxies; TLS also stops the Upgrade from being stripped mid-path. Use modern protocol versions and a tight cipher set, and keep the same certificate chain your HTTPS site uses — the Security & TLS Configuration section covers rotation and origin pinning in depth.

# nginx: terminate TLS and forward the WebSocket Upgrade intact
map $http_upgrade $connection_upgrade {
default upgrade; # client asked to upgrade -> pass it through
'' close; # plain HTTP request -> let keep-alive close normally
}

server {
listen 443 ssl;
server_name realtime.example.com;

ssl_protocols TLSv1.2 TLSv1.3; # no TLS 1.0/1.1
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers on;

location /ws/ {
proxy_pass http://backend_ws;
proxy_http_version 1.1; # 1.0 cannot carry Upgrade
proxy_set_header Upgrade $http_upgrade; # required header pair
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 3600s; # idle sockets must outlive heartbeats
proxy_send_timeout 3600s;
proxy_buffering off; # never buffer a streaming socket
}
}

On the host, raise the descriptor ceiling so a single node can hold tens of thousands of sockets:

# allow many concurrent sockets per worker process
ulimit -n 1048576 # session limit
echo 'fs.file-max = 2097152' >> /etc/sysctl.conf
sysctl -p

If you are running nginx specifically for upgrade handling, the Browser Compatibility & Polyfills section drills into the exact proxy quirks that break legacy clients.

Core mechanism: the WebSocket Upgrade handshake #

A WebSocket connection is an HTTP/1.1 request that asks the server to switch protocols. The client sends GET with Upgrade: websocket, Connection: Upgrade, and a random Sec-WebSocket-Key. The server proves it understood by hashing that key with the fixed GUID 258EAFA5-E914-47DA-95CA-C5AB0DC85B11, base64-encoding the SHA-1 digest into Sec-WebSocket-Accept, and replying 101 Switching Protocols. After 101, the TCP byte stream is no longer HTTP — it is framed WebSocket data flowing both directions. Subprotocol negotiation, extension headers, and the framing format are dissected in Protocol Handshake Mechanics.

This is also the right moment to authenticate. The Upgrade request still carries cookies and headers, so validate the caller before you complete the handshake — see WebSocket Authentication & Authorization for token verification and origin enforcement at this boundary.

// Node.js: validate the Upgrade manually, then hand off to the ws server.
import { createServer } from 'node:http';
import { WebSocketServer } from 'ws';
import { authenticateUpgrade } from './auth';

// noServer: we own the upgrade handshake so we can reject early.
const wss = new WebSocketServer({ noServer: true });
const server = createServer();

server.on('upgrade', async (req, socket, head) => {
// The Upgrade is still an HTTP request: headers + cookies are available here.
const principal = await authenticateUpgrade(req); // returns null if invalid
if (!principal) {
// Reject BEFORE 101 so no WebSocket session is ever created.
socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
socket.destroy();
return;
}

// Origin check stops cross-site script-driven connections (CSWSH).
const origin = req.headers.origin;
if (origin && !isAllowedOrigin(origin)) {
socket.write('HTTP/1.1 403 Forbidden\r\n\r\n');
socket.destroy();
return;
}

// ws computes Sec-WebSocket-Accept and emits the 101 response for us.
wss.handleUpgrade(req, socket, head, (ws) => {
(ws as any).principal = principal; // attach identity for later authz
wss.emit('connection', ws, req);
});
});

function isAllowedOrigin(origin: string): boolean {
const ALLOWED = new Set(['https://app.example.com']);
return ALLOWED.has(origin);
}

server.listen(8080);

The key insight: everything you can check at the HTTP layer — auth, origin, rate limits — must be checked during upgrade, because once 101 is sent the request semantics are gone.

Scaling & architecture: transport selection and fallbacks #

Pick the transport from the data-flow shape, not from familiarity. Use WebSocket when both sides push messages (chat, presence, multiplayer, live editing). Use SSE when the server streams to a mostly passive client (notifications, dashboards, log tails) — it rides plain HTTP, auto-reconnects, and needs no special framing. Use WebRTC data channels only when you need peer-to-peer or sub-50 ms media-grade latency; the signaling and NAT-traversal cost is real. The trade-off matrix lives in the WebSocket vs SSE vs WebRTC Comparison.

Real-time transport decision tree A decision tree: if both sides push frequently use WebSocket; else if peer-to-peer or media-grade latency use WebRTC; else if server-to-client stream use SSE; else plain HTTP fetch. Both sides push frequently? chat, presence, editing yes WebSocket wss:// no Peer-to-peer / media latency? sub-50ms, P2P yes WebRTC data channel no Server-to-client stream only? notifications, dashboards yes SSE event-stream no Plain HTTP / fetch request / response
Runtime fallback ladder A fallback ladder: try WebSocket first; if blocked or timed out try SSE; if that is blocked drop to HTTP long-poll, which always works. Try WebSocket 101 ok? use it blocked Try SSE stream ok? use it blocked HTTP long-poll always works Detect at connect, retry with backoff and jitter

Corporate proxies, captive portals, and some mobile carriers strip the Upgrade header or kill idle sockets. Production clients therefore negotiate a transport at connect time and fall back down the ladder, retrying with exponential backoff and jitter so a regional outage does not produce a thundering-herd reconnect storm.

// Client: detect the best available transport, fall back gracefully.
type Transport = 'websocket' | 'sse' | 'longpoll';

const CONNECT_TIMEOUT_MS = 5_000; // if 101/stream does not arrive, fall back

function selectTransport(): Transport {
// Feature detection is necessary but not sufficient: a proxy can still block
// a runtime-supported transport, so the connect attempt must also time out.
if (typeof WebSocket !== 'undefined') return 'websocket';
if (typeof EventSource !== 'undefined') return 'sse';
return 'longpoll';
}

async function connectWithFallback(url: string): Promise<Transport> {
const order: Transport[] = ['websocket', 'sse', 'longpoll'];
const start = order.indexOf(selectTransport());
for (const transport of order.slice(start)) {
if (await tryTransport(transport, url, CONNECT_TIMEOUT_MS)) return transport;
// else loop: the network blocked this one, drop to the next rung.
}
throw new Error('No usable real-time transport');
}

Horizontally, a single node cannot hold every connection, so a broadcast on node A must reach a subscriber pinned to node B. The two patterns are sticky sessions (pin a client to one node and keep state in memory) and a shared message bus (publish to Redis or NATS and let every node fan out to its local sockets). Sticky sessions are simpler but make rolling deploys and rebalancing painful; a Pub/Sub bus decouples routing from connection placement. The full fan-out, presence, and delivery-guarantee treatment lives under Scaling Real-Time Infrastructure.

// Cross-node broadcast via Redis Pub/Sub: publish once, every node fans out locally.
import { createClient } from 'redis';

const pub = createClient({ url: process.env.REDIS_URL });
const sub = pub.duplicate();
await Promise.all([pub.connect(), sub.connect()]);

// Each node subscribes to the tenants it currently holds connections for.
await sub.pSubscribe('rt:tenant:*', (message, channel) => {
const tenantId = channel.split(':')[2];
fanOutToLocalSockets(tenantId, JSON.parse(message)); // only this node's sockets
});

export async function broadcast(tenantId: string, payload: unknown): Promise<void> {
await pub.publish(`rt:tenant:${tenantId}`, JSON.stringify(payload));
}

Observability checklist #

Real-time systems fail quietly — a wedged socket looks identical to an idle one until you measure it. Instrument these from day one:

  • ws_connections_active
  • ws_handshake_duration_ms — histogram from upgrade received to 101
  • ws_handshake_failures_total — counter labeled by reason (auth, origin, timeout
  • ws_messages_sent_total / ws_messages_received_total
  • ws_send_buffer_bytes — per-socket bufferedAmount
  • ws_close_total — counter labeled by close code (1000, 1006, 1011
  • ws_heartbeat_misses_total
  • Log fields on every close: client_id, close_code, duration_ms, bytes_in, bytes_out
  • A trace span per connection (ws.connection

Failure modes #

Failure Symptom Root cause Mitigation
Upgrade stripped at proxy Client gets 200/426, never 101 Reverse proxy not forwarding Upgrade/Connection headers Add the nginx map + proxy_set_header block; verify with curl -i -H 'Upgrade: websocket'
Half-open / zombie socket Server thinks client is connected; no traffic NAT/firewall dropped the flow silently; no heartbeat Server-side ping with a missed-pong threshold, then terminate()
Idle disconnect at the edge Sockets die at a fixed interval (e.g. 60s) Proxy/LB proxy_read_timeout shorter than heartbeat Set proxy idle timeout above the heartbeat interval (3600s)
Reconnect storm Backend CPU and broker saturate after a blip Synchronized client reconnects without jitter Exponential backoff with jitter and a hard retry cap
Slow-consumer back-pressure Node memory climbs, latency rises bufferedAmount grows faster than the client drains Watch ws_send_buffer_bytes; drop or disconnect over a threshold

Explore this area #

FAQ #

Should I just use Socket.IO instead of choosing a transport myself? #

Socket.IO bundles a transport ladder (WebSocket with HTTP long-poll fallback) and a reconnection layer, which is convenient. The cost is a non-standard wire protocol, a heavier client, and harder debugging when something goes wrong at the network layer. If you need raw control, browser-native WebSocket plus the ws package on the server keeps the protocol transparent and easy to inspect with curl and DevTools. Choose Socket.IO for speed-to-ship; choose raw ws for control and observability.

How do I confirm my reverse proxy is forwarding the Upgrade correctly? #

Send a handshake by hand and look for 101:

curl -i -N -H 'Connection: Upgrade' -H 'Upgrade: websocket' \
-H 'Sec-WebSocket-Version: 13' \
-H 'Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==' \
https://realtime.example.com/ws/

A 101 Switching Protocols with a Sec-WebSocket-Accept header means the proxy passed the upgrade through. A 200, 400, or 426 means a hop is stripping or rejecting the headers — check every proxy in the path.

Does this work behind AWS ALB? #

Yes. ALB supports WebSocket upgrades natively and keeps the connection on a single target, but its default idle timeout (60s) is shorter than most heartbeat intervals, so raise the idle timeout or ping more often. For SSE the same idle-timeout rule applies. WebRTC media does not flow through the ALB at all — only the signaling channel does.

When is WebRTC actually worth the complexity? #

Only when you need peer-to-peer data paths or media-grade latency that a server relay cannot provide — live audio/video, low-latency game state between players, or screen sharing. For everything server-mediated (chat, dashboards, notifications), a WebSocket or SSE connection is simpler to deploy, authenticate, and observe.

Back to Real-Time WebSocket Engineering