System DesignApr 22, 2026·13 min read·

Designing a live chat — from WebSocket to delivery receipts

WhatsApp, Slack, Discord, Messenger all solve the same core loop: deliver a message to the right people, confirm arrival, handle offline, show who's there. Here's the five-decision framework that shapes every live-chat product, with the end-to-end data flow you can step through.

Step through what happens from the moment you hit send to the moment the recipient's phone buzzes:

Senderweb / mobileRecipientweb / mobileWS Gatewaysticky sessionMessage Servicefanout + orderPush ServiceAPNs · FCMMessages DBappend-only logPresence CacheRedis · TTL
01/06Sender composes, hits send. The client optimistically shows the message as 'sending' and emits over the open WebSocket.

Every live chat product — WhatsApp, Slack, Discord, Messenger, iMessage — solves the same loop. You type. The server confirms delivery. The recipient's client (online or offline) eventually shows the message. Presence and read receipts ride on the same infrastructure.

The decisions underneath that loop are a surprisingly small set. Get them right and the rest scales. Get them wrong and you get the kind of bugs users call "Slack feels slow today" even though the server is fine.

tl;dr

Five decisions shape every chat product. (1) Transport is the default; as a read-only fallback; long-polling only for blocked networks. (2) Message ordering — monotonic sequence numbers from the server (Slack, Discord) vs. client timestamps (rarely works at scale). (3) Delivery contract — at-least-once with client dedupe (every production chat), at-most-once (loses messages), exactly-once (impossible across unreliable networks). (4) Presence granularity — online/offline only (Discord), last-seen-only (older products), typing + online + last-seen (Slack, WhatsApp, Messenger). Cheaper is usually better. (5) Offline story — push notification + resume-on-reconnect with sequence-number replay, not "keep the socket open forever". Everything else — emoji reactions, threads, mentions — is application logic on top.

Decision 1 — Transport

The browser has several ways to keep a connection open. For chat, there's really one answer. WebSocket. Full-duplex, low overhead, battle-tested in every major chat product.

The reasoning:

  • Bidirectional — the client needs to send messages, and the server needs to push them. Short polling and SSE fail one side.
  • Low latency — no HTTP request/response overhead per message.
  • Broadly supported — Chrome, Safari, Firefox, WebView all have it. Corporate proxies occasionally block it, which is why you still need an SSE or long-polling fallback for the ~1% of users whose network is hostile.

Slack, Discord, Messenger, WhatsApp Web, and Linear's message system all use WebSocket as the primary transport. Pick it and move on.

The actual engineering work is around the connection lifecycle:

  • Exponential backoff on reconnect — start at 1s, cap at 30s, to prevent thundering-herd.
  • Application-level every 20–30s — the browser's built-in WebSocket ping doesn't help; a dead TCP connection can sit in the connected state for hours. Your server sends {type: "ping"}, client responds with {type: "pong"}, if no pong in 60s → close and reconnect.
  • Resume-with-sequence on reconnect — the client tells the server "I last saw message #42"; the server flushes everything after 42 from the log. No reconnect-and-miss window.

Discord's engineering blog describes their gateway as holding millions of concurrent WebSocket connections, with Elixir/Erlang handling the . The architecture is the same as Slack's — a gateway layer in front of the core message service, with so reconnects land on the server that already has the conversation cached.

Decision 2 — Message ordering

Two users both send at roughly the same time. Whose message appears first in the conversation?

Wrong answer: client timestamps. Every phone's clock is slightly different. Some are off by hours. Ordering by Date.now() produces "Bob's message from 2 minutes ago" appearing above Alice's current message because Bob's phone is in the wrong timezone.

Right answer: the server assigns a monotonically-increasing sequence number to every message in a conversation. The client shows messages in sequence-number order. The server's wall-clock is irrelevant to ordering — only the order of commit to the log matters.

Slack's Gareth (from their engineering blog) frames this as: "The server is the only component with a consistent view of ordering. Everything else is a cache."

Mechanically:

// Server pseudocode
async function onMessage(conversationId, msg) {
  const seq = await db.incrementAndGet(`seq:${conversationId}`);
  msg.seq = seq;
  await db.append(`log:${conversationId}`, msg);
  fanout(conversationId, msg);
}

On the client, store messages indexed by seq. Render in seq order. A message that arrives out of order (because of network hiccups) slots into the right place, no flicker.

Decision 3 — Delivery contract

Three possibilities:

  • At-most-once: server sends, forgets. If the client was briefly offline, the message is lost. Acceptable for some notification systems; unacceptable for chat.
  • At-least-once: server sends; if it doesn't get an ack, it re-sends. Client deduplicates on a message ID. This is what every real chat product ships.
  • Exactly-once: server guarantees the message arrives once and only once. Over an unreliable network, this is a theorem-disproven-to-be-possible. Don't try.

Implementing at-least-once with client dedupe is easy:

  1. Every message carries a globally-unique id (UUIDv7 or client-id + counter).
  2. Server persists the message, then fans out. Fanout is at-least-once — the server may retry until it gets a client ack.
  3. Client keeps a set of seen IDs. If the same ID arrives twice, drop the duplicate.

The dedupe set can be bounded (keep the last 1000 IDs). Older messages would have been persisted and re-rendered from a server replay, not a fanout retry, so they're not subject to this dedupe.

Decision 4 — Presence

Presence is the feature everyone under-appreciates. Users love seeing "Alice is typing…". Engineering teams love it less — it doubles the chattiness of the whole system.

Three tiers, cheapest to most expensive:

  1. Online / offline only (Discord's green dot). Updated when the user connects or disconnects. One write per session.
  2. Last seen (Telegram without "last seen off", older chat products before typing became table stakes). Updated whenever the user is actively interacting. Many writes per session, but cheap to cache — Redis with a 5-minute TTL works fine.
  3. Typing + online + last-seen (Slack, WhatsApp, Messenger, iMessage). Typing events fire on every keystroke (or debounced to every 500ms), plus the online state and last-seen on top. This is the expensive one — and what most modern chat products ship.

Key design principle: presence and messages are different-shaped problems. Messages must be durable, persisted, replayable. Presence is ephemeral — if a presence update is dropped, nobody cares; the next one arrives in seconds. Most chat products split presence into a separate service or channel:

  • Messages: WebSocket → gateway → message service → durable log → fanout.
  • Presence: WebSocket (same connection) → presence service → Redis → fanout on change.

Don't try to ride presence through your durable message pipeline. You'll burn 10x the disk for data nobody needs kept.

Decision 5 — Offline & reconnect

Users close the laptop, put the phone in a pocket, walk into an elevator. The WebSocket drops. Messages keep arriving. When they come back, the chat has to look correct.

The pattern every chat product uses:

  1. Persist every message to a durable log, indexed by conversation + sequence number.
  2. On reconnect, the client tells the server the last seq it saw.
  3. Server queries the log for everything after that seq and flushes it over the restored socket.

That's it. No "hold messages for offline users in RAM". No special queue. The log is the queue.

For mobile specifically, there's an extra layer: the app might not even be running when the message arrives. That's what push notifications are for — and on mobile, you don't deliver them yourself. You hand the payload to one of two gatekeepers and they wake the device up for you:

  • (Apple Push Notification service) on iOS. Every iPhone keeps a single always-on socket to Apple's servers. When your backend sends a push to APNs, Apple routes it over that socket and wakes your app.
  • (Firebase Cloud Messaging) on Android, and a web-push variant for Chrome/Edge. Same shape — one socket from device to Google, your server pushes through it.

The push is a wake-up, not the message content. Both services cap payloads at a few KB and are operated by Apple and Google — you don't want the full message body sitting there. Standard pattern: send a short notification ("Alice sent you a message"), let the app wake up, and the app pulls the real message from your API. That keeps the content, encryption keys, and revocation logic on your servers.

The hard parts nobody lists

A few senior-level gotchas that don't fit in the five decisions but will break your system in year two:

  • Message editing / deletion — editing a message changes the log entry (or appends a new entry that supersedes it). Clients need to reconcile both cases. WhatsApp edits the message in place with a small "edited" marker; Slack keeps the history accessible. Pick a model on day one.
  • Threaded replies — Slack's threads are a separate sequence number space, but they share the main conversation's timeline for push notifications. This is complex; don't try to retrofit threads onto a flat-chat model.
  • End-to-end encryption — WhatsApp and Signal encrypt each message with the . Your server can only see ciphertext and metadata (who sent what to whom, when). Great for privacy; trades off against search, moderation, and backup, which all normally assume the server can read content. E2EE from day one is easier than retrofitting.
  • Spam / moderation — server-side classification needs to read the message, which E2EE rules out. The realistic compromises are client-side filters (the device scans its own messages), reputation signals from metadata (unusual fan-out patterns, new accounts), and user-driven reporting.
  • Message search — two tiers, and the choice follows your E2EE decision. On-device search: the client decrypts messages as they arrive and stores the cleartext in a local database ( on mobile). Search runs against that local copy — fast, private, but limited to what's on this phone. That's how WhatsApp, Signal, and iMessage ship full-text search despite being E2EE. New phone or fresh install? They rely on an encrypted cloud backup (iCloud / Google Drive) to rehydrate history. Server-side search: full-text indexing on the server; only possible for non-E2EE products like Slack and Discord. For E2EE products, the server can still index metadata — search by sender, date, chat — but not message body.

Sizing — how big is the problem

Rough numbers to hold in your head before sizing servers:

  • Each message on the wire: 200–500 bytes after compression. Short text plus sender ID and timestamps — not much.
  • One user sitting idle: one open , about 2 KB per minute of traffic, plus the occasional heartbeat. In steady state, tens of bytes per second — the open socket is the cost, the traffic is noise.
  • One user actively chatting: around 10 messages per hour on average. Bursts of 50+ during meetings or incidents.
  • Sockets per server: a normal cloud VM running Go, Elixir, or well-tuned Node.js holds 50,000–200,000 open WebSockets. Slack runs about a million per host, but that's years of stack-level tuning — ceiling, not starting point.

A product with 10 million monthly active users typically has around a million sockets open at any given moment. You spread users across a fleet of these servers using a stable key (conversation ID works well) — so each server owns a known slice of traffic, and if a user reconnects they land back on the same box that already has their state.

Primary sources