System DesignApr 22, 2026·15 min read·

Designing a collaborative editor — four decisions that shape the rest

Every collaborative product — Figma, Notion, Google Docs, Linear — picks the same four decisions before writing a line of code. Get them right and the rest is implementation; get them wrong and no amount of engineering recovers. Here's the framework, with working examples from the three shops that do this well.

Toggle the two dominant architectures and watch where truth lives:

Server-authoritative
participants
Client A· replicaClient B· replicaServer· authorityPostgres· durable store
data flow
  • · Client A → Server (ops, WebSocket)
  • · Client B → Server (ops, WebSocket)
  • · Server → Clients (broadcast)
  • · Server → Postgres (persist)
truth lives at
The server holds the canonical document. Clients are replicas.
latency
Every edit round-trips to the server. Optimistic UI hides the RTT but reconciles on reject.
offline
Client can queue ops while offline; rejoin replays them (with possible conflicts).
failure mode
Server outage = no writes. Clients diverge, then reconcile on reconnect.
examples: Figma, Notion, Google Docs, Linear.

Someone on your team says "we need real-time collaboration". Your instinct is to open a codebase and start coding. Don't.

Every collaborative product — Figma, Notion, Google Docs, Linear, Miro, Excalidraw — resolved the same four decisions before writing a line of code. Each decision has 2–3 reasonable answers and one catastrophic one. The decisions compound: pick authority wrong and your merge strategy is limited; pick granularity wrong and your transport choice is forced; pick transport wrong and your offline story is dead on arrival.

This post is the framework. Read it once before you design the next collaborative feature.

tl;dr

The four decisions, in the order they compound: (1) Authority — server-authoritative (Figma, Notion, Docs, Linear) vs local-first / P2P (Automerge, Yjs over WebRTC). (2) Merge strategy — last-writer-wins (property-level, like Figma) vs operational transform (character-level, like Google Docs) vs CRDT (ID-tagged, like Yjs/Automerge). (3) Data granularity — what's the atomic unit? Block, property, paragraph, character? Coarser = simpler merge but less precise; finer = better UX but more metadata. (4) Transport — WebSocket (every real product) vs long-polling (corporate-network fallback) vs WebRTC DataChannel (P2P only). Pick these in order: authority drives what merges you can do, merge drives what granularity is viable, granularity drives how chatty the transport needs to be.

The system end-to-end

Before picking apart the decisions, here's the full system with the data flowing through it. Pick a scenario and step through:

Client Abrowser · ReactClient Bbrowser · ReactLoad Balancersticky sessionWS Serverops + presenceAPI Serverauth · RESTOps LogPostgres · appendSnapshot StoreS3 · CRDT blob
01/04User A types. The client applies the op locally first (optimistic), then emits it over the open WebSocket.

The diagram is deliberately simplified. No CDN, no Redis cache for presence, no search service. The critical path is what's shown.

The four decisions below directly reshape this diagram. Change the authority model and the server disappears. Change the merge strategy and the payload on every edge changes. Change the transport and one box gets replaced. Here's what the first one does visually:

Client Afull replicaClient Bfull replicaRelay (optional)dumb fanout
01/03A produces a CRDT op locally and emits it. No server in the trust path — the relay only forwards bytes.

Notice what's gone: no WebSocket server, no ops log, no snapshot store. That entire server-side column is optional in local-first because every client holds the complete document. The tradeoff is everything else — access control, search, analytics, billing enforcement — becomes harder without a server that can see the document.

The rest of this post is how to make each of these choices well.

Decision 1 — Authority: who holds the truth?

Every collaborative editor makes a core architectural choice: does a central server hold the canonical document, or does every client hold a complete replica?

Server-authoritative (Figma, Notion, Google Docs, Linear) is the mainstream answer. A server owns the "real" document; clients are optimistic replicas that send operations and receive broadcasts.

Figma's multiplayer servers keep track of the latest value that any client has sent for a given property on a given object.

The server arbitrates conflicts, enforces constraints, writes to durable storage, and manages access control. Clients never need to agree with each other directly — they only need to agree with the server.

Local-first / P2P (Automerge, Yjs over WebRTC, Actual Budget, some Zed collab modes) is the contrarian answer. Every client holds the complete document. Ops propagate peer-to-peer (or via a dumb relay). No central authority means no single point of failure and genuine offline capability.

The tradeoff the screenshot shows isn't academic: offline support is an order of magnitude better local-first, because the client already has everything. But every other concern — access control, schema migrations, server-side search, analytics — is an order of magnitude harder without a server that can see the full document.

Which to pick? A one-line rule that gets you 90% of the way: if your product has paid users who pay per seat, you need server-authoritative (access control + billing enforcement lives server-side). If your product is personal data tools or offline-first software, local-first starts to earn the tradeoff.

Decision 2 — Merge strategy: how do you resolve concurrent edits?

Once authority is decided, merge is the next fork.

starting state
Hello world
User A
types "planet"
User B
types "mars"
Last-Writer-Wins
Each concurrent edit is applied in the order the server receives it. Whoever sends last wins. Figma uses this at the property-value level — "simpler system, easier to reason about" per their multiplayer engineering blog.
User B wins → "Hello mars"
primary sources · Figma multiplayer · Automerge

Three real options. Each picks a point on the convergence/complexity curve.

Last-Writer-Wins (LWW)

The simplest. Every property has a version. When the server gets two concurrent writes, the later timestamp wins. Figma does this at the property-value level: if you and a colleague both change the same rectangle's fill colour, one of you wins.

A conflict happens when two clients change the same property on the same object, in which case the document will just end up with the last value that was sent to the server.

The trick is that Figma picked the right granularity to make LWW acceptable: properties, not characters. You never get half your colour and half theirs; you get one of two valid colours. Acceptable for a design tool.

When LWW wins: design tools, form builders, structured editors (Notion blocks), CRUD apps with "last edit wins" semantics. Cheap, obvious, fast.

When LWW loses: text editors. If two users concurrently type into the same paragraph, "last writer wins" = one user's typing is lost. Unacceptable.

Operational Transform (OT)

The answer Google Docs uses. Every edit is an operation (insert("x", pos) / delete(pos, len)). The server receives ops from multiple clients concurrently, then transforms each incoming op against the ops it has already applied. The transform preserves each author's intent even after concurrent positions shift.

OT is more powerful than LWW (nothing is lost), but the transform function is notoriously hard to get right. A comprehensive OT needs tens of transform functions for every pair of op types (insert-vs-insert, insert-vs-delete, delete-vs-delete, and so on). Google Docs famously reported getting this right took multiple attempts.

When OT wins: any text editor at scale where the server can do work (Google Docs).

When OT loses: peer-to-peer, because the "server doing the transform" story doesn't exist without a server.

CRDTs (Conflict-free Replicated Data Types)

The approach Yjs and Automerge use. Each character (or block, or property) gets a globally-unique ID — typically (clientId, counter, position). Merging two ops means looking up IDs and inserting relative to them. No position arithmetic, no transform functions. Order-independent: any client can apply any set of ops and converge.

The cost: every character carries an ID (metadata overhead), and naive CRDTs can grow unboundedly (old tombstones + every edit's ID lives forever). Modern CRDTs (RGA, Logoot, Yjs's RelativePositions) compress this aggressively.

When CRDT wins: local-first / P2P (you need order-independence because there's no server to enforce order) and any scenario where the server is untrusted or absent.

When CRDT loses: pure simplicity — a naive OT is actually easier for a simple text-only editor than a naive CRDT.

If you want to see the actual ops that drive these algorithms — the transforms, the ID-tagged characters, the exact sequence of events that makes both clients converge — I walk through the full step-by-step in the companion post: OT vs CRDT — how two clients converge without a referee.

Decision 3 — Data granularity: what's the atomic unit?

This is the decision most teams get wrong, because it feels like an implementation detail. It isn't; it shapes every other choice.

Character-level (Google Docs, Yjs text types). Every character is addressable. Required for real prose editing. Merges text perfectly. Cost: very chatty transport (each keystroke is an op), large metadata overhead in CRDTs.

Block-level (Notion). A block is the atomic unit — a paragraph, list item, image, code block. Within a block, LWW on the text content. Between blocks, ordering is tracked. Good fit for page-shaped content where block-boundary edits are rare and within-block concurrency is tolerable.

Everything you see in Notion is a block. Text, images, lists, a row in a database, even pages themselves — these are all blocks.

Property-level (Figma). Every property on every object gets LWW. Width, fill, rotation, text content of a text layer. Coarse enough for acceptable LWW semantics; fine enough for design-tool UX.

Object-level / "whole document" level (Google Sheets for some operations; CAD tools historically). The entire document is locked to one editor at a time, or merged whole-scale. Use only if concurrency is rare.

The right question to ask: what's the smallest unit where "both users lose their typing and one of them wins" is acceptable? That's your granularity. If you're building a text editor, it's per-character. If you're building a design tool, per-property. If you're building a database record editor, maybe per-field.

Decision 4 — Transport: how do ops flow?

This one matters less for correctness but more for operational complexity.

  • WebSocket — the default. Persistent connection, low latency, broadcast-friendly. Every real collaborative product uses this. Uses HTTP Upgrade to establish; once established, ops flow bidirectionally.
  • WebRTC DataChannel — mandatory for P2P. Harder to operate (STUN/TURN servers, NAT traversal); worth it only if you're actually building peer-to-peer.
  • Server-Sent Events (SSE) — one-way only, server → client. Can be paired with POSTs for client → server. Works when WebSocket is blocked by corporate proxies. Not ideal for editor use; latency is fine but the pattern is awkward.
  • Long-polling — legacy fallback. Don't pick it unless you have no other option.

See the connections-on-the-web post for the full transport comparison — this post's focus is the app-layer protocol, not the wire format.

The operational reality: WebSocket + a small handshake protocol is what every modern collaborative product uses. The interesting work is application-level (presence, document state, user metadata) — not transport.

Presence is a second system

One more framing error teams make: lumping presence (cursors, selections, user avatars, typing indicators) into the same system as document state. These are different-shaped problems and lumping them together causes pain.

  • Document state — the ops you've covered. Must be durable, persisted, conflict-resolved, replayable.
  • Presence — ephemeral. Someone's cursor position is meaningless once they disconnect. Doesn't need to be persisted. Can be lost on reconnect without data loss.

Most collaborative products split presence into a separate channel / service. Figma explicitly mentions this:

We also sync changes to a lot of other data (comments, users, teams, projects, etc.) but that is stored in Postgres, not our multiplayer system.

Practical split: document ops go through the authoritative collaboration server; presence piggy-backs on the same WebSocket but is fire-and-forget (dropped on the floor if any hop fails). Two pipelines, one transport.

Undo — the boss-level problem

A genuine senior-level gotcha that rarely makes the short list: undo in a multi-user editor is a different problem than undo in a single-user editor.

In a single-user editor, Cmd+Z walks a stack of past states. Straightforward. In a multi-user editor:

  • If you undo "everyone's last edit", you un-do your colleague's work too. Users hate this.
  • If you undo "only my last edit", you need a per-user ops log. Each user's undo walks only their own ops, skipping everyone else's.

Every serious collaborative editor ships per-user undo. The implementation is an ops log tagged with authorId, and the client walks backward through that user's ops only. It's the kind of feature that looks trivial in the spec doc and takes three sprints to get right.

The one-screen summary

If you take one thing from this post, it's the dependency ordering:

  1. Authority first — server-authoritative or local-first. Driven by product requirements (billing, offline, privacy).
  2. Merge strategy second — driven by authority + content type. Text → OT or CRDT; structured objects → LWW is fine.
  3. Granularity third — what's the atomic unit? Smallest where LWW semantics would be acceptable.
  4. Transport fourth — WebSocket unless you have a specific reason otherwise.

Then the second-order concerns: presence as a separate channel, per-user undo from day one, schema migrations (hard problem, not covered here), access control (server-side only, enforced on every op).

Get these right in the whiteboard session. They compound for the next three years of engineering.

Primary sources