How to Use This Document

This document is structured as a 45-60 minute technical interview script or lecture. It flows from foundational concepts to advanced distributed systems patterns. Each section includes:

What to say (the teaching content)
What to draw (diagrams to sketch)
Key numbers (statistics to cite)
Interview tips (how to present in an interview setting)

Part 1: Setting the Stage (5 minutes)

1.1 Opening Statement

“Today we’re designing a large-scale chat system for companies, similar to Slack. Before diving into the architecture, I want to establish the scope and understand what makes this problem interesting at scale.”

Key insight to share: “The fundamental challenge in chat systems is balancing real-time delivery with reliability. We need messages to arrive instantly, but we also can’t lose them. At FAANG scale, this means handling billions of messages daily while maintaining sub-second latency.”

1.2 Clarifying Requirements (Always Start Here)

In an interview, always ask clarifying questions first. This shows structured thinking and helps scope the problem appropriately.

Functional Requirements (What the system does)

Ask the interviewer to confirm:

Channel creation — Can users create channels with multiple participants? What about direct messages (1:1)?
Messaging — Send and receive messages in real-time within channels
Offline delivery — Users receive messages sent while they were offline
Media files — Support for images, documents, and other attachments
Multi-tenancy — Multiple companies (workspaces) use the same infrastructure with complete data isolation

Below the line (mention but deprioritize):

Message editing and deletion
Reactions and threading
Search functionality
Private channels with access control

Non-Functional Requirements (How the system performs)

Requirement	Target	Rationale
Latency	< 500ms end-to-end	Users expect “instant” messaging
Throughput	1 billion+ messages/day	Enterprise scale
Availability	99.9% - 99.99%	Chat is mission-critical for businesses
Delivery guarantee	At-least-once with deduplication	Messages must not be lost
Consistency	Eventual for display, ordered within channels	Perfect global ordering is unnecessary

Below the line:

End-to-end encryption
Compliance and audit logging
Geographic data residency

Scale Estimation (Back-of-envelope)

“Let me estimate the scale we’re designing for:”

 1Users:           100 million registered users
 2DAU:             20 million daily active users
 3Concurrent:      5 million simultaneous connections (peak)
 4Messages/day:    1 billion messages
 5Messages/second: ~12,000 average, ~50,000 peak
 6
 7Message size:    ~1 KB average (text + metadata)
 8Storage/day:     1 billion × 1 KB = 1 TB/day
 9Storage/year:    365 TB/year (before replication)
10
11Connections:     5 million WebSocket connections
12                 At ~30 KB RAM per connection = 150 GB RAM for connections alone
13                 Need ~150 servers at 1M connections each (aggressive)
14                 Or ~500 servers at 100K connections each (conservative)

Part 2: Core Design Foundations (10 minutes)

2.1 Core Entities

“Let me identify the fundamental data entities we’ll be working with:”

 1┌─────────────────────────────────────────────────────────────────┐
 2│                        CORE ENTITIES                            │
 3├─────────────────────────────────────────────────────────────────┤
 4│  Workspace    │ The company/organization container              │
 5│  Channel      │ A conversation space (public or private)        │
 6│  User         │ A person who can send/receive messages          │
 7│  Membership   │ Relationship between users and channels         │
 8│  Message      │ The content sent within a channel               │
 9│  Device       │ A user's connected client (phone, laptop, etc.) │
10└─────────────────────────────────────────────────────────────────┘

2.2 API Design

“I’ll define the core APIs. I prefer to think about these in terms of the user actions they enable:”

Interview tip: When listing APIs, focus on the most important ones first. Don’t list every possible API. It wastes time and can make you seem unprepared if you can’t explain all of them.

Essential APIs

 1# Channel Management
 2POST   /workspaces/{workspace_id}/channels
 3       → create_channel(workspace_id, name, is_private, initial_members[])
 4       → Returns: channel_id
 5
 6POST   /channels/{channel_id}/members
 7       → add_member(channel_id, user_id, role)
 8       → Returns: membership_id
 9
10# Messaging
11POST   /channels/{channel_id}/messages
12       → send_message(channel_id, sender_id, content, idempotency_key)
13       → Returns: message_id, sequence_number
14
15GET    /channels/{channel_id}/messages?after={seq}&limit={n}
16       → get_messages(channel_id, after_sequence, limit)
17       → Returns: messages[], has_more
18
19# Real-time (WebSocket)
20WS     /connect?token={auth_token}
21       → Establishes bidirectional connection
22       → Server pushes: new_message, presence_update, typing_indicator
23       → Client sends: heartbeat, subscribe_channel, mark_read
24
25# File Upload (separate flow)
26POST   /files/upload-url
27       → get_presigned_url(file_name, file_size, content_type)
28       → Returns: upload_url, file_id
29
30POST   /channels/{channel_id}/messages
31       → send_message(channel_id, sender_id, content, file_ids[])

API Design Principles

Include workspace_id in paths — Essential for multi-tenancy and sharding
Use idempotency keys — Prevents duplicate messages on retry
Cursor-based pagination — Use sequence numbers, not page offsets
Separate file upload — Don’t mix large binary data with message APIs

2.3 High-Level Architecture Overview

“Let me draw the high-level architecture, then we’ll dive into each component:”

 1┌─────────────────────────────────────────────────────────────────────────────┐
 2│                              CLIENTS                                        │
 3│                    (Web, Mobile, Desktop Apps)                              │
 4└─────────────────────────────────────────────────────────────────────────────┘
 5                                    │
 6                                    ▼
 7┌─────────────────────────────────────────────────────────────────────────────┐
 8│                           LOAD BALANCER                                     │
 9│                 (SSL termination, routing, rate limiting)                   │
10└─────────────────────────────────────────────────────────────────────────────┘
11                          │                    │
12              ┌───────────┘                    └───────────┐
13              ▼                                            ▼
14┌──────────────────────────┐                ┌──────────────────────────┐
15│      API SERVERS         │                │    GATEWAY SERVERS       │
16│   (HTTP REST requests)   │                │  (WebSocket connections) │
17│   - Auth, file upload    │                │  - Real-time delivery    │
18│   - Message send         │                │  - Presence, typing      │
19└──────────────────────────┘                └──────────────────────────┘
20              │                                            │
21              └──────────────────┬─────────────────────────┘
22                                 ▼
23┌─────────────────────────────────────────────────────────────────────────────┐
24│                          CORE SERVICES                                      │
25│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
26│  │  Channel    │  │  Message    │  │  Presence   │  │ Notification│        │
27│  │  Service    │  │  Service    │  │  Service    │  │  Service    │        │
28│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
29└─────────────────────────────────────────────────────────────────────────────┘
30                                 │
31              ┌──────────────────┼──────────────────┐
32              ▼                  ▼                  ▼
33┌──────────────────────┐ ┌─────────────┐ ┌──────────────────────┐
34│    MESSAGE QUEUE     │ │    CACHE    │ │      DATABASES       │
35│  (Kafka/Redis Pub/Sub)│ │   (Redis)   │ │  (MySQL/Cassandra)   │
36└──────────────────────┘ └─────────────┘ └──────────────────────┘

Part 3: Building the Basic System (10 minutes)

3.1 Start Simple: Single Server MVP

“Let me start with the simplest working system, then evolve it to handle scale.”

1┌──────────┐         ┌──────────────────┐         ┌────────────┐
2│  Client  │ ◄─────► │   Chat Server    │ ◄─────► │  Database  │
3│  (User)  │   WS    │  (Single Node)   │   SQL   │  (MySQL)   │
4└──────────┘         └──────────────────┘         └────────────┘

Single Server Workflow: Sending a Message

User A sends message via WebSocket to Chat Server
Chat Server validates the request (auth, permissions)
Chat Server stores message in database
Chat Server looks up channel members from database
Chat Server pushes message to all online members via their WebSocket connections

What’s wrong with this?

Single point of failure
Can’t scale beyond one server’s connection limit (~100K-500K connections)
All members must be connected to the same server
Database becomes bottleneck

3.2 Database Schema (Foundation)

“Before scaling, let me establish the data model. This schema works for both SQL and NoSQL with modifications.”

Core Tables

 1-- Workspace: The company/organization
 2CREATE TABLE workspace (
 3    workspace_id    BIGINT PRIMARY KEY,
 4    name            VARCHAR(255),
 5    plan            ENUM('free', 'pro', 'enterprise'),
 6    created_at      TIMESTAMP
 7);
 8
 9-- Channel: A conversation space
10CREATE TABLE channel (
11    workspace_id    BIGINT,
12    channel_id      BIGINT,
13    name            VARCHAR(255),
14    is_private      BOOLEAN DEFAULT FALSE,
15    created_at      TIMESTAMP,
16    PRIMARY KEY (workspace_id, channel_id)
17);
18
19-- Channel Membership: Who is in which channel
20CREATE TABLE channel_member (
21    workspace_id    BIGINT,
22    channel_id      BIGINT,
23    user_id         BIGINT,
24    role            ENUM('member', 'admin', 'owner'),
25    joined_at       TIMESTAMP,
26    PRIMARY KEY (workspace_id, channel_id, user_id)
27);
28
29-- User's view of channels (for quick lookup)
30CREATE TABLE user_channel (
31    workspace_id        BIGINT,
32    user_id             BIGINT,
33    channel_id          BIGINT,
34    last_read_seq       BIGINT DEFAULT 0,      -- For unread counts
35    last_delivered_seq  BIGINT DEFAULT 0,      -- For offline catchup
36    muted               BOOLEAN DEFAULT FALSE,
37    PRIMARY KEY (workspace_id, user_id, channel_id)
38);
39
40-- Messages
41CREATE TABLE message (
42    workspace_id    BIGINT,
43    channel_id      BIGINT,
44    message_seq     BIGINT,                    -- Per-channel sequence number
45    message_id      BIGINT,                    -- Global unique ID (Snowflake)
46    sender_id       BIGINT,
47    body            TEXT,
48    created_at      TIMESTAMP,
49    PRIMARY KEY (workspace_id, channel_id, message_seq)
50);

Why This Schema Design?

workspace_id in every table — Enables multi-tenancy and sharding
Composite primary keys — (workspace_id, channel_id, message_seq) colocates related data
Dual membership tables — channel_member for “who’s in this channel”, user_channel for “what channels is this user in”
Sequence numbers — message_seq enables ordered pagination and gap detection

3.3 Communication Protocols

“Let me explain why we use different protocols for different operations:”

Protocol Comparison

Protocol	Use Case	Characteristics
HTTP/REST	Sending messages, file upload, auth	Request-response, stateless, easy to load balance
WebSocket	Receiving messages, presence, typing	Bidirectional, persistent connection, server push
Long Polling	Fallback when WebSocket unavailable	Higher latency, more overhead
HTTP/2 + SSE	Alternative to WebSocket	Multiplexed, unidirectional server push

Why WebSocket for Receiving?

“The receiver side needs server push capability. Let me compare the options:”

Polling (Bad)

1Client: "Any new messages?" → Server: "No"
2Client: "Any new messages?" → Server: "No"
3Client: "Any new messages?" → Server: "Yes, here's one"

Wastes resources checking constantly
High latency (message waits for next poll)

Long Polling (Better)

1Client: "Any new messages? I'll wait..."
2Server: [holds connection for 30 seconds]
3Server: "Here's a message!" [or timeout]

Reduces requests but still overhead of reconnecting
Each response requires new connection

WebSocket (Best)

1Client: "Let's open a persistent connection"
2Server: [keeps connection open]
3Server: "New message!" [instant push]
4Server: "Another message!" [instant push]

Single persistent connection
Sub-100ms delivery latency
Efficient for high-frequency updates

Connection Lifecycle

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    WebSocket Connection Lifecycle                       │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 51. ESTABLISH CONNECTION
 6   Client ──── HTTP Upgrade Request ────► Server
 7   Client ◄─── 101 Switching Protocols ── Server
 8   [WebSocket connection established]
 9
102. AUTHENTICATE
11   Client ──── {type: "auth", token: "..."} ────► Server
12   Client ◄─── {type: "auth_ok", user_id: 123} ── Server
13
143. SUBSCRIBE TO CHANNELS
15   Client ──── {type: "subscribe", channels: [...]} ────► Server
16   Client ◄─── {type: "subscribed", channels: [...]} ──── Server
17
184. STEADY STATE
19   Client ◄─── {type: "message", channel_id: 1, ...} ──── Server (push)
20   Client ◄─── {type: "presence", user_id: 5, status: "online"} ── Server
21   Client ──── {type: "heartbeat"} ────► Server (every 30 seconds)
22
235. DISCONNECTION
24   Client ──── Close frame ────► Server
25   [or Server ──── Close frame ────► Client]
26   [or Connection times out after missed heartbeats]

Part 4: Scaling to Distributed System (15 minutes)

4.1 The Multi-Server Challenge

“When we have multiple servers, a fundamental problem emerges: the sender and receiver may be connected to different servers.”

1┌──────────┐                                           ┌──────────┐
2│  User A  │                                           │  User B  │
3│ (Sender) │                                           │(Receiver)│
4└────┬─────┘                                           └────┬─────┘
5     │                                                      │
6     ▼                                                      ▼
7┌──────────┐                                           ┌──────────┐
8│ Server 1 │         HOW DOES MESSAGE GET HERE? ────► │ Server 2 │
9└──────────┘                                           └──────────┘

Two solutions:

Consistent Hashing — Deterministically route users to servers
Pub/Sub Message Queue — Broadcast messages to all servers

4.2 Solution 1: Consistent Hashing

“We can use consistent hashing to ensure all members of a channel connect to the same server.”

How It Works

 1Hash Ring:
 2                    Server A
 3                       │
 4            ┌──────────┼──────────┐
 5           ╱                       ╲
 6     Server D                    Server B
 7           ╲                       ╱
 8            └──────────┬──────────┘
 9                       │
10                    Server C
11
12Channel 123 → hash(123) → position on ring → Server B
13All users in Channel 123 connect to Server B

What’s the draw back of the consistent hashing approach?

If a server goes down, all channels mapped to it lose connectivity until reassignment
Adding/removing servers causes many channels to remap (connection churn)
Hot channels (10K+ members) still create hotspots
Complex to manage and monitor

Slack’s CHARM System

Slack uses Consistent Hash Ring Managers (CHARMs):

Each channel_id maps to exactly one Channel Server
64-256 virtual nodes per physical server (smooths distribution)
When server fails, only ~1/N of channels need reassignment
Slack reports: Unhealthy server replacement completes in under 20 seconds

Limitations

Doesn’t work when user is in multiple channels on different servers
Server addition/removal causes connection migrations
Hot channels (10K+ members) still create hotspots

4.3 Solution 2: Pub/Sub Architecture (Preferred)

“The better solution decouples message routing from connection management using pub/sub.”

 1┌──────────────────────────────────────────────────────────────────────────┐
 2│                         PUB/SUB ARCHITECTURE                             │
 3└──────────────────────────────────────────────────────────────────────────┘
 4
 5┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 6│  User A  │     │  User B  │     │  User C  │     │  User D  │
 7└────┬─────┘     └────┬─────┘     └────┬─────┘     └────┬─────┘
 8     │ WS             │ WS             │ WS             │ WS
 9     ▼                ▼                ▼                ▼
10┌─────────────────────────────────────────────────────────────────────────┐
11│              GATEWAY SERVERS (WebSocket Connections)                    │
12│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                   │
13│  │  Gateway 1   │  │  Gateway 2   │  │  Gateway 3   │                   │
14│  │ Users: A, B  │  │ Users: C     │  │ Users: D     │                   │
15│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                   │
16└─────────┼─────────────────┼─────────────────┼───────────────────────────┘
17          │                 │                 │
18          │    Subscribe    │    Subscribe    │    Subscribe
19          │    to topics    │    to topics    │    to topics
20          ▼                 ▼                 ▼
21┌─────────────────────────────────────────────────────────────────────────┐
22│                        PUB/SUB SYSTEM                                   │
23│                    (Redis Pub/Sub or Kafka)                             │
24│                                                                         │
25│   Topics: channel:123, channel:456, user:789:devices, ...               │
26└─────────────────────────────────────────────────────────────────────────┘
27          ▲
28          │ Publish message to channel:123
29          │
30┌─────────┴───────┐
31│  Message        │
32│  Service        │
33└─────────────────┘

Message Flow with Pub/Sub

User A sends message to Channel 123 via HTTP POST
API Server validates and stores message in database
API Server publishes to pub/sub topic channel:123
All Gateway Servers subscribed to channel:123 receive the message
Each Gateway pushes message to connected users who are members of Channel 123

Pub/Sub Technology Choices

Technology	Latency	Durability	Throughput	Best For
Redis Pub/Sub	Sub-ms	None (fire-and-forget)	High	Typing indicators, presence
Redis Streams	Sub-ms	Yes (persisted)	High	Messages with replay capability
Kafka	10-100ms	Excellent	Very high	Audit logs, analytics, durability
NATS	0.1-0.4ms	Optional	3M+ msg/sec	Lightweight real-time delivery

Discord uses: Redis Pub/Sub for real-time fanout Slack uses: Custom internal pub/sub + Kafka for durability

Hybrid Approach (Production Pattern)

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                      HYBRID MESSAGE FLOW                                │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5                    Message from User A
 6                           │
 7                           ▼
 8                   ┌───────────────┐
 9                   │  API Server   │
10                   └───────┬───────┘
11                           │
12          ┌────────────────┼────────────────┐
13          ▼                ▼                ▼
14   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
15   │  Database   │  │ Redis Pub/  │  │   Kafka     │
16   │  (persist)  │  │ Sub (fast)  │  │  (durable)  │
17   └─────────────┘  └──────┬──────┘  └─────────────┘
18                           │
19                           ▼
20                   Gateway Servers
21                   (push to users)

Why both Redis and Kafka?

Redis for instant delivery (sub-millisecond)
Kafka for durability, replay, and downstream consumers (search indexing, analytics)

4.4 Stateless Gateway Design

“Gateway servers must be stateless for horizontal scaling and fault tolerance.”

What “Stateless” Means

Stateful (Bad):

1# Gateway server stores user session in local memory
2class GatewayServer:
3    def __init__(self):
4        self.sessions = {}  # user_id → session_data
5        self.connections = {}  # user_id → websocket
6
7    def on_connect(self, user_id, websocket):
8        self.sessions[user_id] = load_user_data(user_id)
9        self.connections[user_id] = websocket

If this server crashes, all session data is lost.

Stateless (Good):

 1# Gateway server stores session in Redis
 2class GatewayServer:
 3    def __init__(self):
 4        self.redis = Redis()
 5        self.connections = {}  # Local connection map only
 6
 7    def on_connect(self, user_id, websocket):
 8        session = self.redis.get(f"session:{user_id}")  # External storage
 9        self.connections[user_id] = websocket
10        # Register this gateway as handling this user
11        self.redis.hset(f"user:{user_id}:gateways", self.server_id, time.now())

If server crashes, user reconnects to any gateway and resumes from Redis.

Session Data in Redis

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    DISTRIBUTED SESSION STORAGE                          │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5Redis Keys:
 6  session:{user_id}
 7    → {workspace_id, subscribed_channels[], last_seen_seq:{channel_id → seq}}
 8
 9  user:{user_id}:gateways
10    → {gateway_1: timestamp, gateway_2: timestamp}  (multi-device)
11
12  presence:{user_id}
13    → {status: "online", last_heartbeat: timestamp}  TTL: 60 seconds
14
15  channel:{channel_id}:subscribers
16    → Set of user_ids currently subscribed

Reconnection Flow

 11. User's connection drops (network issue, server crash)
 2
 32. Client detects via:
 4   - WebSocket onclose event
 5   - Missed heartbeat response (no pong for 60s)
 6
 73. Client initiates reconnection:
 8   - Uses exponential backoff with jitter
 9   - Connects to any available gateway (via load balancer)
10
114. New gateway:
12   - Authenticates user
13   - Retrieves session from Redis
14   - Subscribes to user's channels
15   - Fetches missed messages since last_delivered_seq
16
175. Client is back online with no message loss

4.5 Database Scaling

“The database is often the hardest component to scale. Let me explain the strategies.”

Sharding Strategies

Option 1: Shard by workspace_id

 1Shard 1: workspaces 1-1000
 2Shard 2: workspaces 1001-2000
 3...
 4
 5```text
 6- **Pros:** Complete tenant isolation, simple routing
 7- **Cons:** Large workspaces become hotspots (Slack's #general channel problem)
 8
 9**Option 2: Shard by channel_id (Slack's current approach)**
10
11```text
12Shard = hash(workspace_id, channel_id) % num_shards

Pros: Spreads hot channels across shards
Cons: Cross-channel queries require scatter-gather

Option 3: Shard by (channel_id, time_bucket) (Discord’s approach)

1Partition Key: (channel_id, bucket)
2Bucket: 10-day time window
3Clustering Key: message_id (Snowflake)

Pros: Bounded partition size, efficient time-range queries
Cons: Complex composite queries

Discord’s Scaling Journey

Metric	Cassandra (Before)	ScyllaDB (After)
Nodes	177	72
P99 Read Latency	40-125ms	15ms
P99 Write Latency	5-70ms (variable)	5ms (stable)
Messages Stored	Trillions	Trillions

Why ScyllaDB outperformed Cassandra:

C++ implementation eliminates JVM garbage collection pauses
Shard-per-core architecture for predictable performance
Same data model and query language (CQL compatible)

Slack’s Vitess Migration

Slack migrated from monolithic MySQL to Vitess (horizontal MySQL sharding):

2.3 million queries per second at peak
2ms median latency
Preserves ACID transactions within a shard
VTGate layer handles routing transparently

Database Choice Decision Framework

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    DATABASE SELECTION GUIDE                             │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5                    Need ACID Transactions?
 6                           │
 7              ┌────────────┴────────────┐
 8              │ YES                     │ NO
 9              ▼                         ▼
10    ┌─────────────────┐       ┌─────────────────┐
11    │ MySQL + Vitess  │       │ Need Low Latency│
12    │ PostgreSQL      │       │ at Scale?       │
13    └─────────────────┘       └────────┬────────┘
14                                       │
15                          ┌────────────┴────────────┐
16                          │ YES                     │ NO
17                          ▼                         ▼
18                 ┌─────────────────┐      ┌─────────────────┐
19                 │ ScyllaDB        │      │ Cassandra       │
20                 │ (C++, no GC)    │      │ (JVM, mature)   │
21                 └─────────────────┘      └─────────────────┘

Part 5: Message Delivery & Ordering (10 minutes)

5.1 Message Ordering Guarantees

“A critical question for chat systems: how do we ensure messages appear in the correct order?”

What Level of Ordering Do We Need?

Ordering Level	Description	Difficulty	Needed?
Global ordering	All messages across all channels ordered	Very Hard	No
Per-channel ordering	Messages within a channel are ordered	Medium	Yes
Per-sender ordering	Messages from same sender are ordered	Easy	Yes
Causal ordering	Replies appear after original message	Hard	Nice to have

Key insight: “Per-channel ordering is sufficient for chat. Users don’t care if Channel A’s message 5 came before or after Channel B’s message 10.”

Implementing Per-Channel Ordering

Approach 1: Database Atomic Counter

1-- For each message insert:
2BEGIN TRANSACTION;
3  SELECT MAX(message_seq) + 1 INTO @next_seq
4  FROM message WHERE channel_id = ?;
5
6  INSERT INTO message (channel_id, message_seq, ...)
7  VALUES (?, @next_seq, ...);
8COMMIT;

Pros: Simple, strongly consistent
Cons: Single point of contention for high-traffic channels

Approach 2: Snowflake IDs (Twitter/Discord)

1Snowflake ID (64 bits):
2┌─────────────────────────────────────────────────────────────────┐
3│  41 bits: timestamp  │  10 bits: machine ID  │  12 bits: seq   │
4│  (milliseconds)      │  (1024 machines)      │  (4096/ms)      │
5└─────────────────────────────────────────────────────────────────┘
6
7Example: 1234567890123456789
8         └─── Encodes: time, which machine, sequence within that ms

Pros: Decentralized generation, time-sortable, globally unique
Cons: Clock skew can cause ordering issues, not strictly monotonic per channel

Approach 3: Hybrid (Recommended)

1# Message has both:
2message = {
3    "message_id": generate_snowflake(),      # Global unique, time-sortable
4    "message_seq": get_channel_sequence(),   # Per-channel strict ordering
5    "channel_id": 123,
6    "body": "Hello!"
7}

Use Snowflake for global uniqueness and rough time ordering
Use per-channel sequence for strict ordering within channel
Sequence can be generated by channel service (single leader per channel)

Handling Simultaneous Messages

“What if two users send messages at the exact same millisecond?”

1User A sends "Hello" ─────┐
2                          ├───► Channel Server ───► Database
3User B sends "Hi" ────────┘
4
5Both arrive at same millisecond. Which is first?

Solution: Single leader per channel

The Channel Server (or database primary) serializes all writes to a channel. It assigns sequence numbers atomically:

 1class ChannelServer:
 2    def __init__(self, channel_id):
 3        self.channel_id = channel_id
 4        self.next_seq = self.load_from_db()
 5        self.lock = threading.Lock()
 6
 7    def assign_sequence(self):
 8        with self.lock:
 9            seq = self.next_seq
10            self.next_seq += 1
11            return seq

Trade-off: This creates a bottleneck for very active channels. Solutions:

Shard large channels by time bucket
Accept eventual consistency with client-side sorting

5.2 Delivery Guarantees

The Impossible Problem: Exactly-Once Delivery

“True exactly-once delivery is impossible in distributed systems. This is proven by the Two Generals Problem.”

1Two Generals Problem:
2  General A ───message───► General B
3            (may be lost)
4
5  General A ◄───ack────── General B
6            (may be lost)
7
8Neither general can ever be certain the other received the message.

Practical Solution: At-Least-Once Semantics + Idempotency

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│              AT-LEAST-ONCE WITH DEDUPLICATION                           │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 51. Client sends message with idempotency_key
 6   {idempotency_key: "abc-123", text: "Hello!"}
 7
 82. Server checks if key exists in Redis:
 9   - If exists → return cached response (duplicate)
10   - If not → process and cache response
11
123. Client doesn't receive response (network issue)
13
144. Client retries with SAME idempotency_key
15
165. Server recognizes duplicate, returns cached response
17
186. Message appears exactly once to users

Idempotency Key Implementation

 1def send_message(request):
 2    key = f"idempotency:{request.idempotency_key}"
 3
 4    # Check for duplicate
 5    existing = redis.get(key)
 6    if existing:
 7        return json.loads(existing)  # Return cached response
 8
 9    # Try to acquire lock (prevent concurrent duplicates)
10    lock_acquired = redis.set(key, "processing", nx=True, ex=30)
11    if not lock_acquired:
12        # Another request is processing, wait for it
13        return wait_for_result(key)
14
15    try:
16        # Process the message
17        message = store_message(request)
18        response = {"message_id": message.id, "status": "sent"}
19
20        # Cache response for 24 hours
21        redis.setex(key, 86400, json.dumps(response))
22        return response
23    except Exception as e:
24        redis.delete(key)  # Release lock on failure
25        raise e

Client-Side Deduplication

Even with server-side idempotency, clients should deduplicate:

 1class MessageHandler {
 2    constructor() {
 3        this.seenMessageIds = new Set();
 4    }
 5
 6    onMessageReceived(message) {
 7        if (this.seenMessageIds.has(message.message_id)) {
 8            return;  // Already displayed this message
 9        }
10        this.seenMessageIds.add(message.message_id);
11        this.displayMessage(message);
12    }
13}

5.3 Offline Message Delivery

“When a user is offline, we need to ensure they receive messages when they come back online.”

Approach 1: Inbox Table (Not Recommended)

1-- Store undelivered messages per user
2CREATE TABLE inbox (
3    user_id BIGINT,
4    message_id BIGINT,
5    channel_id BIGINT,
6    created_at TIMESTAMP,
7    PRIMARY KEY (user_id, message_id)
8);

Problems:

Storage explodes for users in large channels
Need to delete after delivery (write amplification)
Fanout cost: 1 message to 10K members = 10K inbox rows

Approach 2: Cursor-Based Catchup (Recommended)

1-- Track delivery progress per user per channel
2CREATE TABLE user_channel_cursor (
3    workspace_id        BIGINT,
4    user_id             BIGINT,
5    channel_id          BIGINT,
6    last_delivered_seq  BIGINT,  -- Last message user received
7    last_read_seq       BIGINT,  -- Last message user saw
8    PRIMARY KEY (workspace_id, user_id, channel_id)
9);

On reconnect:

 1def handle_reconnect(user_id, channel_id):
 2    cursor = db.get_cursor(user_id, channel_id)
 3
 4    # Fetch messages since last delivery
 5    missed_messages = db.query("""
 6        SELECT * FROM message
 7        WHERE channel_id = ? AND message_seq > ?
 8        ORDER BY message_seq ASC
 9        LIMIT 1000
10    """, channel_id, cursor.last_delivered_seq)
11
12    # Send to client
13    for msg in missed_messages:
14        websocket.send(msg)
15
16    # Update cursor
17    db.update_cursor(user_id, channel_id,
18                     last_delivered_seq=missed_messages[-1].message_seq)

Advantages:

No per-message-per-user storage
Messages stored once, cursors are tiny
Works for any channel size

Part 6: Presence & Multi-Device Sync (5 minutes)

6.1 Presence System (Online/Offline Status)

“The green dot showing someone is online is surprisingly complex at scale.”

Requirements

Show real-time online/offline status
Handle brief disconnections gracefully (don’t flicker)
Scale to millions of concurrent users
Tolerate false positives (better to show online when offline than vice versa)

Heartbeat-Based Presence

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                      PRESENCE SYSTEM                                    │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5Client ──── heartbeat ────► Gateway ──── update ────► Presence Service
 6                                                            │
 7                                                            ▼
 8                                                    Redis (TTL keys)
 9                                                            │
10                                                            ▼
11                                                    presence:{user_id}
12                                                    TTL: 60 seconds

Flow:

Client sends heartbeat every 30 seconds
Gateway forwards to Presence Service
Presence Service sets Redis key with 60 second TTL
If no heartbeat received, key expires → user is offline
Status changes published via pub/sub

Why the Asymmetry? (30s heartbeat, 60s TTL)

Network jitter: Heartbeats may be delayed by 5-10 seconds
Retry opportunity: Client has one retry window before marked offline
Graceful degradation: Brief network glitches don’t cause status flicker

Actor Model for Presence

The actor model is particularly elegant for presence:

 1# Each user gets their own actor (Erlang process)
 2defmodule UserPresenceActor do
 3  def init(user_id) do
 4    # Set timer for offline transition
 5    timer = schedule_offline_check(60_000)  # 60 seconds
 6    {:ok, %{user_id: user_id, timer: timer, status: :online}}
 7  end
 8
 9  def handle_cast(:heartbeat, state) do
10    # Cancel old timer, schedule new one
11    Process.cancel_timer(state.timer)
12    new_timer = schedule_offline_check(60_000)
13    {:noreply, %{state | timer: new_timer}}
14  end
15
16  def handle_info(:offline_check, state) do
17    # No heartbeat received in time
18    broadcast_status_change(state.user_id, :offline)
19    {:stop, :normal, state}
20  end
21end

Why actors work well:

Each user’s presence is independent
Millions of lightweight processes (2KB each in Erlang)
One actor crash doesn’t affect others
Timers are natural with self-messages

Presence Fanout Challenge

“When User A comes online, who needs to know?”

Naive approach: Notify all users in all channels User A is in Problem: User in 100 channels with 100 members each = 10,000 notifications

Optimized approach:

Only notify users who are currently viewing a channel with User A
Client subscribes to presence updates for visible channel members only
Lazy loading: fetch presence on-demand when opening a channel

6.2 Multi-Device Synchronization

“A user might have Slack open on their laptop, phone, and tablet simultaneously.”

Challenges

Message delivery to all devices
Read status sync — Read on phone, laptop shows as read too
Typing indicators — Don’t show “typing” from your own other device
Notification suppression — If active on laptop, don’t buzz phone

Device Registration

1Redis structure:
2  user:{user_id}:devices
3    → {
4        device_1: {gateway_id: "gw-1", last_active: timestamp, platform: "ios"},
5        device_2: {gateway_id: "gw-2", last_active: timestamp, platform: "web"}
6      }

Read Status Sync

 1def mark_as_read(user_id, channel_id, message_seq):
 2    # Update cursor in database
 3    db.update("""
 4        UPDATE user_channel_cursor
 5        SET last_read_seq = ?
 6        WHERE user_id = ? AND channel_id = ?
 7    """, message_seq, user_id, channel_id)
 8
 9    # Publish to user's other devices
10    pubsub.publish(f"user:{user_id}:sync", {
11        "type": "read_status",
12        "channel_id": channel_id,
13        "last_read_seq": message_seq
14    })

Consistency model: Eventual consistency is acceptable for read status. A 1-2 second delay in syncing across devices is fine.

Part 7: Failure Handling & Reliability (5 minutes)

7.1 What Can Fail?

“In a distributed system, everything will fail eventually. Let me address each failure mode:”

Component	Failure Mode	Impact	Mitigation
Client network	Disconnection	User can’t send/receive	Reconnection with catchup
Gateway server	Crash	Users disconnected	Stateless design, auto-reconnect
Message service	Crash	Can’t process messages	Multiple instances, queue buffering
Database	Primary failure	Writes fail	Automatic failover to replica
Pub/sub	Redis crash	No real-time delivery	Fallback to polling, Redis Cluster
Entire datacenter	Power outage	Full outage	Multi-region deployment

7.2 Client Reconnection with Exponential Backoff

“When connection drops, clients must reconnect intelligently to avoid overwhelming servers.”

The Thundering Herd Problem

 1Scenario: AWS outage takes down all connections
 2          10 million clients try to reconnect simultaneously
 3
 4Without backoff:
 5  Second 1: 10M connection attempts → servers crash
 6  Second 2: 10M more attempts → still crashed
 7  ...
 8
 9With jittered exponential backoff:
10  Second 1: ~100K attempts (some clients start)
11  Second 2: ~200K attempts (more clients start)
12  Second 10: Spread across time, servers can handle it

Implementation

 1def reconnect_with_backoff():
 2    base_delay = 1.0       # 1 second
 3    max_delay = 30.0       # 30 seconds max
 4    attempt = 0
 5
 6    while not connected:
 7        # Calculate delay with exponential backoff
 8        delay = min(base_delay * (2 ** attempt), max_delay)
 9
10        # Add jitter (randomness) to spread out retries
11        jittered_delay = random.uniform(0, delay)  # "Full jitter"
12
13        time.sleep(jittered_delay)
14
15        try:
16            connect()
17            # Reset on success
18            attempt = 0
19        except ConnectionError:
20            attempt += 1

AWS research shows: Full jitter reduces total retry attempts by over 50% compared to exponential backoff without jitter.

Network Change Detection

1// When network changes (WiFi → cellular), reset backoff
2window.addEventListener('online', () => {
3    reconnectAttempt = 0;  // Reset backoff
4    reconnect();           // Try immediately
5});

7.3 Gateway Server Failure

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    GATEWAY FAILURE RECOVERY                             │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5BEFORE:
 6┌──────────┐     ┌──────────────┐
 7│  Client  │◄───►│  Gateway 1   │  ← Server crashes
 8└──────────┘     └──────────────┘
 9                        │
10                        ▼
11                 ┌─────────────┐
12                 │    Redis    │  Session data safe here
13                 │   (state)   │
14                 └─────────────┘
15
16AFTER:
17┌──────────┐     ┌──────────────┐
18│  Client  │◄───►│  Gateway 2   │  ← Reconnects to different server
19└──────────┘     └──────────────┘
20                        │
21                        ▼
22                 ┌─────────────┐
23                 │    Redis    │  ← Retrieves same session
24                 │   (state)   │
25                 └─────────────┘
26
27Client experience: Brief disconnection, then seamless resume

7.4 Database Failure

Leader-Follower Replication

 1                 ┌─────────────┐
 2    Writes ────► │   Primary   │
 3                 │  (Leader)   │
 4                 └──────┬──────┘
 5                        │ Replication
 6           ┌────────────┼────────────┐
 7           ▼            ▼            ▼
 8    ┌──────────┐ ┌──────────┐ ┌──────────┐
 9    │ Replica  │ │ Replica  │ │ Replica  │
10    │    1     │ │    2     │ │    3     │
11    └──────────┘ └──────────┘ └──────────┘
12           ▲            ▲            ▲
13           └────────────┴────────────┘
14                   Reads

When primary fails:

Automated failover promotes Replica 1 to Primary
Other replicas reconfigure to follow new primary
Application reconnects to new primary
Potential data loss: Last few writes may not have replicated

Slack’s AZ Drain Button

Slack built manual tooling for datacenter failures:

AZ drain button: Redirects all traffic away from failing availability zone
Execution time: Under 5 minutes
Mechanism: Envoy weighted clusters shift traffic 100% to healthy AZs

“Sometimes human judgment is better than automated failover for complex scenarios.”

7.5 Split-Brain Prevention

“Split-brain occurs when network partition makes two nodes think they’re both the primary.”

1Network Partition:
2                    ┌───────────────────────────────┐
3    ┌──────────┐    │         PARTITION            │    ┌──────────┐
4    │ Primary? │◄───┼─────────────X────────────────┼───►│ Primary? │
5    │ Node A   │    │                               │    │ Node B   │
6    └──────────┘    └───────────────────────────────┘    └──────────┘
7
8Both nodes think: "I can't reach the other one, I must take over!"
9Result: Two primaries accepting writes → data divergence

Prevention: Quorum-based consensus

 1Cluster of 3 nodes:
 2- Majority = 2 nodes
 3- If partition splits 2-1, the side with 2 can operate
 4- The side with 1 knows it doesn't have majority, refuses writes
 5
 6┌──────────────────────────────────────────────────────────────────────┐
 7│  Node A  │───────│  Node B  │          │  Node C  │                  │
 8│ (active) │       │ (active) │          │ (isolated)│                  │
 9│          │       │          │          │ (read-only)                  │
10│     Majority (2/3)          │    X     │ No majority                  │
11└──────────────────────────────────────────────────────────────────────┘

Part 8: Multi-Tenancy (3 minutes)

8.1 Enterprise Isolation Requirements

“Slack is a B2B product. Each company (workspace) expects complete isolation.”

Isolation Levels

Level	Description	Complexity	Use Case
Logical	Same database, workspace_id in every query	Low	Small/free workspaces
Database	Separate database per workspace	Medium	Pro workspaces
Physical	Separate servers per workspace	High	Enterprise with compliance

Implementation: workspace_id Everywhere

1-- EVERY query must include workspace_id
2SELECT * FROM message
3WHERE workspace_id = ? AND channel_id = ?;
4
5-- Primary keys include workspace_id
6PRIMARY KEY (workspace_id, channel_id, message_seq)

Why?

Sharding: workspace_id is natural partition key
Security: Impossible to accidentally query wrong workspace
Performance: Database can route to correct shard immediately

Row-Level Security

1-- PostgreSQL row-level security
2CREATE POLICY workspace_isolation ON message
3    USING (workspace_id = current_setting('app.workspace_id')::BIGINT);
4
5-- Application sets context before queries
6SET app.workspace_id = 123;
7SELECT * FROM message WHERE channel_id = 456;
8-- Automatically filters to workspace 123

Rate Limiting per Workspace

Prevent one workspace from monopolizing resources:

 1def rate_limit(workspace_id, operation):
 2    key = f"rate_limit:{workspace_id}:{operation}"
 3    current = redis.incr(key)
 4    if current == 1:
 5        redis.expire(key, 60)  # 1 minute window
 6
 7    limits = {
 8        "free": 1000,
 9        "pro": 10000,
10        "enterprise": 100000
11    }
12    plan = get_workspace_plan(workspace_id)
13
14    if current > limits[plan]:
15        raise RateLimitExceeded()

Part 9: Deep Dive Topics (Interview Follow-ups)

9.1 What if the Pub/Sub Queue Goes Down?

Outbox Pattern:

 1def send_message(message):
 2    with db.transaction():
 3        # Write message AND outbox event in same transaction
 4        db.insert("message", message)
 5        db.insert("outbox", {
 6            "event_type": "new_message",
 7            "payload": message,
 8            "processed": False
 9        })
10
11    # If pub/sub is up, process immediately
12    try:
13        publish_to_pubsub(message)
14        db.update("outbox", {"processed": True}, ...)
15    except PubSubDown:
16        pass  # Background worker will retry
17
18# Background worker
19def process_outbox():
20    while True:
21        events = db.query("SELECT * FROM outbox WHERE processed = FALSE")
22        for event in events:
23            try:
24                publish_to_pubsub(event.payload)
25                db.update("outbox", {"processed": True}, ...)
26            except PubSubDown:
27                time.sleep(1)  # Retry later

9.2 How to Handle Large Channels (10K+ members)?

Problem: Sending a message to 10K members = 10K fanout operations

Solutions:

Lazy fanout: Only notify online members, others catchup on reconnect
Tiered delivery: Push to first 1000 members, others poll
Read-only for very large channels: Disable @channel mentions

9.3 Message Search Implementation

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                      SEARCH ARCHITECTURE                                │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5┌──────────┐      ┌─────────────┐      ┌─────────────────┐
 6│ Message  │─────►│    Kafka    │─────►│  Elasticsearch  │
 7│ Service  │      │   (stream)  │      │  (search index) │
 8└──────────┘      └─────────────┘      └─────────────────┘
 9                                               │
10                        Search queries ────────┘

Messages are primary stored in Cassandra/MySQL
Kafka streams changes to Elasticsearch (near-realtime)
Search queries hit Elasticsearch
Indexing delay: 1-5 seconds (acceptable for search)

9.4 File Upload Flow

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                   PRESIGNED URL UPLOAD FLOW                             │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 51. Client ──── POST /files/upload-url ────► API Server
 6                                                │
 72.                                    Generate presigned URL
 8                                                │
 93. Client ◄─── {upload_url, file_id} ──────────┘
10
114. Client ──── PUT [binary data] ────► S3 (direct upload)
12
135. Client ──── POST /messages {file_id} ────► API Server
14
156. API Server validates file_id exists in S3, creates message

Why presigned URLs?

Large files don’t go through API servers
Direct S3 upload is faster and cheaper
API servers stay lightweight

Part 10: Interview Strategy

10.1 Time Management (45-minute interview)

Phase	Time	Focus
Clarify requirements	5 min	Ask questions, establish scope
High-level design	10 min	Draw architecture, explain components
Core deep dive	20 min	Database, WebSocket, message delivery
Failure handling	5 min	Proactively discuss failures
Wrap-up / Questions	5 min	Extensions, tradeoffs

10.2 What Interviewers Look For

Based on real OpenAI interview feedback:

Database design (100% asked): You must draw schemas and explain sharding
Fan-out problem (50% asked): How to send message to 10K members
Multi-tenancy (50% asked): How to isolate company data
Failure handling (50% asked): What if servers crash?

10.3 Common Mistakes to Avoid

Mistake	Why It’s Bad	What to Do Instead
Being reactive	Waiting for interviewer to ask about scale/failures	Proactively say “Let me discuss what happens when this fails…”
Forgetting multi-tenancy	Designing for single company	Always include workspace_id from the start
Ignoring fan-out	Saying “just send to everyone”	Explain pub/sub, batching, lazy delivery
Vague database design	Just saying “use a database”	Draw tables, specify keys, explain indexes
No numbers	“A lot of messages”	“1 billion messages/day, 12K messages/second”

10.4 Key Phrases to Use

✅ “Let me start with requirements to ensure I understand the scope…” ✅ “For a billion-user system, we need to consider…” ✅ “The tradeoff here is between consistency and availability…” ✅ “What happens when this component fails? Let me address that…” ✅ “Discord handles this by…, Slack’s approach is…, let me explain why I’d choose…” ✅ “At scale, this becomes a bottleneck, so we need to…”

Appendix: Quick Reference

Production Numbers to Cite

Platform	Concurrent Users	Messages/Day	P99 Latency	Database
Discord	15M+	Trillions stored	15ms read	ScyllaDB
Slack	5M+ WebSocket	Millions	500ms global	MySQL/Vitess
WhatsApp	147M peak	100B+	Sub-second	Mnesia

Technology Choices Cheat Sheet

Component	Small Scale	Large Scale
Database	PostgreSQL	Vitess (MySQL) or ScyllaDB
Cache	Redis single	Redis Cluster
Pub/Sub	Redis Pub/Sub	Kafka + Redis
WebSocket	Node.js	Elixir/Erlang (BEAM)
Message Queue	RabbitMQ	Kafka

Key Formulas

1Connections per server:     100K (conservative) to 2M (optimized Elixir)
2RAM per connection:         ~30 KB
3Messages per second:        DAU × messages_per_user / 86400
4Storage per day:            messages × avg_size × replication_factor
5Fanout cost:                messages × avg_channel_size

This document consolidates material from ByteByteGo, Discord Engineering, Slack Engineering, WhatsApp architecture papers, and real interview experiences.

Design a Slack-Like Chat System

Table Of Contents