How to Use This Document
This document is structured as a 45-60 minute technical interview script or lecture. It flows from foundational concepts to advanced distributed systems patterns. Each section includes:
- What to say (the teaching content)
- What to draw (diagrams to sketch)
- Key numbers (statistics to cite)
- Interview tips (how to present in an interview setting)
Part 1: Setting the Stage (5 minutes)
1.1 Opening Statement
“Today we’re designing a large-scale chat system for companies, similar to Slack. Before diving into the architecture, I want to establish the scope and understand what makes this problem interesting at scale.”
Key insight to share: “The fundamental challenge in chat systems is balancing real-time delivery with reliability. We need messages to arrive instantly, but we also can’t lose them. At FAANG scale, this means handling billions of messages daily while maintaining sub-second latency.”
1.2 Clarifying Requirements (Always Start Here)
In an interview, always ask clarifying questions first. This shows structured thinking and helps scope the problem appropriately.
Functional Requirements (What the system does)
Ask the interviewer to confirm:
- Channel creation — Can users create channels with multiple participants? What about direct messages (1:1)?
- Messaging — Send and receive messages in real-time within channels
- Offline delivery — Users receive messages sent while they were offline
- Media files — Support for images, documents, and other attachments
- Multi-tenancy — Multiple companies (workspaces) use the same infrastructure with complete data isolation
Below the line (mention but deprioritize):
- Message editing and deletion
- Reactions and threading
- Search functionality
- Private channels with access control
Non-Functional Requirements (How the system performs)
| Requirement | Target | Rationale |
|---|---|---|
| Latency | < 500ms end-to-end | Users expect “instant” messaging |
| Throughput | 1 billion+ messages/day | Enterprise scale |
| Availability | 99.9% - 99.99% | Chat is mission-critical for businesses |
| Delivery guarantee | At-least-once with deduplication | Messages must not be lost |
| Consistency | Eventual for display, ordered within channels | Perfect global ordering is unnecessary |
Below the line:
- End-to-end encryption
- Compliance and audit logging
- Geographic data residency
Scale Estimation (Back-of-envelope)
“Let me estimate the scale we’re designing for:”
1Users: 100 million registered users
2DAU: 20 million daily active users
3Concurrent: 5 million simultaneous connections (peak)
4Messages/day: 1 billion messages
5Messages/second: ~12,000 average, ~50,000 peak
6
7Message size: ~1 KB average (text + metadata)
8Storage/day: 1 billion × 1 KB = 1 TB/day
9Storage/year: 365 TB/year (before replication)
10
11Connections: 5 million WebSocket connections
12 At ~30 KB RAM per connection = 150 GB RAM for connections alone
13 Need ~150 servers at 1M connections each (aggressive)
14 Or ~500 servers at 100K connections each (conservative)
Part 2: Core Design Foundations (10 minutes)
2.1 Core Entities
“Let me identify the fundamental data entities we’ll be working with:”
1┌─────────────────────────────────────────────────────────────────┐
2│ CORE ENTITIES │
3├─────────────────────────────────────────────────────────────────┤
4│ Workspace │ The company/organization container │
5│ Channel │ A conversation space (public or private) │
6│ User │ A person who can send/receive messages │
7│ Membership │ Relationship between users and channels │
8│ Message │ The content sent within a channel │
9│ Device │ A user's connected client (phone, laptop, etc.) │
10└─────────────────────────────────────────────────────────────────┘
2.2 API Design
“I’ll define the core APIs. I prefer to think about these in terms of the user actions they enable:”
Interview tip: When listing APIs, focus on the most important ones first. Don’t list every possible API. It wastes time and can make you seem unprepared if you can’t explain all of them.
Essential APIs
1# Channel Management
2POST /workspaces/{workspace_id}/channels
3 → create_channel(workspace_id, name, is_private, initial_members[])
4 → Returns: channel_id
5
6POST /channels/{channel_id}/members
7 → add_member(channel_id, user_id, role)
8 → Returns: membership_id
9
10# Messaging
11POST /channels/{channel_id}/messages
12 → send_message(channel_id, sender_id, content, idempotency_key)
13 → Returns: message_id, sequence_number
14
15GET /channels/{channel_id}/messages?after={seq}&limit={n}
16 → get_messages(channel_id, after_sequence, limit)
17 → Returns: messages[], has_more
18
19# Real-time (WebSocket)
20WS /connect?token={auth_token}
21 → Establishes bidirectional connection
22 → Server pushes: new_message, presence_update, typing_indicator
23 → Client sends: heartbeat, subscribe_channel, mark_read
24
25# File Upload (separate flow)
26POST /files/upload-url
27 → get_presigned_url(file_name, file_size, content_type)
28 → Returns: upload_url, file_id
29
30POST /channels/{channel_id}/messages
31 → send_message(channel_id, sender_id, content, file_ids[])
API Design Principles
- Include workspace_id in paths — Essential for multi-tenancy and sharding
- Use idempotency keys — Prevents duplicate messages on retry
- Cursor-based pagination — Use sequence numbers, not page offsets
- Separate file upload — Don’t mix large binary data with message APIs
2.3 High-Level Architecture Overview
“Let me draw the high-level architecture, then we’ll dive into each component:”
1┌─────────────────────────────────────────────────────────────────────────────┐
2│ CLIENTS │
3│ (Web, Mobile, Desktop Apps) │
4└─────────────────────────────────────────────────────────────────────────────┘
5 │
6 ▼
7┌─────────────────────────────────────────────────────────────────────────────┐
8│ LOAD BALANCER │
9│ (SSL termination, routing, rate limiting) │
10└─────────────────────────────────────────────────────────────────────────────┘
11 │ │
12 ┌───────────┘ └───────────┐
13 ▼ ▼
14┌──────────────────────────┐ ┌──────────────────────────┐
15│ API SERVERS │ │ GATEWAY SERVERS │
16│ (HTTP REST requests) │ │ (WebSocket connections) │
17│ - Auth, file upload │ │ - Real-time delivery │
18│ - Message send │ │ - Presence, typing │
19└──────────────────────────┘ └──────────────────────────┘
20 │ │
21 └──────────────────┬─────────────────────────┘
22 ▼
23┌─────────────────────────────────────────────────────────────────────────────┐
24│ CORE SERVICES │
25│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
26│ │ Channel │ │ Message │ │ Presence │ │ Notification│ │
27│ │ Service │ │ Service │ │ Service │ │ Service │ │
28│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
29└─────────────────────────────────────────────────────────────────────────────┘
30 │
31 ┌──────────────────┼──────────────────┐
32 ▼ ▼ ▼
33┌──────────────────────┐ ┌─────────────┐ ┌──────────────────────┐
34│ MESSAGE QUEUE │ │ CACHE │ │ DATABASES │
35│ (Kafka/Redis Pub/Sub)│ │ (Redis) │ │ (MySQL/Cassandra) │
36└──────────────────────┘ └─────────────┘ └──────────────────────┘
Part 3: Building the Basic System (10 minutes)
3.1 Start Simple: Single Server MVP
“Let me start with the simplest working system, then evolve it to handle scale.”
1┌──────────┐ ┌──────────────────┐ ┌────────────┐
2│ Client │ ◄─────► │ Chat Server │ ◄─────► │ Database │
3│ (User) │ WS │ (Single Node) │ SQL │ (MySQL) │
4└──────────┘ └──────────────────┘ └────────────┘
Single Server Workflow: Sending a Message
- User A sends message via WebSocket to Chat Server
- Chat Server validates the request (auth, permissions)
- Chat Server stores message in database
- Chat Server looks up channel members from database
- Chat Server pushes message to all online members via their WebSocket connections
What’s wrong with this?
- Single point of failure
- Can’t scale beyond one server’s connection limit (~100K-500K connections)
- All members must be connected to the same server
- Database becomes bottleneck
3.2 Database Schema (Foundation)
“Before scaling, let me establish the data model. This schema works for both SQL and NoSQL with modifications.”
Core Tables
1-- Workspace: The company/organization
2CREATE TABLE workspace (
3 workspace_id BIGINT PRIMARY KEY,
4 name VARCHAR(255),
5 plan ENUM('free', 'pro', 'enterprise'),
6 created_at TIMESTAMP
7);
8
9-- Channel: A conversation space
10CREATE TABLE channel (
11 workspace_id BIGINT,
12 channel_id BIGINT,
13 name VARCHAR(255),
14 is_private BOOLEAN DEFAULT FALSE,
15 created_at TIMESTAMP,
16 PRIMARY KEY (workspace_id, channel_id)
17);
18
19-- Channel Membership: Who is in which channel
20CREATE TABLE channel_member (
21 workspace_id BIGINT,
22 channel_id BIGINT,
23 user_id BIGINT,
24 role ENUM('member', 'admin', 'owner'),
25 joined_at TIMESTAMP,
26 PRIMARY KEY (workspace_id, channel_id, user_id)
27);
28
29-- User's view of channels (for quick lookup)
30CREATE TABLE user_channel (
31 workspace_id BIGINT,
32 user_id BIGINT,
33 channel_id BIGINT,
34 last_read_seq BIGINT DEFAULT 0, -- For unread counts
35 last_delivered_seq BIGINT DEFAULT 0, -- For offline catchup
36 muted BOOLEAN DEFAULT FALSE,
37 PRIMARY KEY (workspace_id, user_id, channel_id)
38);
39
40-- Messages
41CREATE TABLE message (
42 workspace_id BIGINT,
43 channel_id BIGINT,
44 message_seq BIGINT, -- Per-channel sequence number
45 message_id BIGINT, -- Global unique ID (Snowflake)
46 sender_id BIGINT,
47 body TEXT,
48 created_at TIMESTAMP,
49 PRIMARY KEY (workspace_id, channel_id, message_seq)
50);
Why This Schema Design?
- workspace_id in every table — Enables multi-tenancy and sharding
- Composite primary keys — (workspace_id, channel_id, message_seq) colocates related data
- Dual membership tables — channel_member for “who’s in this channel”, user_channel for “what channels is this user in”
- Sequence numbers — message_seq enables ordered pagination and gap detection
3.3 Communication Protocols
“Let me explain why we use different protocols for different operations:”
Protocol Comparison
| Protocol | Use Case | Characteristics |
|---|---|---|
| HTTP/REST | Sending messages, file upload, auth | Request-response, stateless, easy to load balance |
| WebSocket | Receiving messages, presence, typing | Bidirectional, persistent connection, server push |
| Long Polling | Fallback when WebSocket unavailable | Higher latency, more overhead |
| HTTP/2 + SSE | Alternative to WebSocket | Multiplexed, unidirectional server push |
Why WebSocket for Receiving?
“The receiver side needs server push capability. Let me compare the options:”
Polling (Bad)
1Client: "Any new messages?" → Server: "No"
2Client: "Any new messages?" → Server: "No"
3Client: "Any new messages?" → Server: "Yes, here's one"
- Wastes resources checking constantly
- High latency (message waits for next poll)
Long Polling (Better)
1Client: "Any new messages? I'll wait..."
2Server: [holds connection for 30 seconds]
3Server: "Here's a message!" [or timeout]
- Reduces requests but still overhead of reconnecting
- Each response requires new connection
WebSocket (Best)
1Client: "Let's open a persistent connection"
2Server: [keeps connection open]
3Server: "New message!" [instant push]
4Server: "Another message!" [instant push]
- Single persistent connection
- Sub-100ms delivery latency
- Efficient for high-frequency updates
Connection Lifecycle
1┌─────────────────────────────────────────────────────────────────────────┐
2│ WebSocket Connection Lifecycle │
3└─────────────────────────────────────────────────────────────────────────┘
4
51. ESTABLISH CONNECTION
6 Client ──── HTTP Upgrade Request ────► Server
7 Client ◄─── 101 Switching Protocols ── Server
8 [WebSocket connection established]
9
102. AUTHENTICATE
11 Client ──── {type: "auth", token: "..."} ────► Server
12 Client ◄─── {type: "auth_ok", user_id: 123} ── Server
13
143. SUBSCRIBE TO CHANNELS
15 Client ──── {type: "subscribe", channels: [...]} ────► Server
16 Client ◄─── {type: "subscribed", channels: [...]} ──── Server
17
184. STEADY STATE
19 Client ◄─── {type: "message", channel_id: 1, ...} ──── Server (push)
20 Client ◄─── {type: "presence", user_id: 5, status: "online"} ── Server
21 Client ──── {type: "heartbeat"} ────► Server (every 30 seconds)
22
235. DISCONNECTION
24 Client ──── Close frame ────► Server
25 [or Server ──── Close frame ────► Client]
26 [or Connection times out after missed heartbeats]
Part 4: Scaling to Distributed System (15 minutes)
4.1 The Multi-Server Challenge
“When we have multiple servers, a fundamental problem emerges: the sender and receiver may be connected to different servers.”
1┌──────────┐ ┌──────────┐
2│ User A │ │ User B │
3│ (Sender) │ │(Receiver)│
4└────┬─────┘ └────┬─────┘
5 │ │
6 ▼ ▼
7┌──────────┐ ┌──────────┐
8│ Server 1 │ HOW DOES MESSAGE GET HERE? ────► │ Server 2 │
9└──────────┘ └──────────┘
Two solutions:
- Consistent Hashing — Deterministically route users to servers
- Pub/Sub Message Queue — Broadcast messages to all servers
4.2 Solution 1: Consistent Hashing
“We can use consistent hashing to ensure all members of a channel connect to the same server.”
How It Works
1Hash Ring:
2 Server A
3 │
4 ┌──────────┼──────────┐
5 ╱ ╲
6 Server D Server B
7 ╲ ╱
8 └──────────┬──────────┘
9 │
10 Server C
11
12Channel 123 → hash(123) → position on ring → Server B
13All users in Channel 123 connect to Server B
What’s the draw back of the consistent hashing approach?
- If a server goes down, all channels mapped to it lose connectivity until reassignment
- Adding/removing servers causes many channels to remap (connection churn)
- Hot channels (10K+ members) still create hotspots
- Complex to manage and monitor
Slack’s CHARM System
Slack uses Consistent Hash Ring Managers (CHARMs):
- Each channel_id maps to exactly one Channel Server
- 64-256 virtual nodes per physical server (smooths distribution)
- When server fails, only ~1/N of channels need reassignment
- Slack reports: Unhealthy server replacement completes in under 20 seconds
Limitations
- Doesn’t work when user is in multiple channels on different servers
- Server addition/removal causes connection migrations
- Hot channels (10K+ members) still create hotspots
4.3 Solution 2: Pub/Sub Architecture (Preferred)
“The better solution decouples message routing from connection management using pub/sub.”
1┌──────────────────────────────────────────────────────────────────────────┐
2│ PUB/SUB ARCHITECTURE │
3└──────────────────────────────────────────────────────────────────────────┘
4
5┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
6│ User A │ │ User B │ │ User C │ │ User D │
7└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
8 │ WS │ WS │ WS │ WS
9 ▼ ▼ ▼ ▼
10┌─────────────────────────────────────────────────────────────────────────┐
11│ GATEWAY SERVERS (WebSocket Connections) │
12│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
13│ │ Gateway 1 │ │ Gateway 2 │ │ Gateway 3 │ │
14│ │ Users: A, B │ │ Users: C │ │ Users: D │ │
15│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
16└─────────┼─────────────────┼─────────────────┼───────────────────────────┘
17 │ │ │
18 │ Subscribe │ Subscribe │ Subscribe
19 │ to topics │ to topics │ to topics
20 ▼ ▼ ▼
21┌─────────────────────────────────────────────────────────────────────────┐
22│ PUB/SUB SYSTEM │
23│ (Redis Pub/Sub or Kafka) │
24│ │
25│ Topics: channel:123, channel:456, user:789:devices, ... │
26└─────────────────────────────────────────────────────────────────────────┘
27 ▲
28 │ Publish message to channel:123
29 │
30┌─────────┴───────┐
31│ Message │
32│ Service │
33└─────────────────┘
Message Flow with Pub/Sub
- User A sends message to Channel 123 via HTTP POST
- API Server validates and stores message in database
- API Server publishes to pub/sub topic
channel:123 - All Gateway Servers subscribed to
channel:123receive the message - Each Gateway pushes message to connected users who are members of Channel 123
Pub/Sub Technology Choices
| Technology | Latency | Durability | Throughput | Best For |
|---|---|---|---|---|
| Redis Pub/Sub | Sub-ms | None (fire-and-forget) | High | Typing indicators, presence |
| Redis Streams | Sub-ms | Yes (persisted) | High | Messages with replay capability |
| Kafka | 10-100ms | Excellent | Very high | Audit logs, analytics, durability |
| NATS | 0.1-0.4ms | Optional | 3M+ msg/sec | Lightweight real-time delivery |
Discord uses: Redis Pub/Sub for real-time fanout Slack uses: Custom internal pub/sub + Kafka for durability
Hybrid Approach (Production Pattern)
1┌─────────────────────────────────────────────────────────────────────────┐
2│ HYBRID MESSAGE FLOW │
3└─────────────────────────────────────────────────────────────────────────┘
4
5 Message from User A
6 │
7 ▼
8 ┌───────────────┐
9 │ API Server │
10 └───────┬───────┘
11 │
12 ┌────────────────┼────────────────┐
13 ▼ ▼ ▼
14 ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
15 │ Database │ │ Redis Pub/ │ │ Kafka │
16 │ (persist) │ │ Sub (fast) │ │ (durable) │
17 └─────────────┘ └──────┬──────┘ └─────────────┘
18 │
19 ▼
20 Gateway Servers
21 (push to users)
Why both Redis and Kafka?
- Redis for instant delivery (sub-millisecond)
- Kafka for durability, replay, and downstream consumers (search indexing, analytics)
4.4 Stateless Gateway Design
“Gateway servers must be stateless for horizontal scaling and fault tolerance.”
What “Stateless” Means
Stateful (Bad):
1# Gateway server stores user session in local memory
2class GatewayServer:
3 def __init__(self):
4 self.sessions = {} # user_id → session_data
5 self.connections = {} # user_id → websocket
6
7 def on_connect(self, user_id, websocket):
8 self.sessions[user_id] = load_user_data(user_id)
9 self.connections[user_id] = websocket
If this server crashes, all session data is lost.
Stateless (Good):
1# Gateway server stores session in Redis
2class GatewayServer:
3 def __init__(self):
4 self.redis = Redis()
5 self.connections = {} # Local connection map only
6
7 def on_connect(self, user_id, websocket):
8 session = self.redis.get(f"session:{user_id}") # External storage
9 self.connections[user_id] = websocket
10 # Register this gateway as handling this user
11 self.redis.hset(f"user:{user_id}:gateways", self.server_id, time.now())
If server crashes, user reconnects to any gateway and resumes from Redis.
Session Data in Redis
1┌─────────────────────────────────────────────────────────────────────────┐
2│ DISTRIBUTED SESSION STORAGE │
3└─────────────────────────────────────────────────────────────────────────┘
4
5Redis Keys:
6 session:{user_id}
7 → {workspace_id, subscribed_channels[], last_seen_seq:{channel_id → seq}}
8
9 user:{user_id}:gateways
10 → {gateway_1: timestamp, gateway_2: timestamp} (multi-device)
11
12 presence:{user_id}
13 → {status: "online", last_heartbeat: timestamp} TTL: 60 seconds
14
15 channel:{channel_id}:subscribers
16 → Set of user_ids currently subscribed
Reconnection Flow
11. User's connection drops (network issue, server crash)
2
32. Client detects via:
4 - WebSocket onclose event
5 - Missed heartbeat response (no pong for 60s)
6
73. Client initiates reconnection:
8 - Uses exponential backoff with jitter
9 - Connects to any available gateway (via load balancer)
10
114. New gateway:
12 - Authenticates user
13 - Retrieves session from Redis
14 - Subscribes to user's channels
15 - Fetches missed messages since last_delivered_seq
16
175. Client is back online with no message loss
4.5 Database Scaling
“The database is often the hardest component to scale. Let me explain the strategies.”
Sharding Strategies
Option 1: Shard by workspace_id
1Shard 1: workspaces 1-1000
2Shard 2: workspaces 1001-2000
3...
4
5```text
6- **Pros:** Complete tenant isolation, simple routing
7- **Cons:** Large workspaces become hotspots (Slack's #general channel problem)
8
9**Option 2: Shard by channel_id (Slack's current approach)**
10
11```text
12Shard = hash(workspace_id, channel_id) % num_shards
- Pros: Spreads hot channels across shards
- Cons: Cross-channel queries require scatter-gather
Option 3: Shard by (channel_id, time_bucket) (Discord’s approach)
1Partition Key: (channel_id, bucket)
2Bucket: 10-day time window
3Clustering Key: message_id (Snowflake)
- Pros: Bounded partition size, efficient time-range queries
- Cons: Complex composite queries
Discord’s Scaling Journey
| Metric | Cassandra (Before) | ScyllaDB (After) |
|---|---|---|
| Nodes | 177 | 72 |
| P99 Read Latency | 40-125ms | 15ms |
| P99 Write Latency | 5-70ms (variable) | 5ms (stable) |
| Messages Stored | Trillions | Trillions |
Why ScyllaDB outperformed Cassandra:
- C++ implementation eliminates JVM garbage collection pauses
- Shard-per-core architecture for predictable performance
- Same data model and query language (CQL compatible)
Slack’s Vitess Migration
Slack migrated from monolithic MySQL to Vitess (horizontal MySQL sharding):
- 2.3 million queries per second at peak
- 2ms median latency
- Preserves ACID transactions within a shard
- VTGate layer handles routing transparently
Database Choice Decision Framework
1┌─────────────────────────────────────────────────────────────────────────┐
2│ DATABASE SELECTION GUIDE │
3└─────────────────────────────────────────────────────────────────────────┘
4
5 Need ACID Transactions?
6 │
7 ┌────────────┴────────────┐
8 │ YES │ NO
9 ▼ ▼
10 ┌─────────────────┐ ┌─────────────────┐
11 │ MySQL + Vitess │ │ Need Low Latency│
12 │ PostgreSQL │ │ at Scale? │
13 └─────────────────┘ └────────┬────────┘
14 │
15 ┌────────────┴────────────┐
16 │ YES │ NO
17 ▼ ▼
18 ┌─────────────────┐ ┌─────────────────┐
19 │ ScyllaDB │ │ Cassandra │
20 │ (C++, no GC) │ │ (JVM, mature) │
21 └─────────────────┘ └─────────────────┘
Part 5: Message Delivery & Ordering (10 minutes)
5.1 Message Ordering Guarantees
“A critical question for chat systems: how do we ensure messages appear in the correct order?”
What Level of Ordering Do We Need?
| Ordering Level | Description | Difficulty | Needed? |
|---|---|---|---|
| Global ordering | All messages across all channels ordered | Very Hard | No |
| Per-channel ordering | Messages within a channel are ordered | Medium | Yes |
| Per-sender ordering | Messages from same sender are ordered | Easy | Yes |
| Causal ordering | Replies appear after original message | Hard | Nice to have |
Key insight: “Per-channel ordering is sufficient for chat. Users don’t care if Channel A’s message 5 came before or after Channel B’s message 10.”
Implementing Per-Channel Ordering
Approach 1: Database Atomic Counter
1-- For each message insert:
2BEGIN TRANSACTION;
3 SELECT MAX(message_seq) + 1 INTO @next_seq
4 FROM message WHERE channel_id = ?;
5
6 INSERT INTO message (channel_id, message_seq, ...)
7 VALUES (?, @next_seq, ...);
8COMMIT;
- Pros: Simple, strongly consistent
- Cons: Single point of contention for high-traffic channels
Approach 2: Snowflake IDs (Twitter/Discord)
1Snowflake ID (64 bits):
2┌─────────────────────────────────────────────────────────────────┐
3│ 41 bits: timestamp │ 10 bits: machine ID │ 12 bits: seq │
4│ (milliseconds) │ (1024 machines) │ (4096/ms) │
5└─────────────────────────────────────────────────────────────────┘
6
7Example: 1234567890123456789
8 └─── Encodes: time, which machine, sequence within that ms
- Pros: Decentralized generation, time-sortable, globally unique
- Cons: Clock skew can cause ordering issues, not strictly monotonic per channel
Approach 3: Hybrid (Recommended)
1# Message has both:
2message = {
3 "message_id": generate_snowflake(), # Global unique, time-sortable
4 "message_seq": get_channel_sequence(), # Per-channel strict ordering
5 "channel_id": 123,
6 "body": "Hello!"
7}
- Use Snowflake for global uniqueness and rough time ordering
- Use per-channel sequence for strict ordering within channel
- Sequence can be generated by channel service (single leader per channel)
Handling Simultaneous Messages
“What if two users send messages at the exact same millisecond?”
1User A sends "Hello" ─────┐
2 ├───► Channel Server ───► Database
3User B sends "Hi" ────────┘
4
5Both arrive at same millisecond. Which is first?
Solution: Single leader per channel
The Channel Server (or database primary) serializes all writes to a channel. It assigns sequence numbers atomically:
1class ChannelServer:
2 def __init__(self, channel_id):
3 self.channel_id = channel_id
4 self.next_seq = self.load_from_db()
5 self.lock = threading.Lock()
6
7 def assign_sequence(self):
8 with self.lock:
9 seq = self.next_seq
10 self.next_seq += 1
11 return seq
Trade-off: This creates a bottleneck for very active channels. Solutions:
- Shard large channels by time bucket
- Accept eventual consistency with client-side sorting
5.2 Delivery Guarantees
The Impossible Problem: Exactly-Once Delivery
“True exactly-once delivery is impossible in distributed systems. This is proven by the Two Generals Problem.”
1Two Generals Problem:
2 General A ───message───► General B
3 (may be lost)
4
5 General A ◄───ack────── General B
6 (may be lost)
7
8Neither general can ever be certain the other received the message.
Practical Solution: At-Least-Once Semantics + Idempotency
1┌─────────────────────────────────────────────────────────────────────────┐
2│ AT-LEAST-ONCE WITH DEDUPLICATION │
3└─────────────────────────────────────────────────────────────────────────┘
4
51. Client sends message with idempotency_key
6 {idempotency_key: "abc-123", text: "Hello!"}
7
82. Server checks if key exists in Redis:
9 - If exists → return cached response (duplicate)
10 - If not → process and cache response
11
123. Client doesn't receive response (network issue)
13
144. Client retries with SAME idempotency_key
15
165. Server recognizes duplicate, returns cached response
17
186. Message appears exactly once to users
Idempotency Key Implementation
1def send_message(request):
2 key = f"idempotency:{request.idempotency_key}"
3
4 # Check for duplicate
5 existing = redis.get(key)
6 if existing:
7 return json.loads(existing) # Return cached response
8
9 # Try to acquire lock (prevent concurrent duplicates)
10 lock_acquired = redis.set(key, "processing", nx=True, ex=30)
11 if not lock_acquired:
12 # Another request is processing, wait for it
13 return wait_for_result(key)
14
15 try:
16 # Process the message
17 message = store_message(request)
18 response = {"message_id": message.id, "status": "sent"}
19
20 # Cache response for 24 hours
21 redis.setex(key, 86400, json.dumps(response))
22 return response
23 except Exception as e:
24 redis.delete(key) # Release lock on failure
25 raise e
Client-Side Deduplication
Even with server-side idempotency, clients should deduplicate:
1class MessageHandler {
2 constructor() {
3 this.seenMessageIds = new Set();
4 }
5
6 onMessageReceived(message) {
7 if (this.seenMessageIds.has(message.message_id)) {
8 return; // Already displayed this message
9 }
10 this.seenMessageIds.add(message.message_id);
11 this.displayMessage(message);
12 }
13}
5.3 Offline Message Delivery
“When a user is offline, we need to ensure they receive messages when they come back online.”
Approach 1: Inbox Table (Not Recommended)
1-- Store undelivered messages per user
2CREATE TABLE inbox (
3 user_id BIGINT,
4 message_id BIGINT,
5 channel_id BIGINT,
6 created_at TIMESTAMP,
7 PRIMARY KEY (user_id, message_id)
8);
Problems:
- Storage explodes for users in large channels
- Need to delete after delivery (write amplification)
- Fanout cost: 1 message to 10K members = 10K inbox rows
Approach 2: Cursor-Based Catchup (Recommended)
1-- Track delivery progress per user per channel
2CREATE TABLE user_channel_cursor (
3 workspace_id BIGINT,
4 user_id BIGINT,
5 channel_id BIGINT,
6 last_delivered_seq BIGINT, -- Last message user received
7 last_read_seq BIGINT, -- Last message user saw
8 PRIMARY KEY (workspace_id, user_id, channel_id)
9);
On reconnect:
1def handle_reconnect(user_id, channel_id):
2 cursor = db.get_cursor(user_id, channel_id)
3
4 # Fetch messages since last delivery
5 missed_messages = db.query("""
6 SELECT * FROM message
7 WHERE channel_id = ? AND message_seq > ?
8 ORDER BY message_seq ASC
9 LIMIT 1000
10 """, channel_id, cursor.last_delivered_seq)
11
12 # Send to client
13 for msg in missed_messages:
14 websocket.send(msg)
15
16 # Update cursor
17 db.update_cursor(user_id, channel_id,
18 last_delivered_seq=missed_messages[-1].message_seq)
Advantages:
- No per-message-per-user storage
- Messages stored once, cursors are tiny
- Works for any channel size
Part 6: Presence & Multi-Device Sync (5 minutes)
6.1 Presence System (Online/Offline Status)
“The green dot showing someone is online is surprisingly complex at scale.”
Requirements
- Show real-time online/offline status
- Handle brief disconnections gracefully (don’t flicker)
- Scale to millions of concurrent users
- Tolerate false positives (better to show online when offline than vice versa)
Heartbeat-Based Presence
1┌─────────────────────────────────────────────────────────────────────────┐
2│ PRESENCE SYSTEM │
3└─────────────────────────────────────────────────────────────────────────┘
4
5Client ──── heartbeat ────► Gateway ──── update ────► Presence Service
6 │
7 ▼
8 Redis (TTL keys)
9 │
10 ▼
11 presence:{user_id}
12 TTL: 60 seconds
Flow:
- Client sends heartbeat every 30 seconds
- Gateway forwards to Presence Service
- Presence Service sets Redis key with 60 second TTL
- If no heartbeat received, key expires → user is offline
- Status changes published via pub/sub
Why the Asymmetry? (30s heartbeat, 60s TTL)
- Network jitter: Heartbeats may be delayed by 5-10 seconds
- Retry opportunity: Client has one retry window before marked offline
- Graceful degradation: Brief network glitches don’t cause status flicker
Actor Model for Presence
The actor model is particularly elegant for presence:
1# Each user gets their own actor (Erlang process)
2defmodule UserPresenceActor do
3 def init(user_id) do
4 # Set timer for offline transition
5 timer = schedule_offline_check(60_000) # 60 seconds
6 {:ok, %{user_id: user_id, timer: timer, status: :online}}
7 end
8
9 def handle_cast(:heartbeat, state) do
10 # Cancel old timer, schedule new one
11 Process.cancel_timer(state.timer)
12 new_timer = schedule_offline_check(60_000)
13 {:noreply, %{state | timer: new_timer}}
14 end
15
16 def handle_info(:offline_check, state) do
17 # No heartbeat received in time
18 broadcast_status_change(state.user_id, :offline)
19 {:stop, :normal, state}
20 end
21end
Why actors work well:
- Each user’s presence is independent
- Millions of lightweight processes (2KB each in Erlang)
- One actor crash doesn’t affect others
- Timers are natural with self-messages
Presence Fanout Challenge
“When User A comes online, who needs to know?”
Naive approach: Notify all users in all channels User A is in Problem: User in 100 channels with 100 members each = 10,000 notifications
Optimized approach:
- Only notify users who are currently viewing a channel with User A
- Client subscribes to presence updates for visible channel members only
- Lazy loading: fetch presence on-demand when opening a channel
6.2 Multi-Device Synchronization
“A user might have Slack open on their laptop, phone, and tablet simultaneously.”
Challenges
- Message delivery to all devices
- Read status sync — Read on phone, laptop shows as read too
- Typing indicators — Don’t show “typing” from your own other device
- Notification suppression — If active on laptop, don’t buzz phone
Device Registration
1Redis structure:
2 user:{user_id}:devices
3 → {
4 device_1: {gateway_id: "gw-1", last_active: timestamp, platform: "ios"},
5 device_2: {gateway_id: "gw-2", last_active: timestamp, platform: "web"}
6 }
Read Status Sync
1def mark_as_read(user_id, channel_id, message_seq):
2 # Update cursor in database
3 db.update("""
4 UPDATE user_channel_cursor
5 SET last_read_seq = ?
6 WHERE user_id = ? AND channel_id = ?
7 """, message_seq, user_id, channel_id)
8
9 # Publish to user's other devices
10 pubsub.publish(f"user:{user_id}:sync", {
11 "type": "read_status",
12 "channel_id": channel_id,
13 "last_read_seq": message_seq
14 })
Consistency model: Eventual consistency is acceptable for read status. A 1-2 second delay in syncing across devices is fine.
Part 7: Failure Handling & Reliability (5 minutes)
7.1 What Can Fail?
“In a distributed system, everything will fail eventually. Let me address each failure mode:”
| Component | Failure Mode | Impact | Mitigation |
|---|---|---|---|
| Client network | Disconnection | User can’t send/receive | Reconnection with catchup |
| Gateway server | Crash | Users disconnected | Stateless design, auto-reconnect |
| Message service | Crash | Can’t process messages | Multiple instances, queue buffering |
| Database | Primary failure | Writes fail | Automatic failover to replica |
| Pub/sub | Redis crash | No real-time delivery | Fallback to polling, Redis Cluster |
| Entire datacenter | Power outage | Full outage | Multi-region deployment |
7.2 Client Reconnection with Exponential Backoff
“When connection drops, clients must reconnect intelligently to avoid overwhelming servers.”
The Thundering Herd Problem
1Scenario: AWS outage takes down all connections
2 10 million clients try to reconnect simultaneously
3
4Without backoff:
5 Second 1: 10M connection attempts → servers crash
6 Second 2: 10M more attempts → still crashed
7 ...
8
9With jittered exponential backoff:
10 Second 1: ~100K attempts (some clients start)
11 Second 2: ~200K attempts (more clients start)
12 Second 10: Spread across time, servers can handle it
Implementation
1def reconnect_with_backoff():
2 base_delay = 1.0 # 1 second
3 max_delay = 30.0 # 30 seconds max
4 attempt = 0
5
6 while not connected:
7 # Calculate delay with exponential backoff
8 delay = min(base_delay * (2 ** attempt), max_delay)
9
10 # Add jitter (randomness) to spread out retries
11 jittered_delay = random.uniform(0, delay) # "Full jitter"
12
13 time.sleep(jittered_delay)
14
15 try:
16 connect()
17 # Reset on success
18 attempt = 0
19 except ConnectionError:
20 attempt += 1
AWS research shows: Full jitter reduces total retry attempts by over 50% compared to exponential backoff without jitter.
Network Change Detection
1// When network changes (WiFi → cellular), reset backoff
2window.addEventListener('online', () => {
3 reconnectAttempt = 0; // Reset backoff
4 reconnect(); // Try immediately
5});
7.3 Gateway Server Failure
1┌─────────────────────────────────────────────────────────────────────────┐
2│ GATEWAY FAILURE RECOVERY │
3└─────────────────────────────────────────────────────────────────────────┘
4
5BEFORE:
6┌──────────┐ ┌──────────────┐
7│ Client │◄───►│ Gateway 1 │ ← Server crashes
8└──────────┘ └──────────────┘
9 │
10 ▼
11 ┌─────────────┐
12 │ Redis │ Session data safe here
13 │ (state) │
14 └─────────────┘
15
16AFTER:
17┌──────────┐ ┌──────────────┐
18│ Client │◄───►│ Gateway 2 │ ← Reconnects to different server
19└──────────┘ └──────────────┘
20 │
21 ▼
22 ┌─────────────┐
23 │ Redis │ ← Retrieves same session
24 │ (state) │
25 └─────────────┘
26
27Client experience: Brief disconnection, then seamless resume
7.4 Database Failure
Leader-Follower Replication
1 ┌─────────────┐
2 Writes ────► │ Primary │
3 │ (Leader) │
4 └──────┬──────┘
5 │ Replication
6 ┌────────────┼────────────┐
7 ▼ ▼ ▼
8 ┌──────────┐ ┌──────────┐ ┌──────────┐
9 │ Replica │ │ Replica │ │ Replica │
10 │ 1 │ │ 2 │ │ 3 │
11 └──────────┘ └──────────┘ └──────────┘
12 ▲ ▲ ▲
13 └────────────┴────────────┘
14 Reads
When primary fails:
- Automated failover promotes Replica 1 to Primary
- Other replicas reconfigure to follow new primary
- Application reconnects to new primary
- Potential data loss: Last few writes may not have replicated
Slack’s AZ Drain Button
Slack built manual tooling for datacenter failures:
- AZ drain button: Redirects all traffic away from failing availability zone
- Execution time: Under 5 minutes
- Mechanism: Envoy weighted clusters shift traffic 100% to healthy AZs
“Sometimes human judgment is better than automated failover for complex scenarios.”
7.5 Split-Brain Prevention
“Split-brain occurs when network partition makes two nodes think they’re both the primary.”
1Network Partition:
2 ┌───────────────────────────────┐
3 ┌──────────┐ │ PARTITION │ ┌──────────┐
4 │ Primary? │◄───┼─────────────X────────────────┼───►│ Primary? │
5 │ Node A │ │ │ │ Node B │
6 └──────────┘ └───────────────────────────────┘ └──────────┘
7
8Both nodes think: "I can't reach the other one, I must take over!"
9Result: Two primaries accepting writes → data divergence
Prevention: Quorum-based consensus
1Cluster of 3 nodes:
2- Majority = 2 nodes
3- If partition splits 2-1, the side with 2 can operate
4- The side with 1 knows it doesn't have majority, refuses writes
5
6┌──────────────────────────────────────────────────────────────────────┐
7│ Node A │───────│ Node B │ │ Node C │ │
8│ (active) │ │ (active) │ │ (isolated)│ │
9│ │ │ │ │ (read-only) │
10│ Majority (2/3) │ X │ No majority │
11└──────────────────────────────────────────────────────────────────────┘
Part 8: Multi-Tenancy (3 minutes)
8.1 Enterprise Isolation Requirements
“Slack is a B2B product. Each company (workspace) expects complete isolation.”
Isolation Levels
| Level | Description | Complexity | Use Case |
|---|---|---|---|
| Logical | Same database, workspace_id in every query | Low | Small/free workspaces |
| Database | Separate database per workspace | Medium | Pro workspaces |
| Physical | Separate servers per workspace | High | Enterprise with compliance |
Implementation: workspace_id Everywhere
1-- EVERY query must include workspace_id
2SELECT * FROM message
3WHERE workspace_id = ? AND channel_id = ?;
4
5-- Primary keys include workspace_id
6PRIMARY KEY (workspace_id, channel_id, message_seq)
Why?
- Sharding: workspace_id is natural partition key
- Security: Impossible to accidentally query wrong workspace
- Performance: Database can route to correct shard immediately
Row-Level Security
1-- PostgreSQL row-level security
2CREATE POLICY workspace_isolation ON message
3 USING (workspace_id = current_setting('app.workspace_id')::BIGINT);
4
5-- Application sets context before queries
6SET app.workspace_id = 123;
7SELECT * FROM message WHERE channel_id = 456;
8-- Automatically filters to workspace 123
Rate Limiting per Workspace
Prevent one workspace from monopolizing resources:
1def rate_limit(workspace_id, operation):
2 key = f"rate_limit:{workspace_id}:{operation}"
3 current = redis.incr(key)
4 if current == 1:
5 redis.expire(key, 60) # 1 minute window
6
7 limits = {
8 "free": 1000,
9 "pro": 10000,
10 "enterprise": 100000
11 }
12 plan = get_workspace_plan(workspace_id)
13
14 if current > limits[plan]:
15 raise RateLimitExceeded()
Part 9: Deep Dive Topics (Interview Follow-ups)
9.1 What if the Pub/Sub Queue Goes Down?
Outbox Pattern:
1def send_message(message):
2 with db.transaction():
3 # Write message AND outbox event in same transaction
4 db.insert("message", message)
5 db.insert("outbox", {
6 "event_type": "new_message",
7 "payload": message,
8 "processed": False
9 })
10
11 # If pub/sub is up, process immediately
12 try:
13 publish_to_pubsub(message)
14 db.update("outbox", {"processed": True}, ...)
15 except PubSubDown:
16 pass # Background worker will retry
17
18# Background worker
19def process_outbox():
20 while True:
21 events = db.query("SELECT * FROM outbox WHERE processed = FALSE")
22 for event in events:
23 try:
24 publish_to_pubsub(event.payload)
25 db.update("outbox", {"processed": True}, ...)
26 except PubSubDown:
27 time.sleep(1) # Retry later
9.2 How to Handle Large Channels (10K+ members)?
Problem: Sending a message to 10K members = 10K fanout operations
Solutions:
- Lazy fanout: Only notify online members, others catchup on reconnect
- Tiered delivery: Push to first 1000 members, others poll
- Read-only for very large channels: Disable @channel mentions
9.3 Message Search Implementation
1┌─────────────────────────────────────────────────────────────────────────┐
2│ SEARCH ARCHITECTURE │
3└─────────────────────────────────────────────────────────────────────────┘
4
5┌──────────┐ ┌─────────────┐ ┌─────────────────┐
6│ Message │─────►│ Kafka │─────►│ Elasticsearch │
7│ Service │ │ (stream) │ │ (search index) │
8└──────────┘ └─────────────┘ └─────────────────┘
9 │
10 Search queries ────────┘
- Messages are primary stored in Cassandra/MySQL
- Kafka streams changes to Elasticsearch (near-realtime)
- Search queries hit Elasticsearch
- Indexing delay: 1-5 seconds (acceptable for search)
9.4 File Upload Flow
1┌─────────────────────────────────────────────────────────────────────────┐
2│ PRESIGNED URL UPLOAD FLOW │
3└─────────────────────────────────────────────────────────────────────────┘
4
51. Client ──── POST /files/upload-url ────► API Server
6 │
72. Generate presigned URL
8 │
93. Client ◄─── {upload_url, file_id} ──────────┘
10
114. Client ──── PUT [binary data] ────► S3 (direct upload)
12
135. Client ──── POST /messages {file_id} ────► API Server
14
156. API Server validates file_id exists in S3, creates message
Why presigned URLs?
- Large files don’t go through API servers
- Direct S3 upload is faster and cheaper
- API servers stay lightweight
Part 10: Interview Strategy
10.1 Time Management (45-minute interview)
| Phase | Time | Focus |
|---|---|---|
| Clarify requirements | 5 min | Ask questions, establish scope |
| High-level design | 10 min | Draw architecture, explain components |
| Core deep dive | 20 min | Database, WebSocket, message delivery |
| Failure handling | 5 min | Proactively discuss failures |
| Wrap-up / Questions | 5 min | Extensions, tradeoffs |
10.2 What Interviewers Look For
Based on real OpenAI interview feedback:
- Database design (100% asked): You must draw schemas and explain sharding
- Fan-out problem (50% asked): How to send message to 10K members
- Multi-tenancy (50% asked): How to isolate company data
- Failure handling (50% asked): What if servers crash?
10.3 Common Mistakes to Avoid
| Mistake | Why It’s Bad | What to Do Instead |
|---|---|---|
| Being reactive | Waiting for interviewer to ask about scale/failures | Proactively say “Let me discuss what happens when this fails…” |
| Forgetting multi-tenancy | Designing for single company | Always include workspace_id from the start |
| Ignoring fan-out | Saying “just send to everyone” | Explain pub/sub, batching, lazy delivery |
| Vague database design | Just saying “use a database” | Draw tables, specify keys, explain indexes |
| No numbers | “A lot of messages” | “1 billion messages/day, 12K messages/second” |
10.4 Key Phrases to Use
✅ “Let me start with requirements to ensure I understand the scope…” ✅ “For a billion-user system, we need to consider…” ✅ “The tradeoff here is between consistency and availability…” ✅ “What happens when this component fails? Let me address that…” ✅ “Discord handles this by…, Slack’s approach is…, let me explain why I’d choose…” ✅ “At scale, this becomes a bottleneck, so we need to…”
Appendix: Quick Reference
Production Numbers to Cite
| Platform | Concurrent Users | Messages/Day | P99 Latency | Database |
|---|---|---|---|---|
| Discord | 15M+ | Trillions stored | 15ms read | ScyllaDB |
| Slack | 5M+ WebSocket | Millions | 500ms global | MySQL/Vitess |
| 147M peak | 100B+ | Sub-second | Mnesia |
Technology Choices Cheat Sheet
| Component | Small Scale | Large Scale |
|---|---|---|
| Database | PostgreSQL | Vitess (MySQL) or ScyllaDB |
| Cache | Redis single | Redis Cluster |
| Pub/Sub | Redis Pub/Sub | Kafka + Redis |
| WebSocket | Node.js | Elixir/Erlang (BEAM) |
| Message Queue | RabbitMQ | Kafka |
Key Formulas
1Connections per server: 100K (conservative) to 2M (optimized Elixir)
2RAM per connection: ~30 KB
3Messages per second: DAU × messages_per_user / 86400
4Storage per day: messages × avg_size × replication_factor
5Fanout cost: messages × avg_channel_size
This document consolidates material from ByteByteGo, Discord Engineering, Slack Engineering, WhatsApp architecture papers, and real interview experiences.