Design a Slack-Like Chat System

Table Of Contents

How to Use This Document

This document is structured as a 45-60 minute technical interview script or lecture. It flows from foundational concepts to advanced distributed systems patterns. Each section includes:

  • What to say (the teaching content)
  • What to draw (diagrams to sketch)
  • Key numbers (statistics to cite)
  • Interview tips (how to present in an interview setting)

Part 1: Setting the Stage (5 minutes)

1.1 Opening Statement

“Today we’re designing a large-scale chat system for companies, similar to Slack. Before diving into the architecture, I want to establish the scope and understand what makes this problem interesting at scale.”

Key insight to share: “The fundamental challenge in chat systems is balancing real-time delivery with reliability. We need messages to arrive instantly, but we also can’t lose them. At FAANG scale, this means handling billions of messages daily while maintaining sub-second latency.”

1.2 Clarifying Requirements (Always Start Here)

In an interview, always ask clarifying questions first. This shows structured thinking and helps scope the problem appropriately.

Functional Requirements (What the system does)

Ask the interviewer to confirm:

  1. Channel creation — Can users create channels with multiple participants? What about direct messages (1:1)?
  2. Messaging — Send and receive messages in real-time within channels
  3. Offline delivery — Users receive messages sent while they were offline
  4. Media files — Support for images, documents, and other attachments
  5. Multi-tenancy — Multiple companies (workspaces) use the same infrastructure with complete data isolation

Below the line (mention but deprioritize):

  • Message editing and deletion
  • Reactions and threading
  • Search functionality
  • Private channels with access control

Non-Functional Requirements (How the system performs)

Requirement Target Rationale
Latency < 500ms end-to-end Users expect “instant” messaging
Throughput 1 billion+ messages/day Enterprise scale
Availability 99.9% - 99.99% Chat is mission-critical for businesses
Delivery guarantee At-least-once with deduplication Messages must not be lost
Consistency Eventual for display, ordered within channels Perfect global ordering is unnecessary

Below the line:

  • End-to-end encryption
  • Compliance and audit logging
  • Geographic data residency

Scale Estimation (Back-of-envelope)

“Let me estimate the scale we’re designing for:”

 1Users:           100 million registered users
 2DAU:             20 million daily active users
 3Concurrent:      5 million simultaneous connections (peak)
 4Messages/day:    1 billion messages
 5Messages/second: ~12,000 average, ~50,000 peak
 6
 7Message size:    ~1 KB average (text + metadata)
 8Storage/day:     1 billion × 1 KB = 1 TB/day
 9Storage/year:    365 TB/year (before replication)
10
11Connections:     5 million WebSocket connections
12                 At ~30 KB RAM per connection = 150 GB RAM for connections alone
13                 Need ~150 servers at 1M connections each (aggressive)
14                 Or ~500 servers at 100K connections each (conservative)

Part 2: Core Design Foundations (10 minutes)

2.1 Core Entities

“Let me identify the fundamental data entities we’ll be working with:”

 1┌─────────────────────────────────────────────────────────────────┐
 2│                        CORE ENTITIES                            │
 3├─────────────────────────────────────────────────────────────────┤
 4│  Workspace    │ The company/organization container              │
 5│  Channel      │ A conversation space (public or private)        │
 6│  User         │ A person who can send/receive messages          │
 7│  Membership   │ Relationship between users and channels         │
 8│  Message      │ The content sent within a channel               │
 9│  Device       │ A user's connected client (phone, laptop, etc.) │
10└─────────────────────────────────────────────────────────────────┘

2.2 API Design

“I’ll define the core APIs. I prefer to think about these in terms of the user actions they enable:”

Interview tip: When listing APIs, focus on the most important ones first. Don’t list every possible API. It wastes time and can make you seem unprepared if you can’t explain all of them.

Essential APIs

 1# Channel Management
 2POST   /workspaces/{workspace_id}/channels
 3        create_channel(workspace_id, name, is_private, initial_members[])
 4        Returns: channel_id
 5
 6POST   /channels/{channel_id}/members
 7        add_member(channel_id, user_id, role)
 8        Returns: membership_id
 9
10# Messaging
11POST   /channels/{channel_id}/messages
12        send_message(channel_id, sender_id, content, idempotency_key)
13        Returns: message_id, sequence_number
14
15GET    /channels/{channel_id}/messages?after={seq}&limit={n}
16        get_messages(channel_id, after_sequence, limit)
17        Returns: messages[], has_more
18
19# Real-time (WebSocket)
20WS     /connect?token={auth_token}
21        Establishes bidirectional connection
22        Server pushes: new_message, presence_update, typing_indicator
23        Client sends: heartbeat, subscribe_channel, mark_read
24
25# File Upload (separate flow)
26POST   /files/upload-url
27        get_presigned_url(file_name, file_size, content_type)
28        Returns: upload_url, file_id
29
30POST   /channels/{channel_id}/messages
31        send_message(channel_id, sender_id, content, file_ids[])

API Design Principles

  1. Include workspace_id in paths — Essential for multi-tenancy and sharding
  2. Use idempotency keys — Prevents duplicate messages on retry
  3. Cursor-based pagination — Use sequence numbers, not page offsets
  4. Separate file upload — Don’t mix large binary data with message APIs

2.3 High-Level Architecture Overview

“Let me draw the high-level architecture, then we’ll dive into each component:”

 1┌─────────────────────────────────────────────────────────────────────────────┐
 2                              CLIENTS                                        
 3                    (Web, Mobile, Desktop Apps)                              
 4└─────────────────────────────────────────────────────────────────────────────┘
 5                                    
 6                                    
 7┌─────────────────────────────────────────────────────────────────────────────┐
 8                           LOAD BALANCER                                     
 9                 (SSL termination, routing, rate limiting)                   
10└─────────────────────────────────────────────────────────────────────────────┘
11                                              
12              ┌───────────┘                    └───────────┐
13                                                          
14┌──────────────────────────┐                ┌──────────────────────────┐
15      API SERVERS                             GATEWAY SERVERS       
16   (HTTP REST requests)                     (WebSocket connections) 
17   - Auth, file upload                      - Real-time delivery    
18   - Message send                           - Presence, typing      
19└──────────────────────────┘                └──────────────────────────┘
20                                                          
21              └──────────────────┬─────────────────────────┘
22                                 
23┌─────────────────────────────────────────────────────────────────────────────┐
24                          CORE SERVICES                                      
25  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        
26    Channel        Message        Presence      Notification        
27    Service        Service        Service        Service            
28  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        
29└─────────────────────────────────────────────────────────────────────────────┘
30                                 
31              ┌──────────────────┼──────────────────┐
32                                                  
33┌──────────────────────┐ ┌─────────────┐ ┌──────────────────────┐
34    MESSAGE QUEUE          CACHE           DATABASES       
35  (Kafka/Redis Pub/Sub)    (Redis)      (MySQL/Cassandra)   
36└──────────────────────┘ └─────────────┘ └──────────────────────┘

Part 3: Building the Basic System (10 minutes)

3.1 Start Simple: Single Server MVP

“Let me start with the simplest working system, then evolve it to handle scale.”

1┌──────────┐         ┌──────────────────┐         ┌────────────┐
2│  Client  │ ◄─────► │   Chat Server    │ ◄─────► │  Database  │
3│  (User)  │   WS    │  (Single Node)   │   SQL   │  (MySQL)   │
4└──────────┘         └──────────────────┘         └────────────┘

Single Server Workflow: Sending a Message

  1. User A sends message via WebSocket to Chat Server
  2. Chat Server validates the request (auth, permissions)
  3. Chat Server stores message in database
  4. Chat Server looks up channel members from database
  5. Chat Server pushes message to all online members via their WebSocket connections

What’s wrong with this?

  • Single point of failure
  • Can’t scale beyond one server’s connection limit (~100K-500K connections)
  • All members must be connected to the same server
  • Database becomes bottleneck

3.2 Database Schema (Foundation)

“Before scaling, let me establish the data model. This schema works for both SQL and NoSQL with modifications.”

Core Tables

 1-- Workspace: The company/organization
 2CREATE TABLE workspace (
 3    workspace_id    BIGINT PRIMARY KEY,
 4    name            VARCHAR(255),
 5    plan            ENUM('free', 'pro', 'enterprise'),
 6    created_at      TIMESTAMP
 7);
 8
 9-- Channel: A conversation space
10CREATE TABLE channel (
11    workspace_id    BIGINT,
12    channel_id      BIGINT,
13    name            VARCHAR(255),
14    is_private      BOOLEAN DEFAULT FALSE,
15    created_at      TIMESTAMP,
16    PRIMARY KEY (workspace_id, channel_id)
17);
18
19-- Channel Membership: Who is in which channel
20CREATE TABLE channel_member (
21    workspace_id    BIGINT,
22    channel_id      BIGINT,
23    user_id         BIGINT,
24    role            ENUM('member', 'admin', 'owner'),
25    joined_at       TIMESTAMP,
26    PRIMARY KEY (workspace_id, channel_id, user_id)
27);
28
29-- User's view of channels (for quick lookup)
30CREATE TABLE user_channel (
31    workspace_id        BIGINT,
32    user_id             BIGINT,
33    channel_id          BIGINT,
34    last_read_seq       BIGINT DEFAULT 0,      -- For unread counts
35    last_delivered_seq  BIGINT DEFAULT 0,      -- For offline catchup
36    muted               BOOLEAN DEFAULT FALSE,
37    PRIMARY KEY (workspace_id, user_id, channel_id)
38);
39
40-- Messages
41CREATE TABLE message (
42    workspace_id    BIGINT,
43    channel_id      BIGINT,
44    message_seq     BIGINT,                    -- Per-channel sequence number
45    message_id      BIGINT,                    -- Global unique ID (Snowflake)
46    sender_id       BIGINT,
47    body            TEXT,
48    created_at      TIMESTAMP,
49    PRIMARY KEY (workspace_id, channel_id, message_seq)
50);

Why This Schema Design?

  1. workspace_id in every table — Enables multi-tenancy and sharding
  2. Composite primary keys — (workspace_id, channel_id, message_seq) colocates related data
  3. Dual membership tables — channel_member for “who’s in this channel”, user_channel for “what channels is this user in”
  4. Sequence numbers — message_seq enables ordered pagination and gap detection

3.3 Communication Protocols

“Let me explain why we use different protocols for different operations:”

Protocol Comparison

Protocol Use Case Characteristics
HTTP/REST Sending messages, file upload, auth Request-response, stateless, easy to load balance
WebSocket Receiving messages, presence, typing Bidirectional, persistent connection, server push
Long Polling Fallback when WebSocket unavailable Higher latency, more overhead
HTTP/2 + SSE Alternative to WebSocket Multiplexed, unidirectional server push

Why WebSocket for Receiving?

“The receiver side needs server push capability. Let me compare the options:”

Polling (Bad)

1Client: "Any new messages?" → Server: "No"
2Client: "Any new messages?" → Server: "No"
3Client: "Any new messages?" → Server: "Yes, here's one"
  • Wastes resources checking constantly
  • High latency (message waits for next poll)

Long Polling (Better)

1Client: "Any new messages? I'll wait..."
2Server: [holds connection for 30 seconds]
3Server: "Here's a message!" [or timeout]
  • Reduces requests but still overhead of reconnecting
  • Each response requires new connection

WebSocket (Best)

1Client: "Let's open a persistent connection"
2Server: [keeps connection open]
3Server: "New message!" [instant push]
4Server: "Another message!" [instant push]
  • Single persistent connection
  • Sub-100ms delivery latency
  • Efficient for high-frequency updates

Connection Lifecycle

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    WebSocket Connection Lifecycle                       │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 51. ESTABLISH CONNECTION
 6   Client ──── HTTP Upgrade Request ────► Server
 7   Client ◄─── 101 Switching Protocols ── Server
 8   [WebSocket connection established]
 9
102. AUTHENTICATE
11   Client ──── {type: "auth", token: "..."} ────► Server
12   Client ◄─── {type: "auth_ok", user_id: 123} ── Server
13
143. SUBSCRIBE TO CHANNELS
15   Client ──── {type: "subscribe", channels: [...]} ────► Server
16   Client ◄─── {type: "subscribed", channels: [...]} ──── Server
17
184. STEADY STATE
19   Client ◄─── {type: "message", channel_id: 1, ...} ──── Server (push)
20   Client ◄─── {type: "presence", user_id: 5, status: "online"} ── Server
21   Client ──── {type: "heartbeat"} ────► Server (every 30 seconds)
22
235. DISCONNECTION
24   Client ──── Close frame ────► Server
25   [or Server ──── Close frame ────► Client]
26   [or Connection times out after missed heartbeats]

Part 4: Scaling to Distributed System (15 minutes)

4.1 The Multi-Server Challenge

“When we have multiple servers, a fundamental problem emerges: the sender and receiver may be connected to different servers.”

1┌──────────┐                                           ┌──────────┐
2│  User A  │                                           │  User B  │
3│ (Sender) │                                           │(Receiver)│
4└────┬─────┘                                           └────┬─────┘
5     │                                                      │
6     ▼                                                      ▼
7┌──────────┐                                           ┌──────────┐
8│ Server 1 │         HOW DOES MESSAGE GET HERE? ────► │ Server 2 │
9└──────────┘                                           └──────────┘

Two solutions:

  1. Consistent Hashing — Deterministically route users to servers
  2. Pub/Sub Message Queue — Broadcast messages to all servers

4.2 Solution 1: Consistent Hashing

“We can use consistent hashing to ensure all members of a channel connect to the same server.”

How It Works

 1Hash Ring:
 2                    Server A
 3 4            ┌──────────┼──────────┐
 5           ╱                       ╲
 6     Server D                    Server B
 7           ╲                       ╱
 8            └──────────┬──────────┘
 910                    Server C
11
12Channel 123 → hash(123) → position on ring → Server B
13All users in Channel 123 connect to Server B

What’s the draw back of the consistent hashing approach?

  • If a server goes down, all channels mapped to it lose connectivity until reassignment
  • Adding/removing servers causes many channels to remap (connection churn)
  • Hot channels (10K+ members) still create hotspots
  • Complex to manage and monitor

Slack’s CHARM System

Slack uses Consistent Hash Ring Managers (CHARMs):

  • Each channel_id maps to exactly one Channel Server
  • 64-256 virtual nodes per physical server (smooths distribution)
  • When server fails, only ~1/N of channels need reassignment
  • Slack reports: Unhealthy server replacement completes in under 20 seconds

Limitations

  • Doesn’t work when user is in multiple channels on different servers
  • Server addition/removal causes connection migrations
  • Hot channels (10K+ members) still create hotspots

4.3 Solution 2: Pub/Sub Architecture (Preferred)

“The better solution decouples message routing from connection management using pub/sub.”

 1┌──────────────────────────────────────────────────────────────────────────┐
 2│                         PUB/SUB ARCHITECTURE                             │
 3└──────────────────────────────────────────────────────────────────────────┘
 4
 5┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
 6│  User A  │     │  User B  │     │  User C  │     │  User D  │
 7└────┬─────┘     └────┬─────┘     └────┬─────┘     └────┬─────┘
 8     │ WS             │ WS             │ WS             │ WS
 9     ▼                ▼                ▼                ▼
10┌─────────────────────────────────────────────────────────────────────────┐
11│              GATEWAY SERVERS (WebSocket Connections)                    │
12│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                   │
13│  │  Gateway 1   │  │  Gateway 2   │  │  Gateway 3   │                   │
14│  │ Users: A, B  │  │ Users: C     │  │ Users: D     │                   │
15│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                   │
16└─────────┼─────────────────┼─────────────────┼───────────────────────────┘
17          │                 │                 │
18          │    Subscribe    │    Subscribe    │    Subscribe
19          │    to topics    │    to topics    │    to topics
20          ▼                 ▼                 ▼
21┌─────────────────────────────────────────────────────────────────────────┐
22│                        PUB/SUB SYSTEM                                   │
23│                    (Redis Pub/Sub or Kafka)                             │
24│                                                                         │
25│   Topics: channel:123, channel:456, user:789:devices, ...               │
26└─────────────────────────────────────────────────────────────────────────┘
2728          │ Publish message to channel:123
2930┌─────────┴───────┐
31│  Message        │
32│  Service        │
33└─────────────────┘

Message Flow with Pub/Sub

  1. User A sends message to Channel 123 via HTTP POST
  2. API Server validates and stores message in database
  3. API Server publishes to pub/sub topic channel:123
  4. All Gateway Servers subscribed to channel:123 receive the message
  5. Each Gateway pushes message to connected users who are members of Channel 123

Pub/Sub Technology Choices

Technology Latency Durability Throughput Best For
Redis Pub/Sub Sub-ms None (fire-and-forget) High Typing indicators, presence
Redis Streams Sub-ms Yes (persisted) High Messages with replay capability
Kafka 10-100ms Excellent Very high Audit logs, analytics, durability
NATS 0.1-0.4ms Optional 3M+ msg/sec Lightweight real-time delivery

Discord uses: Redis Pub/Sub for real-time fanout Slack uses: Custom internal pub/sub + Kafka for durability

Hybrid Approach (Production Pattern)

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                      HYBRID MESSAGE FLOW                                │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5                    Message from User A
 6 7 8                   ┌───────────────┐
 9                   │  API Server   │
10                   └───────┬───────┘
1112          ┌────────────────┼────────────────┐
13          ▼                ▼                ▼
14   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
15   │  Database   │  │ Redis Pub/  │  │   Kafka     │
16   │  (persist)  │  │ Sub (fast)  │  │  (durable)  │
17   └─────────────┘  └──────┬──────┘  └─────────────┘
181920                   Gateway Servers
21                   (push to users)

Why both Redis and Kafka?

  • Redis for instant delivery (sub-millisecond)
  • Kafka for durability, replay, and downstream consumers (search indexing, analytics)

4.4 Stateless Gateway Design

“Gateway servers must be stateless for horizontal scaling and fault tolerance.”

What “Stateless” Means

Stateful (Bad):

1# Gateway server stores user session in local memory
2class GatewayServer:
3    def __init__(self):
4        self.sessions = {}  # user_id → session_data
5        self.connections = {}  # user_id → websocket
6
7    def on_connect(self, user_id, websocket):
8        self.sessions[user_id] = load_user_data(user_id)
9        self.connections[user_id] = websocket

If this server crashes, all session data is lost.

Stateless (Good):

 1# Gateway server stores session in Redis
 2class GatewayServer:
 3    def __init__(self):
 4        self.redis = Redis()
 5        self.connections = {}  # Local connection map only
 6
 7    def on_connect(self, user_id, websocket):
 8        session = self.redis.get(f"session:{user_id}")  # External storage
 9        self.connections[user_id] = websocket
10        # Register this gateway as handling this user
11        self.redis.hset(f"user:{user_id}:gateways", self.server_id, time.now())

If server crashes, user reconnects to any gateway and resumes from Redis.

Session Data in Redis

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    DISTRIBUTED SESSION STORAGE                          │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5Redis Keys:
 6  session:{user_id}
 7    → {workspace_id, subscribed_channels[], last_seen_seq:{channel_id → seq}}
 8
 9  user:{user_id}:gateways
10    → {gateway_1: timestamp, gateway_2: timestamp}  (multi-device)
11
12  presence:{user_id}
13    → {status: "online", last_heartbeat: timestamp}  TTL: 60 seconds
14
15  channel:{channel_id}:subscribers
16    → Set of user_ids currently subscribed

Reconnection Flow

 11. User's connection drops (network issue, server crash)
 2
 32. Client detects via:
 4   - WebSocket onclose event
 5   - Missed heartbeat response (no pong for 60s)
 6
 73. Client initiates reconnection:
 8   - Uses exponential backoff with jitter
 9   - Connects to any available gateway (via load balancer)
10
114. New gateway:
12   - Authenticates user
13   - Retrieves session from Redis
14   - Subscribes to user's channels
15   - Fetches missed messages since last_delivered_seq
16
175. Client is back online with no message loss

4.5 Database Scaling

“The database is often the hardest component to scale. Let me explain the strategies.”

Sharding Strategies

Option 1: Shard by workspace_id

 1Shard 1: workspaces 1-1000
 2Shard 2: workspaces 1001-2000
 3...
 4
 5```text
 6- **Pros:** Complete tenant isolation, simple routing
 7- **Cons:** Large workspaces become hotspots (Slack's #general channel problem)
 8
 9**Option 2: Shard by channel_id (Slack's current approach)**
10
11```text
12Shard = hash(workspace_id, channel_id) % num_shards
  • Pros: Spreads hot channels across shards
  • Cons: Cross-channel queries require scatter-gather

Option 3: Shard by (channel_id, time_bucket) (Discord’s approach)

1Partition Key: (channel_id, bucket)
2Bucket: 10-day time window
3Clustering Key: message_id (Snowflake)
  • Pros: Bounded partition size, efficient time-range queries
  • Cons: Complex composite queries

Discord’s Scaling Journey

Metric Cassandra (Before) ScyllaDB (After)
Nodes 177 72
P99 Read Latency 40-125ms 15ms
P99 Write Latency 5-70ms (variable) 5ms (stable)
Messages Stored Trillions Trillions

Why ScyllaDB outperformed Cassandra:

  • C++ implementation eliminates JVM garbage collection pauses
  • Shard-per-core architecture for predictable performance
  • Same data model and query language (CQL compatible)

Slack’s Vitess Migration

Slack migrated from monolithic MySQL to Vitess (horizontal MySQL sharding):

  • 2.3 million queries per second at peak
  • 2ms median latency
  • Preserves ACID transactions within a shard
  • VTGate layer handles routing transparently

Database Choice Decision Framework

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    DATABASE SELECTION GUIDE                             │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5                    Need ACID Transactions?
 6 7              ┌────────────┴────────────┐
 8              │ YES                     │ NO
 9              ▼                         ▼
10    ┌─────────────────┐       ┌─────────────────┐
11    │ MySQL + Vitess  │       │ Need Low Latency│
12    │ PostgreSQL      │       │ at Scale?       │
13    └─────────────────┘       └────────┬────────┘
1415                          ┌────────────┴────────────┐
16                          │ YES                     │ NO
17                          ▼                         ▼
18                 ┌─────────────────┐      ┌─────────────────┐
19                 │ ScyllaDB        │      │ Cassandra       │
20                 │ (C++, no GC)    │      │ (JVM, mature)   │
21                 └─────────────────┘      └─────────────────┘

Part 5: Message Delivery & Ordering (10 minutes)

5.1 Message Ordering Guarantees

“A critical question for chat systems: how do we ensure messages appear in the correct order?”

What Level of Ordering Do We Need?

Ordering Level Description Difficulty Needed?
Global ordering All messages across all channels ordered Very Hard No
Per-channel ordering Messages within a channel are ordered Medium Yes
Per-sender ordering Messages from same sender are ordered Easy Yes
Causal ordering Replies appear after original message Hard Nice to have

Key insight: “Per-channel ordering is sufficient for chat. Users don’t care if Channel A’s message 5 came before or after Channel B’s message 10.”

Implementing Per-Channel Ordering

Approach 1: Database Atomic Counter

1-- For each message insert:
2BEGIN TRANSACTION;
3  SELECT MAX(message_seq) + 1 INTO @next_seq
4  FROM message WHERE channel_id = ?;
5
6  INSERT INTO message (channel_id, message_seq, ...)
7  VALUES (?, @next_seq, ...);
8COMMIT;
  • Pros: Simple, strongly consistent
  • Cons: Single point of contention for high-traffic channels

Approach 2: Snowflake IDs (Twitter/Discord)

1Snowflake ID (64 bits):
2┌─────────────────────────────────────────────────────────────────┐
3│  41 bits: timestamp  │  10 bits: machine ID  │  12 bits: seq   │
4│  (milliseconds)      │  (1024 machines)      │  (4096/ms)      │
5└─────────────────────────────────────────────────────────────────┘
6
7Example: 1234567890123456789
8         └─── Encodes: time, which machine, sequence within that ms
  • Pros: Decentralized generation, time-sortable, globally unique
  • Cons: Clock skew can cause ordering issues, not strictly monotonic per channel

Approach 3: Hybrid (Recommended)

1# Message has both:
2message = {
3    "message_id": generate_snowflake(),      # Global unique, time-sortable
4    "message_seq": get_channel_sequence(),   # Per-channel strict ordering
5    "channel_id": 123,
6    "body": "Hello!"
7}
  • Use Snowflake for global uniqueness and rough time ordering
  • Use per-channel sequence for strict ordering within channel
  • Sequence can be generated by channel service (single leader per channel)

Handling Simultaneous Messages

“What if two users send messages at the exact same millisecond?”

1User A sends "Hello" ─────┐
2                          ├───► Channel Server ───► Database
3User B sends "Hi" ────────┘
4
5Both arrive at same millisecond. Which is first?

Solution: Single leader per channel

The Channel Server (or database primary) serializes all writes to a channel. It assigns sequence numbers atomically:

 1class ChannelServer:
 2    def __init__(self, channel_id):
 3        self.channel_id = channel_id
 4        self.next_seq = self.load_from_db()
 5        self.lock = threading.Lock()
 6
 7    def assign_sequence(self):
 8        with self.lock:
 9            seq = self.next_seq
10            self.next_seq += 1
11            return seq

Trade-off: This creates a bottleneck for very active channels. Solutions:

  • Shard large channels by time bucket
  • Accept eventual consistency with client-side sorting

5.2 Delivery Guarantees

The Impossible Problem: Exactly-Once Delivery

“True exactly-once delivery is impossible in distributed systems. This is proven by the Two Generals Problem.”

1Two Generals Problem:
2  General A ───message───► General B
3            (may be lost)
4
5  General A ◄───ack────── General B
6            (may be lost)
7
8Neither general can ever be certain the other received the message.

Practical Solution: At-Least-Once Semantics + Idempotency

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│              AT-LEAST-ONCE WITH DEDUPLICATION                           │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 51. Client sends message with idempotency_key
 6   {idempotency_key: "abc-123", text: "Hello!"}
 7
 82. Server checks if key exists in Redis:
 9   - If exists → return cached response (duplicate)
10   - If not → process and cache response
11
123. Client doesn't receive response (network issue)
13
144. Client retries with SAME idempotency_key
15
165. Server recognizes duplicate, returns cached response
17
186. Message appears exactly once to users

Idempotency Key Implementation

 1def send_message(request):
 2    key = f"idempotency:{request.idempotency_key}"
 3
 4    # Check for duplicate
 5    existing = redis.get(key)
 6    if existing:
 7        return json.loads(existing)  # Return cached response
 8
 9    # Try to acquire lock (prevent concurrent duplicates)
10    lock_acquired = redis.set(key, "processing", nx=True, ex=30)
11    if not lock_acquired:
12        # Another request is processing, wait for it
13        return wait_for_result(key)
14
15    try:
16        # Process the message
17        message = store_message(request)
18        response = {"message_id": message.id, "status": "sent"}
19
20        # Cache response for 24 hours
21        redis.setex(key, 86400, json.dumps(response))
22        return response
23    except Exception as e:
24        redis.delete(key)  # Release lock on failure
25        raise e

Client-Side Deduplication

Even with server-side idempotency, clients should deduplicate:

 1class MessageHandler {
 2    constructor() {
 3        this.seenMessageIds = new Set();
 4    }
 5
 6    onMessageReceived(message) {
 7        if (this.seenMessageIds.has(message.message_id)) {
 8            return;  // Already displayed this message
 9        }
10        this.seenMessageIds.add(message.message_id);
11        this.displayMessage(message);
12    }
13}

5.3 Offline Message Delivery

“When a user is offline, we need to ensure they receive messages when they come back online.”

1-- Store undelivered messages per user
2CREATE TABLE inbox (
3    user_id BIGINT,
4    message_id BIGINT,
5    channel_id BIGINT,
6    created_at TIMESTAMP,
7    PRIMARY KEY (user_id, message_id)
8);

Problems:

  • Storage explodes for users in large channels
  • Need to delete after delivery (write amplification)
  • Fanout cost: 1 message to 10K members = 10K inbox rows
1-- Track delivery progress per user per channel
2CREATE TABLE user_channel_cursor (
3    workspace_id        BIGINT,
4    user_id             BIGINT,
5    channel_id          BIGINT,
6    last_delivered_seq  BIGINT,  -- Last message user received
7    last_read_seq       BIGINT,  -- Last message user saw
8    PRIMARY KEY (workspace_id, user_id, channel_id)
9);

On reconnect:

 1def handle_reconnect(user_id, channel_id):
 2    cursor = db.get_cursor(user_id, channel_id)
 3
 4    # Fetch messages since last delivery
 5    missed_messages = db.query("""
 6        SELECT * FROM message
 7        WHERE channel_id = ? AND message_seq > ?
 8        ORDER BY message_seq ASC
 9        LIMIT 1000
10    """, channel_id, cursor.last_delivered_seq)
11
12    # Send to client
13    for msg in missed_messages:
14        websocket.send(msg)
15
16    # Update cursor
17    db.update_cursor(user_id, channel_id,
18                     last_delivered_seq=missed_messages[-1].message_seq)

Advantages:

  • No per-message-per-user storage
  • Messages stored once, cursors are tiny
  • Works for any channel size

Part 6: Presence & Multi-Device Sync (5 minutes)

6.1 Presence System (Online/Offline Status)

“The green dot showing someone is online is surprisingly complex at scale.”

Requirements

  • Show real-time online/offline status
  • Handle brief disconnections gracefully (don’t flicker)
  • Scale to millions of concurrent users
  • Tolerate false positives (better to show online when offline than vice versa)

Heartbeat-Based Presence

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                      PRESENCE SYSTEM                                    │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5Client ──── heartbeat ────► Gateway ──── update ────► Presence Service
 6 7 8                                                    Redis (TTL keys)
 91011                                                    presence:{user_id}
12                                                    TTL: 60 seconds

Flow:

  1. Client sends heartbeat every 30 seconds
  2. Gateway forwards to Presence Service
  3. Presence Service sets Redis key with 60 second TTL
  4. If no heartbeat received, key expires → user is offline
  5. Status changes published via pub/sub

Why the Asymmetry? (30s heartbeat, 60s TTL)

  • Network jitter: Heartbeats may be delayed by 5-10 seconds
  • Retry opportunity: Client has one retry window before marked offline
  • Graceful degradation: Brief network glitches don’t cause status flicker

Actor Model for Presence

The actor model is particularly elegant for presence:

 1# Each user gets their own actor (Erlang process)
 2defmodule UserPresenceActor do
 3  def init(user_id) do
 4    # Set timer for offline transition
 5    timer = schedule_offline_check(60_000)  # 60 seconds
 6    {:ok, %{user_id: user_id, timer: timer, status: :online}}
 7  end
 8
 9  def handle_cast(:heartbeat, state) do
10    # Cancel old timer, schedule new one
11    Process.cancel_timer(state.timer)
12    new_timer = schedule_offline_check(60_000)
13    {:noreply, %{state | timer: new_timer}}
14  end
15
16  def handle_info(:offline_check, state) do
17    # No heartbeat received in time
18    broadcast_status_change(state.user_id, :offline)
19    {:stop, :normal, state}
20  end
21end

Why actors work well:

  • Each user’s presence is independent
  • Millions of lightweight processes (2KB each in Erlang)
  • One actor crash doesn’t affect others
  • Timers are natural with self-messages

Presence Fanout Challenge

“When User A comes online, who needs to know?”

Naive approach: Notify all users in all channels User A is in Problem: User in 100 channels with 100 members each = 10,000 notifications

Optimized approach:

  1. Only notify users who are currently viewing a channel with User A
  2. Client subscribes to presence updates for visible channel members only
  3. Lazy loading: fetch presence on-demand when opening a channel

6.2 Multi-Device Synchronization

“A user might have Slack open on their laptop, phone, and tablet simultaneously.”

Challenges

  1. Message delivery to all devices
  2. Read status sync — Read on phone, laptop shows as read too
  3. Typing indicators — Don’t show “typing” from your own other device
  4. Notification suppression — If active on laptop, don’t buzz phone

Device Registration

1Redis structure:
2  user:{user_id}:devices
3    → {
4        device_1: {gateway_id: "gw-1", last_active: timestamp, platform: "ios"},
5        device_2: {gateway_id: "gw-2", last_active: timestamp, platform: "web"}
6      }

Read Status Sync

 1def mark_as_read(user_id, channel_id, message_seq):
 2    # Update cursor in database
 3    db.update("""
 4        UPDATE user_channel_cursor
 5        SET last_read_seq = ?
 6        WHERE user_id = ? AND channel_id = ?
 7    """, message_seq, user_id, channel_id)
 8
 9    # Publish to user's other devices
10    pubsub.publish(f"user:{user_id}:sync", {
11        "type": "read_status",
12        "channel_id": channel_id,
13        "last_read_seq": message_seq
14    })

Consistency model: Eventual consistency is acceptable for read status. A 1-2 second delay in syncing across devices is fine.


Part 7: Failure Handling & Reliability (5 minutes)

7.1 What Can Fail?

“In a distributed system, everything will fail eventually. Let me address each failure mode:”

Component Failure Mode Impact Mitigation
Client network Disconnection User can’t send/receive Reconnection with catchup
Gateway server Crash Users disconnected Stateless design, auto-reconnect
Message service Crash Can’t process messages Multiple instances, queue buffering
Database Primary failure Writes fail Automatic failover to replica
Pub/sub Redis crash No real-time delivery Fallback to polling, Redis Cluster
Entire datacenter Power outage Full outage Multi-region deployment

7.2 Client Reconnection with Exponential Backoff

“When connection drops, clients must reconnect intelligently to avoid overwhelming servers.”

The Thundering Herd Problem

 1Scenario: AWS outage takes down all connections
 2          10 million clients try to reconnect simultaneously
 3
 4Without backoff:
 5  Second 1: 10M connection attempts → servers crash
 6  Second 2: 10M more attempts → still crashed
 7  ...
 8
 9With jittered exponential backoff:
10  Second 1: ~100K attempts (some clients start)
11  Second 2: ~200K attempts (more clients start)
12  Second 10: Spread across time, servers can handle it

Implementation

 1def reconnect_with_backoff():
 2    base_delay = 1.0       # 1 second
 3    max_delay = 30.0       # 30 seconds max
 4    attempt = 0
 5
 6    while not connected:
 7        # Calculate delay with exponential backoff
 8        delay = min(base_delay * (2 ** attempt), max_delay)
 9
10        # Add jitter (randomness) to spread out retries
11        jittered_delay = random.uniform(0, delay)  # "Full jitter"
12
13        time.sleep(jittered_delay)
14
15        try:
16            connect()
17            # Reset on success
18            attempt = 0
19        except ConnectionError:
20            attempt += 1

AWS research shows: Full jitter reduces total retry attempts by over 50% compared to exponential backoff without jitter.

Network Change Detection

1// When network changes (WiFi → cellular), reset backoff
2window.addEventListener('online', () => {
3    reconnectAttempt = 0;  // Reset backoff
4    reconnect();           // Try immediately
5});

7.3 Gateway Server Failure

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                    GATEWAY FAILURE RECOVERY                             │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5BEFORE:
 6┌──────────┐     ┌──────────────┐
 7│  Client  │◄───►│  Gateway 1   │  ← Server crashes
 8└──────────┘     └──────────────┘
 91011                 ┌─────────────┐
12                 │    Redis    │  Session data safe here
13                 │   (state)   │
14                 └─────────────┘
15
16AFTER:
17┌──────────┐     ┌──────────────┐
18│  Client  │◄───►│  Gateway 2   │  ← Reconnects to different server
19└──────────┘     └──────────────┘
202122                 ┌─────────────┐
23                 │    Redis    │  ← Retrieves same session
24                 │   (state)   │
25                 └─────────────┘
26
27Client experience: Brief disconnection, then seamless resume

7.4 Database Failure

Leader-Follower Replication

 1                 ┌─────────────┐
 2    Writes ────► │   Primary   │
 3                 │  (Leader)   │
 4                 └──────┬──────┘
 5                        │ Replication
 6           ┌────────────┼────────────┐
 7           ▼            ▼            ▼
 8    ┌──────────┐ ┌──────────┐ ┌──────────┐
 9    │ Replica  │ │ Replica  │ │ Replica  │
10    │    1     │ │    2     │ │    3     │
11    └──────────┘ └──────────┘ └──────────┘
12           ▲            ▲            ▲
13           └────────────┴────────────┘
14                   Reads

When primary fails:

  1. Automated failover promotes Replica 1 to Primary
  2. Other replicas reconfigure to follow new primary
  3. Application reconnects to new primary
  4. Potential data loss: Last few writes may not have replicated

Slack’s AZ Drain Button

Slack built manual tooling for datacenter failures:

  • AZ drain button: Redirects all traffic away from failing availability zone
  • Execution time: Under 5 minutes
  • Mechanism: Envoy weighted clusters shift traffic 100% to healthy AZs

“Sometimes human judgment is better than automated failover for complex scenarios.”

7.5 Split-Brain Prevention

“Split-brain occurs when network partition makes two nodes think they’re both the primary.”

1Network Partition:
2                    ┌───────────────────────────────┐
3    ┌──────────┐    │         PARTITION            │    ┌──────────┐
4    │ Primary? │◄───┼─────────────X────────────────┼───►│ Primary? │
5    │ Node A   │    │                               │    │ Node B   │
6    └──────────┘    └───────────────────────────────┘    └──────────┘
7
8Both nodes think: "I can't reach the other one, I must take over!"
9Result: Two primaries accepting writes → data divergence

Prevention: Quorum-based consensus

 1Cluster of 3 nodes:
 2- Majority = 2 nodes
 3- If partition splits 2-1, the side with 2 can operate
 4- The side with 1 knows it doesn't have majority, refuses writes
 5
 6┌──────────────────────────────────────────────────────────────────────┐
 7│  Node A  │───────│  Node B  │          │  Node C  │                  │
 8│ (active) │       │ (active) │          │ (isolated)│                  │
 9│          │       │          │          │ (read-only)                  │
10│     Majority (2/3)          │    X     │ No majority                  │
11└──────────────────────────────────────────────────────────────────────┘

Part 8: Multi-Tenancy (3 minutes)

8.1 Enterprise Isolation Requirements

“Slack is a B2B product. Each company (workspace) expects complete isolation.”

Isolation Levels

Level Description Complexity Use Case
Logical Same database, workspace_id in every query Low Small/free workspaces
Database Separate database per workspace Medium Pro workspaces
Physical Separate servers per workspace High Enterprise with compliance

Implementation: workspace_id Everywhere

1-- EVERY query must include workspace_id
2SELECT * FROM message
3WHERE workspace_id = ? AND channel_id = ?;
4
5-- Primary keys include workspace_id
6PRIMARY KEY (workspace_id, channel_id, message_seq)

Why?

  1. Sharding: workspace_id is natural partition key
  2. Security: Impossible to accidentally query wrong workspace
  3. Performance: Database can route to correct shard immediately

Row-Level Security

1-- PostgreSQL row-level security
2CREATE POLICY workspace_isolation ON message
3    USING (workspace_id = current_setting('app.workspace_id')::BIGINT);
4
5-- Application sets context before queries
6SET app.workspace_id = 123;
7SELECT * FROM message WHERE channel_id = 456;
8-- Automatically filters to workspace 123

Rate Limiting per Workspace

Prevent one workspace from monopolizing resources:

 1def rate_limit(workspace_id, operation):
 2    key = f"rate_limit:{workspace_id}:{operation}"
 3    current = redis.incr(key)
 4    if current == 1:
 5        redis.expire(key, 60)  # 1 minute window
 6
 7    limits = {
 8        "free": 1000,
 9        "pro": 10000,
10        "enterprise": 100000
11    }
12    plan = get_workspace_plan(workspace_id)
13
14    if current > limits[plan]:
15        raise RateLimitExceeded()

Part 9: Deep Dive Topics (Interview Follow-ups)

9.1 What if the Pub/Sub Queue Goes Down?

Outbox Pattern:

 1def send_message(message):
 2    with db.transaction():
 3        # Write message AND outbox event in same transaction
 4        db.insert("message", message)
 5        db.insert("outbox", {
 6            "event_type": "new_message",
 7            "payload": message,
 8            "processed": False
 9        })
10
11    # If pub/sub is up, process immediately
12    try:
13        publish_to_pubsub(message)
14        db.update("outbox", {"processed": True}, ...)
15    except PubSubDown:
16        pass  # Background worker will retry
17
18# Background worker
19def process_outbox():
20    while True:
21        events = db.query("SELECT * FROM outbox WHERE processed = FALSE")
22        for event in events:
23            try:
24                publish_to_pubsub(event.payload)
25                db.update("outbox", {"processed": True}, ...)
26            except PubSubDown:
27                time.sleep(1)  # Retry later

9.2 How to Handle Large Channels (10K+ members)?

Problem: Sending a message to 10K members = 10K fanout operations

Solutions:

  1. Lazy fanout: Only notify online members, others catchup on reconnect
  2. Tiered delivery: Push to first 1000 members, others poll
  3. Read-only for very large channels: Disable @channel mentions

9.3 Message Search Implementation

 1┌─────────────────────────────────────────────────────────────────────────┐
 2│                      SEARCH ARCHITECTURE                                │
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 5┌──────────┐      ┌─────────────┐      ┌─────────────────┐
 6│ Message  │─────►│    Kafka    │─────►│  Elasticsearch  │
 7│ Service  │      │   (stream)  │      │  (search index) │
 8└──────────┘      └─────────────┘      └─────────────────┘
 910                        Search queries ────────┘
  • Messages are primary stored in Cassandra/MySQL
  • Kafka streams changes to Elasticsearch (near-realtime)
  • Search queries hit Elasticsearch
  • Indexing delay: 1-5 seconds (acceptable for search)

9.4 File Upload Flow

 1┌─────────────────────────────────────────────────────────────────────────┐
 2                   PRESIGNED URL UPLOAD FLOW                             
 3└─────────────────────────────────────────────────────────────────────────┘
 4
 51. Client ──── POST /files/upload-url ────► API Server
 6                                                
 72.                                    Generate presigned URL
 8                                                
 93. Client ◄─── {upload_url, file_id} ──────────┘
10
114. Client ──── PUT [binary data] ────► S3 (direct upload)
12
135. Client ──── POST /messages {file_id} ────► API Server
14
156. API Server validates file_id exists in S3, creates message

Why presigned URLs?

  • Large files don’t go through API servers
  • Direct S3 upload is faster and cheaper
  • API servers stay lightweight

Part 10: Interview Strategy

10.1 Time Management (45-minute interview)

Phase Time Focus
Clarify requirements 5 min Ask questions, establish scope
High-level design 10 min Draw architecture, explain components
Core deep dive 20 min Database, WebSocket, message delivery
Failure handling 5 min Proactively discuss failures
Wrap-up / Questions 5 min Extensions, tradeoffs

10.2 What Interviewers Look For

Based on real OpenAI interview feedback:

  1. Database design (100% asked): You must draw schemas and explain sharding
  2. Fan-out problem (50% asked): How to send message to 10K members
  3. Multi-tenancy (50% asked): How to isolate company data
  4. Failure handling (50% asked): What if servers crash?

10.3 Common Mistakes to Avoid

Mistake Why It’s Bad What to Do Instead
Being reactive Waiting for interviewer to ask about scale/failures Proactively say “Let me discuss what happens when this fails…”
Forgetting multi-tenancy Designing for single company Always include workspace_id from the start
Ignoring fan-out Saying “just send to everyone” Explain pub/sub, batching, lazy delivery
Vague database design Just saying “use a database” Draw tables, specify keys, explain indexes
No numbers “A lot of messages” “1 billion messages/day, 12K messages/second”

10.4 Key Phrases to Use

✅ “Let me start with requirements to ensure I understand the scope…” ✅ “For a billion-user system, we need to consider…” ✅ “The tradeoff here is between consistency and availability…” ✅ “What happens when this component fails? Let me address that…” ✅ “Discord handles this by…, Slack’s approach is…, let me explain why I’d choose…” ✅ “At scale, this becomes a bottleneck, so we need to…”


Appendix: Quick Reference

Production Numbers to Cite

Platform Concurrent Users Messages/Day P99 Latency Database
Discord 15M+ Trillions stored 15ms read ScyllaDB
Slack 5M+ WebSocket Millions 500ms global MySQL/Vitess
WhatsApp 147M peak 100B+ Sub-second Mnesia

Technology Choices Cheat Sheet

Component Small Scale Large Scale
Database PostgreSQL Vitess (MySQL) or ScyllaDB
Cache Redis single Redis Cluster
Pub/Sub Redis Pub/Sub Kafka + Redis
WebSocket Node.js Elixir/Erlang (BEAM)
Message Queue RabbitMQ Kafka

Key Formulas

1Connections per server:     100K (conservative) to 2M (optimized Elixir)
2RAM per connection:         ~30 KB
3Messages per second:        DAU × messages_per_user / 86400
4Storage per day:            messages × avg_size × replication_factor
5Fanout cost:                messages × avg_channel_size

This document consolidates material from ByteByteGo, Discord Engineering, Slack Engineering, WhatsApp architecture papers, and real interview experiences.