Design Slack

Problem statement

Design a large chat system for companies, similar to Slack. Notice to limit the scope and choose the most unique and critical components to discuss. Should avoid mention not critical part or the parts that you are not familiar with. Choose a coherent structure to describe. In each section, whenever you stuck, try to move forward without wasting a lot of time.

Requirements

Functional requirements

  1. create a channel (multiple or single recipient)
  2. send and receive messages in a channel
  3. receive message for offline user
  4. send multi-media files

Below the line

  1. private channel

Non-functional requirements

  1. low latency (<500ms)
  2. high throughput (1billion) scalability
  3. guarantee receive once delivery
  4. availability (99.9% or 99.99%)
  5. fault tolerance (?????)

Below the line

  1. security model (encryption, store the message in the server)

Core entities

  • message
  • channel (chat)
  • User(participants, sender, recipient)

API overview

  1. create_channel/[channel_id]
  2. add_participant/[participant_id]
  3. send_message/[message_id]
  4. receive_message/[channel_id]
  5. creat_file/ # create a blob data
  6. send_file/ # send the created blob data to the channel
  7. updat_participant/ # handle join or delete of a participant

Q: what’s the order of importance of the API? or what the structure we should keep in mind when list the API, because it could be a lot of functionalities, if you list too much, you’ll have no time to talk about in later design. If you miss the important one, you’ll look bad and interviewer might thought you don’t know. so, comeup a framework that can guide you to make the API and the design consistent.

Basic design

MVP

System Architecture

During interview, it’s better to split the service based on the functionality. For example, we can have separate server for chat server, channel server, message server, inbox server, etc. This is good for discussion about scalability and fault tolerance in the next step. To scale the system, you just need to duplicate these server for each service. When you need to scale the system, you can use the server box to represent the server cluster for each service. So you don’t have to draw too many boxes in the diagram.

Work flow

Send message (single server)

  1. User send message to load balancer, which routes the request a chat server.
  2. Chat server process the message and store the message to DB. The chat server then need to get the recipient list from channel server, chat server should have a map of participant id to participant connection. Here we can use WebSocket or gRPC stream to maintain the connection.
  3. Chat server push the message to each recipient user through the maintained connection.

Send message (multiple server)

To scale the service to billions of users, we need to have multiple chat server to handle the message sending and receiving. The challenge here is that the chat server which process the message may not maintain the connection for each recipient user. So how to deliver the message to each recipient user?

  1. User send message to load balancer, which routes the request a chat server.
  2. Chat server process the message and store the message to DB. The chat server then need find the server which maintain the connection for each recipient user. Here we can use consistent hashing to find the server for each recipient user.
  3. Chat server push the message to each recipient user through the maintained connection.

Better approach is to use a message queue to decouple the chat server and the message delivery. The chat server just need to push the message to the message queue, and each server which maintains the connection to recipients can subscribe to the message queue to forward the message to the users.

Offline message

  1. User send message to load balancer, which routes the request a chat server.
  2. Chat server process the message and store the message to DB. The chat server then need find the server which maintain the connection for each recipient user. Here we can use consistent hashing to find the server for each recipient user.
  3. If the recipient user is offline, the chat server store the message to inbox server for later delivery.
  4. When the user comes online, the chat server fetch the message from inbox server and push the message to the user.

Data model

channel_membership

  • channel_id: int64
  • participant_id: int64
  • created_at: timestamp

channel

  • channel_id
  • name
  • metadata
  • created_at

Inbox (only for not delivered messages)

  • user_id
  • message_id
  • timestamp # for msg ordering

message

  • message_id
  • message_text
  • created_at

Deep dive topics

  • handle multiple clients
  • ensure message order (DB system design)
  • handle simultaneous messages.
  • realtime status feature
  • sync read status across devices (consistent model)
  • multi-tenancy for multiple companies?
  • what if (websocket server, gateway, DB instance) die?
  • reconnect in poor network (or switch network (e.g. cellular to wifi))
  • other failure topics:
    • How clients reconnect (using exponential backoff).
    • How to get missed messages after a reconnect.
    • Database replication (keeping copies of data).
    • “Thundering herd” problem (everyone reconnecting at once).
    • Keeping WebSocket servers “stateless” so they are easy to add or remove.
  1. reconnect without thunder herd(exponential back-off)
    1. How does this exactly work in a real situation?
  2. stateless websocket connection (distributed session storage, can resume session on connection failure) gateway should tateless, only keep the connection_id to the user_id/device_id. Store session info in the distributed cache (Redis). If gateway dies, user reconnect to another box, can resume the session by reading the session info from cache, then e-subscribing/catch up the undelivered messages.
  3. presence/connection directory (backed by distributed key-value store such as Redis or similar) that store user_id → {gateway_id, connections, last_heartbeat}
  4. fanout messages
    1. consistent hashing, deterministic mapping from user_id → gateway_shard. requires rebalancing on auto scale or failure.
    2. pub/sub message queue (best solution)
  5. inbox/delivery tracking
    1. don’t store the undelivered message, use a cursor instead.
      1. user_channel_cursor(workspace_id, user_id, channel_id, last_delivered_seq, last_read_seq)
      2. on reconnect, client ask: give me messages from seq > last_delivered_seq
  6. messaging ID and ordering (per channel order is good enough)
    1. use UUID, ordering comes from message_seq, which is a per channel monotonic
      1. generate message_seq from DB atomic counter per channel (can be hot for huge channel)
      2. shard large channels: partition by “time bucket” an use (bucket, seq) ordering.
    2. use SnowflakeID (timestamp (41 bits) + machineID (10 bits) + sequence (12 bits))
  7. guarantee semantics (at-least-once delivery)
    1. idempotency key
  8. sync read status across devices
    1. store last_read_seq in user_channel_cursor
    2. device mark as read, publish `ReadUpdate(user_id, channel_id, last_read_seq)
    3. fanout to other device of same user to sync.
    4. eventual consistency is find, it’s tolerable
  9. presence/realtime status update
    1. gateways heartbeat to the presence service
    2. best-effort
      1. on connect, send a notification to the presence service, then send heartbeat.
      2. TLL-based status, if no heartbeat for x ms, mark offline
  10. multi-tenancy
    1. workspace_id is included in every db primary key
    2. shared cluster + workspace_id partitioning (logical)
    3. physical host isolation (encryption keys, stricter isolation)
    4. auth process need redirect to the right workspace
    5. encryption in transit always, encryption at rest with per-tenant keys for enterprise.
  11. What if the pub/sub queue is down?
    1. use outbox pattern
      1. write message and outbox event in same DB transaction
      2. background publisher drains outbox to queue

References