Photo by Glen Carrie on Unsplash

System Design — AWS SNS

Amit Singh Rathore
The Startup
Published in
6 min readDec 5, 2020

--

In this blog, we will try to build a miniature version of AWS Simple Notification Service.

  1. Problem Statement
  2. Requirement Gathering
  3. Building MVP
  4. Building for Resiliency and Availability
Generic design for notification service

Problem Statement

A simple notification service allows users to publish messages to topics. A user subscribes to the topic(s). Whenever a message is published to the topic by the publisher, the subscriber(s) receives the message published in the topic. Both publishers and consumers are unaware of each other. They do not communicate directly.

We are required to Design an SNS service that clients all over the world can use to read and write messages.

Requirement gathering:

Users should be able to publish messages

Message Order— Message order must be maintained
Grouping by topic — Message must be grouped by topic.

Users should be able to receive messages

Message Order — Subscribers should receive messages in the order they were published in the topic
Grouping by topic — Subscriber should receive only the subscribed topic’s message

Message retention

For retry purposes, the message should be retained for 21 days in case it has not been consumed. (In AWS SNS keeps trying for ~20 days if the consumer [lambda or sqs] is not available.)

Scale & Performance

The application should be able to sustain uneven traffic distribution.
Peak to be 50M messages per day.
Avg Message size is 4kB.
It should be able to process a considerably high volume of messages.
The solution should be available across multiple locations (Mumbai and Singapore) and the application should be able to process messages in case of one DC’s failure.

Building An MVP

The above diagram shows the basic flow of the system. Publisher publishes message(s) to topic(s) and consumer(s) read/process them.

Actors:

  1. Publisher — One who publishes the message.
  2. Consumer — One who reads the messages from topic (CONSUMER_ID)

APIs:

  1. createTopic(topic_name): →TOPIC_ID
  2. createSubscription(topic_id, consumer_id): →SUBSCRIPTION_ID
  3. publish(TOPIC_ID, Message): → Sends back Ack on successful write
  4. generateMessageId(TOPIC_ID) → MESSAGE_ID
  5. distribute(TOPIC_ID, CONSUMER_ID)
  6. remove(TOPIC_ID, MESSAGE_ID): → Send Ack on successful removal

All APIs are self-explanatory. Message Position Marker Service, this service helps in deleting the correct message-id from the topic. Instead of deleting the last message from the topic, it ensures that it only deletes the message-id which was last read. This ensures the ordering of the message.

Note: For brevity, I have removed delete & unsubscribe APIs.

Storage Abstraction:

This layer can be a NoSQL Db for fresh messages and for archival it can be an Object/Disk storage-based solution. Note the metadata in the message is different than the metadata service(coming later). This metadata can be used to filter and route messages (event filter pattern).

Adding resiliency and Availability

Not all API/Service’s usage patterns will be the same. So we need to be smart with the no of instances of the service that will be needed. The below diagram tries to present an indicative service replication. As is the general use case one message can be consumed by multiple applications hence we should be running more distributed services. Similarly, message-id service is used both while writing and reading the message, so this should have higher replication than publish and distribution service.

In the above diagram, we created multiple instances of the services in a single location, we will now replicate the whole setup in different locations. This will add resiliency as well as reduce the latency. Publisher in Mumbai will hit the DC nearest to it and publish the message. Users in Singapore will hit the nearest DC to read the message. This requires synchronization between two DC. We can have distributed services with a consensus framework (raft, Paxos) to have a common state across DCs.

Data replication between DCs needs to be monitored closely. Inter region latency of write will be high and that will impact the overall time between write initiation and acknowledgment of successful write. Here based on latency requirement we can choose asynchronous, synchronous, or hybrid replication.

In the case of asynchronous replication as soon as the write is complete in the nearest DC the ack is returned. This will give the lowest latency for publishing. But it will also have low data durability. Before the data can be read from the remote location, it has to be replicated to that DC. Hence distribution will have high latency.

In the case of synchronous replication, we will have high latency while publishing and low latency on reading. It will give us high data durability as data must be replicated to remote DC before a successful publication is acknowledged.

The third approach is to choose a middle path where we do synchronous with say 1 remote location and the remaining location will do a lazy replication. When combined with an access pattern this can be very good. We chose the DC which is used heavily for consumption and select that DC as part of Synchronous ack. And remaining non-frequent DCs can be replicated lazily.

Storage calculation:

peak * topic id * message id * metadata * message * ~ memory 
10M * (64 bits) * (64 bits) * (256 bytes) * 4kB * 21 ~ 5TB

Based on available hardware we can come up with the number of machines needed to implement this architecture. If latency is an issue we can go for SSD or RAM intensive instances. Or If we can tolerate some latency we can go with HDD.

Network bandwidth calculation:

Since at peak we are writing 250 GB per day. We need to find the max peak write at any time in the day. For simplifying the calculation, I am assuming that the peak will be 1.5 times of average. This gives us 35Mbps.

READ or consumption can also be obtained similarly. For simplicity, I have assumed READ is 5 times the write. Actual data can be gathered in the requirement phase.

WRITE → peak * 1.5 *1000 /3600 *24 ~ 35 Mbps
READ → WRITE * 6 ~ 200 Mbps

Latency calculation:

ID generation + DC with highest latency + write to disk → publish latency
ID selection + Max(Remote DC read) + Message position marker update → read latency

Based on the above bandwidth requirement we need to select an instance that has the required network capability.

All the above APIs/Services can be fronted by a fleet of FrontEnd services. This service will receive all requests and route them accordingly. The functionality of this service can be, not limited to, the following:

  • Request Validation, Deduplication (Caching recent Message-ID), Routing
  • Rate limiting, Usage metrics collection, Audit logging

Above API service will also talk to a metadata service to persist information like topic attribute, subscription tuple, user attributes, etc.

With this exercise, we have designed a Notification system.

Reference: https://sre.google/resources/practices-and-processes/distributed-pubsub/

--

--

Amit Singh Rathore
The Startup

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML