In almost every discussion of queues or message brokers, the ever-present technologies have been Apache Kafka or RabbitMQ or AWS - SQS. Interestingly, they solve very different problems and cannot be used interchangeably.
This blog covers some of the most common scenarios and gives you insight into this. It’s broken down into 5 acts, each dealing with a specific problem.
The First Act…
Suppose you have a checkout page which needs to contact an inventory service
downstream before actually proceeding with the payment. Now, if there is some
latency in the inventory service, then the checkout process slows down. More requests
keep coming in and are waiting to be processed.
After a point, the checkout page goes unresponsive; requests start to timeout, retries kick in, all while the inventory takes its own sweet time to process things.
Therefore, one slow dependency can lead to cascading failures across upstream
services. It primarily arises from synchronous coupling of services.
Solution: Decoupling
However, if we were to add in a queue to which all these requests could be
published then the requests would stay there without slowing down the checkout
process. A slow inventory is not affecting the health of checkout.
So essentially, the producer is decoupled from the consumer of the events within the system. This leads to three interesting observations:
- The producer and the consumer do not run at the same time
- If one service is down, it doesn’t take the other one down with it
- The slowest service does not choke the faster service
The Second Act…
Suppose you built a system to send marketing emails to 10,000 users. The idea is that all the targeted users of the product get the marketing email. As a data engineer, you drop in the mailing list and the content to be sent to each one and call it a Friday.
It’s Monday, and your boss is fuming! Each user received 50 copies of the same email. Your email has been blacklisted and spam filters are kicking you out. What exactly happened?
Turns out, your system had 50 workers which were supposed to send emails to the
users in parallel. The problem is that instead of doing a round-robin
on them, your system fanned-out each email to every single worker.
Solution: Work-Queue vs Pub/Sub
The critical question that could have saved your job is:
Should this event be handled by one? Or should this event be handled by many?
A work-queue is ideal when you have multiple workers each doing one task. A
pub/sub system makes sense in situations where a single event needs to go to
multiple systems for processing.
For example:
- Transcoding a video to a different resolution involves a payload like
{
"video_url": "mycoolcdn.com/pipeline/video-1",
"result": "720p",
"save_at": "mycoolcdn.com/destination/*"
}
This can be picked up by any worker and should be handled only once.
- An order placed on Swiggy/Doordash means
1. Notify the restaurant
2. Find a driver
3. Notify the customer
When a single event triggers multiple actions from different sub-systems, then you need a fan-out system. Each system gets a copy of the event to handle independently.
Delivery Guarantees (where subtle bugs arrive)
-
At-most-onceis great for use-cases like readings and metrics where a dropped data point is not very consequential. -
At-least-onceis what most systems use by default. The sender keeps retrying until it gets an acknowledgement. While you don’t lose events, you do end up delivering duplicate events. It is the job of the consumer on the other end to process it only once, akaidempotency. -
Exactly-onedelivery across a network is not achievable in general. Whether you use Kafka, RabbitMQ, SQS, Pulsar, Redis Streams or anything else!
The reason why it is impossible in general is because of the Two Generals Problem.
In simple terms, let’s say
- A producer sends a message, and the consumer sends an acknowledgement
ACK - The
ACKgets lost over the network, so now the producer is in a dilemma because- It can retry and cause a duplicate
- Drop the message but potentially lose it
The network doesn’t tell which situation you are in!
So, if a system claims to have exactly-once delivery, what it means is
at-least-once delivery + idempotent processing = effective exactly-one processing
Within Kafka, there is an exactly-once guarantee, but that is within the cluster where deduplication mechanisms exist. Once the event has stepped out of the Kafka cluster,
you are on your own! Roll out your own idempotency solution.
Hence, assume at-least-once delivery and build idempotent consumers. You avoid
a whole class of bugs like
- Duplicate customer emails
- Inventory drift due to double addition/subtraction of items
- Double charges on payments
The Third Act…
While dropping a message into the queue, let’s say that the message was
malformed. Let’s call it a poison message. When this comes up to the consumer
for processing, then it throws an error and retries. The problem is that a
retry is not useful in this scenario as newer messages are piling up and the
consumer is stuck.
One poison message can take down a pipeline handling 1M messages/min.
Solution: Dead-Letter Queue (DLQ)
After a fixed number of retries, the message is moved to a different queue while the main pipeline resumes healthy operations. The poisoned message is available for auditing and troubleshooting but it is not a pipeline choker any longer.
However, your DLQ is quickly becoming a graveyard of messages.
After cleaning up these messages, there should either be some tooling to push them back to the main queue. If you forgot to write this, then:
- Write a one-off script to move to the main queue
- Replay each message’s payload by hand
- Accept your fate and forget the messages :(
Most underprepared teams don’t design the replay path through which messages can move from the dead-letter queue to the primary queue. That tends to destroy the whole point of the DLQ.
That being said, things to consider when building the replay path are:
- Cap the number of
max_retries - Alert on the
dead_letter_queue_size - Attach
metadata::{error, timestamp, retries_count}
The Fourth Act…
You get alerted at 2 AM, and your queue is out of memory. What happened? Everything was designed perfectly so far. Turns out your producers outpace your consumers and the queue got full!
Let’s do some napkin math here to gain perspective
- Producers drop
10,000 messages/sduring peak load - Consumers process
2,000 messages/sat max capacity - There is a delta of
8,000 messages/sat peak load - Within an hour,
8,000 messages/s x 3600sleads to28.8 millionmessages on-disk
An alternate version of the same problem is where the processing delays are enormous due to massive queue sizes. Take, for example, an Aadhaar verification taking hours the first time around.
Solution: Backpressure
When the producer is outpacing the consumer, the consumer needs to inform the producer to slow down. There are a few ways to achieve this:
Bounded Queues:Cap the queue size after which the producer fails fast.- Very simple to implement, errors surface early, loud alarms everywhere
Auto-Scaling:If thequeue_sizecrosses a certain threshold, then spin up more consumers and have them join the pool.- This works when the consumer is stateless and traffic is unpredictable
Credit-Based Flow Control:The consumer tells the producer that it is ready to handle
X number of messages only. Then the producer sends that many messages to the consumer and waits for the next credit message. This is the idea behindReactive Streams. Nothing moves without downstream’s permission.
The Final Act…
So the question to answer is: Kafka, RabbitMQ, or SQS?
RabbitMQ: Routing Control
In RabbitMQ, messages are published to an exchange. This is a router which
connects to multiple different queues and is the deciding authority on which
queues a message must go to. The routing rules are configurable.
routing_key = orders.eu.high_volume
# Exact match
queue_1: orders.eu.high_volume
# Pattern Match
queue_2: orders.eu.*
# Header match
queue_3: orders.#
Once the consumer ACKs the message, it is gone. It does not stay
in the system anymore. You would typically reach for RabbitMQ when:
- There are different worker pools for background jobs
- Message shape decides destination
- Per-message control is more important over raw throughput
Apache Kafka: Append-Only Log
In Kafka, messages are written to an append-only log. They stay there based
on your retention policy. Consumers attach to this log and track their own
position
consumer-1: offset=0 # beginning
consumer-2: offset=9 # end, caught up with all messages
consumer-3: offset=5 # midway
This means that any consumer at any time can rewind. You get to replay old events due to the persisting nature of Kafka. The log is the history. Teams which use Kafka are usually:
- Analytics over every event produced
- Data warehouse for history and auditing
- Search indexing on new items added
A new consumer can join in any time and jump to an offset based on timestamp.
Additionally, it is built for throughput. It can handle 1 million msg/s
The caveat is that there is serious operational weight to running Kafka clusters,
and a Kafka Admin needs a very different resume.
Recently, Kafka has made some updates so it can act as a simple queue as well.
AWS - Simple Queue Service (SQS): Zero-Ops Overhead
SQS is a fully managed service that comes down to three API calls
- create-queue
- send-message
- receive-message
There is no operational overhead. AWS manages it all for you. It comes in two flavors based on your workload
Throughput-first:At-least-once processing and effectively unlimited throughput but messages can go out of order.Order-first: Exactly-once processing with strict FIFO ordering of messages. The throughput can scale from 300 TPS to 70K TPS
But as we know, exactly-one processing is a myth! What actually happens is that
there is a visibility time-out window of 5 minutes within which your consumer
is supposed to do the necessary deduplication and maintain idempotency and send
back the ACK. Otherwise, SQS will send it to another consumer for processing.
Now, by itself, SQS is a queuing service and does not do fan-out behavior. But
in order to achieve that, you usually pair it with other services like
AWS SNS, EventBridge or Kinesis.
Teams use SQS when they want zero-ops burden.
Conclusion
Picking the right queue for the job is essential, but answering the fundamental questions about your data flow is even more important. There are teams which shoot themselves in the foot by reaching for Kafka from day 1 because it has the biggest data-flow diagram on Netflix or in big tech.
Don’t make every engineer on the team go through a Kafka tutorial. The tool gets more complicated than the problem it is solving. Pick the simplest tool, and if you have a real reason, migrate later.
Don’t be afraid of migrations!
Organic growth over architecture madness.