Saturday, July 22, 2017

Queues vs Distributed logs (tech pill)

TL;DR A queue is a good choice when you have one kind of job to do that you can divide in independent smaller jobs that can execute in any order and a distributed log is a good choice when you have several kinds of jobs or functionalities for the same stream of data (logs, events, etc.).

Queues vs Distributed Logs

This blog post tries to explain the typical use for Queues and for Distributed Log, but of course, a system usually uses these solutions in combination or in other ways. But I will try to summarize the usual use because some times I see some confusion about how we can use these interesting solutions.

Queue



Use a Queue when:

  • You want to distribute workload between workers for the same feature. 
  • Each message can represent a job and don't require knowledge about other messages at the queue.

Pros:

  • Easy to implement.
  • Unlimited/Easy horizontal scalability.
  • Fault tolerance is very easy to implement (time out and re-queue of the message).
  • Allow easy balance between latency (time at the queue + processing time) and the cost of the concurrent workers.

Cons:

  • The order is not guaranteed.
  • Each message can only be consumed by one worker for one feature.
  • Usually, we can have duplicates.

Examples

  • Asynchronous processing of email requests.
  • Processing of independent batch jobs.
  • A supermarket queue :)

Alternatives and related variations

  • Queues with guaranteed order (Ex: SQS FIFO).
  • TTL for messages in the Queue.
  • Queues with a max amount of queued messages.

Implementations:

  • AWS SQS, RabbitMQ

Distributed Log


Use a Distributed Log when:

  • You want to execute several functionalities for the same stream of ordered data (log aggregation, statistics calculation, event sourcing, store streams of information, etc.)
  • You want that the data corresponding to the same partition key is processed in order by the same worker instance.

Pros:

  • Allow several functionalities for the same data. Each functionality knows what is his own offset.
  • If two or more functionalities are at the same offset, they can be sure that have seen the same messages in the same order.
  • Having a guaranteed order,  at least at the partition key level, allow design simple solution for calculating statistics, event sourcing, near real-time calculation, etc. 
  • You can add new functionalities that use/process the same data with any change in the actual system (Open/Closed principle: open for extension but closed for modification).

Cons:

  • Usually, the order of the messages is preserved for each partition key only.
  • Scalability using sharding/partitioning.
  • More difficult than queues.
  • Difficult to have balanced shards/partitions (hot partition problem).

Examples

  • Log aggregation / distribution / statistics calculation from the logs.
  • Event streaming to allow event sourcing.

Implementations:

  • AWS Kinesis, Kafka

Practical scalability example:

In this example we have the following:

  • One logical stream
  • Divided into two shards
  • Each partition key (pk) will always be associated with a shard or partition
  • We have one function that runs in two concurrent processes, so the speed of processing is x2. Each process always read events/messages for the same partition keys.

In this example, the first process reads events/messages from the partition key 1 and from the partition key 2.

If the shard s1 have too many event/messages per second, we should scale out dividing the shard s1 into two sub shards.


Now, our speed is x3, and we have one process for each partition key.
As we can see the scalability is not trivial because we should detect when a shard/partition should be divide and rebalance the shards/partition between the running processes.


References:

No comments: