eferro's random stuff: August 2017

Continuing with my process of recovering my previous reading habit (books-i-have-read-in-last-12-months http://www.eferro.net/search/label/books) these are the books that I read lately:

"Nonviolent Communication: A Language of Life" Marshall B. Rosenberg Review
"The Power of Habit: Why We Do What We Do in Life and Business" Charles Duhigg
"Los 7 Habitos de la Gente Altamente Efectiva" Stephen R. Covey
"The Toyota Way" Jeffrey Liker
"Algorithms to Live By: The Computer Science of Human Decisions" Brian Christian, Tom Griffiths
"El Arte de Vivir: Meditación Vipassana tal y como la enseña S.N. Goenka [The Art of Living]" William Hart
"The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life" Mark Manson
"Tribes: We Need You to Lead Us" Seth Godin
"The Five Dysfunctions of a Team: A Leadership Fable" Patrick M. Lencioni Review
"The Happiness Hypothesis: Finding Modern Truth in Ancient Wisdom" Jonathan Haidt
"Team of Teams: New Rules of Engagement for a Complex World" General Stanley McChrystal, Tantum Collins, David Silverman, Chris Fussell Review
"Focus: A Simplicity Manifesto in the Age of Distraction" Leo Babauta
"The Sales Bible: The Ultimate Sales Resource" Jeffrey Gitomer
"The Ideal Team Player: How to Recognize and Cultivate the Three Essential Virtues: A Leadership Fable" Patrick M. Lencioni
"Ego Is the Enemy" Ryan Holiday
"Barking up the Wrong Tree: The Surprising Science Behind Why Everything You Know About Success Is (Mostly) Wrong" Eric Barker
"Scrum" Jeff Sutherland, JJ Sutherland

Read in progress (RIP) :)

"The Lean Product Playbook: How to Innovate with Minimum Viable Products and Rapid Customer Feedback" Dan Olsen
"This is Lean: Resolving the Efficiency Paradox" Niklas Modig, Par Ahlstrom
"Implementation Patterns" Kent Beck (2º reading)
"The real startup Book" v0.3

Friday, August 11, 2017

Scalable design pattern sample (tech pill)

As a complement to the previous blog post Queues vs Distributed logs (tech pill) this blog post describes how to solve a concrete business problem using a distributed log and a queue.... I used this design several times in production with good results.

Any feedback or improvements of the design will be more than welcomed :)

The problem:

We have to implement a business process with the following characteristics:

The business process (Job1) can be divided into three different sequential phases (S1, S2, S3). For example, we can think about the generation and mailing of all the invoices for all the customer of one account manager.
The second phase (S2) can be divided in several (hundreds or thousands) individual and independent sub-jobs (S2.1, S2.2, ...). This sub-jobs can be executed/processed in any order. For example, generate and email one invoice. Each sub-job require one minute of process time.
The complete process should complete in less than 90 minutes without being affected by the number of sub-jobs of the second step (up to 10k sub-jobs).

We also need to:

Notify when each step starts and when each step finished.
Generate some statistics of each the process.
Access to the detailed status of the process at any moment.

The solution should have the following characteristics:

We can have several numbers of concurrent business processes of this type for each customer.
For the sub-jobs at step 2:

We have retries for the sub-job execution.
We can balance the time of the process and the cost.
We have horizontal scalability.

The Design:

After analyzing the requirements and make a fast web storming session we consider the following events:

If we need to include more information about the detailed state of the process we can also signal the starting point of each step using a StepXStarted event. But these Starting Events are not required because we already know when a step started (just when the previous step finished).

Implementation

As we want to implement several functionalities for the same events and we want to have each one completely separated, we can use a distributed log / stream that allows us to design a simple solution to manage the workflow, calculate statistics, compute a detailed status.
The distributed log / stream and the corresponding services can scale using as partition/sharding key the job id or the customer id (assuming that each customer can generate several jobs at the same time).

The Step2 require additional design.
It can be subdivided in several individual jobs (S2.2, S2.2, ...), this jobs can be processed in any order and in parallel so we can have a queue for dispatching this subjobs to a pool workers that can be scaled out if needed.
We can balance the number of workers (and the corresponding cost) with the duration of the Step2 that we want. Each worker get a job description from the queue, execute the job, and include the result in an event (Step2SubjobCompleted) published in the distributed log / stream.

The Workflow Manager should implement a response for each event and generate corresponding events to make the job progress.

The responses for each event are:

Job1Requested:

Execute Step1
Publish Step1Completed

Step1Completed:

Split the Step2 in several subjobs (S2.1, S2.2, ...)
Store the identifiers of the jobs created (S2.1, S2.2, ...)
Send the subjobs to the queue of S2 subjobs

Step2SubjobCompleted:

Mark the id of the job as no longer pending
Validate if we have jobs pending
if there is no more jobs pending, publish Step2Completed

Step2Completed:

Execute Step3
Publish Step3Completed

Step3Completed:

Do any garbage recollection needed
Publish Job1Completed

Job1Completed:

Nothing, Everything is already done :)

For the same events, other functionalities as Statistics, reports or notifications, will define other reactions/implementation to process each event... Compute and store statistics, generate emails or push notification, create a view with detailed information on the status of the job, etc...

Notes and complementary designs:

The queue allows duplication so the workflow should be prepared to receive several events Step2SubjobCompleted for the same subjob.
We can have retries for the queue jobs, so we should include a mechanism to detect when we should stop waiting for Step2SubjobCompleted. For example, we can use a periodic tick event and use this event to decide if we should continue with the next step (for example marking a subjob as an error).
Is also possible to continue receiving Step2SubjobCompleted even if we are at Step3, the easy solution is to ignore this messages.
If we select the Job id as the sharding/partition key for the distributed log / stream we can easily scale out the number of workflow processors. We only need to add more stream shards and the corresponding new processes.
For the fault tolerance of the workflow we can store the events that we already processed and recover in case of failure from this events.

Related references (very recommended):

Distributed Sagas: A Protocol for Coordinating Microservices Caitie McCaffrey
Confusion about Saga pattern Roman Liutikov
Applying the Saga Pattern Caitie McCaffrey

Monday, August 07, 2017

Some good talks from the GOTO 2017 conference

From the GOTO 2017 conference, these are the talks that I liked:

Debugging Under Fire: Keep your Head when Systems have Lost their Mind Bryan Cantrill
Idée Fixe David Nolen
Managing Manager‐less Processes Fred George
Patterns of Effective Teams Dan North
The Many Meanings of Event-Driven Architecture Martin Fowler

Páginas

Monday, August 21, 2017

Good talks/podcasts (Purging the queue III)

Friday, August 18, 2017

Good talks/podcasts (Purging the queue II)

Tuesday, August 15, 2017

Good talks/podcasts (Purging the queue I)

Sunday, August 13, 2017

Books I have read lately and Reads In Progress (RIP)