Any feedback or improvements of the design will be more than welcomed :)
The problem:We have to implement a business process with the following characteristics:
- The business process (Job1) can be divided into three different sequential phases (S1, S2, S3). For example, we can think about the generation and mailing of all the invoices for all the customer of one account manager.
- The second phase (S2) can be divided in several (hundreds or thousands) individual and independent sub-jobs (S2.1, S2.2, ...). This sub-jobs can be executed/processed in any order. For example, generate and email one invoice. Each sub-job require one minute of process time.
- The complete process should complete in less than 90 minutes without being affected by the number of sub-jobs of the second step (up to 10k sub-jobs).
We also need to:
- Notify when each step starts and when each step finished.
- Generate some statistics of each the process.
- Access to the detailed status of the process at any moment.
- We can have several numbers of concurrent business processes of this type for each customer.
- For the sub-jobs at step 2:
- We have retries for the sub-job execution.
- We can balance the time of the process and the cost.
- We have horizontal scalability.
The Design:After analyzing the requirements and make a fast web storming session we consider the following events:
If we need to include more information about the detailed state of the process we can also signal the starting point of each step using a StepXStarted event. But these Starting Events are not required because we already know when a step started (just when the previous step finished).
As we want to implement several functionalities for the same events and we want to have each one completely separated, we can use a distributed log / stream that allows us to design a simple solution to manage the workflow, calculate statistics, compute a detailed status.
The distributed log / stream and the corresponding services can scale using as partition/sharding key the job id or the customer id (assuming that each customer can generate several jobs at the same time).
The Step2 require additional design.
It can be subdivided in several individual jobs (S2.2, S2.2, ...), this jobs can be processed in any order and in parallel so we can have a queue for dispatching this subjobs to a pool workers that can be scaled out if needed.
We can balance the number of workers (and the corresponding cost) with the duration of the Step2 that we want. Each worker get a job description from the queue, execute the job, and include the result in an event (Step2SubjobCompleted) published in the distributed log / stream.
The Workflow Manager should implement a response for each event and generate corresponding events to make the job progress.
The responses for each event are:
- Execute Step1
- Publish Step1Completed
- Split the Step2 in several subjobs (S2.1, S2.2, ...)
- Store the identifiers of the jobs created (S2.1, S2.2, ...)
- Send the subjobs to the queue of S2 subjobs
- Mark the id of the job as no longer pending
- Validate if we have jobs pending
- if there is no more jobs pending, publish Step2Completed
- Execute Step3
- Publish Step3Completed
- Do any garbage recollection needed
- Publish Job1Completed
- Nothing, Everything is already done :)
Notes and complementary designs:
- The queue allows duplication so the workflow should be prepared to receive several events Step2SubjobCompleted for the same subjob.
- We can have retries for the queue jobs, so we should include a mechanism to detect when we should stop waiting for Step2SubjobCompleted. For example, we can use a periodic tick event and use this event to decide if we should continue with the next step (for example marking a subjob as an error).
- Is also possible to continue receiving Step2SubjobCompleted even if we are at Step3, the easy solution is to ignore this messages.
- If we select the Job id as the sharding/partition key for the distributed log / stream we can easily scale out the number of workflow processors. We only need to add more stream shards and the corresponding new processes.
- For the fault tolerance of the workflow we can store the events that we already processed and recover in case of failure from this events.
Related references (very recommended):
- Distributed Sagas: A Protocol for Coordinating Microservices Caitie McCaffrey
- Confusion about Saga pattern Roman Liutikov￼
- Applying the Saga Pattern Caitie McCaffrey