eferro's random stuff: Lean Software Development: Detect errors before they hurt

Second part of the series on how to build with quality from Lean Software Development. If you haven't yet read the first part of this series, where we explain the fundamentals of Lean Software Development, you can find it here: https://www.eferro.net/2025/04/lean-software-development-building-with.html. After understanding why quality is not just the final result, in this installment we focus on how we detect errors as early as possible, stop the flow when they appear, and learn from them to improve the system.

To avoid confusion, we will use "error" to refer to any deviation from the expected result, and "defect" for errors that impact the customer in production.

Detecting errors as early as possible

In more traditional approaches, defects are often prioritized based on their criticality, which sometimes determines whether they are fixed and within what timeframe. However, in Lean Software Development, which considers quality a fundamental part of the product and focuses on the continuous improvement of processes and systems, it is more common to classify errors (potential defects) based on where and when they were identified within the process.

Having this information allows us to identify the stages where errors are most common, helping us focus our improvement efforts on detecting them as early as possible (shift left), reducing the likelihood of them becoming defects.

In my experience, it is very useful to classify error detection depending on the stage where they are identified. I usually use the following classification:

Local machine (pair or ensemble work)

Development cycle (including TDD cycle, continuous style verification (linting), type checks, errors, etc.)
Pre-commit
Pre-push

CI Pipeline:

Checks
Tests (classified from fastest to slowest)
Deployment (including smoke tests and validations during rollout)

Production environment

Pre-release (deployed but not activated for the client)
Client activation

When a feature is already activated for the client, it is also useful to classify errors or defects based on who or what detected them:

Automatic system, before it impacts the client (Monitoring)
Internal user, such as a team member running tests or someone from another department
Final user, reporting the defect to support

Regardless of the stage or who detected it, the main goal is always the same: detect (and fix) the error as early as possible, ideally before it is identified by an end user in production (when it is already considered a defect).

Lean Software Development accepts that we make mistakes continuously and understands that the cost (waste generated) increases the later the error is detected and fixed.

To illustrate how this progressive error detection is structured and visualized, I will show two real examples of pipelines we use. In both cases, the various steps (checks, tests, publishing, deployment, production validations, rollback, etc.) are organized to easily detect any error as soon as possible, stop the process, and fix it. This visualization not only helps structure the workflow better but also ensures that the entire team clearly understands at what stage each type of error can appear.

In this first pipeline, each component (webapp, API, event processor…) has its own checks, unit, integration, and acceptance tests, as well as differentiated publishing and deployment processes for different environments (dev and mgmt). Additionally, end-to-end tests are automated in production before activating changes, and a rollback logic is included if something fails. This structure reinforces the principle of automatically stopping the flow when errors occur and facilitates traceability at each stage.

In this second example, more focused on structural validations and specific testing of certain technologies (argo workflows in this case), additional phases such as static checks, cleanup tasks before publishing the image to ECR, and integration tests with different configurations are highlighted. This type of pipeline shows how even auxiliary tasks like configuration validation or environment cleanup are an integral part of an approach that seeks to detect errors before they hurt.

Stop and fix policy

Jidoka, also known as "autonomation" or "automation with a human touch," is a key principle of Lean Software Development. It’s not just about automating processes but doing so in a way that they automatically stop when a problem is detected, allowing teams to investigate and fix the root cause before continuing. Applying the Jidoka concept, teams working with Lean Software Development design development processes that make it very easy to detect errors—either automatically in most cases, or manually thanks to a development process that facilitates identifying those errors.

For continuous quality improvement to work, we not only need to detect those errors but it is crucial to stop immediately (See https://en.wikipedia.org/wiki/Andon_(manufacturing)) and have a policy that forces us to prioritize their immediate resolution. This way of working may seem too radical at first and might give the impression that it slows the team down. However, my experience is quite the opposite. If you adopt a working approach where, upon detecting an error, you analyze it, learn from it, and fix it at the root—by, for example, adding a test to prevent it from happening again—you soon achieve a process that resolves errors as early as possible. This eliminates a lot of rework and potential problems for the end customer while generating great confidence within the team to move quickly, take risks, and experiment.

In fact, I believe it’s the best way to move fast sustainably, and the DORA Reports studies and the book Accelerate confirm that the best way to be fast is to build with quality—at least when it comes to product development.

In my case, this application of the Jidoka approach is reflected in:

Automatic tests that, upon detecting a failure, temporarily interrupt the development flow to prevent the error from propagating.
Git hooks (pre-commit, pre-push, etc.) that interrupt the flow if an attempt is made to push code with errors.
Working with trunk-based development, a strategy where all developers integrate their changes into a single main branch. In this setup, we run all validations on the main branch. When a test fails in continuous integration, we stop to fix it immediately. This is crucial in trunk-based development because any failure blocks the ability to integrate new changes, ensuring the main branch is always stable and ready for deployment. This discipline is fundamental to maintaining quality and avoiding error accumulation, allowing us to move fast with confidence.
Automatic prioritization in the workflow for resolving bugs detected in production, following an incident management process with postmortems for production incidents. This automatic prioritization is based on the severity of the error and its impact on the customer, determining which bugs are addressed first. (https://www.eferro.net/2024/12/using-blameless-incident-management-to.html).
Pair programming or ensemble programming, where multiple people work together on the same task, allowing misunderstandings or potential errors to be detected from the start. This intense collaboration acts as a continuous review that prevents many errors, both in understanding the problem and in implementing the solution.

Dealing with defects and learning from them

Accepting that we are going to make mistakes—and that some will become defects—is a fundamental part of the Lean approach. Instead of denying or hiding it, we embrace it as something natural. We work in complex environments, with multiple dependencies, constant uncertainty, and, moreover, we are human. We are fallible by definition.

That does not mean we don't try to avoid errors or defects. On the contrary, we put a lot of effort into preventing them with techniques like poka-yoke, automated testing, pair programming, evolutionary design, and many other practices that are part of our daily work. Even so, we know they will happen. And since we know it, we prepare to minimize their impact and recover as quickly as possible.

This shift in mindset is key: we move from an obsession with avoiding mistakes at all costs to a more robust and sustainable strategy based on fast recovery (resilience) and learning capability. Because when a defect reaches production, the first objective is to restore service as quickly as possible. And immediately after, to learn.

Over the past years, in several teams I've worked with, we've refined and applied a blameless incident management approach. The idea is simple: when an incident occurs, we don't look for someone to blame. We focus on understanding what happened, how the system contributed to the error, and what we can do to prevent it from happening again or reduce its impact next time.

This type of approach, simple as it may seem, has had a huge impact on team culture. It brings psychological safety, builds trust, promotes transparency, and encourages people to make problems visible without fear. At TheMotion, Nextail, and ClarityAI, we used it not only to manage incidents but also as a major lever to evolve the culture toward one that is more collaborative, learning-oriented, and focused on continuous improvement.

For example, in a recent incident where a service failed, we applied the 5 Whys technique and discovered that the initial problem (a configuration error) had triggered a cascade of events due to the lack of error handling in another service. This led us to add more robust integration tests and improve the resilience of the second service.

Our blameless incident management process relies on several principles:

Stay calm. Don’t panic. We even value this ability during interviews as a sign of professional maturity.
Assign an Incident Commander to coordinate the response and ensure no one is left alone firefighting.
Restore the service as soon as possible. Sometimes this means disabling a feature, communicating with customers, or applying a temporary mitigation. The important thing is to stabilize the system.
Analyze what happened in depth without seeking a single “root cause.” We understand that incidents usually stem from a combination of causes and circumstances. We use techniques like the 5 Whys, asking “why?” starting from the first visible symptom. Doing this in a group allows us to uncover the various factors that contributed to the incident. Often, we find flaws in the process, assumptions, communication, or even data interpretation.
Define corrective and preventive actions that not only avoid the problem but also reduce future recovery time and increase system resilience.
Integrate these actions into the normal workflow so they don't just remain on paper.
Use a blameless incident report, public within the company and collaborative, as the basis for collective learning. This continuous analysis and learning process is an example of Kaizen—continuous improvement applied to incident management, where we constantly seek ways to improve our processes and prevent future errors.

These reports include summaries, timelines, causes, actions, and learnings. Sharing them openly reinforces the message: errors are not hidden, they are learned from. And when the whole team internalizes this, the organization improves faster.

In the end, the message is clear: incidents are inevitable, but how we respond to them truly defines our culture. We can hide them, blame, and move on... or we can use them as catalysts for improvement and continuous learning. In Lean Software Development, we choose the latter.

In the next article...

Once we have established how to detect and respond quickly to errors, it is crucial to build a solid foundation. In the third part, we will delve into internal quality as the basis for sustainable development. We will see why less is more, how simplicity and well-thought-out design accelerate development, and how a good technical foundation allows us to move fast without breaking things.

Páginas

Friday, April 18, 2025

Lean Software Development: Detect errors before they hurt

Detecting errors as early as possible

Stop and fix policy

In the next article...

Related content

No comments: