Thursday, January 13, 2022

Fighting complexity: let's celebrate removals & simplifications

If you don’t actively fight for simplicity in software, complexity will win. …and it will suck. - @HenrikJoreteg

It is ubiquitous in our profession to celebrate adding new features or capabilities, but it is less common to celebrate removing components or simplifying the system. The problem is that we are using the wrong metaphor. We usually talk about “building” or “making” new features. But by using this “building” metaphor, we neglect all the work associated with the new element we created (feature, capability, component, etc.) (See Basal Cost of software). 

With my teams, I prefer to talk about new capabilities we enable, changes of behavior of our users, and the amount of complexity we manage and maintain. I tried to transmit that it is as essential to control and reduce complexity as developing new capabilities and features.

When I joined the Clarity AI Platform team, there was a problem of too much toil. This toil was generated by the lack of self-service capabilities for the stream-aligned teams, by the team's considerable number of components (k8s clusters, mongodb clusters, in-house monitoring platform, etc.), and the amount of accidental complexity of the infrastructure.

It came as no surprise to me. Clarity AI was a fast growing startup that during its first  years of run was in a hurry to achieve the product-market fit. This meant that controlling the complexity was not the priority in the first days (since the balance would have been much more difficult).

My first step was to quantify and classify the work/tasks and the source of each task. With this information and our context (a startup founded by VCs), we, the Platform team, determined that it does not make sense (in our case) to manage all that infrastructure ourselves. Therefore, we decided to use managed services whenever possible and simplify the infrastructure.

Let's celebrate simplification and removal

During this last year and a half, we developed several new capabilities and reduced the system's complexity by simplifying solutions, removing non-essential components, and migrating solutions to managed services.

Migrate to managed services

  • We removed all code and toil related to self-managed kubernetes clusters. We migrated all of our kops managed k8s clusters to EKS. This change allowed us to upgrade our cluster easily and remove tons of obsolete code and tooling we use to maintain the clusters. This change also enables other simplifications, such as using EKS managed node groups. This change drastically reduced the toil in the team (less security patching and upgrade, less code to maintain, etc.) and allowed us to make faster changes in our kubernetes infrastructure.
  • Use of EKS managed node groups. We migrated (practically) all k8s node groups to managed groups, which has allowed us to remove all the trouble of patching and maintaining the OS for the nodes. This change improved our security position and allowed us to remove these machines' configuration code.
  • We replaced Prometheus,  Alert Manager, and metric server. To provide a more complete monitoring solution, we integrated our systems with Datadog. Using this managed service, we gave a more effective and easy-to-use monitoring solution and allowed us to remove our ad-hoc internal monitoring solution. We replaced Prometheus, the alert manager, the metric server, and the corresponding services and code. In this case, the most significant benefit is to save us from maintaining, updating, patching, and managing all those components, considering that Datadog already provides us with those functionalities.
  • We migrated the self-managed MongoDB Cluster to MongoAtlas. This change allows us to remove several EC2 instances, the code associated with managing the clusters, and minimize the toil associated with the DBs operations. We also saved a lot of development costs with this change since we were in the process of improving security (enabling encryption at rest) and developing all the necessary tools to scale vertically and horizontally without losing service. With the managed service, all of these features are available without any development.

Remove anything unused

  • We removed several S3 buckets and EC2 machines. After two months of talking with many people from the company, we identified several S3 buckets and EC2 machines without clear usage. By removing the residual use, we could delete the buckets and some machines. This change reduced our monthly costs of AWS by $400-$500 and improved our security as the EC2 machines did not follow the same security rules as the rest of our infrastructure. 
  • We deleted an abandoned tool. I consider this change as a personal victory, and only cost us one year talking and convincing a lot of people :)  It was a small application mainly used on demos that did not have a clear owner and that was not maintained properly. Removing this application allows us also to remove the repository code, the database, the deployment artifacts, and some ad-hoc AWS resources. This change reduced our AWS monthly bill by $150 and removed a security attack vector. Furthermore, we saved a lot of development costs because the framework and the database used by the application had become obsolete, so if we hadn't removed the application, we would have had to update the framework, the database, and the DB driver. As its original developers no longer worked for the company, this would not be easy.
  • We removed commands from our platform slack bot. Last year, we created a slack bot that allows our users to self-service some platform-related operations. We try to follow a modern product development flow, including product discovery. Still, sometimes, some commands seem interesting during the discovery phase, but at the end, they are not very used. In these cases, we removed the commands from the code base, knowing that we can recover the code as a starting point to recreate this command or a similar one in the future. We reduced application maintenance and evolution costs with this simplification, accelerating future developments.
  • We removed platform CLI PoC. Right now, our platform bot is used via slack commands. Some months ago we made a proof of concept to have a local command-line interface to interact with the bot. We see that this wasn’t useful enough yet (due to the type of command available), so we removed the corresponding code to remove the complexity until we detect a more evident opportunity to release a command-line interface for our users. Once the PoC allowed us to learn what we needed, eliminating the code, as in other cases, reduces the cost of maintenance and future evolution
  • We removed Atlantis for Terraform changes. Atlantis allows us to automate the process of generating and approving Pull Requests for the terraform code. In the past, we used this approach to allow the stream-aligned teams to create ECR repositories in a self-service manner. In reality, this process was not working well, there were conflicts between different PRs, issues with the changes due to the lack of knowledge about terraform and infrastructure, etc. In the end, we developed a slack command to create ECR repositories and deleted the support for Atlantis. The elimination of this component has reduced the toil of our team, eliminating some very necessary but low value-added maintenance tasks.

Simplify existing services

  • We removed some k8s node groups. We reduced the complexity of our k8s clusters by simplifying the number and type of node groups. This complexity comes from a premature optimization mixed with some “potential” requirements that never come true. In this case, we recognized the error and reduced the complexity by reducing the number of different node groups we required. In this case, this simplification, in addition to reducing the infrastructure code to maintain, improves the use of the machines with the consequent cost savings. This simplification and some other additional changes have saved us about $ 10K / month.
  • We removed layers of complexity in our Terraform code. Our terraform code was structured in a way that allows us maximum flexibility using several layers of abstraction. In the day-to-day, this implies that any change requires making several changes in different repositories. At the same time, we were not using the flexibility that this structure was supposed to provide and generated conflicts with some AWS account migration that we were doing. We recognized this problem, and for several months we simplified our code to remove tons of unneeded complexity. This change has improved the development speed of the team. It has also made it easier for us to onboard new members.
  • We simplified our Monolith pipeline and release system. As part of the Platform/DevEx mission, we make a process to optimize the monolith release system. The first step was to take ownership, understand the release system, and simplify the related pipelines and release mechanism mercilessly. This initiative has reduced some of the friction in the development process, making our developers more efficient. In spite of the fact that this improvement can be quantified in terms of money, for me, the essential factor is that it has improved the teams' confidence in the release system.

Always fighting against complexity


In addition to all these simplifications, we have also identified other opportunities to reduce the complexity that we will address in the coming year.

For example:

  • Replace one of our mongo database backup systems.
  • Move RabbitMQ and Grafana to managed services.
  • Eliminate the current VPN and replace it with a zero-trust network solution.

Of course, if we find other opportunities to simplify the system, have no one doubt that we will take advantage of them :)

Among other things, the Agile Manifesto says:

  • Continuous attention to technical excellence and good design enhances agility.
  • Simplicity--the art of maximizing the amount of work not done--is essential.
I would like to add to the manifesto:
  • Please, eliminate and simplify mercilessly.


Our profession is about managing and controlling complexity, so let's celebrate and prioritize simplification.

"Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it." - Alan Perlis


References and related content

Thanks

The post has been improved based on feedback from:

Tuesday, January 04, 2022

Good talks/podcasts (January 2022 I)



These are the best podcast/talks I've seen/listen to recently:

  • Avoid These Common Mistakes Junior Developers Make (Dave Farley) [Engineering Career, Inspirational, Software Design] [Duration: 0:18:00] (⭐⭐⭐⭐⭐) A must-see talk. Dave Farley describes 8 common mistakes that junior developers often make and offers his advice on how to avoid them. Whatever your approach to software engineering and software development, whether you are practicing Continuous Delivery, DevOps, or something else, we think that you may find some helpful ideas in this video.
  • Martin Fowler On The Fundamentals Of Software Development | The Engineering Room Ep. 1 (Dave Farley, Martin Fowler) [Agile, Architecture, Architecture patterns] [Duration: 1:13:00] Dave and Martin discuss a wide range of ideas, from new work in patterns in distributed systems and Data Mesh, to the fundamental principles of software development that matter, whatever the technology or problem that you are solving.
  • Engineering Productivity @Google (Michael Bachman) [Devex, Engineering productivity] [Duration: 0:32:00] Interesting talk on how engineering productivity is organized at google
  • Gojko Adzic On How Agile Failed at the BBC and the FBI | The Engineering Room Ep. 3 (Gojko Adzic, Dave Farley) [Engineering Career, Engineering Culture, Product, Product Discovery] [Duration: 1:15:00] Dave and Gojko chat about a wide-ranging series of topics on product development, steering development organisations to success, Palchinsky principles and how agile development failed for the FBI and the BBC.
  • The Principles of Product Development Flow / Small batches podcast (Adam Hawkins) [Flow, Lean Product Management, Lean Software Development, Product Team] [Duration: 0:07:00] (⭐⭐⭐⭐⭐) Super dense and interesting summary of the book "The Principles of Product Development Flow".
  • Time Thieves / Small batches podcast (Adam Hawkins) [Agile, Lean, Lean Product Management, Lean Software Development] [Duration: 0:08:00] A summary of the time thieves as described in Domenica DeGrandis's book "Making Work Visible". Adam explains Too much WIP, unknown dependencies, conflicting priorities, neglected work and interruptions.
Reminder, All these talks are interesting even just listening to them.

Related: 

Saturday, December 25, 2021

Good talks/podcasts (December 2021 I)


 


These are the best podcast/talks I've seen/listen to recently:

  • What It Takes To Be A Software Engineer (Dave Farley) [Engineering Culture, Inspirational, Software Design] [Duration: 0:18:00] (⭐⭐⭐⭐⭐) Great and concise description of what software engineering is and the forces that apply to our profession.
  • MMMSS – The Intrinsic Benefit of Steps (GeePaw Hill) [Agile, Flow, XP] [Duration: 0:12:00] (⭐⭐⭐⭐⭐) Why we should work in small safe steps (3s).
  • Wardley Maps for concious strategy definition (Ismael Castillo, Enrique Caballero) [Architecture, Product Strategy, Wardley maps] [Duration: 0:30:00] Share ideas on how to align strategy and create situational awareness using Wardley Maps, DDD, and Team Topologies among others tools and frameworks.
  • The New Faces of Continuous Improvement | Dev Interrupted Engineering Panel (Charity Majors, Kathryn Koehler, Dana Lawson) [Continuous Delivery, Engineering Culture, Teams] [Duration: 0:55:00] A very interesting panel on high performance teams, engineering culture, continuous delivery, etc.
  • Keynote: Systems Thinking (Jessica Kerr, Kent Beck) [Inspirational, Systems Thinking] [Duration: 0:59:00] A great introduction to systems thinking.
Reminder, All these talks are interesting even just listening to them.

Related: 

Sunday, November 14, 2021

Improve or Die

A software development/product development organization should always be learning and improving.

When the organization is not learning or improving means that it is going backward, software development is a complex socio-technical system formed by several interrelated reinforcing loops. Some of the loops are positive (virtuous cycles) and some negative (vicious cycles), but the problem is that in such a complex system is difficult to find any balance, so in general, we are always moving. 

So the question is, in which direction? Are we learning and improving as a team, or are we dying or falling behind

Even if we managed to maintain a continuous flow of development with stable quality and speed (which is impossible), the whole ecosystem around us continues to improve and advance, so even in that case, we would be losing ground.

In general, the reinforcing loops are generated by things that compound with time or volume. 

For example, these are some things with negative compound effects: 

  • Complexity and basal cost of the product (cumulative features)
  • Quality problems.
  • Technical debt (if not managed).

Virtuous cycles examples:

  • Continuous delivery requires quality, requires small batches, removes silos, improves ownership, etc.
  • Product Team ownership requires autonomy, requires product instrumentation, requires learning from customers, generating more value, etc. 
  • etc

Vicious cycles examples:

  • Unmanaged technical debt, remove capacity from the team, generate more pressure, generate more technical debt, etc.
  • Accidental complexity makes difficult to understand the code, so generate more bugs, generate more pressure for the team, generate poor solutions with more accidental complexity, etc.
  • A bad deployment process generates frustration, so we tend to larger batches that are riskier, so have more problems, generate even more frustration, etc.
  • etc.

When we have several of these vicious cycles, it is easier than it seems to fall into a downward spiral from which we cannot get out.

So, are you investing in breaking the vicious cycles of poor quality, high resource usage, and unmanaged technical debt? Or are you investing in improving your virtuous cycles of working in small batches, with ownership and high quality?

Are you improving, or are you dying and falling behind?


And if the problem is that you don't know what a high-performance technology organization should look like, you are lucky; we now have information on how it should be (Accelerate book).


Related:


Sunday, November 07, 2021

Good talks/podcasts (November 2021 I)



These are the best podcast/talks I've seen/listen to recently:

  • Debt Metaphor (Ward Cunningham) [Inspirational, Technical Practices, Technology Strategy, XP] [Duration: 0:05:00] (⭐⭐⭐⭐⭐) Ward Cunningham reflects on the history, motivation and common misunderstanding of the "debt metaphor" as motivation for refactoring.
  • EP 47: How to scale engineering processes w/ Twitter's VP of Engineering (Maria Gutierrez) [Engineering Career, Engineering Culture, leadership] A very interesting interview with Maria Gutierrez. Great lessons about team management, building a company culture, hiring, and mentorship.
  • Getting Started With Microservices (Dave Farley) [Architecture, Architecture patterns, Continuous Delivery] In this episode, a microservices tutorial, Dave Farley describes the microservices basics that help you to do a better job. He describes three different levels that we need to think about when designing a service and offers his advice on how to focus on the right parts of the problem to allow you to create better, more independent, services, based on Dave’s software engineering approach.
  • Industry Keynote: The DevOps Transformation (Jez Humble) [Agile, Continuous Delivery, Devops, Engineering Culture, leadership] (⭐⭐⭐⭐⭐) In this talk Jez will describe how to implement devops principles and practices, how to overcome typical obstacles, and the outcomes DevOps enables. A must-see talk.
  • Lunch & Learn How to Misuse DORA DevOps Metrics (Bryan Finster) [Devops, Engineering Culture, leadership] Interesting presentation in which bryan describes an agile/devops transformation, telling us about mistakes and successes. Interesting learnings, tips, and ideas.
Reminder, All these talks are interesting even just listening to them.

Related: 

    Wednesday, October 27, 2021

    "It depends" / The development Mix (Product, Engineering, Hygiene)

    We already know that is everything about context. I read a lot of blog posts talking about how much time a team should use for decoupling components, introducing a new cache system, or improving the scalability of their systems. When reading this type of content, I always think that they are completely right and completely wrong. Everything in our profession depends a lot on the context (the moment of the company, the business strategy, the market traction, etc.).

    I use a mental model that helps me classify the work we do, which allows me to communicate and make decisions. I call this mental model "The Mix".

    In "The Mix", I classify the work we do as product engineers in:
    • Normal product development.
    • Implementing the Engineering Roadmap.
    • Basic hygiene work.

    Normal product development

    Normal product development should be the most common type of work for a Stream Aligned team. It should help to fulfill the mission of the team. It can be composed of new feature development, discovery experiments, feature evolution, etc. I prefer a very lean approach for this work, following agile development methods such as XP or Lean Software Development. It is essential to generate the expected outcomes with the minimal amount of code possible and a good internal quality that minimizes the maintenance cost. Following YAGNI, KISS, Simple design is the perfect approach for this kind of work. We don't know the future. The most efficient way to work is to have the most simple solution that covers our customer's needs without making any "speculative" design that generates tons of accidental complexity in 99% of the cases.

    Summary:
    • Focus on outcomes for the customer within the business constraints.
    • Evolutionary design uses Simple design and avoids creating anything for future "expected/invented" needs.
    • Use a Lean approach (working in small safe steps).
    • Avoid solving problems that we don't have.
    • High-speed feedback loop.
    • Aligned with the Product Roadmap.

    Implementing the Engineering Roadmap

    In parallel to the product work, it is very common to identify engineering needs derived from the company's engineering strategy. This strategy should prepare and maintain the current and future engineering capability. Examples of this type of work are:
    • Designing the system for fast expected growth (the number of customers, engineering team size, etc.).
    • A technology stack change.
    • A change in the delivery strategy (from On-Prem to SaaS, from Web to mobile, etc.).
    • Prepare the architecture to enable work in autonomous teams.
    • This kind of work usually affects several Stream Aligned teams simultaneously and requires coordination at the engineering organization level.
    • These initiatives require a lot of investment and should be coordinated with the product roadmap and aligned with the company's general strategy.

    Summary:
    • Focus on outcomes for the internal architecture and engineering processes.
    • Require more upfront effort to design the solution.
    • It can be implemented with an agile approach but based on the initial design.
    • Low-speed feedback loop.
    • By definition, try to solve problems that we don't have (yet).
    • It is aligned with the Engineering Roadmap (coordinated with the Product Roadmap).

    Basic hygiene work

    To develop any nontrivial product, we need to have some practices and development infrastructure that I consider basic hygiene. I'm talking about having a reasonable test strategy, zero-downtime releases, good internal code quality, basic security practices, etc.
    In the middle of 2021, not considering these points above seems simply a lack of professionalism. 
    So the Basic hygiene work includes any effort we make to implement or improve these minimal practices.

    Of course, I am a big fan of product discovery with prototypes, and these, for example, do not have to have the same test strategy. But remember, a prototype that ends up in production, staying in front of our customers for months, is not a prototype. It is a trap.
     


    Using The Mix

    Thinking about these three types of work and separating them helps me be more explicit about the context and the appropriate trade-offs in each situation. For example, suppose we are in a Normal Product Development initiative. In that case, we cannot expect big architecture change decisions to emerge, and it is better to focus on small safe steps that add value. At the same time, we take notes to consider some initiatives to introduce in the engineering roadmap.

    A mature product organization will introduce performance, scalability, and availability initiatives into the product roadmap. In a less mature organization, those needs are likely to be missing from the product roadmap, and it is up to engineering to fight to get them into the engineering roadmap.

    We can summarize the different dimensions in this table:
     
    Product development Engineering Roadmap Hygiene
    Source
    Product
    Engineering
    Team (+Engineering)
    Development
    Small Safe Steps
    Upfront Planning + Small Safe Steps
    Small Safe Steps
    Practices
    YAGNI, KISS, TDD...
    Evolutionary Architecture, Load Testing, Testing in Production, migration planning...
    Clean Code, CI, CD, Zero downtime, Observability...
    Type of needs
    Current needs
    Future needs
    Prerequisite
    Value Delivery
    Very Fast
    Slow
    Very Fast
    Coordination Needs
    The team should be autonomous
    Coordination with other teams
    Coordination with other teams


    If we analyze the current trends in technology using this mental model, some questions arise:
    • How do technologies like PaaS or Serverless influence this Mix?
    • How does working in Cloud vs. working on-prem affect the engineering roadmap?
    • Does it make sense to consider ourselves good professionals if we don't have strong knowledge about hygiene factors?
    • How does the mix change in the different phases of the company (startup pre-product-market fit, scale-up, big tech)? And with the life cycle of the product?
    The interesting thing about mental models is that they help us think. I hope this model is as valuable for you as it is to me.

    Related / Other mental models