If you don’t actively fight for simplicity in software, complexity will win. …and it will suck. - @HenrikJoreteg
It is ubiquitous in our profession to celebrate adding new features or capabilities, but it is less common to celebrate removing components or simplifying the system. The problem is that we are using the wrong metaphor. We usually talk about “building” or “making” new features. But by using this “building” metaphor, we neglect all the work associated with the new element we created (feature, capability, component, etc.) (See Basal Cost of software).
With my teams, I prefer to talk about new capabilities we enable, changes of behavior of our users, and the amount of complexity we manage and maintain. I tried to transmit that it is as essential to control and reduce complexity as developing new capabilities and features.
When I joined the Clarity AI Platform team, there was a problem of too much toil. This toil was generated by the lack of self-service capabilities for the stream-aligned teams, by the team's considerable number of components (k8s clusters, mongodb clusters, in-house monitoring platform, etc.), and the amount of accidental complexity of the infrastructure.
It came as no surprise to me. Clarity AI was a fast growing startup that during its first years of run was in a hurry to achieve the product-market fit. This meant that controlling the complexity was not the priority in the first days (since the balance would have been much more difficult).
My first step was to quantify and classify the work/tasks and the source of each task. With this information and our context (a startup founded by VCs), we, the Platform team, determined that it does not make sense (in our case) to manage all that infrastructure ourselves. Therefore, we decided to use managed services whenever possible and simplify the infrastructure.
Let's celebrate simplification and removal
During this last year and a half, we developed several new capabilities and reduced the system's complexity by simplifying solutions, removing non-essential components, and migrating solutions to managed services.
Migrate to managed services
- We removed all code and toil related to self-managed kubernetes clusters. We migrated all of our kops managed k8s clusters to EKS. This change allowed us to upgrade our cluster easily and remove tons of obsolete code and tooling we use to maintain the clusters. This change also enables other simplifications, such as using EKS managed node groups. This change drastically reduced the toil in the team (less security patching and upgrade, less code to maintain, etc.) and allowed us to make faster changes in our kubernetes infrastructure.
- Use of EKS managed node groups. We migrated (practically) all k8s node groups to managed groups, which has allowed us to remove all the trouble of patching and maintaining the OS for the nodes. This change improved our security position and allowed us to remove these machines' configuration code.
- We replaced Prometheus, Alert Manager, and metric server. To provide a more complete monitoring solution, we integrated our systems with Datadog. Using this managed service, we gave a more effective and easy-to-use monitoring solution and allowed us to remove our ad-hoc internal monitoring solution. We replaced Prometheus, the alert manager, the metric server, and the corresponding services and code. In this case, the most significant benefit is to save us from maintaining, updating, patching, and managing all those components, considering that Datadog already provides us with those functionalities.
- We migrated the self-managed MongoDB Cluster to MongoAtlas. This change allows us to remove several EC2 instances, the code associated with managing the clusters, and minimize the toil associated with the DBs operations. We also saved a lot of development costs with this change since we were in the process of improving security (enabling encryption at rest) and developing all the necessary tools to scale vertically and horizontally without losing service. With the managed service, all of these features are available without any development.
Remove anything unused
- We removed several S3 buckets and EC2 machines. After two months of talking with many people from the company, we identified several S3 buckets and EC2 machines without clear usage. By removing the residual use, we could delete the buckets and some machines. This change reduced our monthly costs of AWS by $400-$500 and improved our security as the EC2 machines did not follow the same security rules as the rest of our infrastructure.
- We deleted an abandoned tool. I consider this change as a personal victory, and only cost us one year talking and convincing a lot of people :) It was a small application mainly used on demos that did not have a clear owner and that was not maintained properly. Removing this application allows us also to remove the repository code, the database, the deployment artifacts, and some ad-hoc AWS resources. This change reduced our AWS monthly bill by $150 and removed a security attack vector. Furthermore, we saved a lot of development costs because the framework and the database used by the application had become obsolete, so if we hadn't removed the application, we would have had to update the framework, the database, and the DB driver. As its original developers no longer worked for the company, this would not be easy.
- We removed commands from our platform slack bot. Last year, we created a slack bot that allows our users to self-service some platform-related operations. We try to follow a modern product development flow, including product discovery. Still, sometimes, some commands seem interesting during the discovery phase, but at the end, they are not very used. In these cases, we removed the commands from the code base, knowing that we can recover the code as a starting point to recreate this command or a similar one in the future. We reduced application maintenance and evolution costs with this simplification, accelerating future developments.
- We removed platform CLI PoC. Right now, our platform bot is used via slack commands. Some months ago we made a proof of concept to have a local command-line interface to interact with the bot. We see that this wasn’t useful enough yet (due to the type of command available), so we removed the corresponding code to remove the complexity until we detect a more evident opportunity to release a command-line interface for our users. Once the PoC allowed us to learn what we needed, eliminating the code, as in other cases, reduces the cost of maintenance and future evolution.
- We removed Atlantis for Terraform changes. Atlantis allows us to automate the process of generating and approving Pull Requests for the terraform code. In the past, we used this approach to allow the stream-aligned teams to create ECR repositories in a self-service manner. In reality, this process was not working well, there were conflicts between different PRs, issues with the changes due to the lack of knowledge about terraform and infrastructure, etc. In the end, we developed a slack command to create ECR repositories and deleted the support for Atlantis. The elimination of this component has reduced the toil of our team, eliminating some very necessary but low value-added maintenance tasks.
Simplify existing services
- We removed some k8s node groups. We reduced the complexity of our k8s clusters by simplifying the number and type of node groups. This complexity comes from a premature optimization mixed with some “potential” requirements that never come true. In this case, we recognized the error and reduced the complexity by reducing the number of different node groups we required. In this case, this simplification, in addition to reducing the infrastructure code to maintain, improves the use of the machines with the consequent cost savings. This simplification and some other additional changes have saved us about $ 10K / month.
- We removed layers of complexity in our Terraform code. Our terraform code was structured in a way that allows us maximum flexibility using several layers of abstraction. In the day-to-day, this implies that any change requires making several changes in different repositories. At the same time, we were not using the flexibility that this structure was supposed to provide and generated conflicts with some AWS account migration that we were doing. We recognized this problem, and for several months we simplified our code to remove tons of unneeded complexity. This change has improved the development speed of the team. It has also made it easier for us to onboard new members.
- We simplified our Monolith pipeline and release system. As part of the Platform/DevEx mission, we make a process to optimize the monolith release system. The first step was to take ownership, understand the release system, and simplify the related pipelines and release mechanism mercilessly. This initiative has reduced some of the friction in the development process, making our developers more efficient. In spite of the fact that this improvement can be quantified in terms of money, for me, the essential factor is that it has improved the teams' confidence in the release system.
Always fighting against complexity
- Replace one of our mongo database backup systems.
- Move RabbitMQ and Grafana to managed services.
- Eliminate the current VPN and replace it with a zero-trust network solution.
Of course, if we find other opportunities to simplify the system, have no one doubt that we will take advantage of them :)
Among other things, the Agile Manifesto says:
- Continuous attention to technical excellence and good design enhances agility.
- Simplicity--the art of maximizing the amount of work not done--is essential.
- Please, eliminate and simplify mercilessly.
Our profession is about managing and controlling complexity, so let's celebrate and prioritize simplification.
"Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it." - Alan Perlis