Saturday, January 22, 2022

Good talks/podcasts (January 2022 II)

 


These are the best podcast/talks I've seen/listen to recently:
  • Engineering Your Organization: Services, Platforms, and Communities (Randy Shoup) [Company Culture, Engineering Culture, Inspirational, Management, Platform, Platform as a product, Technology Strategy] [Duration: 0:38:00] (⭐⭐⭐⭐⭐) Great summary about the different ways high-performing engineering organizations gain leverage by specialization and sharing.
  • TDD, where did it all go wrong (Ian Cooper) [Technical Practices, tdd, testing] [Duration: 1:01:00] (⭐⭐⭐⭐⭐) Essential talk about how to do TDD in an efficient way and getting a battery of tests that support continuous refactoring. It fundamentally changed my approach to TDD. I highly recommend it.
  • Driving a Tech-led Reimagination of eBay Through DevOps (US 2021) (Randy Shoup, Mark Weinberg) [Devops, Technical leadership] [Duration: 0:33:00] (⭐⭐⭐⭐⭐) A very interesting session about eBay's strategy to improve delivery performance. A great example of engineering leadership.
  • How Honeycomb Manages Incident Response (Fred Hebert) [Incident respond, Operations] [Duration: 0:30:00] In this talk, Fred covers the full incident lifecycle at Honeycomb: all the way from first detecting issues to resolving them. But most of the effective practices we implement come from work that happens before and after those incidents. You'll also learn about systems we use at Honeycomb that can also help you implement better incident response with your teams.
  • Common Mistakes Data Scientists Make With BIG DATA (Dave Farley) [Big Data, Continuous Delivery, Data Engineering, Data Science] [Duration: 0:14:00] (⭐⭐⭐⭐⭐) n this episode Dave Farley explores how we could do a better job of dealing with data. Ideas like data pipelines and data mesh are becoming more common and more applicable as the scale of the data that we are dealing with grows. Managing the complexity in these activities, as in any other aspect of software engineering, is critical to success with data.
  • Ship It! #31 Is Kubernetes a platform? (Tammer Saleh, Gerhard Lazu) [Devops, Platform, Platform as a product, k8s] [Duration: 1:01:00] Interesting conversation about how to use k8s as a base for a platform.
Reminder, All these talks are interesting even just listening to them.

Related: 

Thursday, January 13, 2022

Fighting complexity: let's celebrate removals & simplifications

If you don’t actively fight for simplicity in software, complexity will win. …and it will suck. - @HenrikJoreteg

It is ubiquitous in our profession to celebrate adding new features or capabilities, but it is less common to celebrate removing components or simplifying the system. The problem is that we are using the wrong metaphor. We usually talk about “building” or “making” new features. But by using this “building” metaphor, we neglect all the work associated with the new element we created (feature, capability, component, etc.) (See Basal Cost of software). 

With my teams, I prefer to talk about new capabilities we enable, changes of behavior of our users, and the amount of complexity we manage and maintain. I tried to transmit that it is as essential to control and reduce complexity as developing new capabilities and features.

When I joined the Clarity AI Platform team, there was a problem of too much toil. This toil was generated by the lack of self-service capabilities for the stream-aligned teams, by the team's considerable number of components (k8s clusters, mongodb clusters, in-house monitoring platform, etc.), and the amount of accidental complexity of the infrastructure.

It came as no surprise to me. Clarity AI was a fast growing startup that during its first  years of run was in a hurry to achieve the product-market fit. This meant that controlling the complexity was not the priority in the first days (since the balance would have been much more difficult).

My first step was to quantify and classify the work/tasks and the source of each task. With this information and our context (a startup founded by VCs), we, the Platform team, determined that it does not make sense (in our case) to manage all that infrastructure ourselves. Therefore, we decided to use managed services whenever possible and simplify the infrastructure.

Let's celebrate simplification and removal

During this last year and a half, we developed several new capabilities and reduced the system's complexity by simplifying solutions, removing non-essential components, and migrating solutions to managed services.

Migrate to managed services

  • We removed all code and toil related to self-managed kubernetes clusters. We migrated all of our kops managed k8s clusters to EKS. This change allowed us to upgrade our cluster easily and remove tons of obsolete code and tooling we use to maintain the clusters. This change also enables other simplifications, such as using EKS managed node groups. This change drastically reduced the toil in the team (less security patching and upgrade, less code to maintain, etc.) and allowed us to make faster changes in our kubernetes infrastructure.
  • Use of EKS managed node groups. We migrated (practically) all k8s node groups to managed groups, which has allowed us to remove all the trouble of patching and maintaining the OS for the nodes. This change improved our security position and allowed us to remove these machines' configuration code.
  • We replaced Prometheus,  Alert Manager, and metric server. To provide a more complete monitoring solution, we integrated our systems with Datadog. Using this managed service, we gave a more effective and easy-to-use monitoring solution and allowed us to remove our ad-hoc internal monitoring solution. We replaced Prometheus, the alert manager, the metric server, and the corresponding services and code. In this case, the most significant benefit is to save us from maintaining, updating, patching, and managing all those components, considering that Datadog already provides us with those functionalities.
  • We migrated the self-managed MongoDB Cluster to MongoAtlas. This change allows us to remove several EC2 instances, the code associated with managing the clusters, and minimize the toil associated with the DBs operations. We also saved a lot of development costs with this change since we were in the process of improving security (enabling encryption at rest) and developing all the necessary tools to scale vertically and horizontally without losing service. With the managed service, all of these features are available without any development.

Remove anything unused

  • We removed several S3 buckets and EC2 machines. After two months of talking with many people from the company, we identified several S3 buckets and EC2 machines without clear usage. By removing the residual use, we could delete the buckets and some machines. This change reduced our monthly costs of AWS by $400-$500 and improved our security as the EC2 machines did not follow the same security rules as the rest of our infrastructure. 
  • We deleted an abandoned tool. I consider this change as a personal victory, and only cost us one year talking and convincing a lot of people :)  It was a small application mainly used on demos that did not have a clear owner and that was not maintained properly. Removing this application allows us also to remove the repository code, the database, the deployment artifacts, and some ad-hoc AWS resources. This change reduced our AWS monthly bill by $150 and removed a security attack vector. Furthermore, we saved a lot of development costs because the framework and the database used by the application had become obsolete, so if we hadn't removed the application, we would have had to update the framework, the database, and the DB driver. As its original developers no longer worked for the company, this would not be easy.
  • We removed commands from our platform slack bot. Last year, we created a slack bot that allows our users to self-service some platform-related operations. We try to follow a modern product development flow, including product discovery. Still, sometimes, some commands seem interesting during the discovery phase, but at the end, they are not very used. In these cases, we removed the commands from the code base, knowing that we can recover the code as a starting point to recreate this command or a similar one in the future. We reduced application maintenance and evolution costs with this simplification, accelerating future developments.
  • We removed platform CLI PoC. Right now, our platform bot is used via slack commands. Some months ago we made a proof of concept to have a local command-line interface to interact with the bot. We see that this wasn’t useful enough yet (due to the type of command available), so we removed the corresponding code to remove the complexity until we detect a more evident opportunity to release a command-line interface for our users. Once the PoC allowed us to learn what we needed, eliminating the code, as in other cases, reduces the cost of maintenance and future evolution
  • We removed Atlantis for Terraform changes. Atlantis allows us to automate the process of generating and approving Pull Requests for the terraform code. In the past, we used this approach to allow the stream-aligned teams to create ECR repositories in a self-service manner. In reality, this process was not working well, there were conflicts between different PRs, issues with the changes due to the lack of knowledge about terraform and infrastructure, etc. In the end, we developed a slack command to create ECR repositories and deleted the support for Atlantis. The elimination of this component has reduced the toil of our team, eliminating some very necessary but low value-added maintenance tasks.

Simplify existing services

  • We removed some k8s node groups. We reduced the complexity of our k8s clusters by simplifying the number and type of node groups. This complexity comes from a premature optimization mixed with some “potential” requirements that never come true. In this case, we recognized the error and reduced the complexity by reducing the number of different node groups we required. In this case, this simplification, in addition to reducing the infrastructure code to maintain, improves the use of the machines with the consequent cost savings. This simplification and some other additional changes have saved us about $ 10K / month.
  • We removed layers of complexity in our Terraform code. Our terraform code was structured in a way that allows us maximum flexibility using several layers of abstraction. In the day-to-day, this implies that any change requires making several changes in different repositories. At the same time, we were not using the flexibility that this structure was supposed to provide and generated conflicts with some AWS account migration that we were doing. We recognized this problem, and for several months we simplified our code to remove tons of unneeded complexity. This change has improved the development speed of the team. It has also made it easier for us to onboard new members.
  • We simplified our Monolith pipeline and release system. As part of the Platform/DevEx mission, we make a process to optimize the monolith release system. The first step was to take ownership, understand the release system, and simplify the related pipelines and release mechanism mercilessly. This initiative has reduced some of the friction in the development process, making our developers more efficient. In spite of the fact that this improvement can be quantified in terms of money, for me, the essential factor is that it has improved the teams' confidence in the release system.

Always fighting against complexity


In addition to all these simplifications, we have also identified other opportunities to reduce the complexity that we will address in the coming year.

For example:

  • Replace one of our mongo database backup systems.
  • Move RabbitMQ and Grafana to managed services.
  • Eliminate the current VPN and replace it with a zero-trust network solution.

Of course, if we find other opportunities to simplify the system, have no one doubt that we will take advantage of them :)

Among other things, the Agile Manifesto says:

  • Continuous attention to technical excellence and good design enhances agility.
  • Simplicity--the art of maximizing the amount of work not done--is essential.
I would like to add to the manifesto:
  • Please, eliminate and simplify mercilessly.


Our profession is about managing and controlling complexity, so let's celebrate and prioritize simplification.

"Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it." - Alan Perlis


References and related content

Thanks

The post has been improved based on feedback from:

Tuesday, January 04, 2022

Good talks/podcasts (January 2022 I)



These are the best podcast/talks I've seen/listen to recently:

  • Avoid These Common Mistakes Junior Developers Make (Dave Farley) [Engineering Career, Inspirational, Software Design] [Duration: 0:18:00] (⭐⭐⭐⭐⭐) A must-see talk. Dave Farley describes 8 common mistakes that junior developers often make and offers his advice on how to avoid them. Whatever your approach to software engineering and software development, whether you are practicing Continuous Delivery, DevOps, or something else, we think that you may find some helpful ideas in this video.
  • Martin Fowler On The Fundamentals Of Software Development | The Engineering Room Ep. 1 (Dave Farley, Martin Fowler) [Agile, Architecture, Architecture patterns] [Duration: 1:13:00] Dave and Martin discuss a wide range of ideas, from new work in patterns in distributed systems and Data Mesh, to the fundamental principles of software development that matter, whatever the technology or problem that you are solving.
  • Engineering Productivity @Google (Michael Bachman) [Devex, Engineering productivity] [Duration: 0:32:00] Interesting talk on how engineering productivity is organized at google
  • Gojko Adzic On How Agile Failed at the BBC and the FBI | The Engineering Room Ep. 3 (Gojko Adzic, Dave Farley) [Engineering Career, Engineering Culture, Product, Product Discovery] [Duration: 1:15:00] Dave and Gojko chat about a wide-ranging series of topics on product development, steering development organisations to success, Palchinsky principles and how agile development failed for the FBI and the BBC.
  • The Principles of Product Development Flow / Small batches podcast (Adam Hawkins) [Flow, Lean Product Management, Lean Software Development, Product Team] [Duration: 0:07:00] (⭐⭐⭐⭐⭐) Super dense and interesting summary of the book "The Principles of Product Development Flow".
  • Time Thieves / Small batches podcast (Adam Hawkins) [Agile, Lean, Lean Product Management, Lean Software Development] [Duration: 0:08:00] A summary of the time thieves as described in Domenica DeGrandis's book "Making Work Visible". Adam explains Too much WIP, unknown dependencies, conflicting priorities, neglected work and interruptions.
Reminder, All these talks are interesting even just listening to them.

Related: 

Saturday, December 25, 2021

Good talks/podcasts (December 2021 I)


 


These are the best podcast/talks I've seen/listen to recently:

  • What It Takes To Be A Software Engineer (Dave Farley) [Engineering Culture, Inspirational, Software Design] [Duration: 0:18:00] (⭐⭐⭐⭐⭐) Great and concise description of what software engineering is and the forces that apply to our profession.
  • MMMSS – The Intrinsic Benefit of Steps (GeePaw Hill) [Agile, Flow, XP] [Duration: 0:12:00] (⭐⭐⭐⭐⭐) Why we should work in small safe steps (3s).
  • Wardley Maps for concious strategy definition (Ismael Castillo, Enrique Caballero) [Architecture, Product Strategy, Wardley maps] [Duration: 0:30:00] Share ideas on how to align strategy and create situational awareness using Wardley Maps, DDD, and Team Topologies among others tools and frameworks.
  • The New Faces of Continuous Improvement | Dev Interrupted Engineering Panel (Charity Majors, Kathryn Koehler, Dana Lawson) [Continuous Delivery, Engineering Culture, Teams] [Duration: 0:55:00] A very interesting panel on high performance teams, engineering culture, continuous delivery, etc.
  • Keynote: Systems Thinking (Jessica Kerr, Kent Beck) [Inspirational, Systems Thinking] [Duration: 0:59:00] A great introduction to systems thinking.
Reminder, All these talks are interesting even just listening to them.

Related: 

Sunday, November 14, 2021

Improve or Die

A software development/product development organization should always be learning and improving.

When the organization is not learning or improving means that it is going backward, software development is a complex socio-technical system formed by several interrelated reinforcing loops. Some of the loops are positive (virtuous cycles) and some negative (vicious cycles), but the problem is that in such a complex system is difficult to find any balance, so in general, we are always moving. 

So the question is, in which direction? Are we learning and improving as a team, or are we dying or falling behind

Even if we managed to maintain a continuous flow of development with stable quality and speed (which is impossible), the whole ecosystem around us continues to improve and advance, so even in that case, we would be losing ground.

In general, the reinforcing loops are generated by things that compound with time or volume. 

For example, these are some things with negative compound effects: 

  • Complexity and basal cost of the product (cumulative features)
  • Quality problems.
  • Technical debt (if not managed).

Virtuous cycles examples:

  • Continuous delivery requires quality, requires small batches, removes silos, improves ownership, etc.
  • Product Team ownership requires autonomy, requires product instrumentation, requires learning from customers, generating more value, etc. 
  • etc

Vicious cycles examples:

  • Unmanaged technical debt, remove capacity from the team, generate more pressure, generate more technical debt, etc.
  • Accidental complexity makes difficult to understand the code, so generate more bugs, generate more pressure for the team, generate poor solutions with more accidental complexity, etc.
  • A bad deployment process generates frustration, so we tend to larger batches that are riskier, so have more problems, generate even more frustration, etc.
  • etc.

When we have several of these vicious cycles, it is easier than it seems to fall into a downward spiral from which we cannot get out.

So, are you investing in breaking the vicious cycles of poor quality, high resource usage, and unmanaged technical debt? Or are you investing in improving your virtuous cycles of working in small batches, with ownership and high quality?

Are you improving, or are you dying and falling behind?


And if the problem is that you don't know what a high-performance technology organization should look like, you are lucky; we now have information on how it should be (Accelerate book).


Related:


Sunday, November 07, 2021

Good talks/podcasts (November 2021 I)



These are the best podcast/talks I've seen/listen to recently:

  • Debt Metaphor (Ward Cunningham) [Inspirational, Technical Practices, Technology Strategy, XP] [Duration: 0:05:00] (⭐⭐⭐⭐⭐) Ward Cunningham reflects on the history, motivation and common misunderstanding of the "debt metaphor" as motivation for refactoring.
  • EP 47: How to scale engineering processes w/ Twitter's VP of Engineering (Maria Gutierrez) [Engineering Career, Engineering Culture, leadership] A very interesting interview with Maria Gutierrez. Great lessons about team management, building a company culture, hiring, and mentorship.
  • Getting Started With Microservices (Dave Farley) [Architecture, Architecture patterns, Continuous Delivery] In this episode, a microservices tutorial, Dave Farley describes the microservices basics that help you to do a better job. He describes three different levels that we need to think about when designing a service and offers his advice on how to focus on the right parts of the problem to allow you to create better, more independent, services, based on Dave’s software engineering approach.
  • Industry Keynote: The DevOps Transformation (Jez Humble) [Agile, Continuous Delivery, Devops, Engineering Culture, leadership] (⭐⭐⭐⭐⭐) In this talk Jez will describe how to implement devops principles and practices, how to overcome typical obstacles, and the outcomes DevOps enables. A must-see talk.
  • Lunch & Learn How to Misuse DORA DevOps Metrics (Bryan Finster) [Devops, Engineering Culture, leadership] Interesting presentation in which bryan describes an agile/devops transformation, telling us about mistakes and successes. Interesting learnings, tips, and ideas.
Reminder, All these talks are interesting even just listening to them.

Related: