I've always worked at product companies, creating or scaling teams. In these product companies, we work remotely, at least partially. In my experience, introducing Agile culture to a technical team means introducing DevOps and Agile software development. However, I see Agile culture as more than just tools and processes; it is a culture of collaboration, continuous improvement, continuous learning, a focus on technical excellence, and transparency. Let's explore how we can manage incidents in a way that aligns with this Agile culture.
Production Incidents
We refer to "production incidents" as anything affecting our clients or that we suspect might affect them. These incidents can include things like machine failures, unexpected metrics, or a client reporting an error.
High-Performing Teams
Let's take a quick look at what makes high-performing teams so effective. Google's research on high-performing teams shows that individual talent is not the most important factor. Instead, the key to high performance is the quality of the interactions within a team. The most important factor for high-performing teams is psychological safety. Team members need to feel safe taking risks without fear of ridicule or failure. This psychological safety is essential for fostering a culture of learning and improvement.
Traditional Incident Management vs. Agile Incident Management and Psychological Safety
Unfortunately, the traditional approach to incident management often lacks this crucial element of psychological safety. Incident management often falls solely on the operations team, creating a siloed and stressful environment. Under pressure, teams may resort to blame and scapegoating, leading to a culture of fear and hiding problems. This approach is not conducive to learning and improvement and can ultimately lead to recurring issues. Instead of resorting to blame, we can adopt an Agile approach to incident management, focusing on collaboration, learning, and continuous improvement. This approach reduces fear, avoids a hero culture, and encourages transparency.
In my experience at TheMotion, Nextail, and ClarityAI introducing blameless incident management practices has served as a lever to shift the team's culture towards one of continuous learning. It has helped us overcome the fear of making problems visible, fostered collaboration, and empowered us to address issues at their root causes. This resonates with one of the core principles of Agile incident management: “Hard on systems. Soft on people." We prioritize understanding how the system contributed to the error, rather than pointing fingers at individuals. This creates a safer space for open communication and learning.
The impact of this cultural shift at TheMotion was so significant that team members who moved to other companies have begun implementing these ideas in their new teams.
Here's how our process works:
- Stay Calm and Don't Panic: When an incident occurs, it's important to stay calm and avoid panicking. Our process is designed to help us avoid this. When we interview developer candidates, we ask them about a time when they made a mistake in production. We evaluate not only their technical skills but also their ability to remain calm under pressure and avoid panicking. This helps ensure that our team can handle incidents effectively without succumbing to fear or stress.
- Assign an Incident Commander: We automatically assign an incident commander to take charge of the situation. The incident commander's responsibilities include:
- Creating a "War Room" for collaboration.
- Creating a blameless incident report to document the incident and focus on learning.
- Notifying the appropriate stakeholders about the incident.
- Recruiting and coordinating a team to resolve the incident.
- Focus on Service Recovery: The team's primary goal is to recover the service as quickly as possible. This might involve implementing a temporary fix, disabling a functionality, or communicating with clients. The key is to stabilize the system and minimize the impact on users.
- Investigate the Root Cause: Once the service is restored, the team investigates the root cause of the incident. This investigation follows a process of:
- Hypothesis generation.
- Validation.
- Documentation.
- Repetition.
- Define Corrective and Preventive Actions: Based on the investigation, the team defines corrective and preventive actions to reduce the mean time to recovery (MTTR) and the blast radius of future incidents. These actions aim to improve the system's resilience and prevent similar incidents from happening again. We prioritize tasks related to improving the system and resolving recurring issues. We ensure these corrective actions are integrated into our workflow and addressed with high priority to maintain team motivation and demonstrate that system improvement is a top priority for the entire company.
- Integrate Actions into the Workflow: The corrective and preventive actions are then prioritized and integrated into our normal workflow, ensuring they are addressed promptly.
The Importance of Blameless Incident Reports
Throughout the entire process, we maintain a blameless approach, emphasizing learning and improvement over assigning blame. We use blameless incident reports, which are:
- Collaborative: The incident commander creates a shared Google Doc where everyone involved can contribute in real-time.
- Transparent: We make incident reports public to the entire company as soon as we start detecting an issue. This transparency fosters trust and allows anyone to stay informed about the incident's progress.
- Detailed: Our incident report template includes a summary of the incident, a timeline, root causes, corrective and preventive actions, and lessons learned.
Facilitating Change
To successfully introduce this approach, it's essential to:
- Focus on Systems and Habits: Instead of blaming individuals, we concentrate on improving our systems, processes, and habits to prevent future incidents.
- Lead by Example: By actively participating in the process and demonstrating a blameless approach, we can encourage others to adopt this mindset.
- Show Vulnerability: Leaders should be willing to admit their mistakes (I have a few 😅) and share their experiences, creating a safe space for others to do the same.
- Prioritize Improvement: It's crucial to ensure corrective and preventive actions are prioritized and not overshadowed by other business priorities.
- Reinforce Learnings: We should highlight key learnings from incident reports and share them with the team to promote continuous learning.
Benefits of Agile Incident Management
Embracing Agile incident management can lead to numerous benefits, including:
- Increased Trust: Transparency and collaboration build trust among team members and between the team and the rest of the company.
- Enhanced Psychological Safety: A blameless approach creates a psychologically safe environment where people feel comfortable taking risks and learning from mistakes.
- Improved Resilience: By systematically addressing incidents, we can continually improve our systems and make them more resilient.
- Focus on Continuous Improvement: Incident management becomes an integral part of our continuous improvement process, leading to a more robust and reliable system.
- Greater Transparency: Open communication about incidents and their resolution fosters a culture of transparency and accountability.
- Enhanced Professionalism: Our commitment to learning and improvement demonstrates professionalism to our clients and stakeholders.
Conclusion
By adopting an Agile approach to incident management, we can transform our team's culture and create a more resilient and reliable system. By focusing on collaboration, learning, and continuous improvement, we can turn incidents into valuable opportunities for growth and development. Remember, incidents are inevitable, but how we respond to them is what truly matters. Let's embrace a culture of learning and create a system that can withstand the inevitable challenges of production.
No comments:
Post a Comment