Many organizations might divide their teams into separate functional areas, leading to groups consisting of developers, operations, QAs, and so on. This model keeps you away from the benefits that shared responsibility, ownership, and team cohesion can bring on your DevOps journey. The whole idea of the DevOps methodology is to break down these silos, ensure teams speak more often, are aligned on what the end customer needs, and collaborate on achieving this goal. The wall that exists between these teams creates resistance to change that is detrimental to reliably releasing software more often. Developers who have had some system administration background in the past should actively involve themselves in improving the operations of the software. Operators, on the other hand, might be well versed in some application programming languages, so they could leverage those capabilities to assist the developers in reliably rolling out changes at scale.
However, introducing this change in organizations that have since long been using these siloed models might face some resistance. A strategy that I’ve seen work quite well in such teams is to temporarily embed DevOps practitioners into the teams who can act as agents of change. By introducing new practices around development, operations, and communications backed by automation and tooling, they can motivate the team members to think beyond their preset boundaries and focus on driving customer value.
Blameless post-mortems and RCAs
Failure is the only constant. Software fails all the time and the best we can do is to architect for reliability and learn from those failures so that a re-occurrence can be avoided. Aptly put by Devin Carraway, Member of Technical Staff at Google:
“The cost of failure is education”
A good mechanism to learn from these failures is Root Cause Analysis (RCA) sessions. The main goal behind them is to document the incident and ensure that all causes are identified and the preventive actions are noted. Most importantly, all of this needs to happen in a blameless way without pointing out any individual or team. It should be seen as a learning opportunity in terms of what could be better next time. The tone of the documentation should also reflect the fact that what everyone did was in the best interest of keeping the service available for use.
However, not all incidents require post-mortems. The criteria that should lead to a deep investigation should be defined upfront by the stakeholders, depending on what is critical to the business. This could mean a resolution time beyond X hours, data loss, or a monitoring failure.
Having discussed some people initiatives that could set up your teams for success, let’s delve into some technical bits.