Making injection of failure scenarios a routine practice – Setting Up Teams for Success

For critical business applications with a very high uptime requirement, organizations might implement disaster recovery (DR) processes and architectures to survive regional outages. The rarity of suchscenarios occurring means that the disaster recovery procedures are hardly tested before a real event happens. AWS services such as Fault Injection Simulator and Resilience Hub allow users to intentionally introduce certain failure scenarios, such as the unavailability of an Availability Zone (AZ) – a collection of data centers in AWS, sudden CPU usage spikes on a container, and so on. In addition to giving the teams confidence around how gracefully their workloads survive these situations, these procedures also help uncover aspects that developers might not have thought of. DR procedures that haven’t been triggered in the last 2 years are as risky as not having any at all.

Organizations with highly mature operational practices regularly operate their critical workloads from a standby site multiple times a year.

Aligning technology decisions with business expectations

As they say, working with AWS services is like being a kid in a candy shop. There’s so much around you that you want to have it all. It’s important to balance the consumption of these services with the real needs of your workload. Multi-region availability, for example, might not always be a must-have for applications. Containerizing an existing monolith might not always lead to happy outcomes. Aiming for an additional 0.01% of application uptime time can lead to an increase in costs and architectural complexities that do not justify the minuscule increase in availability. Therefore, it’s necessary to weigh all decisions against the actual business needs and keep your architectures as lean as possible.

If you remember the metrics highlighted in the Accelerate DevOps report that we discussed previously, application reliability was called out as an important DevOps maturity indicator, just like software delivery.

Therefore, defining realistic thresholds and availabilities for your workloads is a very important input for an efficientcloud architecture that meets your needs. But how do you go about measuring the reliability of your systems and communicating them to your customers? Two important metrics here are service-level objectives (SLOs) and service-level indicators (SLIs). An SLO is the promise that a company makes to its users regarding the availability of a system. As a service provider, you could promise an SLO for 99.99% monthly availability. An SLI, on the other hand, is a key metric that identifies if the SLO is being met or not. It is the actual measured value for the metric described in the SLO.

Based on the SLO committed by a service, the providers then create a service-level agreement (SLA).

The SLA states the action that the company will take in case it is not able to meet the committed SLO.

Typically, these actions translate into service credits or billing discounts.

Did you know?

Google implements periodic planned downtime of some services to prevent a service from being overly available. They tried this initially with some frontend servers in an internal system. This downtime uncovered some other services that were using these servers inappropriately. With that information, they were able to move workloads to somewhere more suitable while maintaining the servers at the right availability level.

The adoption of new ways of working by DevOps methodologies combined with innovative use of AWS services often leads to new roles and roadmaps being identified within organizations. Out of personal interest, some of the existing employees might as well feel motivated to transition their careers and develop breadth and depth in cloud skills. In the next section, I’ll introduce you to some learning resources that can set up your teams for success.

Leave a Reply

Your email address will not be published. Required fields are marked *