Balancing Speed and Reliability With Error Budgets

Jun 29, 2021 10:09:58 AM | Balancing Speed and Reliability With Error Budgets

Fearlessly deploy code using error budgets. This framework will allow you to deploy and test code confidently. Learn more about it here.

- Wasn’t this code tested?

- Yes, it was, but… 

- Well, going forward, we need to focus on quality

<crickets> 

I think every engineering team has had a conversation similar to this.

It inevitably happens on the journey of continuous delivery. It's the definitive end of the DevOps honeymoon, where speed is put head-to-head with quality. Once this happens, you risk resetting progress and slowing down development. 

But in the spirit of fearless deployments, we want to introduce a healthy and popular framework for dealing with the tradeoff of speed and reliability - namely, error budgets. 

Before we get into error budgets, let's look at two pitfalls developers tend to fall into when they either slow down or speed up deployments. 

The Drawbacks of Pushing Deployments Without a Framework

Slowing Down Deployments To Focus On Quality

We know that most outages and customer-impacting failures come from change. 

It’s natural to see change as the root cause of failure. So, instinctively, you might attempt to minimize change as a way to manage this. 

And to some degree, it is. 

But, if you slow down deployments every time you hit a bump, which you will inevitably do so in the world of software, you'll end up in a vicious loop of accumulating change. At some point, you'll need to change your software to keep it running optimally, but now the changes you have to make will be even more significant than before and will likely cause failure. 

A better strategy: make smaller changes more frequently. Unfortunately, this will only take you so far if you don't have a framework as to when to make these changes.

Speeding up Deployments To Resolve Errors

Instead of slow deployments, you might end up having the opposite problem. In an effort to get rid of all of your application's known issues, you might try to push new code and improvements too quickly.  

Unfortunately, this approach doesn't work. Just as you resolve one error, another one is bound to pop up, if not more if you try to speed your way through the debugging process. 

As you can see, blindly moving forward and increasing deployments is no guarantee for eventual stability. In fact, speeding through software issues is a recipe for continuous failures, a rise in technical debt, and a loss of customer trust. 

A Better Way: Use Error Budgets To Determine When to Push Deployments 

What Are Error Budgets?

According to Atlassian

“An error budget is the maximum amount of time that a technical system can fail without contractual consequences. ... If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.”

In other words, an error budget means that your service should be available 99.9% of the time. What about the remaining 0.1% of your budget? Like with any budget, it's meant to be used. Make use of the 0.1% to take calculated risks to increase velocity and ensure future reliability.

Essentially, if you have a budget left, you have no reason to fear deployment.

How Error Budgets Work

Let's say an application has 99.9% availability requirements. That remaining 0.1% translates into time is when the service can be unavailable:    

  • Daily: 1m 26s
  • Weekly: 10m 4s
  • Monthly: 43m 49s
  • Quarterly: 2h 11m 29s
  • Yearly: 8h 45m 56s

If the team has budget left, they can: 

  • Push code on demand that relies only on automation tests, not manual verification, thereby going fast and testing the automation pipeline
  • Deploy an intentional failure to determine how the service fails and test how your team responds to this failure
  • Push failure to understand what precautions to take to safeguard software against this issue in the future. 

On the other hand, if you find that you’re consistently running out of your error budget, then:

  • Slow down and focus on hotfixes
  • Work on the reliability of your product
  • Do post mortems and work with other teams to address root causes of failures, which may be organizational, infrastructural, etc. 

Once the budget is back, it's important to get back on the horse again and resume fearless deployments.

In the next section, we’ll go over how to set up error budget metrics that are right for your application.  

How To Set Error Budgets

To set error budgets, you first need to define the service level agreements (SLAs) and trickle them down to service level objectives (SLOs) and service level indicators (SLIs). 

Starting From an SLA

Service level agreements, also known as SLAs, is the promise your company makes to customers regarding the terms around the quality and availability of a service. They also serve as guidelines on how to respond to failures. 

The drawback: SLAs are usually not very actionable on a technical level.  

That's where service level objectives (SLOs) come in. 

Defining SLOs

SLOs are technical and measurable metrics that a team has selected to ensure they reach their SLA.

The first step in creating an SLO is to determine what constitutes "availability" for the service. The "availability" of a service is not just about when a service is available, but what it takes to complete its tasks satisfactorily. 

For example, a service that receives and stores messages in a database might have SLOs that look something like this:

  • 99.9% of messages are received and stored with status code 200
  • 95% of messages are processed within 1000 ms. 

Together, these two SLOs ensure that most messages are received and stored in a timely manner. If the service conforms to these SLOs, the company is in good shape and can continue to chip away on delivering improvements and value.  

When creating SLOs, you need to develop metrics known as Service Level Indicators (SLIs). 

Measuring SLIs

Measure and monitor the current value of your SLOs using SLIs.

For example, if your SLO is that "99.9% of messages are received and stored with status 200" in a month, and you receive 1,000,000 messages, that means you can drop 1,000 messages. However, dropping more than 1,000 messages means that you are out of budget. 

Here is an example of an SLO/SLI report: 

Microservice 1
January 3 - January 10

SLO SLI for Jan 3 - Jan 10 Status
99.9% of messages are received and stored with HTTP status 200 98.1% (981,000 / 1,000,000 messages successfully  received with HTTP status 200) 190% of budget consumed
95% of messages are processed within 1000 ms. 100% (981,000 / 981,000 are processed within 1000 ms) 0% of budget consumed
Current status Out of budget
Example of an SLO/SLI Report

By using performance monitoring solutions, it's possible to graph these metrics and set up alerts. 

Now that you know how to create metrics for an error budget, let's dive into how to use these budgets. 

How To Act Based on the Budget

What To Do if There's an Excess Of Budget at the End of the Month

You might think a high error budget at the end of the month is a good thing, but this isn't true. An excess of an error budget at the end of the month means your team is not going fast enough. In cases like these, your team should increase deployments.    

What To Do if There's No Budget at the End of the Month

If you consume your error budget on a monthly level - for example, you drop 5% of a month's messages, you might need to discuss this with your team. Violations such as these are serious and could have future consequences. Next steps might include writing post mortems, informing your customers, and freezing features until your fixes are deployed and verified. 

If you find that you're consistently using all of your error budget, you should look into making systematic improvements on how you and your team works. You might even consider an SRE team. They have the know-how to make systematic improvements when it comes to building and deploying pipelines.  

What To Do if You’re Close to Your Budget at the End of the Month

Congratulations! You are balancing speed and availability! Keep up the good work. 

Summary: 

Error budgets are an effective way to help teams understand when they should speed up deployments and take risks and when they should slow down. 

And it's great for customers too- they are getting exactly what they are paying for: high availability and continuous improvements.  

If you want to learn more about error budgets and practices for balancing speed and reliability, check out these SRE books

Another great tool you can use along with error budgets: Airbrake Error Monitoring and Performance Monitoring. Our product gives you all the tools you need to find and fix bugs in your code quickly before they have a chance to impact your customers. Discover the power of Airbrake today with a free 14-day trial

Written By: Alexandra Lindenmuth