Error Budgets in Practice: A Data-Driven Approach to Risk and Release Management

Why Error Budgets? CoinGecko offers API services to our customers. There are 2 types of APIs that we provide, Public API and Pro API. For Pro API, we are bound with tight service-level agreements (SLA) to our customers. These SLAs are important for us to ensure customer satisfaction and trust in the platform. We visualized the risk metrics to categorize risk categories into severities that may impose danger to our SLAs. Instead of settling for availability goals like 99.9% or 99.95%, we strive for tangible information to determine to ensure that our goals remain realistic. In this article, we will discuss the process behind measuring and managing a reliable uptime SLA. How do we track, analyse and understand risks before reaching a conclusion for our SLA? For the ease of understanding, let’s first talk about SLOs and SLAs:- SLA Service Level Agreements – agreements with our customers about reliability of our services. SLO Service Level Objectives – thresholds that catch an issue before it breaches our SLAs. SLA and SLO, where it stands - Courtesy of Google In other words, we have a higher threshold for SLO compared to SLA. We need to capture any issues before it reaches out to the customer. In terms of uptime, internally, we only allow a lower duration of downtime, x, compared to the external threshold that we set, y duration of downtime. To put it into formula → x < y. The SLA that we have for our Pro API is 99.9%. That means, for SLO, we have a higher threshold; e.g., 99.95% or 99.99%. How do we know how much head room that we have before we breach our SLO? An uptime SLA of 99.9% is equivalent to 43.2 minutes of downtime in a month. A corresponding SLO of 99.95% is equivalent to 21.6 minutes of downtime. This difference in minutes is also known as Error Budget. This error budget allows us to do maintenance, deployment and improvement towards our application. Engineers only have 21.6 minutes a month to maneuver around when they face problems that cause downtime. Error Budget is an inverse of SLO. If our SLO is 99.9% availability, our Error Budget is the remaining amount of time (0.1% unavailability). Availability Table - Courtesy of Google From the table above, we now understand the unavailability, i.e., the Error Budget, in terms of the time we can afford. Let’s take a look at these 2 diagrams below to understand the burn rate of our Error Budget from Day 1 to Day 28. The diagram below shows our Error budget of 21.6 mins (expressed as 100%) at the beginning of the month. The first diagram shows a positive Error Budget by the 28th day of the month. Monthly error budget nearing the budget - Courtesy of Google Meanwhile, the following diagram shows a breached Error Budget with negative percentage remaining. Monthly error budget breaching the budget - Courtesy of Google The diagram above provides a visual representation of the burn rate in percentage regardless of how many minutes of Error Budget that we have chosen. Error budget burn rate can be monitored throughout the month to revise the frequency, priority and type of deployments scheduled. Analyzing past incidents to categorize our failure points No application or system is perfect, especially in its early stages. The key is to learn from these experiences by recording, documenting, and categorizing each incident for future reference. By investigating these issues, we gain a deeper understanding of how to prioritize and address them, helping us craft realistic SLOs. Analyzing past incidents and anticipating future ones allows us to take proactive measures to prevent SLA breaches and ensure system reliability. First thing first is we have to understand our failure points. Categorize each incident that occurred or may occur in the future. This is what we call a risk. This helps us to create a high-level view to view which categories cause us the most headache. From where do we obtain the information of risks? Historical data, industry best practices, brainstorming etc. For example purposes, these are some of the categories that we identified that can cause downtime to our application. Disaster recovery drill Updating major code version Code deployment misconfiguration Unoptimized database queries Software defects in the code Breakdown of caching service Outage in an Availability Zone Unintended data loss or corruption Malicious security breach/attack High volume of traffic Breakdown in the message queue system Disk failure Third-party dependency failure Next, from each of the incidents, we calculate: ETTD - Estimated Time To Detection – how long it would take to detect and notify a human (or robot) that the incident has occurred; aka MTTD (Mean Time to Detect). ETTR - Estimated Time To Resolution – how long it would take to fix the incident once the human (or robot) has been notified; aka MTTR (Mean Time to Repair). ETTF - Estimated Time To Failure – estimated frequency between instances

Jan 20, 2025 - 12:52
 0
Error Budgets in Practice: A Data-Driven Approach to Risk and Release Management

Why Error Budgets?

CoinGecko offers API services to our customers. There are 2 types of APIs that we provide, Public API and Pro API. For Pro API, we are bound with tight service-level agreements (SLA) to our customers. These SLAs are important for us to ensure customer satisfaction and trust in the platform.

We visualized the risk metrics to categorize risk categories into severities that may impose danger to our SLAs. Instead of settling for availability goals like 99.9% or 99.95%, we strive for tangible information to determine to ensure that our goals remain realistic.

In this article, we will discuss the process behind measuring and managing a reliable uptime SLA. How do we track, analyse and understand risks before reaching a conclusion for our SLA?

For the ease of understanding, let’s first talk about SLOs and SLAs:-

SLA
Service Level Agreements – agreements with our customers about reliability of our services.

SLO
Service Level Objectives – thresholds that catch an issue before it breaches our SLAs.

SLA and SLO, where it stands - Courtesy of Google
SLA and SLO, where it stands - Courtesy of Google

In other words, we have a higher threshold for SLO compared to SLA. We need to capture any issues before it reaches out to the customer. In terms of uptime, internally, we only allow a lower duration of downtime, x, compared to the external threshold that we set, y duration of downtime. To put it into formula → x < y.

The SLA that we have for our Pro API is 99.9%. That means, for SLO, we have a higher threshold; e.g., 99.95% or 99.99%.

How do we know how much head room that we have before we breach our SLO?

An uptime SLA of 99.9% is equivalent to 43.2 minutes of downtime in a month. A corresponding SLO of 99.95% is equivalent to 21.6 minutes of downtime.

This difference in minutes is also known as Error Budget. This error budget allows us to do maintenance, deployment and improvement towards our application. Engineers only have 21.6 minutes a month to maneuver around when they face problems that cause downtime.

Error Budget is an inverse of SLO. If our SLO is 99.9% availability, our Error Budget is the remaining amount of time (0.1% unavailability).

Availability Table - Courtesy of Google
Availability Table - Courtesy of Google

From the table above, we now understand the unavailability, i.e., the Error Budget, in terms of the time we can afford.

Let’s take a look at these 2 diagrams below to understand the burn rate of our Error Budget from Day 1 to Day 28. The diagram below shows our Error budget of 21.6 mins (expressed as 100%) at the beginning of the month.

The first diagram shows a positive Error Budget by the 28th day of the month.

Monthly error budget nearing the budget - Courtesy of Google
Monthly error budget nearing the budget - Courtesy of Google

Meanwhile, the following diagram shows a breached Error Budget with negative percentage remaining.

Monthly error budget breaching the budget - Courtesy of Google
Monthly error budget breaching the budget - Courtesy of Google

The diagram above provides a visual representation of the burn rate in percentage regardless of how many minutes of Error Budget that we have chosen.

Error budget burn rate can be monitored throughout the month to revise the frequency, priority and type of deployments scheduled.

Analyzing past incidents to categorize our failure points

No application or system is perfect, especially in its early stages. The key is to learn from these experiences by recording, documenting, and categorizing each incident for future reference. By investigating these issues, we gain a deeper understanding of how to prioritize and address them, helping us craft realistic SLOs. Analyzing past incidents and anticipating future ones allows us to take proactive measures to prevent SLA breaches and ensure system reliability.

First thing first is we have to understand our failure points. Categorize each incident that occurred or may occur in the future. This is what we call a risk. This helps us to create a high-level view to view which categories cause us the most headache.

From where do we obtain the information of risks? Historical data, industry best practices, brainstorming etc.

For example purposes, these are some of the categories that we identified that can cause downtime to our application.

  • Disaster recovery drill
  • Updating major code version
  • Code deployment misconfiguration
  • Unoptimized database queries
  • Software defects in the code
  • Breakdown of caching service
  • Outage in an Availability Zone
  • Unintended data loss or corruption
  • Malicious security breach/attack
  • High volume of traffic
  • Breakdown in the message queue system
  • Disk failure
  • Third-party dependency failure

Next, from each of the incidents, we calculate:
ETTD - Estimated Time To Detection – how long it would take to detect and notify a human (or robot) that the incident has occurred; aka MTTD (Mean Time to Detect).
ETTR - Estimated Time To Resolution – how long it would take to fix the incident once the human (or robot) has been notified; aka MTTR (Mean Time to Repair).
ETTF - Estimated Time To Failure – estimated frequency between instances of this incident; aka MTBF (Mean Time Between Failure).
% of Users Affected – Percentage of users was affected by the failure

Above terms visualized - Courtesy of Google
Above terms visualized - Courtesy of Google

This helps us understand the frequency, time, and our swiftness in responding towards an incident.

We wanted to understand how much downtime (bad minutes) per year is caused by a single category. From this valuable information, we open up a spreadsheet, fill in all of our data, and calculate our risk level for each category. This is what we call the Risk Catalog.

Example of a Risk Catalog

We enter our list of risks in the blue cells together with the ETTD, ETTR, Percentage of impact towards users and ETTF. Based on our inputs, we are able to see the number of incidents per year and bad minutes per year generated by the spreadsheet formula in the grey cells.

Computed Stack Rank of Risks

We took the information above and rearranged the risks based on a severity level to a new spreadsheet called the Risk Stack Rank. This is how we can calculate and provide a data-driven context on how we stand today vs our current SLO defined.

Let’s have a look at the computed stack rank of risks below:
Computed stack rank of risks

In the sheet above, we can see that our risks are populated and arranged by bad minutes per year. The most bad minutes per year will be considered as the highest risk

Risk Stack Rank has multiple components to look for:

Target Availability
The desired availability in percentage.

Budget (m/yr)
The total error budget available, measured in minutes per year (m/yr), which represents the maximum allowable downtime while still meeting the target availability.

Accepted (m/yr)
The amount of downtime already allocated for various known risks in minutes per year.

Unallocated Budget (m/yr)
The portion of the error budget that remains uncommitted after accounting for known and accepted risks.

Threshold of unacceptability for an individual risk (% of error budget)
A limit that defines how much of the total error budget a single risk can consume.

Too Big Threshold (m/yr) – for a single risk
The absolute upper limit for the amount of downtime a single risk can be responsible for. If the expected impact of a risk exceeds this threshold, the risk is deemed "too big" and must be mitigated, as it could jeopardize the ability to meet the SLO.

In terms of the colored cell, below are the explanation of it:
Cell colors definition

Red – this risk is unacceptable, as it falls above the acceptable error budget for a single risk.
Amber – this risk should not be acceptable, as it’s a major consumer of our error budget and therefore, needs to be addressed.
Green – this is an acceptable risk. It's not a major consumer of our error budget, and in aggregate, does not cause our application to exceed the error budget.
Blue – this risk has been accepted to fit within our error budget. Accepting a risk means planning not to fix it and taking the outage and corresponding hit on the error budget.

Understanding Risk Stack Rank in Practice

Remember the risks that we have entered in the Risk Catalog together with its metrics? This Risk Stack Rank calculates the risks and rank it according to bad mins/year.

In this subsection, assume that we want to have a 3-nines availability target (99.9%), we have 2 red-shaded (unacceptable) risks and the others are green-shaded (acceptable) risks.

Image description

Let's see some scenarios below to see it in action.

Accepting a Risk that is in Red or Amber-shaded

Say that our threshold of unacceptability for an individual risk is 25% of the error budget.

Image description

We can see from above that accepting “Third-party dependencies failure” causes green-shaded risks to turn into amber-shaded risks (should not be accepted). This happens due to the accepted risks already consuming a number of error budgets, causing other risks to impose danger to our error budget.

Say we accept more of the risks that will consume our error budget.

Image description

The diagram above shows that more risks are amber-shaded. This means that we have to act upon the risks to bring down the bad mins/year. We’ll discuss this in Improving our Risk Stack Rank section.

Ideal Situation

We can start by accepting the green risks and see how it consumes our error budget in this sheet.

Image description

From the figure above we can see that we have accepted 519.62 out of 525.96 minutes from our error budget.

Unaccepted Risks

As we accepted risks (marked y), we agreed to accept it without any mitigation actions that are required. These risks are now known as risks that will burn our error budget.

But how about unaccepted risks that are in the red or amber-shaded? What do we do with them?

If we do accept them, the sheet will show that we have breached our error budget.

Image description

We can see that our Unallocated Budget section has reached a negative value.

These risks (red and amber-shaded) are the risks that require mitigation actions, these risks have to be acted upon so that it does not impose danger to our Error Budget.

How to implement Error Budgets in practice

Now that we have Service Level Objective (SLO) and Error Budget, let's enforce it! It is important that everyone in the organization is aware of this policy, especially the engineering and product team.

This sets as a baseline in determining whether we can release new features or deploy a hotfix in case the Error Budget is nearing its limit or has been breached.

To simplify things in this article, we are going to present 3 tiers of severity levels – Tier 1, Tier 2 and Tier 3 – and with Call to Action (CTA) in each tier.

How do we know which is which? Again, quantifying this is crucial in understanding the criticality of an issue.

Tier 1
Description: There is a depletion of Error Budget within X days (e.g. 14 days) and the Error Budget percentage is still within acceptable status.
CTA: Acknowledgement is required and the SRE team will notify the application team.

Tier 2
Description: The Error Budget has depleted until Y% (e.g. 50%) within 28 days and the Error Budget percentage is in warning status.
CTA: Halt releases and P0 issues or security fixes until SLO recovers; setup dedicated team to investigate AND SRE team to highlight to application team.

Tier 3
Description: There is a major depletion of Error Budget within X days (e.g. 2 days) OR the Error Budget has reached Z% (e.g. 80%) or lower.
CTA: All-hands-on-deck to focus on resolving the service outage. Inform top management. Use a "silver bullet" (see next section on Silver Bullets) carefully upon multiple approvals at this stage. Prepare a PR statement if needed.

By categorizing incidents into these tiers, the team can respond proportionally to the severity of an issue. This ensures resources are allocated effectively while maintaining adherence to SLOs.

SLO Recovery

When SLOs reach warning levels, two immediate actions are crucial – first, a dedicated task force comprising application developers and SRE must be assembled to address the situation. Second, all ongoing releases must be temporarily suspended during the investigation phase.

An improved release planning strategy is fundamental to SLO improvement, particularly in environments practicing Continuous Deployment. The foundation of this strategy involves categorizing deployments based on their risk levels.

The classification of deployments into risk categories requires careful consideration of various factors. While the specific criteria for risk assessment may vary by organization, they typically consider factors such as:

  • The scope of changes
  • Potential impact on critical user paths
  • Architectural modifications
  • Integration points with external systems

For deployments classified as high-risk, several key practices should be implemented. The implementation of a daily stagger system ensures that high-risk deployments are spread across different days, allowing for precise identification and swift rollback of problematic changes if necessary. All high-risk deployments must be:

  • Documented in a shared engineering calendar
  • Clearly communicated to all team members
  • Monitored by relevant stakeholders during and after deployment
  • Scheduled with consideration for key personnel availability

An often overlooked but crucial aspect of release management is ensuring the availability of critical stakeholders during deployment windows. This includes considering team members' leave schedules when planning significant releases.

By implementing these controls, teams can effectively shield their remaining error budget from new incidents, allowing their SLOs to gradually recover as the measurement window advances and maintaining system stability during the recovery period.

Conclusion

We assess risks by understanding how much downtime or error the system can tolerate while still meeting SLOs. We use error budgets to track acceptable failure, balance reliability with deployments, and prioritize risks.

Rather than stating without evidence that we aim for 99.9%, 99.95%, or any other availability goal, we now have concrete data that shows whether or not our goal is feasible.

By analyzing historical data, potential failures, and user impact, we can determine if an Error Budget is realistic and adjust accordingly to ensure system stability without holding back progress. This approach ensures that every decision—whether to deploy a new feature or address a critical issue—is backed by measurable insights and aligned with the organization’s goals.

References

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow