In every engineering organization, there’s a constant, fundamental tension: the push to ship new features versus the need to maintain a stable, reliable service. Move too fast, and you risk outages that erode user trust. Move too slowly, and you risk being outpaced by the competition. For years, this balancing act was managed by intuition, late-night heroics, and tense priority meetings. Site Reliability Engineering (SRE) offers a better way: the error budget.
An error budget is not about “budgeting downtime.” It’s a data-driven framework that reframes reliability as a product feature. It turns the vague goal of “making the system stable” into a quantifiable resource that both product and engineering teams can manage. By designing a robust error budget policy, you create a pact that allows for maximum innovation velocity while guaranteeing a specific reliability target for your users.
The Language of Reliability: SLIs, SLOs, and Error Budgets
Before you can budget for errors, you need a common language to talk about reliability. This language is built on three core concepts.
-
Service Level Indicator (SLI): An SLI is a quantitative measure of some aspect of your service. It must be something you can actually count. The best SLIs are structured as a ratio of good events to total valid events.
- Availability SLI:
(Number of successful requests) / (Total valid requests)
- Latency SLI:
(Number of requests faster than a threshold) / (Total valid requests)
- Availability SLI:
-
Service Level Objective (SLO): An SLO is the target value for your SLI over a specific period. This is the goal you are committing to. It is a precise statement about the desired level of reliability.
- Example SLO: “99.9% of homepage requests over a rolling 28-day window will be successful.”
-
Error Budget: The error budget is simply
100% - SLO
. It represents the acceptable level of unreliability. It is not a goal to be spent, but a limit on how much unreliability your users will tolerate before you must take corrective action.- Example Budget: For a 99.9% SLO, your error budget is 0.1%. If you receive 10 million requests in 28 days, your budget is 10,000 errors.
This framework moves the conversation away from an impossible goal of 100% uptime and toward a realistic, user-centric availability target.
Calculating Your First Error Budget: A Step-by-Step Guide
Creating your first error budget doesn’t have to be perfect; it just has to be a starting point. The process is iterative.
Step 1: Choose Your SLIs (What to Measure)
Start with what your users care about most. Don’t measure everything; pick a handful of key indicators that represent the core user experience.
- For user-facing systems: Availability and latency are the universal starting points. Can users access the service? Is it fast enough?
- For data pipelines: Freshness and correctness are key. Is the data up-to-date? Is it accurate?
- For storage systems: Durability and availability are paramount. Is my data safe? Can I access it when I need to?
When defining these, think in terms of distributions, not averages. An average latency can hide a terrible user experience for a small but important subset of your users. Use percentiles (e.g., 90th, 95th, 99th) to understand the experience of the typical user and the long tail.
Step 2: Set Your SLOs (What to Target)
Choosing the right target is a negotiation between product, engineering, and what’s technically feasible. However, there are some guiding principles.
- 100% is the Wrong Target: The single biggest source of outages is change. A 100% SLO would mean you can never deploy new features, apply security patches, or scale your infrastructure. Furthermore, even if your service is 100% available, dependencies in the user’s path (their ISP, local network, etc.) mean they will never experience 100% reliability.
- Don’t Just Use Current Performance: While it’s tempting to look at your historical performance (e.g., “we were 99.95% available last month”) and set that as the SLO, this can be a trap. You might lock yourself into a target that requires heroic effort to maintain. Use historical data as a starting point for discussion, not the final answer.
- Keep it Simple: An SLO should be easy to explain and understand. “99.9% of successful logins in under 500ms” is clear. A complex, multi-variable SLO is hard to track and even harder to act upon.
Step 3: Do the Math
Once you have your SLO, the budget calculation is straightforward. If your availability SLO is 99.9% over a 28-day period, your error budget is 0.1%. Every failed request during that period consumes a portion of that budget. This simple calculation transforms an abstract percentage into a concrete number of “unhappy user events” you can afford.
The Heart of Governance: The Error Budget Policy
Calculating the budget is easy. Agreeing on what to do with it is hard. An error budget policy is a formal document, a pre-negotiated contract between stakeholders (SRE, development, product) that defines the rules of engagement. Without a policy, your SLO is just a number on a dashboard, not a tool for decision-making.
A strong error budget policy must contain:
- Clear Triggers: When is the policy invoked? This isn’t just when the budget hits zero. A key concept is the burn rate alert. A fast burn alert might trigger if you consume 10% of your monthly budget in under 24 hours. This is an early warning system that allows for course correction.
- Defined Consequences: What happens when a trigger is fired? This must be explicit and agreed upon in advance.
- “Yellow Alert” (Fast Burn): A team meeting is called to analyze the cause. Low-risk feature deployments may continue, but high-risk changes are paused.
- “Red Alert” (Budget Exhausted): All feature development is halted. The development team’s top priority shifts to reliability work—fixing bugs, paying down technical debt, or improving test coverage. New releases are frozen until enough budget is accrued for the next release.
- Ownership and Accountability: The policy must state who is responsible for declaring a freeze (e.g., the SRE team lead, the product owner) and who is responsible for executing the reliability work (the development team).
- Escalation Path: What happens if there’s a disagreement? There must be a clear path for escalating the decision, typically to a director or VP of Engineering who can make the final call based on broader business context.
Getting buy-in for this policy is the ultimate test of your SLOs. If developers feel the consequences are too punitive for the given target, the SLO may be too strict. If the product owner feels the service will be unacceptably bad before a freeze is triggered, the SLO is too loose. This negotiation is what aligns the entire organization around a shared definition of “good enough.”
Tracking and Enforcement at Scale
With a policy in place, you need tooling to make it visible and actionable.
- SLO Dashboards: You cannot manage what you do not measure. A dedicated SLO dashboard is essential. It should visualize, at a minimum:
- The current SLI value.
- The SLO target.
- The remaining error budget for the period.
- A burndown chart showing the rate of budget consumption over time. This makes the burn rate intuitive and visible to everyone.
- Burn Rate Alerting: Your SLO monitoring system must be configured with multi-window, multi-burn-rate alerts. These alerts are the automated triggers for your policy. For example:
- High Priority Alert: “Alert if we are on track to consume our 28-day error budget in the next 3 days.” This requires immediate attention.
- Low Priority Ticket: “Create a ticket if we have consumed 2% of our quarterly budget in the last 6 hours.” This flags an issue for the next planning meeting.
- Governance and Review Cadence: SLOs are not static. The business changes, user expectations change, and your systems evolve. Establish a quarterly SRE governance meeting to review SLO performance.
- Did our SLOs accurately reflect customer impact?
- Were our burn rate alerts effective, or were they too noisy/too slow?
- Do we need to adjust our SLO targets for the next quarter?
- Did we spend our error budget on valuable experiments or on preventable failures?
An error budget transforms reliability from an emotional, often contentious topic into a rational, data-driven engineering discipline. It creates a virtuous cycle: it provides developers with the autonomy to innovate quickly when the service is healthy and provides SREs with the clear mandate to put the brakes on when reliability dips below the user-agreed target. It is the most powerful tool for building services that are not just reliable, but reliably innovative.
To effectively manage an error budget, you need high-resolution, real-time data for your SLIs. Latency or gaps in your monitoring data can lead to inaccurate budget calculations and missed alerts. Netdata’s per-second monitoring provides the granularity needed to build precise SLOs and catch fast-burning budget consumption before it threatens your reliability targets.