A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. The end goal of our SRE principles is to improve services and in turn the user experience.
The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice.
Defining the terms of site reliability engineering
These tools aren’t just useful abstractions. Without them, you won’t know if your system is reliable, available, or even useful. If the tools don’t tie back to your business objectives, then you’ll be missing data on whether your choices are helping or hurting your business.
As a refresher, here’s a look at SLOs, SLAs, and SLIS, as discussed by our Customer Reliability Engineering team in their blog post, SLOs, SLIs, SLAs, oh my – CRE life lessons.
1. Service-Level Objective (SLO)
SRE begins with the idea that availability is a prerequisite for success. An unavailable system can’t perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to its use as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.
When we set out to define the terms of SRE, we wanted to set a precise numerical target for system availability. We term this target the availability Service-Level Objective (SLO) of our system. Any future discussion about whether the system is running reliably and if any design or architectural changes to it are needed must be framed in terms of our system continuing to meet this SLO.
Keep in mind that the more reliable the service, the more it costs to operate. Define the lowest level of reliability that is acceptable for users of each service, then state that as your SLO. Every service should have an availability SLO—without it, your team and your stakeholders can’t make principled judgments about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development). Excessive availability has become the expectation, which can lead to problems. Don’t make your system overly reliable if the user experience doesn’t necessitate it, and especially if you don’t intend to commit to always reaching that level. You can learn more about this by participating in The Art of SLOs training.
Within Google Cloud, we implement periodic downtime in some services to prevent a service from being overly available. You could also try experimenting with occasional planned-downtime exercises with front-end servers, as we did with one of our internal systems. We found that these exercises can uncover services that are using those servers inappropriately. With that information, you can then move workloads to a more suitable place and keep servers at the right availability level.
2. Service-Level Agreement (SLA)
At Google Cloud, we distinguish between an SLO and a Service-Level Agreement (SLA). An SLA normally involves a promise to a service user that the service availability SLO should meet a certain level over a certain period. Failing to do so then results in some kind of penalty. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. Going out of SLO will hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you’ll probably need an SLA.
Because of this, and because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO. This might be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%. Alternatively, the SLA might only specify a subset of the metrics that make up the internal SLO.
If you have an SLO in your SLA that is different from your internal SLO (as it almost always is), it’s important for your monitoring to explicitly measure SLO compliance. You want to be able to view your system’s availability over the SLA calendar period, and quickly see if it appears to be in danger of going out of SLO.
You’ll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (described in the SLA) to paying customers, we need to measure queries received from them separately from other queries. This is another benefit of establishing an SLA—it’s an unambiguous way to prioritize traffic.
When you define your SLA’s availability SLO, be careful about which queries you count as legitimate. For example, if a customer goes over quota because they released a buggy version of their mobile client, you may consider excluding all “out of quota” response codes from your SLA accounting.
3. Service-Level Indicator (SLI)
Our Service-Level Indicator (SLI) is a direct measurement of a service’s behavior, defined as the frequency of successful probes of our system. When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as by running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.
If you’re building a system from scratch, make sure that SLIs and SLOs are part of your system requirements. If you already have a production system but don’t have them clearly defined, then that’s your highest priority work.
Source: Google Cloud Blog