SRE with Service Level Objectives

NOTE: this section is conceptual information about SRE with SLOs. We need to remove it as its own section and instead weave it in through the Get started section.

To start using the SRE Monitoring-as-Code framework, you should read the following Adopting SRE with Service Level Objectives documentation to understand the fundamental concepts.

Adopting SRE with Service Level Objectives

In Home Office DDaT, we follow the GDS Service Manual guidance on how to monitor the status of services and set performance metrics.

We frame the monitoring of our services around the Site Reliability Engineering (SRE) pillar of Service-level objectives (SLOs). SLOs provide a "framework for defining clear targets around application performance, which ultimately help teams provide a consistent customer experience, balance feature development with platform stability, and improve communication with internal and external users" ¹.

Concepts in monitoring using SLOs

The core notions of DDaT's service monitoring include the following:

Selecting metrics that act as service-level indicators (SLIs).
Using the SLIs to set service-level objectives (SLOs) for the SLI values.
Using the error budget implied by the SLO to mitigate risk in your service.

Terminology

Service monitoring has a set of core concepts, which are introduced here:

Service-level indicator (SLI): a measurement of performance.
Service-level objective (SLO): a statement of desired performance.
Error budget: a measurement of the difference between actual and desired performance.

Service-level indicators

Home office DDaT platforms expose and capture metrics that measure the performance of the applications and underpinning service infrastructure. Examples of performance metrics include the following:

Request count: for example, the number of HTTP requests per minute that result in 2xx or 5xx responses.
Response latencies: for example, the latency for HTTP 2xx responses.

The performance metrics are automatically identified based on a set of known sources such as Java Springboot Actuator (whitebox), Kubernetes Ingress Controllers and AWS Cloudwatch (blackbox).

The performance metrics are the basis of the SLIs for your service. An SLI describes the performance of some aspect of your service. For services using Frontier, AWS Simple Queue Service or Appication Load Balancers useful SLIs are already known. For example, if your service has request-count or response-latencies metrics, standard service-level indicators (SLIs) can be derived from those metrics by creating ratios as follows:

An availability SLI is the ratio of the number of successful responses to the number of all responses.
A latency SLI is the ratio of the number of calls below a latency threshold to the number of all calls.

Service-level objectives

An SLO is a target value for an SLI, measured over a period of time. The service determines the available SLIs, and you specify SLOs based on the SLIs. The SLO defines what qualifies as good service.

An SLO is built on the following information:

An SLI, which measures the performance of the service.
A performance goal, which specifies the desired level of performance.
A time period, called the compliance period, for measuring how the SLI compares to the performance goal.

For example, you might have requirements like these:

Latency can exceed 300 ms in only 5 percent of the requests over a rolling 30-day period.
The system must have 99% availability measured over a rolling 30-day period.

Error budgets

An SLO specifies the degree to which a service must perform during a compliance period.The error budget quantifies the degree to which a service can fail to perform during the compliance period and still meet the SLO.

Error budgets let you track how many bad individual events (like requests) are allowed to occur during the remainder of your compliance period before you violate the SLO. You can use the error budget to help you manage maintenance tasks like deployment of new versions. If the error budget is close to depleted, then taking risky actions like pushing new updates might result in your violating an SLO.

Your error budget for a compliance period is (1 − SLO goal) × (eligible events in compliance period). For example, if your SLO is for 85% of requests to be good in a 7-day rolling period, then your error budget allows 15% of these requests to be bad. If you received, say, 60,480 requests in the past week, your error budget is 15% of that total, or 9,072 requests that are permitted to be bad. If you served more errors than this, your service was out of SLO for the 7-day compliance period.

Windows vs Request Based SLIs

You can also set up service-specific SLIs for some other measure of what “good performance” means. These SLIs generally fall into two categories:

Request-based SLIs, where good service is measured by counting atomic units of service, like the number of successful HTTP requests.
Windows-based SLIs, where good service is measured by counting the number of time periods, or windows, during which performance meets a goodness criterion, like response latency below a given threshold.

These SLIs are described in more detail in Compliance in request- and windows-based SLOs.

How we measure reliability

Placeholder for symptoms vs causes.