Skip to main content

Interpret Monitoring-as-Code outputs

This section contains detailed information about how the monitoring dashboard results are calculated.

This is useful for:

  • SRE teams who are implementing MaC on their environment - to test their implementation and to understand dashboard results
  • operational support teams - to understand dashboard results
  • service managers - to report organisational level of service availability

Error budget burn-rate rules

Burn-rate windows

The Monitoring-as-code framework uses a standard set of burn-rate windows for its alerts:

Severity Long window Short window Factor
1 1 hour 5 minutes 14.4
2 6 hours 30 minutes 6
4 3 days 6 hours 1

The purpose of having a short window and a long window is to check that the error budget is being consumed recently (short window) and continuously (long window).

The factor is the proportion of error budget consumed over a 30 day rolling period. For example, if your SLI had an evaluation window of 5 minutes then there would be 8640 total evaluation windows in the 30 days. If you had an SLO target of 90% then you would have a 10% error budget, which equates to 864 bad windows allowed without breaching the SLO. If you consumed all 864 windows of your error budget over the 30 day period, you would have a factor of 1. If you consumed double your error budget, 1728 windows, you would have a factor of 2. If you consumed half your error budget, 432 windows, you would have a factor of 0.5.

Each burn-rate window has a factor that must be exceeded by the burn-rate of both the short and long windows in order for an alert to be fired.

Burn-rate rules

The following calculation is used when determining the burn rate over a long or short window: (1 - number of good evaluation windows / total number of evaluation windows) / error budget as a decimal = burn-rate

For example, if you had an evaluation window of 5 minutes (12 total windows in an hour) and an SLO target of 90% (10% error budget), if there were 3 bad evaluation windows over the 1 hour burn-rate window, the calculation would be:

(1 - 9 / 12) / 0.1 = 2.5

This would not trigger the severity 1 alert because the burn rate of the 1 hour burn-rate window is not greater than the factor of 14.4.

Alerting

Note: This section isn't complete and will need to be reviewed.

We base alerts on the error budget burn-rate, which is measured against the SLO target set by the product team. If monitoring indicates that an SLO is in danger of being breached it will trigger an alert. How quickly the error budget is forecast to breach determines what level of alert is triggered.

Burn rate is how fast, relative to the SLO, the service consumes the error budget.

Our alerting methodology uses multiple burn rates and time windows. "We generate alerts when burn rates surpass a specified threshold. This option retains the benefits of alerting on burn rates and ensures that you don’t overlook lower (but still significant) error rates."

As a baseline we've implemented a 2% budget consumption in one hour, 5% budget consumption in six hours, 7.5% consumption in one day and 10% budget consumption in 3 days for ticket alerts. More detail are provided in the table below: -

SLO budget consumption Time window Burn rate Notification
2% 1 hour 14.4 tbc
5% 6 hours 6 tbc
7.5% 1 day 3 tbc
10% 3 days 1 tbc

Standardised alerts

One of the features provided by the monitoring framework is standardised alerts. These alerts are triggered in response to a significant rise in the error burn-rate of an SLO (for example, if a service starts responding slowly the error burn rate of the SLO for latency would increase and trigger an alert). When an alert is triggered, a message is sent to a specified Slack channel containing details of the alert as a list of labels. Most of the labels are outlined in Designing Service Alerting. (NOTE need a link here) and Planning for Alerts (NOTE need a link here), however there are some additional labels: alertname assignment_group ci_type configuration_item description environment event_class factor instance journey metric_name node replica resource service severity short_description slo title type wait_for time_of_event

SLI recording rules

SLI Value

The SLI value recording rule is used for calculating whether a time window was good or not by comparing the output of the rule to the metric target. The expression used by SLI value varies depending on the type of SLI, the default comparison used is SLI value is less than metric target for a window to be good, but different comparison operators can be defined for each SLI in the mixin file.

Examples of types of SLI value rules are:

  • HTTP availability
  • HTTP latency

HTTP availability

Lets say you had an HTTP availability SLI with 5 minute time windows, a metric target of 0.1 and bad requests defined as any returning 400 or 500 status codes. If over a 5 minute window you had 9 good requests, 1 bad request meaning a total of 10 requests, the calculation for SLI value would be 1 (number of bad requests) / 10 (total number of requests) which equals 0.1. Since the default comparison is less than, and the SLI value of 0.1 is not less than, the metric target of 0.1, it would be a bad window.

If over the next 5 minutes you had 19 good requests, 1 bad request meaning a total of 20 requests, the calculation for SLI value would be 1 (number of bad requests) / 20 (total number of request) which equals 0.05. Since 0.05 is less than the metric target of 0.1 it would be a good window.

HTTP latency

Let's say you had an HTTP latency SLI with 5 minute time windows, a metric target of 0.5 and a latency percentile of 0.9. Below are the ordered results for latency of 10 requests over a 5 minute period:

Latency
0.1
0.13
0.2
0.21
0.22
0.25
0.27
0.31
0.4
0.52

Because the latency percentile is 0.9, or 90%, and there are 10 values in total, we take the 9th value from the ordered list. This value is 0.4 which is less than the metric target of 0.5, so it is a good window. Lets look at another 5 minute time period with 20 requests:

Latency
0.1
0.13
0.15
0.2
0.21
0.21
0.22
0.22
0.25
0.26
0.28
0.28
0.29
0.3
0.35
0.41
0.47
0.52
0.61
0.64

Because the latency percentile is 0.9 or 90% and there are 20 values in total, we take the 18th value from the ordered list. This value is 0.52 which is greater than the metric target of 0.5 so it is a bad window.

SLI percentage

The SLI percentage recording rule is used for comparing the overall percentage of good time windows for different services in the summary view dashboard. It takes the number of good windows in the last 30 days and divides by the total number of windows where data was received in the last 30 days.

Time window based SLOs

A time window based SLO works by taking the number of good windows over a time period and dividing them by the total number of time windows in that period where the metric being used for the SLI was updated. This is easiest to explain with an example:

Lets say you have an SLI for HTTP requests, the time windows for this SLI are 5 minutes long and your SLO performance is being calculated over the past 1 hour. That means 12 time windows will be used for calculating your SLO performance (there are 12 5 minute time windows in 1 hour). The target for your SLO is 80% and the criterion for a time window being good is that at least 90% of the requests performed within that 5 minute window responded with good status codes. The monitoring starts at 14:00, below is the first hour:

Time Window number Number of good requests Total number of requests % of good requests Window status
14:05 1 4 5 80% Bad
14:10 2 9 10 90% Good
14:15 3 3 3 100% Good
14:20 4 0 0 N/A No data
14:25 5 6 8 75% Bad
14:30 6 0 0 N/A No data
14:35 7 12 12 100% Good
14:40 8 5 5 100% Good
14:45 9 19 20 95% Good
14:50 10 2 3 66% Bad
14:55 11 7 7 100% Good
15:00 12 10 11 91% Good

As you can see, within the first hour there were 7 good windows, 3 bad windows and 2 windows where no requests were made. That means there were 10 total windows where the metric for the SLI was updated (since new requests were made). To calculate the SLO performance for the hour, we do 7 (number of good windows) / 10 (total number of windows) which equals 0.7 or 70%. This is lower than the SLO target of 80%, so over the last hour the SLO was not met. Lets look after the next window has been calculated:

Time Window number Number of good requests Total number of requests % of good requests Window status
14:05 1 4 5 80% Bad
14:10 2 9 10 90% Good
14:15 3 3 3 100% Good
14:20 4 0 0 N/A No data
14:25 5 6 8 75% Bad
14:30 6 0 0 N/A No data
14:35 7 12 12 100% Good
14:40 8 5 5 100% Good
14:45 9 19 20 95% Good
14:50 10 2 3 66% Bad
14:55 11 7 7 100% Good
15:00 12 10 11 91% Good
15:05 13 15 15 100% Good

Since the SLO performance is being calculated over an hour, window 1 is not included in the calculation. That means we now have 8 good windows, 2 bad windows and 2 windows where no requests were made. The calculation for SLO performance is 8 (number of good windows) / 10 (total number of windows) which equals 0.8 or 80%. This is the same as our SLO target of 80%, so over the past hour the SLO was met.

Service-level indicator menu

NOTE: This section will be fully reviewed and written.

Overview

We have built on Google's concept of an SLI Menu by developing a standard suite of extensible and reusable SLIs that can be selected as part of a products monitoring configuration.

SLI Menu GUID Self-service SLIs Data source Google SLI Type Google SLI Category
1 HTTP errors Java Spring Boot Actuator Availability Request-driven
2 AWS Application Load Balancer http errors AWS Cloudwatch ApplicationELB Namespace Availability Request-driven
3 HTTP latency Java Spring Boot Actuator Latency Request-driven
4 AWS Application Load Balancer http latency AWS Cloudwatch ApplicationELB Namespace Latency Request-driven
5 - - Quality Request-driven
6 AWS Simple Queue Service high latency in standard queue AWS Cloudwatch SQS Namespace Freshness Pipeline
7 AWS Simple Queue Service message received in dead letter queue AWS Cloudwatch SQS Namespace Pipeline Pipeline
- - - Coverage Pipeline
- - - Durability Storage

SLI worksheet - Sprint Boot Actuator http errors

Google SLI Category: Request-driven

Google SLI Type: Availability

SLI Specification: The proportion of journey requests that return successful HTTP status code

SLI Implementation:

Rationale:

Error Budget:

Clarifications and Caveats: