Defining your own SLIs

You start to implement the SRE Monitoring-as-Code framework in your environment by defining and implementing the SLIs and SLOs for your service.

Define and implement your SLIs and SLOs

Before you can implement the SRE Monitoring-as-Code framework in your environment you must:

Define user journeys and service-level indicators for your service.
Agree baseline SLOs for each SLI.
Implement SLI Definitions.
Observe and iterate.

Define user journeys and service-level indicators for your service

Setting SLIs helps you set realistic objectives for your service and avoid over-committing resources on Site Reliability Engineering (SRE). SLIs benefit your service by:

defining Service Level Objectives (SLO) for your service’s user journeys
helping prioritise your work and improve your infrastructure
creating metrics to help classify incidents
measuring how your system performs in the medium to long term

You should run an SLI workshop to define the specific SLIs for your service. Follow the steps in The GDS Way to run an SLI workshop

Agree baseline SLOs for each SLI

Current performance based on SLIs is usually a good place to start, especially if you do not have any other information. It also helps to set a baseline that you can improve to reflect service objectives.

Once you have everything in place, you can (implement your SLIs and SLOs)[#implement-slis-task-heading]

Implement SLI Definitions (sre-monitoring-as-code)

Low level diagram showing workflow

Teams must create a definition file (mixin) for each product they wish to monitor. A boiler plate mixin is provided in the sre-monitoring-as-code repository.

Within the definition file you need to pass in the following global variables for your service: -

Global variables	Description	Formatting Best Practice	Example
product	Short Product Name	Lower case or hypenated	grapi
applicationServiceName	ServiceNow Primary Impacted Service Name	Must match the Service Now Primary Impacted Service	Great Respect API
servicenowAssignmentGroup	ServiceNow Assignment Group	This is the ServiceNow Assignment Group who are the accountable owner of the Technical Service in question.	Great Respect API
configurationItem	ServiceNow Technical Service Subcomponent name	Must match the ServiceNow Technical service subcomponent name	Great Respect API (App Svc)
max_alert_severity	Severity of the event.	ServiceNow values for severity range from 1 – Critical to 5 – OK, with the severity of 0 – Clear.	3
alertingSlackChannel	Slack Channel to which will be the recipient of alerts	Prefixed with a hash and should match an existing Slack Channel	#prd-alerts
runbookUrl	Link to runbook detailing actionable operation instructions	https link to docs as code or confluence runbook	https://ho-cto.github.io/sre-monitoring-as-code/runbook
grafanaUrl	Link to Grafana dashboards providing further insight on SLIs	https link to platform hosted grafana without any paths	https://grafana.ho-platform-x.gov.uk
alertmanagerUrl	Link to Alertmanager console providing further insight on alerts and silencing options	https link to platform hosted grafana without any paths	https://alertmanager.ho-platform-x.gov.uk
generic	Indicates that the Mixin file is intended to be generic across mutliple services and adds additional product selectors to the dashboards.	Boolean value	false

Once setting global variables you will then need to break down each user journey into a different yaml stanza. See "journey01" example below.

local sliSpecList = {
  # user journey name
  journey01: {
    # SLI per critical user journey step
    SLI01: {
      title: 'grapi search results requests',
      sliDescription: 'grapi search results requests',
      configurationItem: 'GRAPI API Search (App Svc)',
      period: '30d',
      metricType: 'http_server_requests_seconds',
      evalInterval: '5m',
      selectors: {
        product: '.+/grapi-search-api-helm',
        resource: '/search',
        errorStatus: '4..|5..',
      },
      sloTarget: 90.0,
      sliTypes: {
        availability: {
          intervalTarget: 90,
        },
        latency: {
          histogramSecondsTarget: 15,
          percentile: 90,
        },
      },
    },
  },
};

Local variables should be supplied for each SLI as follows: -

Local variables	Description	Formatting Best Practice	Example
title	Meaningful SLI summary	This is propagated into Dashboards and Alerts and should describe the element of the user journey	Landing page requests
sliDescription	Meaningful metric description	This is propagated into Dashboards and Alerts and should describe the metric used for calculations	HTTP actuator requests
period	The rolling period of which the SLI will cover	Default should be 30 days	30d
metricType	The metric to be used to calculate the SLI	This should match the metric exposed and captured by Prometheus	http_server_requests_seconds_count
evalInterval	How frequently Prometheus will evaluate rules	Evaluation interval should be greater or equal to Prometheus scrape interval	1m
selectors	This refers to the Label selectors which are used in the Promql expressions to filter data samples	Minimum needs to include the job to filter	'job=~"grapi", uri=~"/grapi/v1/case"'
sloTarget	Your statement of desired performance over the compliance period defined in the "period" property.	Provided as a percentage target	90 = 90
sliTypes	This map of SLI types based on standard Google SLI categories	Must match SLI types aligned to metric type in metric_types.libsonnet	`{ availability{...}, latency{...} }` configuration for the each SLI type is listed below

Possible SLI types

The monitoring-as-code SLI types is closely aligned with Googles SLI types. See here for more details. The SLI types currently supported by the Monitoring as Code framework and their respective configuration fields are listed below. Not all SLI types will be applicable to every metric type. See the metric-types.libsonnet file for further details about which SLI types are supported by each metric type.

availability

Field name	Description	Validation	Example
intervalTarget	Statement of desired performance within an interval, the length of which is determined by "evalInterval". We use windows-based SLIs so we study the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals	Provided as a percentage target	99 = 99%

latency

Field name	Description	Validation	Example
histogramSecondsTarget	For "latency" SLI types with a histogram metric, a seconds target is provided to determine the goal the latency percentile provided must meet to be determined as a good interval.	Provided in seconds	0.25 = 250 milliseconds
percentile	For "latency" SLI types with a histogram metric we measure percentiles in order to meaningfully describe the distribution of latencies.	The 99 percentile, is defined as the value that 99 out of 100 samples fall below. Thus 99 users out of 100, observe a latency less than this value, and 1 in every 100 observe a latency equal to or greater. We choose the 99%tile, because it represents the tail of the latency distribution (that is the worst cases)	99 = 99th percentile

freshness

counterSecondsTarget	For "latency" SLI types with a counter metric, a seconds target is provided to determine the goal the average latency provided must meet to be determined as a good interval.	Provided in seconds	0.25 = 250 milliseconds
intervalTarget	Statement of desired performance within an interval, the length of which is determined by "evalInterval". We use windows-based SLIs so we study the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals	Provided as a percentage target	99 = 99%

correctness

Field name	Description	Validation	Example
intervalTarget	Statement of desired performance within an interval, the length of which is determined by "evalInterval". We use windows-based SLIs so we study the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals	Provided as a percentage target	99 = 99%

Metrics metric-types.libsonnet

When creating your own metric type you can use the below as a baseline

Observe and iterate (sre-monitoring-as-code)

After implementing your SLI configuration, observe the dashboard journeys over a period of time (for example 1 sprint). After this time, iterate your SLIs to better understand your service’s performance and how the SLIs help your team make decisions.