Defining your own SLIs
You start to implement the SRE Monitoring-as-Code framework in your environment by defining and implementing the SLIs and SLOs for your service.
Define and implement your SLIs and SLOs
Before you can implement the SRE Monitoring-as-Code framework in your environment you must:
- Define user journeys and service-level indicators for your service.
- Agree baseline SLOs for each SLI.
- Implement SLI Definitions.
- Observe and iterate.
Define user journeys and service-level indicators for your service
Setting SLIs helps you set realistic objectives for your service and avoid over-committing resources on Site Reliability Engineering (SRE). SLIs benefit your service by:
- defining Service Level Objectives (SLO) for your service’s user journeys
- helping prioritise your work and improve your infrastructure
- creating metrics to help classify incidents
- measuring how your system performs in the medium to long term
You should run an SLI workshop to define the specific SLIs for your service. Follow the steps in The GDS Way to run an SLI workshop
Agree baseline SLOs for each SLI
Current performance based on SLIs is usually a good place to start, especially if you do not have any other information. It also helps to set a baseline that you can improve to reflect service objectives.
Once you have everything in place, you can (implement your SLIs and SLOs)[#implement-slis-task-heading]
Implement SLI Definitions (sre-monitoring-as-code)
Teams must create a definition file (mixin) for each product they wish to monitor. A boiler plate mixin is provided in the sre-monitoring-as-code repository.
Within the definition file you need to pass in the following global variables for your service: -
Global variables | Description | Formatting Best Practice | Example |
---|---|---|---|
product | Short Product Name | Lower case or hypenated | grapi |
applicationServiceName | ServiceNow Primary Impacted Service Name | Must match the Service Now Primary Impacted Service | Great Respect API |
servicenowAssignmentGroup | ServiceNow Assignment Group | This is the ServiceNow Assignment Group who are the accountable owner of the Technical Service in question. | Great Respect API |
configurationItem | ServiceNow Technical Service Subcomponent name | Must match the ServiceNow Technical service subcomponent name | Great Respect API (App Svc) |
max_alert_severity | Severity of the event. | ServiceNow values for severity range from 1 – Critical to 5 – OK, with the severity of 0 – Clear. | 3 |
alertingSlackChannel | Slack Channel to which will be the recipient of alerts | Prefixed with a hash and should match an existing Slack Channel | #prd-alerts |
runbookUrl | Link to runbook detailing actionable operation instructions | https link to docs as code or confluence runbook | https://ho-cto.github.io/sre-monitoring-as-code/runbook |
grafanaUrl | Link to Grafana dashboards providing further insight on SLIs | https link to platform hosted grafana without any paths | https://grafana.ho-platform-x.gov.uk |
alertmanagerUrl | Link to Alertmanager console providing further insight on alerts and silencing options | https link to platform hosted grafana without any paths | https://alertmanager.ho-platform-x.gov.uk |
generic | Indicates that the Mixin file is intended to be generic across mutliple services and adds additional product selectors to the dashboards. | Boolean value | false |
Once setting global variables you will then need to break down each user journey into a different yaml stanza. See "journey01" example below.
local sliSpecList = {
# user journey name
journey01: {
# SLI per critical user journey step
SLI01: {
title: 'grapi search results requests',
sliDescription: 'grapi search results requests',
configurationItem: 'GRAPI API Search (App Svc)',
period: '30d',
metricType: 'http_server_requests_seconds',
evalInterval: '5m',
selectors: {
product: '.+/grapi-search-api-helm',
resource: '/search',
errorStatus: '4..|5..',
},
sloTarget: 90.0,
sliTypes: {
availability: {
intervalTarget: 90,
},
latency: {
histogramSecondsTarget: 15,
percentile: 90,
},
},
},
},
};
Local variables should be supplied for each SLI as follows: -
Local variables | Description | Formatting Best Practice | Example |
---|---|---|---|
title | Meaningful SLI summary | This is propagated into Dashboards and Alerts and should describe the element of the user journey | Landing page requests |
sliDescription | Meaningful metric description | This is propagated into Dashboards and Alerts and should describe the metric used for calculations | HTTP actuator requests |
period | The rolling period of which the SLI will cover | Default should be 30 days | 30d |
metricType | The metric to be used to calculate the SLI | This should match the metric exposed and captured by Prometheus | http_server_requests_seconds_count |
evalInterval | How frequently Prometheus will evaluate rules | Evaluation interval should be greater or equal to Prometheus scrape interval | 1m |
selectors | This refers to the Label selectors which are used in the Promql expressions to filter data samples | Minimum needs to include the job to filter | 'job=~"grapi", uri=~"/grapi/v1/case"' |
sloTarget | Your statement of desired performance over the compliance period defined in the "period" property. | Provided as a percentage target | 90 = 90 |
sliTypes | This map of SLI types based on standard Google SLI categories | Must match SLI types aligned to metric type in metric_types.libsonnet | { availability{...}, latency{...} } configuration for the each SLI type is listed below |
Possible SLI types
The monitoring-as-code SLI types is closely aligned with Googles SLI types. See here for more details. The SLI types currently supported by the Monitoring as Code framework and their respective configuration fields are listed below. Not all SLI types will be applicable to every metric type. See the metric-types.libsonnet file for further details about which SLI types are supported by each metric type.
availability
Field name | Description | Validation | Example |
intervalTarget | Statement of desired performance within an interval, the length of which is determined by "evalInterval". We use windows-based SLIs so we study the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals | Provided as a percentage target | 99 = 99% |
latency
Field name | Description | Validation | Example |
histogramSecondsTarget | For "latency" SLI types with a histogram metric, a seconds target is provided to determine the goal the latency percentile provided must meet to be determined as a good interval. | Provided in seconds | 0.25 = 250 milliseconds |
percentile | For "latency" SLI types with a histogram metric we measure percentiles in order to meaningfully describe the distribution of latencies. | The 99 percentile, is defined as the value that 99 out of 100 samples fall below. Thus 99 users out of 100, observe a latency less than this value, and 1 in every 100 observe a latency equal to or greater. We choose the 99%tile, because it represents the tail of the latency distribution (that is the worst cases) | 99 = 99th percentile |
freshness
counterSecondsTarget | For "latency" SLI types with a counter metric, a seconds target is provided to determine the goal the average latency provided must meet to be determined as a good interval. | Provided in seconds | 0.25 = 250 milliseconds |
intervalTarget | Statement of desired performance within an interval, the length of which is determined by "evalInterval". We use windows-based SLIs so we study the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals | Provided as a percentage target | 99 = 99% |
correctness
Field name | Description | Validation | Example |
intervalTarget | Statement of desired performance within an interval, the length of which is determined by "evalInterval". We use windows-based SLIs so we study the ratio of the number of measurement intervals that meets some goodness criterion to the total number of intervals | Provided as a percentage target | 99 = 99% |
Metrics metric-types.libsonnet
When creating your own metric type you can use the below as a baseline
Observe and iterate (sre-monitoring-as-code)
After implementing your SLI configuration, observe the dashboard journeys over a period of time (for example 1 sprint). After this time, iterate your SLIs to better understand your service’s performance and how the SLIs help your team make decisions.