Architecture

Monitoring-as-Code is a Prometheus/Grafana based framework only. We are not responsible for the Prometheus and Grafana monitoring and alerting infrastructure which support the artefacts generated by MaC. (Although we do provide a local docker-compose setup which simulates a highly available Prometheus/Grafana environment.)

Architecture Diagrams

Logical Architecture

The MaC framework is invoked via a continuous integration pipeline and generates 3 artefacts: recording rules, alerting rules and dashboards.

Dashboard Hierarchy

Physical Architecture

MaC artefacts are injected into Prometheus and Grafana instances at runtime using a number of distribution options.

Dashboard Hierarchy

Platform setup

The following platform setup is required to maximise the use of MaC:

Telemetry capture

MaC is depending on Prometheus scraping the appropriate metrics from targets such as client instrumented apps, kubernetes controller/worker nodes and AWS CloudWatch namespaces.
Alertmanager and/or Incident Response Tool Configuration, Templating and Egress.

The appropriate recipients and templates should be configured on Alertmanager or Alert Management tooling such as Pager Duty/Ops Genie to ensure labels and annotations are propagated to ServiceNow for incident response.
Monitoring-as-Code adoption

MaC can be adopted by pulling the container from our GitHub Registry, setting up a mixin definition file and invoking manually through a shell script or via a pipeline using a Docker run command.

Monitoring & Alerting Specification

Prometheus/Grafana is used for metrics-based monitoring and a perfect candidate for generating SLIs based on a combination of different telemetry. It should be complimented by other tools providing synthetics and distributed tracing in a composite monitoring architecture.

Category	Use	Tooling	Observability Pillar	MaC Coverage
Application performance monitoring	Investigate the behaviour of your application at the service level. Determine where calls are going and how they perform.	Prometheus/Grafana	Metrics	✅
Infrastructure Monitoring	Determine the health and performance of the containers, environment and managed services your applications run on. In AWS CloudWatch namespace provide	Prometheus/Grafana (via AWS CloudWatch scrape)	Metrics	✅
Real user monitoring	Understand the experience of real users by collecting data from browsers about how your site performs and looks.	Dynatrace RUM	Traces	❌
Synthetic monitoring	Allows you to test and measure the experience of your web application by simulating traffic with set test variables.	Dynatrace Synthetics / Pingdom	Metrics	❌
Alerting	Handles alerts sent by client applications, deduplicating, grouping, and routing them to the correct receiver integration.	Prometheus Alertmanager	Metrics	✅
Log Capture, Aggregation, Viewer	Aggregate, manage and analyse logs generated from your application and infrastructure. Troubleshoot the why behind what.	Elasticsearch (ELK) / Splunk	Logs	❌
Incident Response / Ticketing	IT Service Management tooling for Incident, Problem and Change Management	ServiceNow	Metrics	✅

Non-functional Requirements

Release Management Process

This monitoring framework does not get deployed into an environment but is instead executed from within a pipeline. MaC automation pipelines are detailed in the contribution guide.

Accessibility

N/A at the moment as we are utilising Grafana out of box dashboards, however we will periodically review this to see once Grafana provides updates that provides enhancement accessibility features.

Recovery process

N/A - This is a metric based monitoring tool which utilises existing Prometheus and Grafana products. This means no additional policy/process are required - following existing practices.