0

I've come to the point where I am starting to adopt principles from Site Reliability Engineering to operate Cloud Native applications in a modern way.

From my reading to date, I have established that one of the core practices is monitoring. The way that SRE approaches monitoring is by defining Service Level Indicators (SLI) which measure an aspect of the end-user experience, then identifying a threshold around the SLI which is called a Service Level Objective (SLO).

is there a standard or common approach to defining Service Level Indicators?

Richard Slater
  • 11,747
  • 7
  • 43
  • 82

4 Answers4

3

Whilst not strictly a standard approach, Google has published an SLI Menu and a process for developing SLIs for user journies:

  1. For each User Journey/Data Flow identify from the SLI Menu suitable types of SLI:

    The SLI Menu

  2. Make a decision about how to measure good and valid events,

  3. Decide where to measure the SLI From out of the following: End User, Client-side Instrumentation, Synthetic Clients, Front-end Metrics, Application Metrics or Server-side Logging.

The document then goes on to describe, how you collate all of your SLIs for all of your user journies then walk the journey to look for coverage gaps. Then finally set SLOs based upon the SLIs either based upon business need or past performance.

Richard Slater
  • 11,747
  • 7
  • 43
  • 82
0

You have to define what matters the most to the customers (or users) in a cross functional manner, e.g dev, PM, Support, execs, SRE.

For example, memory usage alone does NOT usually directly matter to the customers and most of the roles above. It does matter for capacity planning though -- so while it is not an application SLI/SLO, it may be important for devs/SRE and eventually execs (funding). There may be an internal SLI/SLO around keeping efficiency high.

A mobile application taking too long to perform an operation or failing too often are likely to negatively impact many customers or a subset of customers that's very relevant to the business. These often turn out to be a customer facing cross functional issue, i.e support tickets are filed, execs may get called, SRE may be oncall trying to resolve the issue and will need to loop in feature [dev]elopers.

Given all that, there is a need for cross functional metrics (SLIs) and boundaries (SLOs) that will represent customer pain/unhappiness. The absence of such common metrics tend to cause the following effect: "memory usage is low" (devs/SREs), "features have been shipped" (PM), "I didn't get a call" (execs), "users are not happy" (Support).

Google has also published their workshop (under CC-BY 4.0) on how to define SLIs and SLOs: https://cloud.google.com/blog/products/management-tools/learn-how-to-set-slos-for-an-sre-or-cre-practice

There is also a blog post on how to tune-up the SLIs (and SLOs) over time: https://cloud.google.com/blog/products/management-tools/tune-up-your-sli-metrics-cre-life-lessons

Disclaimer: I work for Google.

0

I thought I'd add on one point... SLIs/SLOs are not a tool for monitoring in and by themselves.

They are the basis for proactive and continuous site reliability engineering. With only a set of SLIs/SLOs you will be able to assess the service reliability over a prolonged period of time and plan around that (e.g. prioritize reliability enhancing activities over new functionality in order to improve your SLI).

In order to add the monitoring aspect (i.e. early detection of a degradation of the service) you will need to add Burn Rate Alerts on top of your SLIs. Only with those you will have acceptable detection and reset time that serve the monitoring of potential issues. Burn Rate Alert Implementation is covered in Chapter 5 of the SRE Workbook: https://sre.google/workbook/alerting-on-slos/

Pierre.Vriens
  • 7,225
  • 14
  • 39
  • 84
Sophie
  • 1
0

Keptn has a specification for SLIs and SLOs. OpenSLO also provide a slightly different take.

A. Gardner
  • 101
  • 1