9

As there is no Site Reliability Engineering dedicated stackexchange, I found this to be closes one.

There are multiple great resources to use as inspiration for slidedeck about SRE principles [SRE slides].

Still I can't find :

  • short
  • concise
  • examples
  • motivating spending resources to implement SRE in organisation.

Most what I experienced in my professional life were highly confidential cases and numbers. I am concerned that most numbers that SREs know, are to remain "internal" to be presented internally within corporations.

However, maybe you know some study, (preferably set of) nice examples of post-morthems (even one by one is good), from which we could make a strong arguments like "after introducing SRE model into organisation velocity of changes grown from n to m release pushes per x, with increase of availability by y and decrease of costs by z" (brainstorming) or other hard data points?

[SRE slides] - some examples:

P.S. If this question could be rephrased to fit better into this site guidelines, please provide me a suggestion in comment and give me a change to improve. Otherwise, I will appreciate other better platforms (However e.g. reddit.com/r/sre did not make great impression to me)

  • 1
    Chef.io has a bunch of resources, including 4 webminars about devops and speed which may inrerest you: https://www.chef.io/resources/ some customer stories like Rakuten could give you some insights also, I don't know a definitive hard rule guide that said – Tensibai Sep 21 '18 at 10:11
  • 3
    SRE Handbook is a great read for a team trying to implement SRE practices. – user9921 Sep 21 '18 at 01:40
  • The book.ACCELERATE (Forsgene, Gene) does same for DevOps but some datapoints might be compatible, like a service MTTR (mean time to recover) – Ta Mu Sep 26 '18 at 01:49

2 Answers2

3

The types of numbers you're looking for might be hard to come across, because they're highly variable (even within one organization, it varies service-to-service and team-to-team, in my experience.) The SRE Workbook is now available for free, and includes two case studies (chapter 3) that might be helpful. Also, New Relic's SRE eBook does a really good job of summarizing SRE in a concise way.

Another way to approach this would be to try to use what you know about your service today to create a risk assessment and estimate downtime you can prevent if you had SRE and dev support to eliminate those risks

1

I am operating in both DevOps and Site Reliability Engineering organisations across several companies. I would say that SRE has the advantage of being far more concrete than DevOps.

  • DevOps emphasises principles and mindsets, for example, the three ways of DevOps: Systems Thinking, Amplification of Feedback Loops and a Culture of Continuous Experimentation and Learning. DevOps more of an extension to Agile that a different operating model.

  • Site Reliability Engineering emphasises the specific approaches, metrics and measures that Google (and others) apply to achieve a high level of service availability and confidence in the customer. f.ex: The ratio of toil to improvements, quantitative risk analysis and mathematical approaches to SLIs and SLOs.

Because SRE Implements DevOps it's a little unfair to try and compare organisations who do one but don't do the other, so I would actually suggest that all of the content in Accelerate can just as easily be applied to Site Reliability Engineering, thus if you need peer-reviewed data-driven analytics to start there.

Richard Slater
  • 11,747
  • 7
  • 43
  • 82