For questions about Site Reliability Engineering (=SRE).
Questions tagged [sre]
25 questions
68
votes
4 answers
What is the difference between SRE and DevOps?
What is the difference between SRE and DevOps?
Site Reliability Engineering and Development Operations seem to overlap a lot in detail. How do I know which group is responsible for what, and how do I know what jobs would be appropriate for my…
jcolebrand
- 1,275
- 10
- 14
20
votes
4 answers
What processes or tools enable Segregation of Duties when engineers both deploy and run code?
In highly regulated environments, such as the Financial Services sector, Segregation of Duties is an essential mechanism for avoiding collision amongst individuals with development responsibilities and production privileges.
Traditionally this has…
Richard Slater
- 11,747
- 7
- 43
- 82
16
votes
5 answers
Is scrum or kanban really useful for SRE teams?
Agile practices like scrum and kanban were primarily designed for software development.
Interrupt and unplanned work is a significant component of what most SRE (Site Reliability Engineering) or DevOps teams do. While it is always useful to use a…
codeforester
- 391
- 1
- 5
- 29
15
votes
1 answer
What is the difference between the traditional Development and Operations Model and Site Reliability Engineering?
"SRE is what happens when you ask a software engineer to design an operations team." – Site Reliability Engineering
Since Google's Site Reliability Engineering Book was released, on more than one occasion I have been told that SRE is an extension…
Richard Slater
- 11,747
- 7
- 43
- 82
11
votes
3 answers
What is a reasonable level of coding expertise expected of an SRE?
SREs and coding is a contentious topic. In most situations, SREs end up spending a lot of time on operational tasks, even while being efficiency and automation focused. This means that SREs don't spend as much time coding as a full time developer,…
codeforester
- 391
- 1
- 5
- 29
10
votes
3 answers
Prometheus Alertamanger - how to silence all alerts for a given period during a maintenance?
A work scenario that I can't cover currently is that I want to set a maintenance mode, meaning all alerts received from Prometheus to be ignored, I want to be able to set it through the UI for a given period until maintenance finish. One way to do…
anVzdGFub3RoZXJodW1hbg
- 482
- 2
- 7
- 15
9
votes
1 answer
How to calculate burn rate for SLOs?
I've read the Google SRE book a few times but I need some clarifications on exactly how to set up the burn rate and understanding how long it'll take to trigger an alert.
Most of my questions are specifically from this section on the book:…
BlueChips23
- 193
- 1
- 4
9
votes
2 answers
Datapoints motivating introduction of SRE in organisation
As there is no Site Reliability Engineering dedicated stackexchange, I found this to be closes one.
There are multiple great resources to use as inspiration for slidedeck about SRE principles [SRE slides].
Still I can't find :…
Grzegorz Wierzowiecki
- 191
- 1
- 3
9
votes
3 answers
DevOps vs SRE vs Production Support Engineers
DevOps primarily focuses on Delivery Speed and SRE focuses on Reliability in production but where does Production Support Engineers fit who also focuses on production monitoring, alerting, performance, user experience, incident management, RCA and…
Sam
- 91
- 1
- 1
- 3
7
votes
5 answers
What is the future of DevOps in the age of serverless?
My senior recently argued that DevOps related roles such as DevOps Engineer, SRE, etc, will be less relevant once serverless matures. He pointed me to Containers won the battle, but will lose the war to serverless.
Can you see less and less DevOps…
Flowingover
- 81
- 2
6
votes
1 answer
Implications of introducing Docker to the development team
We always tend to have a moral implications of certain critical decisions taken for development irrespective of the realisation if the decision is critical or not. Say for example switching an entire stack for a team of developers or enforcing them…
Abhay Pai
- 303
- 2
- 5
4
votes
4 answers
How can I convince someone that 100% reliability is not the right target for anything?
It's a fundamental principle of DevOps and SRE that failure is normal and setting a goal of perfect reliability is misguided. But sometimes IT execs and business leaders push back on the idea. They believe the business needs 100% reliability. What…
Vivek Rau
- 91
- 5
3
votes
1 answer
How is 'self-healing' to be reconciled with Infrastructure as Code?
As a relative newcomer to the developments happening in Operations - DevOps, SRE, etc. - I'm struggling with a big-picture problem. The following is a specific example, using a network load-balancer.
From an IaC perspective, I take it that we should…
Yellowfog
- 139
- 2
3
votes
0 answers
Grafana K8s microservices availability dashboard based on actuator uptime
I need to create a dashboard that is using micrometer prometheus spring actuator endpoint to show all services for a given namespaces in green boxes or red boxes depending on status, the boxes should contain the service names.
Now I am trying to…
anVzdGFub3RoZXJodW1hbg
- 482
- 2
- 7
- 15
2
votes
1 answer
What are best practices for setting SLOs for a new service?
I am guiding teams setting SLOs for their services and a question came up about setting SLOs for services that are still in design.
Are there standards for setting SLOs for services that don't exist yet?
Or is the best practice to start the service…
Jay Athalye
- 23
- 3