Questions tagged [system-reliability]
25 questions
74
votes
19 answers
How come compilers are so reliable?
We use compilers on a daily basis as if their correctness is a given, but compilers are programs too, and can potentially contain bugs. I always wondered about this infallible robustness. Have you ever encountered a bug in the compiler itself? What…
EpsilonVector
- 10,683
- 10
- 58
- 103
65
votes
31 answers
Why isn't software as reliable as a car?
I had a user ask me this question. We know that cars break down, but that's because of something physical (unless software is involved!).
I tried to answer that software is a much younger industry, but the user countered with "didn't the automobile…
Alex Angas
- 681
- 1
- 8
- 19
24
votes
7 answers
Are the terms stable and reliable interchangeable?
Is there a difference between stability and reliability (at least in software engineering context) or can they be used interchangeably? If not, what would be some examples of reliable but not necessarily stable systems, and vice versa?
gsakkis
- 413
15
votes
7 answers
What special considerations are needed when designing databases to hold financial records?
I hope this question isn't too broad. In the future I may need to add some accounting and financial-tracking systems to some applications (mostly web-based applications, but my questions pertains to desktop apps as well).
Now, creating a simple…
Joshua Carmody
- 798
- 1
- 7
- 16
7
votes
3 answers
Best practices for Heartbeat in distributed systems
We had in our system in the past an external data provider (call it source) sending regular heartbeats to a java application (call it client). If the heartbeat failed, system shut itself down (to avoid serving stale data in a critical application).…
senseiwu
- 668
6
votes
1 answer
Need to re-build an application - how?
For our main system, we have a small monitor application that sits outside our network and periodically tries to log in to verify the system still works. We have a problem with the monitor though in that the communications component set (Asta 3…
Tom A
- 392
5
votes
3 answers
Reliability vs Fault Tolerance
I am confused with the following terms: Reliability and Fault Tolerance
According to Designing Data Intensive Applications book, the definition of Reliability is:
The system should continue to work correctly (performing the correct
function at the…
Raghav Goyal
- 63
- 3
3
votes
1 answer
Defining SLI / SLO for ETL and Reporting Application
All,
We're just started on SRE journey and trying to define SLI / SLO for our application.
It is an ETL application where 1. feeds (e.g. start of day, end of day data feeds) comes from various upstream and gets loaded with some transformation. 2.…
Ravi Parekh
- 41
3
votes
3 answers
How to prevent bugs in business-level configurations with similar discipline as in source code?
We have a system that allows our clients to coordinate people (shoppers) so that they can delivery groceries within 45 minutes from the order creation.
Each client has a set of stores where the orders are processed by the shoppers
and each shopper…
3
votes
2 answers
From a software development lifecycle perspective, is duck-typing a benefit or a problem?
Statically-typed languages such as Java afford the benefit of compile-time checking of types - you are guaranteed that an object is of a given type, so:
there is no need to spend time and resources investigating the TYPE of a variable or parameter,…
Ahmed Tawfik
- 139
2
votes
2 answers
When should I be worried of Time of check time of use vulnerabilities during database queries?
I am having difficulties on understanding when I should be worried about TOCTOU vulnerabilities and how to avoid them because yes, we can use database transactions but there are different level of transactions and using the one which is the safest…
Alessandro
- 69
2
votes
0 answers
Running a high availability PostgreSQL cluster on native AWS services only
Backstory: I am unable to use RDS, as I need to install cartridges in my PostgreSQL instances.
I have been trying to pin down an architecture for PostgreSQL running on EC2 instances for a few days. Most information I could find online use separate…
tjwoon
- 29
1
vote
3 answers
Windows Hibernate API
Is it possible to programmatically trigger the Windows's Hibernate without actually Hibernating, just to take snapshot of the OS at regular intervals? So that the system can return to the previously saved state in case of any failure.
This is much…
1
vote
2 answers
Grafana - detecting abnormal behavior of applications
Have Grafana on premise with prometheus.
Some anomalies can be detected by viewing a set of charts (slow requests, retries, pending transactions, etc.). SRE operators need to have the opportunity to see all information in one Grafana widget instead…
Vasin Yuriy
- 127
- 4
1
vote
1 answer
What is the crux of difference between N version programming and self monitoring architecture?
Source-:https://cs.ccsu.edu/~stan/classes/CS410/Notes16/11-ReliabilityEngineering.html
This is self monitoring architecture. So here computations carried across 2 channels, if they both provide same result then system is operating correctly else…
cuajiu
- 19