When does bugfixing become overkill, if ever?

Question

Imagine you are creating a video player in JavaScript. This video player loops the user's video repeatedly. Each time a new loop begins the player runs a recursive function that calls itself N times, N being the number of times the video has looped, and because of that the browser will trigger a too much recursion RangeError at some time.

Probably no one will use the loop feature that much. Your application will never throw this error, not even if the user left the application looping for a week, but it still exists. Solving the problem will require you to redesign the way looping works in your application, which will take a considerable amount of time. What do you do? Why?

Fix the bug
Leave the bug

Shouldn't you only fix bugs people will stumble in? When does bugfixing become overkill, if it ever does?

score 165 · Accepted Answer · answered Oct 17 '16 at 05:36

You have to be pragmatic.

If the error is unlikely to be triggered in the real world and the cost to fix is high, I doubt many people would consider it a good use of resources to fix. On that basis I'd say leave it but ensure the hack is documented for you or your successor in a few months (see last paragraph).

That said, you should use this issue as a "learning experience" and the next time you do looping do not use a recursive loop unnecessarily.

Also, be prepared for that bug report. You'd be amazed how good end users are at pushing against the boundaries and uncovering defects. If it does become an issue for end users, you're going to have to fix it - then you'll be glad you documented the hack.

score 80 · Answer 2 · answered Oct 17 '16 at 15:40

There was a similar bug in Windows 95 that caused computers to crash after 49.7 days. It was only noticed some years after release, since very few Win95 systems stayed up that long anyway. So there's one point: bugs may be rendered irrelevant by other, more important bugs.

What you have to do is a risk assessment for the program as a whole and an impact assessment for individual bugs.

Is this software on a security boundary?
If so, can this bug result in an exploit?
Is this software "mission critical" to its intended users? (See the list of things the Java EULA bans you from using it for)
Can the bug result in data loss? Financial loss? Reputational loss?
How likely is this bug to occur? (You've included this in your scenario)

And so on. This affects bug triage, the process of deciding which bugs to fix. Pretty much all shipping software has very long lists of minor bugs which have not yet been deemed important enough to fix.

score 33 · Answer 3 · answered Oct 17 '16 at 15:32

The other answers are already very good, and I know your example is just an example, but I want to point out a big part of this process that hasn't been discussed yet:

You need to identify your assumptions, and then test those assumptions against corner cases.

Looking at your example, I see a couple assumptions:

The recursive approach will eventually cause an error.
Nobody will see this error because videos take too long to play to reach the stack limit.

Other people have discussed the first assumption, but look at the second assumption: what if my video is only a fraction of a second long?

And sure, maybe that's not a very common use case. But are you really sure that nobody will upload a very short video? You're assuming that videos are a minimum duration, and you probably didn't even realize you were assuming anything! Could this assumption cause any other bugs in other places in your application?

Unidentified assumptions are a huge source of bugs.

Like I said, I know that your example is just an example, but this process of identifying your assumptions (which is often harder than it sounds) and then thinking of exceptions to those assumptions is a huge factor in deciding where to spend your time.

So if you find yourself thinking "I shouldn't have to program around this, since it will never happen" then you should take some time to really examine that assumption. You'll often think of corner cases that might be more common than you originally thought.

That being said, there is a point where this becomes an exercise in futility. You probably don't care if your JavaScript application works perfectly on a TI-89 calculator, so spending any amount of time on that is just wasted.

The other answers have already covered this, but coming up with that line between "this is important" and "this is a waste of time" is not an exact science, and it depends on a lot of factors that can be completely different from one person or company to another.

But a huge part of that process is first identifying your assumptions and then trying to recognize exceptions to those assumptions.

Vladimir Stokic · Answer 4 · 2016-10-17T12:22:14.460

I would recommend that you read the following paper:

Dependability and Its Threats: A Taxonomy

Among other things, it describes various types of faults that can occur in your program. What you described is called a dormant fault, and in this paper it is described like this:

A fault is active when it produces an error, otherwise it is dormant. An active fault is either a) an internal fault that was previously dormant and that has been activated by the computation process or environmental conditions, or b) an external fault. Fault activation is the application of an input (the activation pattern) to a component that causes a dormant fault to become active. Most internal faults cycle between their dormant and active states

Having described this, it all boils down to a cost-benefit ratio. The cost would consist of three parameters:

How often would the issue present itself?
What would the consequences be?
How much it bothers you personally?

The first two are crucial. If it is some bug that would manifest itself once in a blue moon and/or nobody cares for it, or have a perfectly good and practical workaround, then you can safely document it as a known issue and move on to some more challenging and more important tasks. However, if the bug would cause some money transaction to fail, or interrupt a long registration process, thus frustrating the end user, then you have to act upon it. The third parameter is something I strongly advise against. In the words of Vito Corleone:

It's not personal. It's business.

If you are a professional, leave the emotions aside and act optimally. However, if the application you are writing is a hobby of yours, then you are emotionally involved, and the third parameter is as valid as any in terms of deciding whether to fix a bug or not.

score 11 · Answer 5 · answered Oct 17 '16 at 14:34

That bug only stays undiscovered until the day someone puts your player on a lobby screen running a company presentation 24/7. So it's still a bug.

The answer to What do you do? is really a business decision, not an engineering one:

If the bug only impacts 1% of your users, and your player lacks support for a feature required by another 20%, the choice is obvious. Document the bug, then carry on.
If the bugfix is on your todo list, it's often better to fix it before you start adding new features. You'll get the benefits of zero-defect software development process, and you won't lose much time since it's on your list anyway.

score 5 · Answer 6 · edited Oct 17 '16 at 19:41

There are actually three errors in the situation you describe:

The lack of a process to evaluate all logged errors (you did log the error in your ticket/backlog/whatever system you have in place, right?) to determine whether it should be fixed or not. This is a management decision.
The lack of skills in your team that leads to the use of faulty solutions like this. This is urgent to have this addressed to avoid future problems. (Start learning from your mistakes.)
The problem that the video may stop displaying after a very long time.

Of the three errors only (3) might not need to be fixed.

score 5 · Answer 7 · answered Oct 17 '16 at 11:58

Expecially in big companies (or big projects) there's a very pragmatic way to establish what to do.

If the cost of the fixing is greater than the return that the fix will bring then keep the bug. Viceversa if the fix will return more than its cost then fix the bug.

In your sample scenario it depends on how much users you expect to lose vs how much user you will gain if you develop new features instead of correcting that expensive bug.

score 5 · Answer 8 · answered Oct 17 '16 at 16:26

tl;dr This is why RESOLVED/WONTFIX is a thing. Just don't overuse it - technical debt can pile up if you're not careful. Is this a fundamental problem with your design, likely to cause other problems in the future? Then fix it. Otherwise? Leave it be until it becomes a priority (if it ever does).

score 4 · Answer 9 · answered Oct 17 '16 at 20:14

There are lots of answers here discussing evaluating the cost of the bug being fixed as opposed to leaving it. They all contain good advice, but I'd like to add that the cost of a bug is often underestimated, possibly hugely underestimated. The reason is that existing bugs muddles the waters for continued development and maintenance. Making your testers keep track of several "won't fix" bugs while navigating your software trying to find new bugs make their work slower and more prone to error. A few "won't fix" bugs that are unlikely to affect end users will still make continued development slower and the result will be buggier.

score 2 · Answer 10 · answered Oct 18 '16 at 07:39

One thing I've learned in my years of coding is that a bug will come back. The end user will always discover it and report back. Whether you will fix the bug or not is "merely" a priority and deadline matter.

We've had major bugs (in my opinion major) that were decided against fixing in one release, only to become a show stopper for the next release because the end user stumbled upon it over and over again. The same vice-versa - we were pushed to fix a bug in a feature that nobody uses, but it was handy for management to see.

score 2 · Answer 11 · edited Oct 20 '16 at 10:58

There are three things here:

Principles

This is one side of the coin. To some extent, I feel it is good to insist on fixing bugs (or bad implementations, even if they "work"), even if nobody is noticing it.

Look at it this way: the real problem is not necessarily the bug, in your example, but the fact that a programmer thought it was a good idea to implement the loop in this fashion, in the first place. It was obvious from the first moment, that this was not a good solution. There are now two possibilities:

The programmer just did not notice. Well... a programmer should develop an intuition of how his code runs. It is not like recursion is a very difficult concept. By fixing the bug (and sweating through all the additional work), he maybe learns something and remembers it, if only to avoid the additional work in the future. If the reason was that he just not had enough time, management might learn that programmers do need more time to create higher quality code.
The programmer did notice, but deemed it "not a problem". If this is left to stand, then a culture of laissez-faire is developed that will, ultimately, lead to bugs where it really hurts. In this particular case, who cares. But what if that programmer is developing a banking application next time, and decides that a certain constellation will never happen. Then it does. Bad times.

Pragmatism

This is the other side. Of course you would likely, in this particular case, not fix the bug. But watch out - there is pragmatism, and then there is pragmatism. Good pragmatism is if you find a quick but yet solid, well founded solution for a problem. I.e., you avoid overdesigning stuff, but the things you actually implement are still well-thought-out. Bad pragmatism is when you just hack something together which works "just so" and will break at the first opportunity.

Fail fast, fail hard

If in doubt, fail fast and fail hard.

This means, amongst others, that your code notices the error condition, not the environment.

In this example, the least you can do is to make it so the hard runtime error ("stack depth exceeded" or something like that) does not occur, by replacing it by a hard exception of your own. You could, for example, have a global counter and arbitrarily decide that you bail out after 1000 videos (or whatever number is high enough never to occur in normal use, and low enough to still work in most browsers). Then give that exception (which can be a generic exception, e.g. a RuntimeException in Java, or a simple string in JavaScript or Ruby) a meaningful message. You do not have to go to the extent to create a new type of exceptions or whatever you do in your particular programming language.

This way, you have

...documented the problem inside the code.
...made it a deterministic problem. You know that your exception will happen. You are not at the whim of changes in the underlying browser technology (think about not only PC browser, but also smartphones, tablets or future tech).
...made it easy to fix it when you eventually do need to fix it. The source of the problem is pointed out by your message, you will get a meaningful backtrack and all that.
...still wasted no time doing "real" error handling (remember, you never expect the error to occur).

My convention is to prefix such error messages with the word "Paranoia:". This is a clear sign to me and everybody else that I never expect that error to pop off. I can clearly separate them from "real" exceptions. If I see one like that in a GUI or a logfile, I know for sure that I have an earnest problem - I never expected them to occur, after all. At this point I go into crunch mode (with a good chance to solve it quickly and rather easily, as I know exactly where the problem occurred, saving me from a lot of spurious debugging).

score 1 · Answer 12 · answered Oct 17 '16 at 22:46

A post-it on a senior developer's desk at my workplace says

Does it help anyone?

I think that's often a good starting point for the thought process. There are always lots of things to fix and improve - but how much value are you actually adding? ...whether that's in usability, reliability, maintainability, readability, performance... or any other aspect.

score 0 · Answer 13 · answered Oct 19 '16 at 13:04

Three things come to mind:

First, the impact of an identified bug needs to be thoroughly investigated before the decision to leave the bug in the code can be made in a responsible manner. (In your example I at once thought about the memory leak the ever-growing stack represents and which might make your browser slower and slower with each recursion.) This thorough investigation often takes longer than fixing the bug would, so I'd prefer fixing the bug in most cases.

Second, bugs have a tendency to have more impact than one thinks at first. We are all very familiar with working code because this is the "normal" case. Bugs on the other hand are an "exception". Of course, we all have seen lots of bugs, but we have seen way more working code overall. We therefore have more experience with how working code behaves than with how buggy code behaves. There are gazillions of books about working code and what it will do in which situations. There are close to none about the behavior of buggy code.

The reason for this is simple: Bugs are not order but chaos. They often have a trace of order left in them (or put it the other way round: They don't destroy the order completely), but their buggy nature is a destruction of the order the programmer wanted. Chaos itself tends to defy being estimated correctly. It is way harder to say what a program with a bug will do than what a correct program will do just because it does not fit into our mental patterns anymore.

Third, your example contained the aspect that fixing the bug would mean to have to redesign the program. (If you strip this aspect, the answer is simple: Fix the bug, it should not take too long because no redesign is necessary. Otherwise:) In such a case I'd lose trust in the program the way it currently is designed. The redesign would be a way to restore that trust.

All that said, programs are things which people use, and a missing feature or a second, really cumbersome bug elsewhere can have priority over fixing your bug. Of course then take the pragmatic way and do other things first. But never forget that a first quick estimation of the impact of a bug can be utterly wrong.

score 0 · Answer 14 · edited Jun 16 '20 at 10:01

Low probabilty / Mild consequences = Low priorty fix

If the probability of ocurrence is very low
If the consequences of the ocurrence are mild
Then the bug does not pose a threat, then is not a priority fix.

But this can not become a crutch for lazy developers...

What "very low ocurrence" even mean?
What "mild consequences" even mean?

To state the probability of ocurrence is very low and the consequences are mild, the developement team must understand the code, the usage patterns and security.

Most developers gets suprised that things they originally tought will never happen, actually happen a lot

Our educational system doesn't teach probability and logic very well. Most persons, including most software engineers have a broken logic and broken proability intuition. Experience with real world problems and experience with extensive simulations are the only way I know to fix this.

Confront your intuition with real world data

It is important to make several logs to be able to follow the usage patterns. Fill the code with assertions of things you think should not happen. You will get surprised that they do. That way you will be able to confront your intuition with hard data and refine it.

My example of a mild problem and a measure of control

In a e-commerce site I worked a long time ago, another programmer made a mistake: In some obscure condition the system debited the client one cent less than registered in the logs. I discovered the bug because I made reports to identify differences between the logs and the account ballances to make the accounting system more resilient. I never fixed this bug because the difference was very small. The difference was calculated daily and was lower than US$ 2.00 monthly. It so happen that we was developing an entirelly new system that in a year should replace the current one. Make no sense to divert resources from potentially profitable project to fix something that costs US$ 2.00 monthly and was subjected to an appropriated measure of control.

Conclusion

Yes, there are bugs that does not need to be fixed right away, that are not important enough to delay new feature development. However the system must have control of the ocurence of this bug to make sure it is small because we can not trust our own intuition.

score -1 · Answer 15 · answered Oct 20 '16 at 21:50

I think this is asking the wrong question from the start.

The question isn't "should I fix this bug or should I not fix this bug". Any developer has a limited amount of time. So the question is "what is the most useful thing that I can do in one hour, or four hours, or a week".

If fixing that bug is the most useful thing to do, because it improves the software by the largest amount for most people, then fix the bug. If you could make bigger improvements elsewhere, by adding features that people are missing, or by fixing a more important bug, then do these other things. And extra bonus points for anything that makes your future development more efficient.