Is there a general solution to the problem of "sudden unexpected bursts of errors" in software?

Question

Let me explain what I mean.

I have made a complex, highly polished over years PHP framework/library for my own use. I very aggressively log the smallest notice and immediately deal with it as soon as it pops ups, always trying to predict potential errors in my code as to never have them occur even in rare situations, but rather handling them automatically before they get logged.

However, in spite of all my efforts, inevitably, I wake up (such as today) to find that some third-party service has fiddles around with their file format for one of their CSV files of data that they provide on their website and which my system fetches and imports every day.

Then I get a flood of ugly PHP errors. Ouch.

Even though it looks scary at first, it's typically just a pretty simple fix, and it's typically really just ONE error, which cascades into tons of apparent errors because the chain of function calls "fall apart" as each one expects something that they no longer get.

I fix the issue, clear the errors, re-run the logic, verify that it no longer causes any errors, and then it's fixed. For now. Until the same thing happens again, with some other part of the system.

I can personally "deal with" this, but it really bothers me in terms of giving away my system to somebody else to run on their machines. If/when the same kind of thing happens for them, they will doubtlessly blame me and think I'm incompetent (which may be true).

But even for myself, this is quite annoying and makes me feel as if my system is very fragile and a house of cards waiting to fall apart, in spite of there normally being not a single little notice or warning logged during "normal operation".

Short of predicting every possible change and writing enormous amounts of extra "checking" code to verify that all data is always exactly what is expected, is there anything I can do to fix this general problem? Or is this like asking for a pill that cures any disease instantly?

Please don't get hung up on the fact that I mentioned PHP. I'd say that this question goes completely regardless of the programming language or environment. It's really more of a philosophical question than a technical one IMO.

I fear that the answer will be: "There is no way. You have to bite the bullet and verify, verify and verify everything all the time!"

score 100 · Answer 1 · answered Dec 02 '20 at 11:39

An improvement would be to design your system to fail gracefully. If the first step of parsing a file fails, then stop with an error. Don't carry on passing bad data from one step to the next.

The other thing to check is that you are implementing the file handling correctly and robustly. CSV is quite complicated when you encounter quoted strings with embedded commas in them. If the supplier has actually changed the file format, then you should stop processing. If they have used a feature of CSV that you haven't implemented right, you need to fix that robustly.

score 77 · Answer 2 · answered Dec 02 '20 at 22:32

There was a popular blog post on this topic last year called Parse, don't validate. It's an excellent read that's difficult to paraphrase, but the essence is you should put your input data into a format where illegal states are unrepresentable as soon as possible.

For reading from an external CSV file, following this advice would mean:

Use a proper CSV parsing library, not a regex or a split or something.
Use the header names, not a column number to get a specific field.
Put it into an object with only the fields you use, already validated that ints are ints, dates are dates, etc.
Pass only that object down to the lower layers of the program. You know all the fields in there are valid.
Use your type system as much as possible to your advantage. I haven't written any php in decades, so I'm not familiar with its current capabilities, but I know it has improved in that area.

I generally expect the following from reputable data providers:

Make only backward-compatible changes if possible.
If not possible, provide some sort of version to indicate backward-incompatible changes.
Announce schema changes in advance, so I can test before they are needed.
If possible, provide the schema in a standard format I can use to automatically adapt my parsing in most cases.
If practical, allow me to customize what fields I am retrieving.

I don't know what sort of relationship you have with your data provider, but if they are not doing these things, I would try to influence them to start. If they are doing those things, make sure you are taking advantage of it.

score 10 · Answer 3 · answered Dec 02 '20 at 14:32

There is no general solution that fixes this. When integrating with outside systems, you have very little control. From what you describe, you are including a lot of defensive programming — this is good. As others have mentioned, you need to fail more gracefully. If a chain of operations requires data from an outside source, you'll need additional defensive programming to ensure downstream operations do not get triggered when a failure occurs. End users should also be presented with a reasonable error message.

Beyond that, setting up automated integration tests between your application and the outside provider can help you find issues before they hit production. Many outside services have a "test" or "beta" environment, where they deploy new releases. This allows you to identify breaking changes in their upcoming releases before it hits their production environment (and therefore takes down your production environment). Furthermore, any time a breaking change occurs, add that to your automated integration test suite to guard against that change moving forward.

When integrating with outside services, you absolutely must keep up to date on their changes. Consider subscribing to mailing lists or periodically checking their developer sites for upcoming releases. Integrating with external services is never something you can build and forget. You'll have continuing maintenance work to stay on top of this, which will include regular maintenance releases for your application and/or code.

score 8 · Answer 4 · answered Dec 02 '20 at 21:46

Validate your data early.

As soon as you can, check that your input falls within your required range.

Fuzz test within the domain of your data.

Your system should seek to handle all data that passes validation gracefully. Fuzzing refers to generating random data within the range you are testing in question.

The fuzz data is on the boarder of nonsense, but matches the minimal structure required by your validator. If you find it hard to generate random data that passes validation, you might need to clean up your validation logic; make it more strict, or less strict.

Fuzz test your validators

Your system should sharply and reliably distinguish valid from invalid data.

Fail early on invalid data

If your data doesn't pass validation, do not hobble along. Fail fast and fail gracefully.

Once you have invalid data, your assumptions that your processing is meaningful has failed. Continuing to barge on and keep working will both generate a flood of errors and can result in output that is not just missing, but wrong.

Garbage In, Garbage Out can only be prevented by detecting garbage and stopping before you generate garbage.

Bart van Ingen Schenau · Answer 5 · 2020-12-03T06:27:18.103

When reading data from an external source, and that includes data written by your application in a previous run, then it is a given that sooner or later the data you read does not match exactly with the data you expect.

If the format is specified externally, then the specification can change at any time. Besides that, the program generating the data could have a bug, or some glitch in the storage or communication causes a data corruption.

This is an interoperability problem that has existed as long as multiple machines communicate with each other and has given rise to the adage: "Be strict in what you send, but lenient in what you receive", meaning that when producing data you should try to adhere to the specified formats as much as you can, but when receiving data you should try to make sense of it (without reporting an error) even if it does not exactly match the prescribed format.

score 2 · Answer 6 · answered Jan 02 '21 at 17:26

The ultimate in "general solutions" is to treat your error-cascade problem not as a program design problem but as a specification problem--specifically, having a missing or inadequate specification. Michael Jackson did this in 1975 in his book, "Principles of Program Design", which treats this subject thoroughly. Although the examples are written in COBOL, the principles are the same for processing linear sequences of inputs, whether it is tokens in a programming language, commands in a shell, a .csv file of billing entries for a job, or keystrokes in a word-processor:

Define the grammar of a valid input stream (valid input)
Define the grammar for each kind of erroneous input stream (error input)
All other input structures are by default "invalid"
Define the program's response to valid input, creating test cases for each equivalence class of valid input
Define the program's response to error input, creating test cases as before
Define the program's response to invalid input, creating test cases as before

What most of us often do (myself included), is to let external actors teach us by example about error inputs (step 2 above) after we have deployed the system, and then have to react with a patch, and mollify unhappy users in the meantime. By treating this as a specification problem, you avoid this situation entirely.

Jackson shows program structures for responding to valid, error, and invalid data sequences, using COBOL. Of course, now we have all kinds of different programming constructs for handling errors, but defining the errors and your program's expected response to them helps you create a design which meets your needs rather than trying to play catch-up with an inadequate design.

In summary, there is a general solution, but it is at the specification level: define all the kinds of meaningful input you will provide meaningful responses to, and engineer for each of them. The rest are simply rejected with some sort of error indication.

score 1 · Answer 7 · answered Dec 03 '20 at 04:26

Basically, I would argue you should write the checking code (maybe offer a "performance" mode that doesn't run the checks). I would recommend using assert statements to ensure that the input is in the expected format. Maybe, put a comment in the code next to the assert statement saying the semantic meaning of that particular assert statement. That way, when your code fails, it is obvious to an outside developer that your code has not failed due to an internal fault, but because its assumptions have been violated.

score 1 · Answer 8 · answered Dec 05 '20 at 13:02

When you have a mature codebase, then you have seen a lot of different error scenarios and implemented all the code needed to handle it appropriately.

This means that if your code encounters an unexpected error now, you are in a situation where your world is broken (because it is something you have never seen before or you would already have handled it) and the only sensible approach from here on is for your code to stop what it is are doing and asking for emergency help.

Your cascading of errors come from that you are not prepared for this. If you aren't then your code cannot be either.

I would suggest you read "Release it!" as it contains a lot of useful advice for writing more robust code. https://pragprog.com/titles/mnee2/release-it-second-edition/

score 0 · Answer 9 · answered Dec 03 '20 at 23:47

Writing enormous amounts of extra "checking" code is pretty much needed unfortunately. The checking code is usually enough to help, as you can get the changes that break your code by printing what made the code fail where it failed. This is useful to the user if they gave to program bad input. Failing with decent error messages during checks is the easiest way to debug bad input.

One of the ways to validate data is to have a builder. You give the builder the pieces of data you have and then have it build an object consisting of that data. The factory can produce fuzzy logic (Yakk's idea), or it can throw an error if any data is missing when you tell it to build. You can also add methods to the builder to check if the data was fuzzy generated or is valid. Each data feed into the builder can also check that the data is valid on input, and throw a helpful error.

Anticipating bad input is one way to deal with it like you say in your question. You can write write code that checks if data should be equal (simple example being Hello and hello being the same words despite capitalism). This is really something to wait for an error for except simple examples. If the user really needs you to support a format, you can get the error message with details if you wrote good checking code. Then you can add support for the format they want. This can be harder said than done.

If you do need to add support, using a base interface can help if you need to change a lot of code. So say one customer has a different csv format, you can create code on top of the original interface that is labelled with that customer. So with your csv example, say one customer uses ;s instead on ,s. The base interface would deal with that and you can label the code on tope as semicolonSeperatedValues or something like that. This does take some thought as to what is needed in the base interface. This comes with a disadvantage of a lot of refactoring if there are some poor design choices early on, but it can help prevent duplicate code and bloated program files.

You can also ask the users of your software what their format is. Make sure if there is an error to print the error with the formatting that produced the error so they can fix the input; they can also give you a decent error message that helps you write more robust code.

As far as decent error messaging goes, as long as the error has helpful information and doesn't give the user an ugly crash or exception, you are good. Going with the csv example, if the user has a bad file, you should display an error that says what file, what line, and why that line is bad. Also, make sure not to change the state of data you are reporting the error on, otherwise you will be left with a potentially very obscure bug, and could confuse the user.

Try to avoid creating exceptions. An example in Java being a NullPointerException. You can pass null around, but unless you are checking for null everywhere it is getting passed around, eventually a NullPointerException will get thrown. In Java, they have a way to avoid this by using empty containers. If a method you are using throws an exception, you want to write code that will never trigger that exception.

Also, very important, do not ignore exceptions as a way of error handling. You will cover up what could potentially cause errors far away from their source.

Minimizing variable scopes also helps with errors. Having a global variable that multiple programs depend on is a good way to introduce a bug. Giving each program their own variable is much safer. Even safer is method local variables.

In multithreading environments, using immutable classes helps avoid a lot of potential headaches. In fact, multithreading is best avoided unless the performance is needed, because debugging errors is a lot harder in a multithreaded environment.

You want to give the user only as much control as they need. You want to give the user as little control as they need. This will prevent a user from messing up and getting frustrated at you for what they perceive as being your fault.

Using programs designed to do stuff for you is also a good way to avoid errors. A simple example being a for each loop in Java vs a normal loop where it is much easier to get an IndexOutOfBoundsException.

Getting familiar with the programming language you are using is also a great way to avoid errors. Find some reading material and exercises and do them.

Also, in multithreading environments, make sure to synchronize data. This is a complex subject in and of itself and has books written on it. Once again limit the control of the user. They absolutely should not be able to mutate synchronized data while it is being synchronized.

I have tried to make this list as general as possible, but the ideas are from reading Effective Java.

score 0 · Answer 10 · answered Dec 04 '20 at 23:48

Q : I get a flood of ugly PHP errors...

Short of predicting every possible change ... is there anything I can do to fix this general problem?

A: Yes, definitely. As Simon B suggested above, you want to fail fast, and fail loud. This is excellent advice :)

I'm surprised nobody mentioned using exceptions (thrown where the problem is first detected) and try/catch/finally blocks (at a higher level, where you can intelligently handle, and/or recover).

If you're interested, please take a look at these articles:

score -2 · Answer 11 · answered Dec 05 '20 at 05:37

The general counter-technique to cascading errors is resilience and compartmentalization.

Resilience - if part of a data stream is broken, ignore the broken part and work with the part that is okay if possible, otherwise abort the affected process then and there, but only the affected process. Have fallback options available.

Compartmentalization - separate your resources. Assign Threads, memory, access rights to different parts of the system and keep them separate. If one part fails make sure the others are not affected. For instance, if you call another component (internal or external)and that repeatedly fails or produces errors, stop calling it (circuit breaker concept) - at least for a while - use fallback values and don't spread faulty data in your system.

Is there a general solution to the problem of "sudden unexpected bursts of errors" in software?

11 Answers11