53

Lets say you are coding a function that takes input from an external API MyAPI.

That external API MyAPI has a contract that states it will return a string or a number.

Is it recommended to guard against things like null, undefined, boolean, etc. even though it's not part of the API of MyAPI? In particular, since you have no control over that API you cannot make the guarantee through something like static type analysis so it's better to be safe than sorry?

I'm thinking in relation to the Robustness Principle.

jpmc26
  • 5,489

9 Answers9

103

You should never trust the inputs to your software, regardless of source. Not only validating the types is important, but also ranges of input and the business logic as well. Per a comment, this is well described by OWASP

Failing to do so will at best leave you with garbage data that you have to later clean up, but at worst you'll leave an opportunity for malicious exploits if that upstream service gets compromised in some fashion (q.v. the Target hack). The range of problems in between includes getting your application in an unrecoverable state.


From the comments I can see that perhaps my answer could use a bit of expansion.

By "never trust the inputs", I simply mean that you can't assume that you'll always receive valid and trustworthy information from upstream or downstream systems, and therefore you should always sanitize that input to the best of your ability, or reject it.

One argument surfaced in the comments I'll address by way of example. While yes, you have to trust your OS to some degree, it's not unreasonable to, for example, reject the results of a random number generator if you ask it for a number between 1 and 10 and it responds with "bob".

Similarly, in the case of the OP, you should definitely ensure your application is only accepting valid input from the upstream service. What you do when it's not OK is up to you, and depends a great deal on the actual business function that you're trying to accomplish, but minimally you'd log it for later debugging and otherwise ensure that your application doesn't go into an unrecoverable or insecure state.

While you can never know every possible input someone/something might give you, you certainly can limit what's allowable based on the business requirements and do some form of input whitelisting based on that.

Paul
  • 3,347
35

Yes, of course. But what makes you think the answer could be different?

You surely don't want to let your program behave in some unpredictable manner in case the API does not return what the contract says, don't you? So at least you have to deal with such a behaviour somehow. A minimal form of error handling is always worth the (very minimal!) effort, and there is absolutely no excuse for not implementing something like this.

However, how much effort you should invest to deal with such a case is heavily case dependent and can only be answered in context of your system. Often, a short log entry and letting the application end gracefully can be enough. Sometimes, you will be better off to implement some detailed exception handling, dealing with different forms of "wrong" return values, and maybe have to implement some fallback strategy.

But it makes a hell of a difference if you are writing just some inhouse spreadsheet formatting application, to be used by less than 10 people and where the financial impact of an application crash is quite low, or if you are creating a new autonomous car driving system, where an application crash may cost lives.

So there is no shortcut against reflecting about what you are doing, using your common sense is always mandatory.

Doc Brown
  • 218,378
26

The Robustness Principle--specifically, the "be liberal in what you accept" half of it--is a very bad idea in software. It was originally developed in the context of hardware, where physical constraints make engineering tolerances very important, but in software, when someone sends you malformed or otherwise improper input, you have two choices. You can either reject it, (preferably with an explanation as to what went wrong,) or you can try to figure out what it was supposed to mean.

EDIT: Turns out I was mistaken in the above statement. The Robustness Principle doesn't come from the world of hardware, but from Internet architecture, specifically RFC 1958. It states:

3.9 Be strict when sending and tolerant when receiving. Implementations must follow specifications precisely when sending to the network, and tolerate faulty input from the network. When in doubt, discard faulty input silently, without returning an error message unless this is required by the specification.

This is, plainly speaking, simply wrong from start to finish. It is difficult to conceive of a more wrongheaded notion of error handling than "discard faulty input silently without returning an error message," for the reasons given in this post.

See also the IETF paper The Harmful Consequences of the Robustness Principle for further elaboration on this point.

Never, never, never choose that second option unless you have resources equivalent to Google's Search team to throw at your project, because that's what it takes to come up with a computer program that does anything close to a decent job at that particular problem domain. (And even then, Google's suggestions feel like they're coming straight out of left field about half the time.) If you try to do so, what you'll end up with is a massive headache where your program will frequently try to interpret bad input as X, when what the sender really meant was Y.

This is bad for two reasons. The obvious one is because then you have bad data in your system. The less obvious one is that in many cases, neither you nor the sender will realize that anything went wrong until much later down the road when something blows up in your face, and then suddenly you have a big, expensive mess to fix and no idea what went wrong because the noticeable effect is so far removed from the root cause.

This is why the Fail Fast principle exists; save everyone involved the headache by applying it to your APIs.

Mason Wheeler
  • 83,213
14

In general, code should be constructed to uphold the at least the following constraints whenever practical:

  1. When given correct input, produce correct output.

  2. When given valid input (that may or may not be correct), produce valid output (likewise).

  3. When given invalid input, process it without any side-effects beyond those caused by normal input or those which are defined as signalling an error.

In many situations, programs will essentially pass through various chunks of data without particularly caring about whether they are valid. If such chunks happen to contain invalid data, the program's output would likely contain invalid data as a consequence. Unless a program is specifically designed to validate all data, and guarantee that it will not produce invalid output even when given invalid input, programs that process its output should allow for the possibility of invalid data within it.

While validating data early on is often desirable, it's not always particularly practical. Among other things, if the validity of one chunk of data depends upon the contents of other chunks, and if the majority of of the data fed into some sequence of steps will get filtered out along the way, limiting validation to data which makes it through all stages may yield much better performance than trying to validate everything.

Further, even if a program is only expected to be given pre-validated data, it's often good to have it uphold the above constraints anyway whenever practical. Repeating full validation at every processing step would often be a major performance drain, but the limited amount of validation needed to uphold the above constraints may be much cheaper.

supercat
  • 8,629
3

Let's compare the two scenarios and try to come to a conclusion.

Scenario 1 Our application assumes the external API will behave as per the agreement.

Scenario 2 Our application assumes the external API can misbehave, hence add precautions.

In general, there is a chance for any API or software to violate the agreements; may be due to a bug or unexpected conditions. Even an API might be having issues in the internal systems resulting in unexpected results.

If our program is written assuming the external API will adhere to the agreements and avoid adding any precautions; who will be the party facing the issues? It will be us, the ones who has written integration code.

For example, the null values that you have picked. Say, as per the API agreement the response should have not-null values; but if it is suddenly violated our program will result in NPEs.

So, I believe it will be better to make sure your application has some additional code to address unexpected scenarios.

lkamal
  • 199
2

You should always validate incoming data -- user-entered or otherwise -- so you should have a process in place to handle when the data retrieved from this external API is invalid.

Generally speaking, any seam where extra-orgranizational systems meet should require authentication, authorization (if not defined simply by authentication), and validation.

StarTrekRedneck
  • 198
  • 1
  • 4
2

To give a slightly differing opinion: I think it can be acceptable to just work with the data you are given, even if it violates it's contract. This depends on the usage: It's something that MUST be a string for you, or is it something you are just displaying / does not use etc. In the latter case, simply accept it. I have an API which just needs 1% of the data delivered by another api. I could not care less what kind of data are in the 99%, so I will never check it.

There has to be balance between "having errors because I do not check my inputs enough" and "I reject valid data because I am too strict".

1

In general, yes, you must always guard against flawed inputs, but depending on the kind of API, "guard" means different things.

For an external API to a server, you do not want to accidentally create a command that crashes or compromises the state of the server, so you must guard against that.

For an API like e.g. a container class (list, vector, etc), throwing exceptions is a perfectly fine outcome, compromising the state of the class instance may be acceptable to some extent (e.g. a sorted container provided with a faulty comparison operator will not be sorted), even crashing the application may be acceptable, but compromising the state of the application - e.g. writing to random memory locations unrelated to the class instance - is most likely not.

Peter
  • 3,778
0

My take on this is to always, always check each and every input to my system. That means every parameter returned from an API should be checked, even if my program does not use it. I tend to as well check every parameter I send to an API for correctness. There are only two exceptions to this rule, see below.

The reason for testing is that if for some reason the API / input is incorrect my program cannot rely on anything. Maybe my program was linked to an old version of the API that does something different from what I believe? Maybe my program stumbled on a bug in the external program that has never before happened. Or even worse, happens all the time but no one cares! Maybe the external program is beeing fooled by a hacker to return stuff that can hurt my program or the system?

The two exceptions to testing everything in my world are:

  1. Performance after careful measurement of performance:

    • never optimize before you have measured. Testing all input / returned data most often takes a very small time compared to the actual call so removing it often saves little or nothing. I would still keep the error detection code, but comment it out, perhaps by a macro or simply commenting it away.
  2. When you have no clue what to do with an error

    • there are times, not often, when your design simply does not allow handling of the kind of error you would find. Maybe what you ought to do is log an error, but there is no error logging in the system. It is almost always possible to find some way to "remember" the error allowing at least you as a developer to later check for it. Error counters is one good thing to have in a system, even if you elect to not have logging.

Exactly how carefully to check inputs / return values is an important question. As example, if the API is said to return a string, I would check that:

  • the data type actully is a string

  • and that length is between min and max values. Always check strings for max size that my program can expect to handle (returning too large strings is a classical security problem in networked systems).

  • Some strings should be checked for "illegal" characters or content when that is relevant. If your program might send the string to say a database later, it is a good idea to be check for database attacks (search for SQL injection). These tests are best done at the borders of my system, where I can pinpoint where the attack came from and I can fail early. Doing a full SQL injection test might be difficult when strings are later combined, so that test should be done before calling the database, but if you can find some problems early it can be useful.

The reason for testing parameters I send to the API is to be sure that I get a correct result back. Again, doing these tests before calling an API might seem unnecessary but it takes very little performance and may catch errors in my program. Hence the tests are most valuable when developing a system (but nowadays every system seems to be in continous development). Depending on the parameters the tests can be more or less thorough but I tend to find that you can often set allowable min and max values on most parameters that my program could create. Perhaps a string should always have at least 2 characters and be a maximum of 2000 characters long? The min and maximum should be inside what the API allows as I know that my program will never use the full range of some parameters.