6

I often hear people say they "sanitize input," which would mean make it clean. I understand this to mean "clean of potentially damaging contents," where the function that does the sanitizing would do something like character escaping.

But then I hear things like "sane input," which to me, means the input isn't a string where a double was expected, or "January Third of Nineteen Ninety Five" where "1995/01/03" would have been correct. This represents a matter of formatting.

Then we have "sanity functions," which handle user input to make it usable by the backend of the software. Can this refer to both types of input validation? Does it only deal with the formatting (like "sane input"), or with cleanliness of the input (like "sanitized input")? Are they two different classes of operations, or does sanity in this case just refer to both? I always thought it referred to sanitizing it (if that actually means something different than making it sane) since I thought "sanity" was a root for "sanitize." But I just looked it up and can't find any definition of "sanity" that has anything to do with cleanliness or sanitization.

Is there idioms for each of these operations that I don't know about, or is it always just "sanity functions" which do both of these things? Would it be confusing to see "sanity" and "safety" functions?

Carson Myers
  • 2,480
  • 3
  • 24
  • 25

3 Answers3

6

Sane input means input that is acceptable for further processing. It doesn't have to be dangerous - just wrong. Say,

  • fractional amount of items that are sold only in whole units.
  • A person's name containing newline characters.
  • A PO Box address for paid-upon-receiving parcel.
  • A value that is against official regulations.
  • A textual description where only number is accepted.
  • Invalid date, like 31st February.
  • A value out of reasonable bonds, say, birthdate two centuries ago.
  • Email address without the @ character. And so on.
  • First name with trailing spaces at the end.

Sometimes sanitizing means only fitting into desired standard, meaning change date ordering and separators, so that 1/1/2011 turns into 2011-01-01, or stripping whitespaces at the ends, or capitalizing the country code etc. Sometimes it's limiting it to sane values, you are entitled to 100% refund, not 18000%. Sometimes it's discarding gibberish or useless data, say, nonexistent zip code will render the whole address invalid, and wrong number of digits in account number will make money transfer impossible.

Sanitizing against SQL injection, or other attacks is only a margin of the operation.

Edit: Yes, pretty much both are a subset of the same problem - making data fit for further processing.

If your database is dropped because someone wrote '; DROP DATABASE;-- as their username it's the same set of problems as when someone wrote 0 as a quotient and your backend blew up on division by zero, or as when someone stole money from someone else's account by entering negative value in amount field of a bank transfer, or as your parcel was returned to sender because you accepted phone number in place of ZIP code.

SF.
  • 5,236
2

Something everyone seems to overlook: "to sanitize" means "to make sanitary", i.e. to clean up, not "to make sane".

Thus, "sanitizing input" means cleaning up input by normalizing it or removing bad or unnecessary parts, but with the basic asssumption that the input is generally sane but possibly flawed in some aspects. This most often applies to input provided by users.

"Sanity checking", on the other hand, means identifying and rejecting input that is fundamentally broken. This is generally used for data provided by other systems or automatically, which is expected to be flawless. Sanity checking is basically a form of defensive programming (fail-fast).

1

There's a whole chain of things which can be considered input sanitation:

  • Client-side error avoidance means constructing the user interface so that certain human errors (as different from malicious use) are impossible. This includes setting input field lengths, using date pickers, selection lists which update to contain only valid selections, and discarding any non-numeric characters typed into a number field. Any logic here should not be duplicated in client-side validation.
  • Client-side validation checks the input for user errors which can pass error avoidance. Obvious examples include trailing whitespace (you don't know until submission time that the user isn't planning to add another word) or two dots in a number. This should not check for things which are sign of bots (such as strings longer than the maximum input field length), because the test results will simply be ignored.
  • Server-side input filtering using a blacklist or whitelist is probably the most common definition of sanitation: Remove data from input that you don't think the system is capable of handling properly, and return an error if the result is deemed not usable.
  • Server-side input sanity check could be necessary if the user interface is somehow unable to verify some parts of the input. For example, a calculation or third-party communication might take too long to be done interactively by the client-side error avoidance code.
  • Server-side escaping is the good twin sister of filtering. By escaping once and only once on input and output you ensure that your entire stack is able to handle any input thrown at it.
  • Database restrictions are the final checkpoint, and there should be plenty of validation there (foreign keys, sane column lengths and data types, triggers if necessary) to ensure that a data insert is atomic, and that the result of a successful commit is usable. Any errors caught at this level should be a sure indication of willful attempts at sabotage.

Sanity can thus refer to formatting (date, int), validity (password, IBAN), computability (regular expression, mathematical formula), and just plain neatness (trailing spaces).

l0b0
  • 11,547