60

The way I see it, SQL injection attacks can be prevented by:

  1. Carefully screening, filtering, encoding input (before insertion into SQL)
  2. Using prepared statements / parameterized queries

I suppose that there are pros and cons for each, but why did #2 take off and become considered to be more or less the de facto way to prevent injection attacks? Is it just safer and less prone to error or were there other factors?

As I understand, if #1 is used properly and all caveats are taken care of, it can be just as effective as #2.

Sanitizing, Filtering, and Encoding

There was some confusion on my part between what sanitizing, filtering, and encoding meant. I'll say that for my purposes, all of the above can be considered for option 1. In this case I understand that sanitizing and filtering have the potential to modify or discard input data, while encoding preserves data as-is, but encodes it properly to avoid injection attacks. I believe that escaping data can be considered as a way of encoding it.

Parameterized Queries vs Encoding Library

There are answers where concepts of parameterized queries and encoding libraries that are treated interchangeably. Correct me if I'm wrong, but I am under impression that they are different.

My understanding is that encoding libraries, no matter how good they are always have the potential to modify SQL "Program", because they are making changes to the SQL itself, before it is sent off to the RDBMS.

Parameterized queries on the other hand, send the SQL program to the RDBMS, which then optimizes the query, defines the query execution plan, selects indexes that are to be used, etc., and then plug in the data, as the last step inside the RDBMS itself.

Encoding Library

  data -> (encoding library)
                  |
                  v
SQL -> (SQL + encoded data) -> RDBMS (execution plan defined) -> execute statement

Parameterized Query

                                               data
                                                 |
                                                 v
SQL -> RDBMS (query execution plan defined) -> data -> execute statement

Historal Significance

Some answers mention that historically, parameterized queries (PQ) were created for performance reasons, and before injection attacks that targeted encoding issues became popular. At some point it became apparent that PQ were also pretty effective against injection attacks. To keep with the spirit of my question, why did PQ remain the method of choice and why did it flourish above most other methods when it comes to preventing SQL injection attacks?

Dennis
  • 8,267
  • 6
  • 38
  • 70

16 Answers16

148

The problem is that #1 requires you effectively parse and interpret the entirety of the SQL variant you're working against so you know if it is doing something it shouldn't. And keep that code up to date as you update your database. Everywhere you accept input for your queries. And not screw it up.

So yes, that sort of thing would stop SQL injection attacks, but it is absurdly more costly to implement.

Telastyn
  • 110,259
79

Because option 1 is not a solution. Screening and filtering means rejecting or removing invalid input. But any input might be valid. For example apostrophe is a valid character in the name "O'Malley". It just have to be encoded correctly before being used in SQL, which is what prepared statements does.


After you added the note, it seems you are basically asking why use a standard library function rather than writing your own functionally similar code from scratch? You should always prefer standard library solutions to writing your own code. It is less work and more maintainable. This is the case for any functionality, but especially for something which is security sensitive it makes absolutely no sense to reinvent the wheel on your own.

JacquesB
  • 61,955
  • 21
  • 135
  • 189
60

If you're trying to do string processing, then you're not really generating an SQL query. You're generating a string that can produce an SQL query. There's a level of indirection that opens up a lot of room for errors and bugs. It's somewhat surprising really, given that in most contexts we're happy to interact with something programmatically. For instance, if we have some list structure and want to add an item, we usually don't do:

List<Integer> list = /* a list of 1, 2, 3 */
String strList = list.toString();   /* to get "[1, 2, 3]" */
strList = /* manipulate strList to become "[1, 2, 5, 3]" */
list = parseList(strList);

If someone suggests doing that, you'd rightly respond that it's rather ridiculous, and that one should just do:

List<Integer> list = /* ... */;
list.add(5, position=2);

That interacts with the data structure at its conceptual level. It doesn't introduce any dependency on how that structure might be printed or parsed. Those are completely orthogonal decisions.

Your first approach is like the first sample (only a little bit worse): you're assuming that can programatically construct the string that will be correctly parsed as the query that you want. That depends on the parser, and a whole bunch of string processing logic.

The second approach of using prepared queries is much more like the second sample. When you use a prepared query, you essentially parsing a pseudo-query that's legal but has some placeholders in it, and then using an API to correctly substitute some values in there. You no longer involve the parsing process, and you don't have to worry about any string processing.

In general, it's much easier, and much less error prone, to interact with things at their conceptual level. A query isn't a string, a query is what you get when you parse a string, or construct one programatically (or whatever other method lets you create one).

There's a good analogy here between C-style macros that do simple text replacement and Lisp-style macros that do arbitrary code generation. With C-style macros, you can replace text in the source code, and that means that you have the ability to introduce syntactic errors or misleading behavior. With Lisp macros, you're generating code in the form that the compiler processes it (that is, you're returning the actual data structures that the compiler processes, not text that the reader has to process before the compiler can get to it). With a Lisp macro, you can't generate something that would be a parse error, though. E.g., you can't generate (let ((a b) a.

Even with Lisp macros, you can still generate bad code though, because you don't necessarily get to be aware of the structure that's supposed to be there. E.g., in Lisp, (let ((a b)) a) means "establish a new lexical binding of the variable a to the value of variable b, and then return the value of a", and (let (a b) a) means "establish new lexical bindings of the variables a and b and initialize them both to nil, and then return the value of a." Those are both syntactically correct, but they mean different things. To avoid this issue, you could use more semantically-aware functions and do something like:

Variable a = new Variable("a");
Variable b = new Variable("b");
Let let = new Let();
let.getBindings().add(new LetBinding(a,b));
let.setBody(a);
return let;

With something like that, it's impossible to return something that's syntactically invalid, and it's much harder to return something that's accidentally not what you wanted.

21

It helps that option #2 is generally considered a best practice because the database can cache the unparameterized version of the query. Parameterized queries predates the issue of SQL injection by several years (I believe), it just so happens that you can kill two birds with one stone.

JasonB
  • 339
20

Simply said: They did not. Your statement:

Why did SQL Injection prevention mechanism evolve into the direction of using Parameterized Queries?

is fundamentally flawed. Parameterized queries have existed way longer than SQL Injection is at least widely known. They were generally developped as a way to avoid string concentation in the usual "form for search" functionality LOB (Line of Business) applications have. Many - MANY - years later, someone found a security issue with said string manipulation.

I remember doing SQL 25 years ago (when the internet was NOT widely used - it was just starting) and I remember doing SQL vs. IBM DB5 IIRC version 5 - and that had parameterized queries already.

TomTom
  • 545
13

In addition of all the other good answers:

The reason why #2 is better is because it separates your data from you code. In the #1 your data is part of your code and that's where all the bad things come from. With #1 you get your query and need to perform additional steps to make sure your query understands your data as data whereas in #2 you get your code and it's code and your data is data.

Pieter B
  • 13,310
11

Parameterized queries, apart from providing SQL injection defence, often have an additional benefit of being compiled only once, then executed multiple times with different parameters.

From the SQL database point of view select * from employees where last_name = 'Smith' and select * from employees where last_name = 'Fisher' are distinctly different and therefore require separate parsing, compilation, and optimization. They will also occupy separate slots in the memory area dedicated to storing compiled statements. In a heavily loaded system with a large number of similar queries that have different parameters computation and memory overhead can be substantial.

Subsequently, using parameterized queries often provides major performance advantages.

mustaccio
  • 227
  • 1
  • 4
5

Wait but why?

Option 1 means you have to write sanitizing routines for ever type of input whereas option 2 is less error-prone and less code for you to write/test/maintain.

Almost certainly "taking care of all caveats" can be more complex that you think it is, and your language (for example Java PreparedStatement) has more under the hood than you think.

Prepared statements or parametrized queries are pre-compiled in the database server so, when parameters are set, no SQL concatenation is done because the query is no longer a SQL string. An aditional advantage is that the RDBMS caches the query and subsequent calls are considered to be the same SQL even when parameter values vary, whereas with concatenated SQL every time the query is run with different values the query is different and the RDBMS has to parse it, create the execution plan again, etc.

Tulains Córdova
  • 39,570
  • 13
  • 100
  • 156
1

Let's imagine what an ideal "sanitize, filter and encode" approach would look like.

Sanitizing and filtering might make sense in the context of a particular application, but ultimately they both boil down to saying "you can't put this data in the database". For your application, that might be a good idea, but it's not something you can recommend as a general solution, since there will be applications that need to be able to store arbitrary characters in the database.

So that leaves encoding. You could start by having a function that encodes strings by adding escape characters, so that you can substitute them in yourself. Since different databases need different characters escaping (in some databases, both \' and '' are valid escape sequences for ', but not in others), this function needs to be provided by the database vendor.

But not all variables are strings. Sometimes you need to substitute in an integer, or a date. These are represented differently to strings, so you need different encoding methods (again, these would need to be specific to the database vendor), and you need to substitute them into the query in different ways.

So maybe things would be easier if the database handled substitution for you too - it already knows what types the query expects, and how to encode data safely, and how to substitute them into your query safely, so you don't need to worry about it in your code.

At this point, we've just reinvented parameterised queries.

And once queries are parameterised, it opens up new opportunities, such as performance optimizations, and simplified monitoring.

Encoding is hard to do right, and encoding-done-right is indistinguishable from parameterisation.

If you really like string interpolation as a way of building queries, there are a couple of languages (Scala and ES2015 come to mind) that have pluggable string interpolation, so there are libraries that let you write parameterised queries that look like string interpolation, but are safe from SQL injection - so in ES2015 syntax:

import {sql} from 'cool-sql-library'

let result = sql`select *
    from users
    where user_id = ${user_id}
      and password_hash = ${password_hash}`.execute()

console.log(result)
James_pic
  • 341
1

Offering an alternative perspective on this one - I see SQL injection (and other injection attacks) as being a serialisation problem... you're taking a complex structure (in this case, a database query), serialising it to string, and sending it across the wire to the database which has to deserialise it into a usable form.

That last part is the key. The database doesn't just magically run an SQL string directly... it has to parse the SQL string into some kind of syntax tree, turning embedded values into tokens.

So in that model, parameterised queries make perfect sense... the caller already has those values separated out, and the database needs them separated out, so you might as well keep them in the form that's easiest for both sides, instead of worrying about how to safely encode them in such a way that the parser can safely decode them.

0

In option 1, you are working with an input set of size=infinity that you are trying to map to a very large output size. In option 2, you have bounded your input to whatever you choose. In other words:

  1. Carefully screening and filtering [infinity] for [all safe SQL queries]
  2. Using [preconsidered scenarios limited to your scope]

According to other answers, there also appears to be some performance benefits from limiting your scope away from infinity and towards something manageable.

0

One useful mental model of SQL (especially modern dialects) is that each SQL statement or query is a program. In a native binary executable program, the most dangerous kinds of security vulnerabilities are overflows where an attacker can overwrite or modify the program code with different instructions.

A SQL injection vulnerability is isomorphic to a buffer overflow in a language like C. History has shown that buffer overflows are extremely difficult to prevent -- even extremely critical code subject to open review has often contained such vulnerabilities.

One important aspect of the modern approach to solving overflow vulnerabilities is the use of hardware and OS mechanisms to mark particular parts of memory as non-executable, and to mark other parts of memory as read-only. (See the Wikipedia article on Executable space protection, for example.) That way, even if an attacker could modify data, the attacker cannot cause their injected data to be treated as code.

So if a SQL injection vulnerability is equivalent to a buffer overflow, then what's the SQL equivalent to an NX bit, or to read-only memory pages? The answer is: prepared statements, which include parameterized queries plus similar mechanisms for non-query requests. The prepared statement is compiled with certain parts marked read-only, so an attacker cannot change those parts of the program, and other parts marked as non-executable data (the parameters of the prepared statement), which the attacker could inject data into but which will never be treated as program code, thus eliminating most of the potential for abuse.

Certainly, sanitizing user input is good, but to really be secure you need to be paranoid (or, equivalently, to think like an attacker). A control surface outside of the program text is the way to do that, and prepared statements provide that control surface for SQL. So it should come as no surprise that prepared statements, and thus parameterized queries, are the approach that the vast majority of security professionals recommend.

0

I alredy write about this here: https://stackoverflow.com/questions/6786034/can-parameterized-statement-stop-all-sql-injection/33033576#33033576

But, just to keep it simple:

The way parameterized queries work, is that the sqlQuery is sent as a query, and the database knows exactly what this query will do, and only then will it insert the username and passwords merely as values. This means they cannot effect the query, because the database already knows what the query will do. So in this case it would look for a username of "Nobody OR 1=1'--" and a blank password, which should come up false.

This isn't a complete solution though, and input validation will still need to be done, since this won't effect other problems, such as XSS attacks, as you could still put javascript into the database. Then if this is read out onto a page, it would display it as normal javascript, depending on any output validation. So really the best thing to do is still use input validation, but using parameterized queries or stored procedures to stop any SQL attacks

Josip Ivic
  • 1,657
0

I've never used SQL. But obviously you hear about what problems people have, and SQL developers had problems with this "SQL injection" thing. For a long time I couldn't figure it out. And then I realized that people where creating SQL statements, real textual SQL source statements, by concatenating strings, of which some where entered by a user. And my first thought on that realisation was shock. Total shock. I thought: How can anyone be so ridiculously stupid and create statements in any programming language like that? To a C, or C++, or Java, or Swift developer, this is utter madness.

That said, it is not very difficult to write a C function that takes a C string as its argument, and produces a different string that looks exactly like a string literal in C source code that represents the same string. For example, that function would translate abc to "abc", and "abc" to "\"abc\"" and "\"abc\"" to "\"\\"abc\\"\"". (Well, if this looks wrong to you, that's html. It was right when I typed it in, but not when it gets displayed) And once that C function is written, it isn't difficult at all to generate C source code where the text from an input field supplied by the user is turned into a C string literal. That isn't hard to make safe. Why SQL developers wouldn't use that approach as a way to avoid SQL injections is beyond me.

"Sanitizing" is a totally flawed approach. The fatal flaw is that it makes certain user inputs illegal. You end up with a database where a generic text field cannot contain text like ; Drop Table or whatever you would use in an SQL injection to cause damage. I find that quite unacceptable. If a database stores text, it should be able to store any text. And the practical flaw is that sanitizer can't seem to get it right :-(

Of course, parameterized queries are what any programmer using a compiled language would be expecting. It makes life so much easier: You have some string input, and you never even bother to translate it into an SQL string, but just pass it as a parameter, with no chance of any characters in that string causing any damage.

So from the point of a developer using compiled languages, sanitizing is something that would never occur to me. The need for sanitizing is insane. Parameterised queries are the obvious solution to the problem.

(I found Josip's answer interesting. He basically says that with parameterised queries you can stop any attack against SQL, but then you can have text in your database that is used to create a JavaScript injection :-( Well, we have the same problem again, and I don't know if Javascript has a solution to that.

gnasher729
  • 49,096
0

When I first heard of SQL injections I had the impression that something magically clever had to go on. Then I found it was the absolute opposite: textual substitution in a query is just utter stupidity on an incredible scale.

If parameterised queries didn’t exist, there would be one possible solution: Escaping. As an example, I can write a C function that takes any char* containing a C string as it’s argument and outputs a C string containing a C-language string literal that is compiled to this C string. For example a char* containing "abc" including the apostrophes would return ""abc"". You must do something like that to produce JSON. And the same could be done for SQL (hoping that SQL doesn’t make it too hard).

gnasher729
  • 49,096
-2

The main problem is that hackers found ways to surround the sanitation while the parametrized queries was an existing procedure that worked perfectly with the extra benefits of performance and memory.

Some people simplify the problem as "it is just the single quote and double quote" but hackers found smart ways to avoid detection like using different encodings or making use of database funtions.

Anyway, you only needed to forget one single string to create a catastrophic data breach. Hackers where able to automatize scripts to download the complete database with a series or queries. If the software is well known like an open source suite or a famous business suite you could simply attact the users and passwords table.

On the other hand just using concatenated queries was just a matter of learning to use and getting used to it.

Borjab
  • 1,339