Is the use of NoSQL Databases impractical for large datasets where you need to search by content?

Question

I've been learning about NoSQL Databases for a week now.

I really understand the advantages of NoSQL Databases and the many use cases they are great for.

But often people write their articles as if NoSQL could replace Relational Databases. And there is the point I can't get my head around:

NoSQL Databases are (often) key-value stores.

Of course it is possible to store everything into a key-value store (by encoding the data in JSON, XML, whatever), but the problem I see is that you need to get some amount of data that matches a specific criterion, in many use cases. In a NoSQL database you have only one criterion you can search for effectively - the key. Relational Databases are optimized to search for any value in the data row effectively.

So NoSQL Databases are not really a choice for persisting data that need to be searched by their content. Or have I misunderstood something?

An example:

You need to store user data for a webshop.

In a relational database you store every user as an row in the users table, with an ID, the name, his country, etc.

In a NoSQL Database you would store each user with his ID as key and all his data (encoded in JSON, etc.) as value.

So if you need to get all users from a specific country (for some reason the marketing guys need to know something about them), it's easy to do so in the Relational Database, but not very effective in the NoSQL Database, because you have to get every user, parse all the data and filter.

I don't say it's impossible, but it gets a lot more tricky and I guess not that effective if you want to search in the data of NoSQL entries.

You could create a key for each country that stores the keys of every user who lives in this country, and get the users of a specific country by getting all the keys which are deposited in the key for this country. But I think this techique makes a complex dataset even more complex - it's harder to implement and not as effective as querying an SQL Database. So I think it's not a way you would use in production. Or is it?

I'm not really sure if I misunderstood something or overlooked some concepts or best practices to handle such use cases. Maybe you could correct my statements and answer my questions.

score 42 · Accepted Answer · answered Jan 12 '16 at 01:02

While I agree with your premise that NoSQL is not a panacea for all database woes, I think you misunderstand one key point.

In NoSQL database you have only one criterion you can search for effectively - the key.

This is clearly not true.

For example MongoDB supports indices. (from https://docs.mongodb.org/v3.0/core/indexes-introduction/)

Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.

Indexes are special data structures [1] that store a small portion of the collection’s data set in an easy to traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field. The ordering of the index entries supports efficient equality matches and range-based query operations. In addition, MongoDB can return sorted results by using the ordering in the index.

As does couchbase (from http://docs.couchbase.com/admin/admin/Views/views-intro.html)

Couchbase views enable indexing and querying of data.

A view creates an index on the data according to the defined format and structure. The view consists of specific fields and information extracted from the objects in Couchbase.

In fact anything that calls itself a NoSQL database rather than a key-value store should realy support some kind of indexing schemes.

In fact, it is often the flexibility of these index schemes that makes NoSQL shine. In my opinion, the language used to define the NoSQL indices are often more expressive or natural than SQL, and since they usually live outside the table, you don't need to change your table schemas to support them. (Not to say you cant do similar things in SQL but to me it feels like there is a lot more hoop-jumping involved).

score 41 · Answer 2 · edited Jan 11 '16 at 19:30

Generally speaking, if your workflow is a perfect match for relational database queries, you'll find relational databases to be the most efficient approach. Its kind of tautological, but its true.

The claim that many NoSQL advocates would make is that many workflows were actually massaged into a relational form, and would have been more effective before such massaging. The validity of this claim is complicated to ascertain. Clearly there are jobs that are very well described by SQL queries. I can say from my experience that my particular relational programming tasks could have been done using NoSQL with nearly the same level of efficiency, if not more. However, that's a very subjective statement based on narrow experience.

I have a feeling much of the sale of the NoSQL approach comes from the assumption of large databases. The larger the database is, the more you must groom your workflow to support the larger datasets. NoSQL seems to be better at supporting that grooming effort. Thus, the larger the database, the more important NoSQL's features can potentially be.

To use the example, in SQL querying by country is just as slow as the NoSQL scan of all users, unless you explicitly told SQL to index the users table by country. NoSQL can do the same, where you create an ordered key-value collection that is the index (just like SQL does under the hood) and maintain it.

The difference? SQL engines had the concept of indexing the table built in. This means you got to do less work (all you had to do was add an index to the table). However, it also means you had less control. For most cases, that loss of control is acceptable, in exchange for the SQL engine doing the work for you. However, in massive datasets, you may want a different consistency model than the typical SQL ACID model. You may want to use the BASE model which supports eventual consistency. That could be very difficult in SQL, because the SQL engine is doing the work for you so it has to be done by the SQL engine's rules. In NoSQL, those layers are typically exposed, letting you hack at them.

score 16 · Answer 3 · edited Mar 20 '17 at 10:29

NoSQL is a rather vague term, since it basically covers all database systems which are not relational.

What you describe is a key-value store, which is a kind of database where a blob of data is stored under a key, and can be quickly looked up if you know the key. These databases are blazingly fast if you know the exact key, but as you say yourself, if you need to search or filter on multiple properties on the data, it will be slow and cumbersome.

Nobody in their right mind would claim that key-value stores can replace relational databases in general. However there may be particular use cases where are key-value store is a good fit. Key-value stores are often used for caching, since you typically cache items by id, but you don't need to perform ad-hoc queries over caches. For example the Stackoverflow site itself uses Redis (a key-value db) extensively, but only for output caching. The underlying canonical data is still persisted in a relational database.

So the answer is pretty obvious: Use a key-value store if you only need to store and lookup using a single key. Otherwise use a different kind of database. And if you are in doubt, use a relational database, since this is the most versatile kind of database, while the NoSQL databases are often optimized towards very particular use cases.

score 11 · Answer 4 · answered Jan 11 '16 at 23:23

Your assertions about relational databases are all true, up until the point where you have so much data you can't fit a copy of it on a single server anymore. Then you start running into all sorts of interesting problems. How do you split up your tables so most of your queries can run on a single server? How many copies of the data do you make? How do you deal with inconsistencies between those copies? How do you keep a user's data in a data center that's relatively close to him or her geographically?

These goals often conflict with each other. A lot of twitter users follow people from all over the world. Should twitter's database be geographically optimized for reading tweets or writing tweets?

It turns out when you deal with that kind of scale, you start inventing solutions, adding redundancies, and imposing restrictions that very much resemble a NoSQL database. If you can fit all your data on one box, you're only getting the restrictions and have no need for the benefits.

score 5 · Answer 5 · edited Jan 12 '16 at 20:08

NoSQL databases have very little to do with “No SQL”.

They are about admitting that you can’t have a database at scale that is always consistent and supports complex transactions and has durability.

In a normal relational database all indexes are automatically kept updated within the scope of a transaction, so can be used for any query.

In a NoSQL database the programmer is responsible for maintaining a lot of the indexes and it is assumed that indexes will always be out of date.

For example:

An index of people by tax number may contain some people that never complete the process of registration for tax.
Therefore code that uses the index has to be able to cope with incomplete registration for tax
Another option is to have times when a person that is registered for tax is not in the index. (So your design has to cope with not having consistent data and decide how the data will not be consistent.)

As a real would example, Amazon would rather show me the out of date description of a book than delay the display of the web page by waiting for 106 computers to confirm that the correct lock has been taken out.

Therefore.....

If a single normal relational database can hold all your data and process each transaction quickly enough that locking does not stop your system from doing useful work, a relational database is the best option.

But as soon as you have to start thinking about using more than one relational database, or splitting up transactions to avoid locking errors, you are going down the road of having to cope with the sort of issues you get when using “NoSQL” databases.

As “NoSQL” databases do not hide these issues, they may become the best option when you scale up a system. But remember that Stackoverflow still uses an relational database for storing all its data, with limited use of NoSQL in the caching layer – so you have to be VERY big before you are forced to use NoSQL to store your data.

score 3 · Answer 6 · answered Jan 11 '16 at 18:31

Relational Databases are optimized to search for any value in the datarow effectively.

Don't confuse the ability to search on "any" value in a row with "every" value in a row. The most effective way to do this requires one or more indexes. You could have indexes include all the fields, but then you just hindered you're ability to make changes that require altering the index (inserts, updates, deletes). You (or your DBA) have to understand the data, the usage, bottlenecks etc.

score -1 · Answer 7 · answered Jan 12 '16 at 19:48

There are many answers already, but I just wanted to add my summary.

Clearly NoSQL concept covers a variety of different approaches in organizing data on-disk, in-memory and exposing it via a query language (some are even SQL-like!). In my view the strength comes from this variety of systems so you can choose the best tool for the job. But still hopefully you can cover a dozen different needs with just a few different solutions, you wouldn't want to manage a dozen different systems.

Relational databases can get you very far and are a proven technology, but just like the database you might want to choose the programming language based on each project's needs (but taking team's experience into account as well).

score -2 · Answer 8 · answered Jan 12 '16 at 00:12

I've been using couchdb for two years now. It's mostly used for content management and configuration.

For hierarchical relationships are much easier to manage when you can visualize them. For read-mostly data, it's easier to edit JSON than it is to write an UPDATE statement in many cases. Doesn't take a programmer, actually, to edit JSON. And SQL gives you rows and columns, which you then have to map into some sort of object structure.

You also get a performance boost because you're not joining 10-20 tables on complex queries. Couchdb views are very fast because the javascript they are based on are not executed at query time.

Most programmers understand Javascript, and most programmers struggle with SQL occasionally.

In Couchdb, a view can be thought of as an abstract of a JSON document. How the view data is structured is up to you (you are not constrained by the original hierarchy).

I wouldn't use Couchdb for highly transactional data, but for semi-static data with a parts-explosion type structure, it is MUCH easier to work with than SQL.

Note though, that there is no clear 'normalization' that can be applied (though avoiding duplication of data is a worthy goal), and there is essentially and 'optimistic' update strategy akin to optimistic locking.

Is the use of NoSQL Databases impractical for large datasets where you need to search by content?

8 Answers8