17

It is often repeated that the Big Data problem is that relational databases can not scale to process the massive volumes of data that are now being created.

But what are these scalability limitations that Big Data solutions like Hadoop are not bound to? Why can't Oracle RAC or MySQL sharding or MPP RDBMS like Teradata (etc.) achieve these feats?

I am interested in the technical limitations - I am aware that the financial costs of clustering RDBMS can be prohibitive.

Jeremy Beard
  • 281
  • 1
  • 2
  • 6

3 Answers3

15

MS just had a tech talk in the Netherlands where they discussed some of this stuff. It starts off slowly, but gets into the meat of Hadoop around the 20 minute mark.

The gist of it is that "it depends". If you have a sensibly arranged, (at least somewhat) easy to partition set of data that (at least somewhat) is homogeneous, it should be fairly easy to scale to those high data volumes with an RDBMS, depending upon what you're doing.

Hadoop and MR seem to be more geared to situations where you are forced to to large distributed scans of data, especially when those data aren't necessarily as homogeneous or as structured as what we find in the RDBMS world.

What limitations are Big Data solutions not bound to? To me, the biggest limitation they're not bound to is having to make a rigid schema ahead of time. With Big Data solutions, you shove massive amounts of data into the "box" now, and add logic to your queries later to deal with the lack of homogeneity of the data. From a developer's perspective the tradeoff is ease of implementation and flexibility on the front end of the project, versus complexity in querying and less immediate data consistency.

Dave Markle
  • 1,354
  • 9
  • 14
6

Database pioneer and researcher Michael Stonebraker co-wrote a paper that discusses the limitations of traditional database architectures. Generally, they scale up with more expensive hardware, but have difficulty scaling out with more commodity hardware in parallel, and are limited by legacy software architecture that was designed for an older era. He contends that the BigData era requires multiple new database architectures that take advantage of modern infrastructure and optimize for a particular workload. Examples of this are the C-store project, which led to the commercial database Vertica Systems, and the H-store project that led to VoltDB, an in-memory OLTP SQL database designed for high velocity BigData workloads. (Full disclosure, I work for VoltDB).

You might find this webinar interesting on this topic. It responds to some of the myths that have arisen with the success of NoSQL databases. Basically, he contends that SQL was not the problem, it shouldn't be necessary to give up traditional database features such as consistency in order to get performance.

5

It is not entirely true that RDBMS cannot scale. However, the partial truth in the statement depends on the architecture. In the list that you gave, Oracle RAC is different from the rest (sharded MySQL and Teradata). The major difference is shared disk vs shared nothing architectures.

Shared disk architectures like Oracle RAC suffer from scaling because at some point or other all the machines running should synchronize on some part of the data. For e.g. global lock manger is a killer. You can keep fine tuning it to some extent but you will ultimately hit a wall. If you cannot easily add machines, you should have fewer but super powerful machines which may burn your pocket. In case of shared nothing architectures(or sharded data), each machines takes ownership of some data. It need not synchronize with other mahcines if it want to update some data.

Then comes the breed of NoSQL databases. I would treat them a subset of traditional RDBMS databases. Not all applications in this world will need all the functionality offered by RDBMS. If I want to use database as a cache, I would not care about durability. May be in some cases I would also not care about consistency. If all my data lookup is based on a key, I dont need support for range queries. I may not need secondary indexes. I dont need the whole query processing/query optimization layer which all the traditional databases have.

sunil
  • 421
  • 2
  • 2