9

Is the main reason for the difficulty in horizontally scaling / distributing an RDBMS because of an adherence to ACID transactions? Is it the fact that multiple tables are so interconnected, or something else?

My impression is that it’s mostly the ACID requirements, given that different nodes in a cluster may have different values at any given time. But heck, I’m fuzzy on how ACID even works.

Conversely, why are some NoSQL databases so much easier to distribute? I don’t know enough about distributed databases to understand why one db can be easily distributed but the other cannot.

Can anyone shed some light?

Greg Thomas
  • 201
  • 2
  • 3

3 Answers3

16

Why is it more difficult to horizontally scale a relational database than a NoSQL database

I'm afraid your assumption is a bit flawed.

It is not difficult to scale a relational (or any other) DBMS horizontally (or along any other axis). What's difficult is to maintain all the promises the DBMS offers while doing so.

A relational DBMS mainly promises consistency, and when it's spread across multiple instances, it makes you wait until all (or at least the majority) of instances agree on the final state of the transaction. Thus, you buy the consisent state of the database for the time it takes for all the instances to come to an agreement about the transaction state.

A so called1 "NoSQL" DBMS generally promises very short local response time, at the price of not guaranteeing the same transaction state in its other instances, at any point in time (which includes never, in some cases).

Some applications (airline seat reservation comes to mind) value consistency more, while others might prefer faster response time even if the result isn't necessarlily true (like some social networks do).

Fundamentally, your question reduces to "why is it [...] difficult to [...] scale a [...] database", and the answer is, "because". It's difficult to scale anything without sacrificing its original qualities, be it raising children, delivering pizza, sending humans into space, breeding chicken in your backyard...


1 - There's nothing to prevent it from using the SQL syntax, anyway.

mustaccio
  • 28,207
  • 24
  • 60
  • 76
2

There are a number of reasons why it is harder to horizontally distribute an RDBMS than a NoSQL system. Let me just start with tables and JOINs.

In a NoSQL database you may have very large denormalized tables, but they exist without JOINs between them. So you can partition and shard them fairly evenly across the nodes in a cluster. Sure, data will be duplicated, but each row can be dealt with atomically and locally. I look it up on node X, Y and Z where the replicas exist.

Now imagine that with a SQL JOIN. You have to look up the customer on X, Y and Z, but then their orders are elsewhere. So I have to look them up on A, B, and C.

Now consider more complex JOINs. The more JOINs you need to do means you are going to be hitting more and more of a cluster. Each JOIN is a performance hit and a data marshaling challenge before you return results back to the client.

Then there is the issue of automatic vs. manual sharding. While some SQL systems support automatic sharding, many require manual sharding to this day — which is often perilous. Many databases were never designed to be distributed, and such partitioning is an unnatural act imposed on them.

There are distributed SQL or "NewSQL" databases such as PostgreSQL, Spanner or CockroachDB that handle sharding and distribution better, but they don't obviate the issues of strong consistency/ACID transactions, which, when put up against speed-of-light propagation delays across datacenters, means you are going to be limited in terms of throughputs and latencies.

Which is why many NoSQL databases instead implement eventual consistency or tunable consistency, so you can have QUORUM or even LOCAL QUORUM consistency. This allows maintaining those higher throughputs and lower latencies. Which means you won't have as many timeouts or locking and blocking on operations.

There's also the issue of active-active vs. primary/replica. And this applies across SQL and NoSQL alike. Some systems just aren't designed for all read-write nodes. Instead they have a primary, and other nodes are read-only replicas. This can be problematic when it comes to mixed read-write or write-heavy workloads. Bottlenecking can occur. Plus it's a single point of failure, and you have to design for quick failover.

There are more systems in the NoSQL space that are designed for active-active leaderless topologies. Which means no single point of failure. And less bottlenecking on write-heavy workloads.

Active-active leaderless design also means being more homogenous in node operations. You're trying to avoid that one "more equal" peer, and especially trying to avoid that one "hot node" that everything has to be passing through.

Is it possible to do a good horizontal design with an RDBMS? Yes. There are good examples with PostgreSQL and CockroachDB. But because of strong consistency requirements, they will never be as good in terms of performance as a NoSQL system designed for fast eventually-consistent operations.

More details and considerations here:

Disclosure: I work for ScyllaDB; we make a NoSQL database.

-2

No I do not believe that horizonally scaling traditional RDBMS problems is due to ACID.

The problem is history. When traditional RDBMS systems were built, they were minicomputer systems, with limited capacity disks - so they use disk efficiently. They also had all the data in one place on one machine for efficient queries. Internet scale didn't exist at that point in time. The problems that Michael Stonebaker were trying to solve seem to be problems of querying and retrieving and joining billions of rows efficiently. Database technology. That's how we get such efficient btrees, indexing and efficient querying technology.

I developed a toy SQL database from scratch (it is very small and is written in Python) and I implemented distributed SQL joins. Data can be larger than the storage space of any replica and every node is queried to retrieved its subset of the joined data. This gives some degree of scalability at the cost of latency as every shard has to be queried to retrieve the full subset of the data. In theory the subset of data of a join shall be less than the total amount on all the servers, so there is not a problem with overwhelming the network with traffic.

CockroachDB, Tidb, RethinkDB all prove that you can create scalable databases that span multiple nodes. The problem is the engineering effort in doing so.

PolarDB and proves you can have transactions with a distributed RDBMS database. I also implemented multiversion concurrency control.

Nowadays we have Citusdata which shows that postgres can be scaled out.