3

I'm currently designing a conceptual web app, where users can submit posts, comments, and "like"/"dislike" both of those. However, I am not sure of how to store the (dis)likes, because of how many there could be and how many queries would be run on their corresponding table. (For example, getting (dis)likes per post, getting which posts/comments the user has liked, calculating which posts are trending based on how many likes they have received recently.)

In the concept, the web app is fairly popular. Let's compare it to Facebook and say it's 1/1000th as popular as Facebook. In 2012, Facebook dealt with 2.7 billion likes per day (probably more now, but we'll go with the 2012 stats). That means that the conceptual web app is going to deal with 2.7 million likes a day, which is almost 30 billion per year. 30 billion inserts a year and many more queries on the table.

I have two main choices when it comes to database system; SQL or NoSQL. I have already chosen MySQL for the other parts of the web app. To my knowledge, NoSQL databases like Cassandra are faster with inserts, but would there be a noticeable general performance difference?

gnat
  • 20,543
  • 29
  • 115
  • 306

3 Answers3

8

No startup has ever written the first version of their software with this kind of scalability in mind.

Facebook started out in PHP, and wrote a cross-compiler to convert their PHP code to C++ to reduce the number of servers they need by 50%. Twitter made major architectural changes, and got a 3X improvement in speed.

In both cases, they started out with a small but nimble system, typically in a rapid development tool, and switched to more robust, scalable systems later. The ability to write a functioning system and get it to market quickly is all that matters when you're small.

If you ever get as big as Facebook or Twitter, you're scalability woes will be a good problem to have, and you'll have the time and money to fix them properly.

Robert Harvey
  • 200,592
2

RDBMS vs. NoSQL

The question RDBMS vs. NoSQL, despite the claims of some vendors, is not a simple question of performance and scalability. It's a question of structure of your data and what you intend to do with it.

If your data is highly structured, you can certainly benefit from an RDBMS, and scale it as needed, by using bigger servers, adding multiple processors, distribute your data across several databases using some clever partitioning scheme, and even use some hadoop based SQL implementations having big data in mind... all this if the data structure allows it.

However if your data is not so structured, or has a structure that might evolve quickly, then a NoSQL database such as MongoDB, Aerospike, Cassandra or others could certainly be a more flexible alternative. These databases, due to their flexible structure are also easier to distribute. Some are even container-enabled thus allowing the best possible scalability. But to select the most appropriate one, you also have to take care of the kind of NoSQL you'll need, the read-write patterns in your application, and even some low level system architecture aspects (example: use of in-memory technology, or SSD storage vs. traditional hard disks).

How to choose ?

So the first thing you need is to have a clear view of your application, and the data it has to manage. and deal with the overall architecture of your product.

One of the aspect is the distribution of your webapp accross web-servers for ensuring scalability, and the use of eventual middleware such as kafka as message broker, and eventually use of microservices, with several loosely connected databases.

Another aspect will be the interface between your application and the database. The best approach here, is to design the software architecture so to isolate the interactions with the database, alowing you to change if needed, with a minimum of impact on the remaining of the application. This approach will enable you to start with a database and change if needed without too much worries, gaining from the experience as the dataflows grow.

Christophe
  • 81,699
0

The driving factor in this choice is not speed of operation but the structure of the data.

In your case you have a post with assocIated likes. If you always display the post and is likes together without reference to other data then this fits the no-sql pattern.

You need only select the post by id or perhaps an indexed userid and the db returns the required likes automatically.

If however you wanted to display the average likes per post with todays date, or the post with the most likes. a sql db would be better suited as a single select can calculate the aggregate over the tables.

When you talk about billions of rows you also have to consider the size of the data. Any query which has to look at all the data will be slow regardless of the database structure.

Ewan
  • 83,178