0

I'm playing around with machine learning projects that use Reddit comments. I would like to download all Reddit comments and submissions (available as a torrent) and store them in a database on an external hard drive for easier and faster access. I believe there are at least tens of billions of comments, each with about a dozen columns of metadata.

What I need it to do:

  • Should be compressed. My external hard drive is 4 TB. The whole reddit dataset is 2 TB compressed as zstd, and much too large when uncompressed.
  • I'll occasionally request all comments or submissions that meet some criteria, in order to create a training dataset. For example, all comments from the subreddit r/funny. Or possibly more advanced queries, like semantic text search. This doesn't need to be super fast.
  • I'll occasionally request all comments in order for fasttext to make predictions. Then I'll need to store fasttext's predictions for each comment somewhere. Then I'll request all comments that fasttext scored greater than some value, in order to predict with a more sophisticated neural network. These predictions also need to be stored somewhere.

I'm not well versed in database design. I've only used MariaDB/MySQL for smaller databases. So I'm looking for something well documented and beginner friendly.

The three I've considered are MariaDB, ClickHouse, and PostgreSQL. Any insights on best way forward?

Dima
  • 101
  • 1

2 Answers2

1

Of the options you suggested, MySQL/MariaDB is best for your project.

It is easy to use and effective at storing your data but your queries against the size of data make it prohibiting to use. You will have to setup indexes to speed up these queries but even so.

Having said the above, if storage wouldn't be a consideration and access speed would be more important, you could work with something like Elasticsearch. All you do is just dump JSON records and then you can query them blazing fast. No need for database design.

Btw, database design can be rather complex. As an author of a book in relational database design that runs a total of 8 chapters, there are a lot of considerations that need to be made that can improve storage and access speed. It takes planning and visualizing your SQL queries that you would want to run on the database. A shortcut is to use generative AI (e.g., ChatGPT) but they make a ton of errors. What is generated is a C grade design (possibly B), so keep that in mind.

MichaelT
  • 32
  • 4
0

What database should I use for storing all Reddit comments/submissions for local hobby project

Based on what you described, especially since the text search based queries don't need to be particularly fast, any mainstream relational database management system (RDBMS) will work for you. Size of data at rest doesn't really matter nor affect performance of data in motion, so all of the ones you've mentioned are equal options.

Management of the data and schema is what may vary for data at that scale, between each database system. Though since this isn't live data it's unlikely it's going to change too much, and even less likely you plan to modify the schema often, so this isn't really a concern either.

The Unsung Hero

One other database system you left out was Microsoft SQL Server. Since you're using this as a hobby project (as opposed to production software), it's completely free for you to use the Enterprise Edition features of SQL Server - known as Development Edition. There's a multitude of features in SQL Server that make management of data at that scale easier, and even help query performance. Alternatively, should you be concerned about furthering your project into a more production-use system, SQL Server Express Edition is also completely free for production systems, but is more limited than Developer Edition.

For example, there are many options for data compression in SQL Server, which will make storing the data of that size on your disk easier. It'll also make loading the data off disk potentially faster, since it'll be compressed when read off the disk.

There's also a feature called Columnstore Indexes which makes OLAP / aggregation type queries significantly faster, through columnar compression and batch mode operations, without the need of building out and maintaining a separate data warehouse. This would be useful if you care about metrics type of queries such as the count of how many comments were added a given subreddit over a specific date range, for example, etc.

It also offers other helpful performance and data management features such as Indexed Views, Online Indexing, and Partitioning, among a multitude of other features, all free for you to use.

Conclusion

In my opinion, SQL Server is hard to beat when it's free to use. It also has one of the largest user bases, it's well documented, and there's a lot of information out there on how to solve problems with it. Outside of it, I like PostgreSQL next, which also has a similar amount of features and large user base as well. MariaDB, MySQL, and Oracle SQL are all capable options too.

J.D.
  • 40,776
  • 12
  • 62
  • 141