I'm playing around with machine learning projects that use Reddit comments. I would like to download all Reddit comments and submissions (available as a torrent) and store them in a database on an external hard drive for easier and faster access. I believe there are at least tens of billions of comments, each with about a dozen columns of metadata.
What I need it to do:
- Should be compressed. My external hard drive is 4 TB. The whole reddit dataset is 2 TB compressed as zstd, and much too large when uncompressed.
- I'll occasionally request all comments or submissions that meet some criteria, in order to create a training dataset. For example, all comments from the subreddit r/funny. Or possibly more advanced queries, like semantic text search. This doesn't need to be super fast.
- I'll occasionally request all comments in order for fasttext to make predictions. Then I'll need to store fasttext's predictions for each comment somewhere. Then I'll request all comments that fasttext scored greater than some value, in order to predict with a more sophisticated neural network. These predictions also need to be stored somewhere.
I'm not well versed in database design. I've only used MariaDB/MySQL for smaller databases. So I'm looking for something well documented and beginner friendly.
The three I've considered are MariaDB, ClickHouse, and PostgreSQL. Any insights on best way forward?