0

I have to store some amount of data for analytical purposes.

  • The data source produces 2TB data per month.
  • Data is collected on a monthly basis (not real-time).
  • Data is fully structured.
  • There are 100+ different columns of data.
  • Availability of SQL is important.
  • Engineer/developer resources are limited.

I planned to use Postgres (probably with column-oriented extension), however, it would not be feasible for such data amounts (more than 20TB per year). I also made a research on Hadoop/Spark, however, it looks like a bit massive solution (considering, that the data is fully structured). I don't consider cloud-based solutions, as well as expensive ones (preferably, free-licence) .

Would you be so kind to suggest, which data storage to use for big amounts of structured data for analytical purposes?

Leeloo
  • 111
  • 5

3 Answers3

2

Aside from the recommendations made by Danielle and J.D., I suggest you examine how the data is represented and the frequency.

For example, if the data is transmitted in a JSON or XML format, which is handy for transmission and processing, the size will be large (e.g. a numeric string vice a float or an int). Also, string identifiers (e.g. serial number, credit card number) can be made into PK/FK relationships that could save space.

Your data sampling frequency may be higher than what is realistically usable. For example, if you get weather data at 10 samples/sec but your analysis only has 1 hour resolution, you have a lot more weather data than you can use. Perhaps resampling the data to a lower frequency (e.g average and standard deviation, low-pass filter) and storing that might be a better solution. You can always archive the raw data--archival storage for a years worth of data is not very expensive.

Jim D
  • 570
  • 2
  • 8
1

I would go for an availability group on a cluster with SQL Enterprise Edition. You could have partitioned tables. Since the data is not real time, you could have maintenance windows where you enter the data from the staging table (maybe one day of data) and if you need to purge some data, you could easily get rid of one old partition.

If you have complex reporting, it could be done on the secondary readable nodes. It is quite complex to setup.

Here's a few links :

Danielle Paquette-Harvey
  • 2,099
  • 1
  • 17
  • 31
1

How many rows of data is generated in a month of 2 TB of data? In a year's time, how many months back will still be queried? In 5 years, how many years back of data needs to still be queried?...in 10 years, and in 20 years? What's the most amount of months (or rows of data) that will need to be queried at one time? What type of analytics will need to be done, quantitative (aggregative and arithmetic based), or more qualitative?

These are just a subset of all the important questions to weigh when deciding a database solution, especially at the scale of data you're trying to plan for. Unfortunately there's no direct answer to "which database system should you choose" because there are many factors. And to be honest, most of them can handle even the scale of data you're talking about.

PostgreSQL should be able to handle that much data (though I haven't personally worked with it on that scale yet, but from my research and understanding). I know Microsoft SQL Server definitely can handle a data warehouse at that scale (again depending on your specific use cases based on the questions above), as I have worked with one not far off from it, and the analytical queries we ran took seconds to process.

Although throwing money at hardware isn't the primary recommended solution to a database's performance problems (proper schema design, index design, and process design will go the longest of ways) I would say it does help and is worth considering making sure your non-Cloud based server has decent hardware behind it. NVMe's go a long way these days in terms of improving generally the biggest hardware bottleneck (the disk). Having a decent amount of RAM and CPUs will be helpful too. (E.g. in the Microsoft SQL Server data warehouse I mentioned above, we had 32 GB of RAM and an 8 Core CPU. It probably could've benefitted from doubling both of those, but worked quite well as is too.)

The other benefits of Microsoft SQL Server (though other database systems probably have some sort of equivalent) is you can setup an AlwaysOn Availability Group to ensure constant uptime through redundancy. It's easy to setup and you can also query off your replicas as way to horizontally shard the load when querying the data.

J.D.
  • 40,776
  • 12
  • 62
  • 141