3

I need to upload daily CSV dumps to a blackbox of my choice, and it should be possible to run queries or create filters over the uploaded data. Daily dumps weight around 1TB and are basically id, datetime, latitude and longitude columns.

It's desirable to be able to store 1 year of info, so we're talking about 365TB here.

I believe that, provided enough bandwidth, I could upload the CSV to a cloud storage and import it to a Big data engine. For example: upload to Google Cloud Storage, then use it as source to import to Google Big Query, or upload to Amazon S3 and use it as source to import to Amazon Redshift.

At the beggining I thought hadoop was a third solution, but then I saw that hadoop can use Google BigQuery as its persistence layer, and that in front of hadoop there is yet another layer to pick (Apache Spark, Cloudera Impala).

So, at this point, I'm lost. In what use case should I go for hadoop? Is it safe to assume that for the most basic queries/filters I can rely on BigQuery alone, and leave open a further connection with hadoop only in case I need more elaborate Mapping/Reductions?

Paul White
  • 94,921
  • 30
  • 437
  • 687
ffflabs
  • 345
  • 1
  • 3
  • 11

0 Answers0