Terabyte data set

Question

I have the feeling that most of the database systems originate in the '80s and stereotypically implement tables, ACID transactions and constraints. They were build having in mind the scarcity of memory, disk and processing power.

I am wondering if there is any storage system (not necessary accessible through SQL) that is able to handle the following:

graphs (querying a la SPARQL).
matrix (n-dimensional) also sparse. Support of trivial matrix algorithms such as SVD, clustering.
efficient management of large (terabyte size) of data that doesn't change continuously; the change being happening though daily batches.
make use of large disk systems (14TB of RAID5 is today less than $1500). This means, more space for indexes, precalculated results, etc.
make use of GPU/multiple cores/processors/nodes, for a large query and indexing.

I know most of the items are implemented somewhere (Apache Cassandra, SPARQL, Netezza, Exadata), but I don't have knowledge of any product that may implement all.

score 5 · Accepted Answer · answered Aug 23 '11 at 12:49

I think a lot of these items are very much on the horizon (or beyond).

SPARQL, for instance, is something that I don't see databases incorporating any time soon. The closest I've seen is SDB, an interface that processes SPARQL and sends it off to a standard database.

Also, using GPUs as generic processors is still kind of a revolutionary thing. It hasn't quite caught on in the database world. At this stage, it's still in the world of academia and theory.

There's only one group (that I could find) that is developing a database to take advantage of the GPU. Alenka is an open source project but it's still very much in development.

Also, there's a new sorting algorithm called GPUTeraSort out there on the horizon. But (being an algorithm), I don't know of any specific databases that use it at this point.

Finally, there's a site, GPGPU, for general purpose computing on GPUs that you might want to keep an eye on. As databases arise that use the GPU, this will be the site that reports it.

Having said all of that, using multiple cores or multiple processors is almost the status quo. SQL Server, MySQL, Oracle--all of the major databases use multi-threading.

Ultimately, the items that you are asking for are something that is currently far beyond the database world.

You might also try cross-posting this on StackOverflow, as they might have some ideas of how to deal with Terabyte-level data using graphs, SPARQL, GPU enhancements, etc. However, their answer is probably going to be something like, "Yes, you can do it, but it would be a huge custom-built system."

Terabyte data set

1 Answers1