suggestion needed for big data development

Question

I am trying to find out what is state of the art with database, python, and big data.

My starting point began with a SQL server, and multiprocessing pandas, and dask. Imagine I need to maintain a database with more than 1 billion rows, and I need to keep inserting into it and even perform multiprocessing, larger than memory complex analysis on them.

Some drawbacks includes that, SQL server is very slow in inserting data and extracting data.

Inserting 100k rows takes 1 second, reading 1M head rows takes 5s+. The speed is very dissatisfactory compared with dask with parquet. However, for dask with parquet, I cannot keep inserting into this "more than 1 billion rows database". Multiindex/None-clustered index is also not supported even making some previously fast sql join slower....

I looked around and found apache sql, pyspark. But I'm a bit unsure if that is the correct step forward. Any suggestion? Thanks!

J.D. · Accepted Answer · 2021-08-01T04:59:12.250

I've worked with tables in the 10s of billions of records, that were pretty heavily transactional, and SQL Server handled it just fine. Not sure what your expectations are, but reading 1 million rows (by returning it to a client) in about 5s is pretty reasonable, depending on total data size, even parquet formatted data (which I've also worked with) is only going to shave off less than a second for the same amount of data being used in the same exact way. I could say the same about inserting 100,000 rows in about 1 second.

Note measuring performance in terms of numbers of rows is missing the variable of the column width to accurately determine the reasonability of the performance. So yes I'd agree returning 1 million rows of a single TINYINT (1 byte wide) column in 5 seconds sounds inappropriate, but 1 million rows with a total column width of 1,000 bytes is reasonable.

At this scale, measured performance now becomes hardware and architecture design bound, which is mostly database platform independent. If you've perfectly architected your schema, then the performance you're seeing is not a limitation of SQL Server, rather it's the reality of the amount of data you're trying to process on the hardware you've provisioned.

Long story short, the tools out there for managing data these days aren't necessarily inherently any different performance-wise between each other, rather it's how you use those tools and the hardware you provision to back them.

suggestion needed for big data development

1 Answers1