I am currently working to design a MSSQL 2016 based platform to handle a dataset (OLTP based) that will grow above the PetaByte level. It will be used for specific types of analysis that will require trends to be discovered using various methodes and tool (incl. R). There will be various sources feeding the database(s) on a 'live' basis as well as batches of data that will be ingested on a batch basis. Due to the high volumes of transactions, that number of concurrent users projected (>250) and the way data will be consumed by the users (later more), we need this solution to be high performant and scalable. It is obvious the data needs to be partitioned on a few levels to support the data consumers.
The users will be running trend analysis type workloads on daily, weekly, monthly and multi year ranges. Most data will be supplied with date fields, but customer name, account numbers and transaction types, are also in scope for doing trend analysis.
My question to you all is as follows, what would your strategy be for designing a proper partitioning solution? What questions would you ask and what would you look for in the answers? How would you handle maintenance on indexes and such.... What would you factor into the design?
Oowww and dropping everything into a datalake (read: swamp) or going for a different platform is not an option. Also, I am not at liberty to discuss the particulars of the project or the data involved so please don't ask. Just know it is highly confidential financial and personal data and we will be doing forensic analysis (using R, PowerBI and/or other BI tooling) in compliance with lawful requirements that have been imposed on us. I will not share any other details beyond this, sorry.