I am currently working to design a MSSQL 2016 based platform to handle a dataset (OLTP based) that will grow above the PetaByte level. It will be used for specific types of analysis that will require trends to be discovered using various methodes and tool (incl. R). There will be various sources feeding the database(s) on a 'live' basis as well as batches of data that will be ingested on a batch basis. Due to the high volumes of transactions, that number of concurrent users projected (>250) and the way data will be consumed by the users (later more), we need this solution to be high performant and scalable. It is obvious the data needs to be partitioned on a few levels to support the data consumers.
The users will be running trend analysis type workloads on daily, weekly, monthly and multi year ranges. Most data will be supplied with date fields, but customer name, account numbers and transaction types, are also in scope for doing trend analysis.
My question to you all is as follows, what would your strategy be for designing a proper partitioning solution? What questions would you ask and what would you look for in the answers? How would you handle maintenance on indexes and such.... What would you factor into the design?
Oowww and dropping everything into a datalake (read: swamp) or going for a different platform is not an option. Also, I am not at liberty to discuss the particulars of the project or the data involved so please don't ask. Just know it is highly confidential financial and personal data and we will be doing forensic analysis (using R, PowerBI and/or other BI tooling) in compliance with lawful requirements that have been imposed on us. I will not share any other details beyond this, sorry.
I would suggest you to go through the article describes some important prerequisites and suggestion for OLTP databases.
For the loading process use
BULK INSERT and for normal insert user
What you need to know.
--I have experience on 2 TB table with 50GB growth/day, one month data on production rest on WH. So suggesting accordingly.
If 70-80 % Usage of daily basis analytical reports. I would suggest to go for daily basis partition as there would be huge amount of data. It will perform faster but to generate weekly, monthly and yearly report you'll have lengthy query.
If there is 50-50 ratio between daily,weekly and monthly analytical then go for monthly partitioning. In this case Daily and Weekly basis reports will perform slower than day basis partitioning, because there would be lot of records to filter from the month. But you'll have quite simple query.
Partitioning by considering retention of online data make Archiving Policy easier.
As table will be partitioned you should create partitioned indexes on the table.To create partitioned index you need to include partition base column in the indexes. Until you don't create partitioned indexes on the partitioned table you wouldn't get performance benefits.
Creating Indexes on separate filegroups will result in good performance for reports. So create separate Partition scheme,function, on separate filegroupes for indexes same as created for table.
Better to go for Column Stored Index on (Base_Partition_Column,customer name, account numbers, transaction types, Financial Column) on Index_Partition_Scheme.
Create indexes with
Creating partitioned indexes make index maintenance easier. Instead of rebuilding or reorganizing complete index you can perform maintenance task for particular partition of the index which minimize maintenance duration for big tables.
To do it you can track index fragmentation and row count of the partitions. It will help you to find out indexes on which partition should be rebuild.
Maintenance schedule depends on data size, how much off duration you have to perform the maintenance activities and how long SQL Server takes to finish the task. It would be better to test you maintenance plan on test environment first with same amount of data then move for production if it finish within the off hour you have.