Growing SQL Server Data Warehouse - How to reorganize efficiently?

Question

we've been using SQL Server for quite a while now in our company to host all application data. We use a typical set-up with data being loaded from various data sources into a Staging Area, which afterwards feeds different datamarts for their specific purposes (with tailored data stored there).

However, from my point of view, this generates data "silos", which duplicates so much data and makes it hard to keep all somewhat synchronous. Furthermore, as the number of solutions is growing constantly, the number of "silo" datamarts is growing proportionally, as the requirements are always slightly different and we have a clear instruction to use separate datamarts for each solution.

Apps use direct connections to the datamarts then to consume and manipulate the data (as needed).

I was thinking a lot about this recently and there are some other initiatives ongoing at the moment to modernize our backend, so I did some research and my idea is to propose the following:

keep the Staging Area as-is (plain import of data from different sources into one datamart). There is one schema for each source and corresponding SSIS packages to load this data. This all should stay.
create a schema for each solution within the Staging Area. Access rights can be managed on this level then to ensure app consumers can't access the "import schemas" or others, but only the ones that belong to the apps they have access to. As soon as point 5 (API) is implemented, authorization should take place there before connecting to the datamart.
replace "silo" datamarts and the regarding ETL jobs (SSIS packages) that currently load the datamarts with materialized (aka "indexed") views in the newly defined schemas
move any tables that are used with CRUD operations from the former app datamart into the corresponding new schema in the "Staging Area" datamart (needs to be renamed, as it now covers the entire data warehouse in one datamart)
implement a simple API that features GET requests for the defined views and POSTs for required CRUD operations of the tables from point 4

Of course each of these steps has a great impact on the solutions themselves and where they consume data from. Also, of course many hours of development would be needed. But this is not the point for the time being, it's more a general question if this makes sense or how you would set this up on SQL Server.

I just have the feeling that we build so much around the data, which also makes it hard in other areas to improve / speed-up backend processes.

Many thanks in advance and all the best Benny

J.D. · Accepted Answer · 2023-10-31T11:51:26.677

I am still a little confused overall (and some of that is just due to not having your ETL stack in front of me). But hopefully you find something meaningful in this answer.

Managing schema changes from an external source can be tedious, depending on how you structure your stack. This is how I typically structure things when I'm dealing with an external data source:

I bring the data over to its own database on my server*, named in accordance with the source database it came from (if known), otherwise a reasonable name based on the source. For example, let's call it SimpleSales. I preserve the structure exactly as the source provides it. This makes it easier to debug data issues later on (to trace if it's an issue from the data source vs my normalizations or transformations).
I create a schema with a matching name to the database that was created in step 1, in my primary database. E.g: CREATE SCHEMA SimpleSales;
In that new schema in my primary database, I create (regular) views that reference the native objects from the source database. Usually these views are one-to-one with the source objects. I only create views for the objects that I'm ultimately going to consume. This is where I normalize the name of the object itself and its column names and data types.
Finally, if I have special business logic I need to incorporate on top of those columns, e.g. to transform them, I'll do that in another view in a more specific schema (in my primary database) relating to the application / domain that's going to consume that transformed object. E.g. if SimpleSales.SalesOrders is my normalized view in my primary database, and I need to apply dedicated logic for my Automated Invoicing application, I'll have a schema such as AutoInvoice that I create my dedicated views in, such as AutoInvoice.Customers. If it's global business logic that applies in most cases / many different applications, then I actually use a schema called Common instead. I also use the Common schema to unify related data from different external systems. E.g. if I had two different external systems that maintained Sales data, such as Customer information, then I'd I have a Commom.Customers view that pulls in the distinct list of Customers across both sources.
(Optional) If I need to improve the specific performance of something and the solution is to use an Indexed View, then that'll likely have to happen in the source database. I normally create a single schema, named after the primary database, in the source database, to put these customizations in. E.g. in the SimpleSales database, I may have a PrimaryDatabase schema, and that's where my Indexed Views for the SimpleSales objects would go.

Views (when not abused) are a great tool to add a layer of abstraction. Aside from a column with dependencies being renamed or removed, other schema changes to the root tables generally won't break your consuming stack, when you use views as your objects for consumption. And when there is a breaking schema change, it's easy to update just your root view in the primary database to account for the change.

Also, keep in mind the limitations of Indexed Views. Not being able to use outer joins or self-joins, aside from many other limitations, makes them a bit more niche for when they're applicable to solve a problem. This is why I don't default to them for my initial structuring of the data.

For more complex cases, table-valued functions and stored procedures can also be utilized and even consumed similarly to views. But that's another chapter for another day.

*As far as how to actually ingest the data into your source database, really just depends on what technology you're most comfortable with, at what rate the data needs to be updated, and the size of the data. SSIS is a bit dated and complex, IMO. In my ventures, I've been lucky enough to be able to utilize Replication, most times, for real-time data synchronization. Alternatively, if the data isn't terribly huge, and doesn't need to be updated in real-time, the process can be as simple as pure T-SQL that inserts the data into a new table, renames (or drops if you don't need the history) the existing table, and finally renames the new table to the old table name. Something that can be done in under 10 lines of code.

score 0 · Answer 2 · answered Nov 01 '23 at 11:33

I think what I'm getting from the question is that the primary problem is the apparent duplication of data in different "data marts".

There can be a tension in this area between producing a fully-integrated whole using common tables as much as possible, and maintaining a reasonable degree of modularity for different reporting developments.

The consequence of full integration of everything is that every proposed change to common tables eventually becomes titanic in the scope of possible implications.

To avoid that kind of unmanageable effort, ad-hoc bolt-ons and exceptional arrangements creep in, but unfortunately does so in a way that is poorly organised and sometimes poorly localised (with single-use features or data creeping into "common" areas), and often naming schemes also become strained beyond comprehension.

It also becomes difficult to maintain individual reporting modules in future, since small adjustments which need to deviate from an original common source, may require a number of things to be split off from the common pool so that the processing can now be altered independently.

But this splitting off ends up getting done remote in time from the original analysis and development of the module, and perhaps not even by the same person - inevitably causing more risk, more upset to the accrued testing-in-use, and more skill and heave from the developer for what may have been a much smaller task if all inputs and processing had already been independent.

The notion that maintenance effort increases with modest duplication, is also imaginary in my experience.

It may seem that if you discover a fault in code that has been duplicated in many places, it must be harder to fix than if it had been a single piece of code shared by many outputs.

In reality, such fixes in complicated reporting pipelines often require all downstream code to be reviewed for whether consequential adjustments are required. It's far easier to do this fixing incrementally, module-by-module, in order of perceived importance of the fix to each module, than to have to review (and perhaps adjust) absolutely everything in a reporting system before a fix can be committed.

My experience is that, in reporting systems, it's very much more feasible and resilient to accept some duplication of data for the sake of clear modularity and independence.

Growing SQL Server Data Warehouse - How to reorganize efficiently?

2 Answers2