Looking for advice on how to integrate data from 100+ client DB's into a centralized reporting database

Question

I am a SQL Developer (not DBA or Architect) for a small (~50 employees) SaaS company. I am tasked with figuring out how to:

Offload operational reporting from our 100+ OLTP databases
Allow those reports to run against data from multiple client databases
Position our company to provide more analytics-based solutions in the future

I have read a number of articles on various technologies like transactional replication (specifically the many-to-one/central subscriber model), SQL service broker, log shipping, Change Tracking (CT), and Change Data Capture (CDC, my understanding is this is enterprise-only), and I am not sure what path is best to pursue.

I am hoping some of you with integration expertise may have encountered a setup similar to ours and be able to point me down a successful path or direct me to some resources that would be helpful.

Due to cost constraints, our solution must work within SQL Server Standard Edition. Also, the solution must be reasonable to support/maintain within our small organization.

Basic configuration:

We currently have 100+ individual client databases, most deployed on SQL servers at our data center, but some deployed on client servers within their data center that we can remote into. These are all SQL Server 2008 R2 databases, but we are planning to upgrade to SQL 2016 soon.

We use database projects and dacpacs to ensure the schema is the same across all client databases that would be integrated. However, since we do not force all clients to upgrade to new versions at the same time, some schema differences are possible between upgrades. The solution must be flexible enough not to break if client A is on software version 1.0 and client B is on version 1.1.

Operational reports are currently run directly from each client's OLTP database. We are concerned about the impact this will have on the application's performance if we do not offload it.

High-Level Requirements:

Our clients are hospital sterile processing departments (SPD’s) who want up-to-the-moment reports on what they’ve processed so far, where inventory is, etc. SPD's process inventory around the clock, including weekends and holidays. Since one of the main purposes of this effort is to better support operational reporting, we would like the data to be as close to real-time as possible to continue meeting clients’ needs.

Currently we have some SPD’s in separate databases that are actually part of the same hospital system. These clients want the ability to report against all the SPD’s in their system.

Strategically speaking, we would like the ability to easily aggregate data across all our clients to support our internal analytics initiatives. Our expectation is that we would be able to use the collected operational data as a source for data marts/warehouse.

Thoughts so far:

Transactional replication seems like it would provide the most "real-time" solution. I found this response to be especially helpful, but I am concerned that with the potential for schema differences it will not work for us: SQL Server Many-to-One replication

Log shipping doesn't sound ideal given that the log cannot restore while queries are active. I either have to kick everyone out so the log can restore or the data will become stale. I am unclear as to whether this method could be used to centralize data from multiple databases, since each shipped log would only be for the individual database it came from.

Using SQL service broker, latency may be unpredictable if a queue were unable to keep up with the number of messages to process.

CT only identifies a version for each table row. Latency would be dependent on how quickly we could process something like an SSIS package against each database to retrieve the data and insert it in a central repository.

Do we need to consider replicating each database individually and then perhaps use some sort of data virtualization technique to combine data from the various replicated sources?

Any advice or direction you are willing to provide would be greatly appreciated.

score 1 · Answer 1 · answered Jun 29 '17 at 14:35

Do we need to consider replicating each database individually and then perhaps use some sort of data virtualization technique to combine data from the various replicated sources?

Yes. You can host multiple subscriber databases on a single instance, and then query across them with views, or load them into a consolidated database.

score 1 · Answer 2 · answered Jul 31 '17 at 07:16

According to your above description below link will help you and i also working on same scenario.Multiple publisher with single subscriber.

Add one more column like server_id with default value like 1,2,3 and so on and make it composite primary key.
When creating the publications and adding articles, the article property Action if name is in use needs to be set to Delete data. If article has a row filter, delete only data that matches the filter. This can be set using the New Publication Wizard Article Properties dialog or by using replication stored procedures sp_addarticle and specifying a value of delete for the @pre_creation_cmd argument. This way, when the central subscriber is initialized or reinitialized from multiple publication snapshots, previously applied snapshot data will be preserved since only data matching the filter clause will be deleted.

http://www.sqlrepl.com/sql-server/central-subscriber-model-explained/

score 1 · Answer 3 · answered Sep 01 '17 at 22:17

One possible architecture:

Consider the reporting as a data warehouse based solution.

Typically a data warehouse is a DB with a schema that represents the required subset of the source systems. AdventureWorks and AdventureworksDW demonstrate that modelling.

Next, the ETL: Moving data from sources to the data warehouse.

A possible implementation here is to use change tracking.

Firstly, one can implement views that are version specific in what they consume, but in terms of what they return, are uniform. e.g. If Person.Gender exists in version 2 but not in version 1, the person view for version one can return, say, null for version 1.

For the warehouse consumer, solely reading the views, the data is the same shape (with varying completeness).

Change tracking provides a (relatively) light weight way of determining what data needs changing at each refresh.

Implementing the above relies on hand tooling it all, so you'll need to be confident in SQL coding, and test performance scenarios to see how fast the increments take to run. In many cases they can be sub 1 second, but some high transaction tables may generate high load in processing changes. (Change tracking is 'relatively' light weight... only testing proves it out).

The good thing here that you have high degree of control over how schema differences manifest, and with change tracking, there is no chance of integrity issues when implemented correctly, as the tracking is done at engine level.

Whether this is definitely right for you, it would be difficult to say.

Looking for advice on how to integrate data from 100+ client DB's into a centralized reporting database

3 Answers3