5

I am currently in a process of maintaining a data warehouse for a quickly growing start up company. There is a lot of reporting demands from the clients, and this is usually handled by a data warehouse we set up. However, unlike bigger more established companies I worked for in the past, this company has a rapidly growing schema that would see almost 1 new table and 2 to 3 table changes every two weeks. These rapid schema changes have been picked up by the ETL team, but it has been challenging to keep up so I was wondering if there is a better way to handle it.

To explain the process in more detail, we have been using a traditional star schema data warehouse model. The tables would be transferred to ODS DB first without any changes, and will be transferred to Data warehouse layer as dimension and fact tables. If there is one change in the production DB schema, we will need to change DB structure for ODS and Data warehouse, and change the ETL steps as well. In essence, we have a meeting every sprint to check if there is any change in the production DB, and apply these changes to the ODS DB, Data Warehouse, and ETL. At this point I am wondering whether there is a better systematic way of doing this maintenance.

2 Answers2

2

Your overall strategy is fine. The point where you could try to improve is the way how the transformation rules for filling the ODS and the DW are defined. You may be able to find a declarative approach, where the mapping from the source columns and tables to the destination columns/tables is defined in a very compact, concise DSL, and these rules are used to drive the transformation processes (at least, partly). That would not take the burden from you to adopt the ETL process to schema changes, but it could actually make it simpler and quicker.

In general, this subject is still under research, you will find some papers and even some commercial products by googling for "declarative etl". But maybe you will find a way to design something simple which fits to your specific environment, and the expected kind of schema changes for your use cases.

Doc Brown
  • 218,378
2

Consider changing the warehouse schema to "data vault" (https://en.wikipedia.org/wiki/Data_vault_modeling) which is very well suited to quickly adapt to changes in data structures.

As a two-liner summary: basically, you keep the core identifier for an object in a so-called "hub" with all attributes in joined "satellite" tables. If you have new attributes for a given object, you simply add another satellite instead of modifying the existing structure. Relations between hubs are expressed as "links", which also have their attributes in satellites.

mtj
  • 2,360