-1

enter image description here

What are the pros and cons of using agile/iterative approach in ETL/ELT (Extract Transform Load or Extract Load Transform) data warehouses/data lakes/lakehouses systems development?

I often find that many business analysts / project managers tend to plan ingesting all data first, then building all semantic data and other horizontal layers only then build reports and go to the business. I have drawn the diagram in endeavour to to show the differences between vertical slices planning and horizontal ETL/ELT implementation.

Can ETL development be the area where agile approaches bring value to avoid rework risks? Should we ingest all sources first (blue arrows) or should we prioritise implementation of one vertical piece (green arrows) ?

1 Answers1

1

Setting up an ETL is not different from any development task; therefore, proceed in Agile way, unless there are serious reasons not to use Agile.

Note that agility doesn't imply that you don't need to think/talk about what will be added to the ETL in the future. If you know that you'll need to import data from three sources, it may be useful to check first what are those sources, how the information is organized inside, what is the information, etc. Sometimes, this would give you an opportunity to save some time.

Here's an example. Imagine that you have two sources. You start with the first one which has a list of employees, with some basic information available. For every employee, you need to show his country, but the source doesn't give the country, so you need to deduce it from the email address, the city, or some other information. The second source, however, contains much more information about the employees. If you knew that, you would simply wait until you use this second source in order to populate the country field. Time saved.

Be cautious, however, not to take as granted that all the sources will necessarily be used, and that none of the sources will change. Business requirements are often volatile, and more often than not, stakeholders imagine that an ETL would handle exabytes of data, use one hundred sources, and do magic. Keep YAGNI in mind, and try to stick to the sources that you are very likely to use in the next sprints. This way, you will avoid rework due to the fact that you expected a given source to be used, but the business decided otherwise a few sprints later.