How do you deal with legacy data integrity issues when rewriting software?

Question

I am working on a project which is a rewrite of an existing legacy software. The legacy software primarily consists of CRUD operations (create, read, update, delete) on an SQL database.

Despite the CRUD-based style of coding, the legacy software is extremely complex. This software complexity is not only the result of the complexity of the problem domain itself, but also the result of poor (and regularly bordering on insane) design decision. This poor coding has lead to the data in the database lacking integrity. These integrity issues are not solely in terms of relationships (foreign keys), but also in terms of the integrity within a single row. E.g., the meaning of column "x" outright contradicts the meaning of column "y". (Before you ask, the answer is "yes", I have analysed the problem domain and correctly understand the meaning and purpose of these columns, and better than the original software developers it seems).

When writing the replacement software, I have used principles from Domain Driven Design and Command Query Reponsibility Segregation, primarily due to the complexity of the domain. E.g., I've designed aggregate roots to enforce invariants in the write model, command handlers to perform "cross-aggregate" consistency checks, query handlers to query intentionally denormalised data in a manner appropriate for various screens, etc, etc.

The replacement software works very well when entering new data, in terms of accuracy and ease of use. In that respect, it is successful. However, because the existing data is full of integrity issues, operations that involve the existing data regularly fail by throwing an exception. This typically occurs because an aggregate can't be read from a repository because the data passed to the constructor violates the aggregate's invariants.

How should I deal with this legacy data that "breaks the rules". The old software worked fine in this respect, because it performed next to no validation. Because of this lack of validation, it was easy for inexperienced users to enter nonsensical data (and experienced users became very valuable because they had years of understanding it's "idiosyncrasies").

The data itself is very important, so it cannot be discarded. What can I do? I've tried sorting out the integrity issues as I go, and this has worked in some cases, but in others it is nearly impossible (e.g., data is outright missing from the database because the original developers decided not to save it). The sheer number of data integrity issues is overwhelming.

What can I do?

(Note that this question has been moved from stackoverflow.)

score 8 · Answer 1 · answered May 13 '16 at 05:24

Perhaps it is too naive, but did you consider making a fresh database for your fresh application and writing some converter from the old (badly designed) database to the new one?

That converter would be hard to code, but you'll get some more cleaner data from it.

VoiceOfUnreason · Answer 2 · 2016-05-13T22:36:23.603

How should I deal with this legacy data that "breaks the rules".

So the starting point is to talk to your domain experts. What do they do with the "invalid" data in the legacy system today?

Also, is the legacy system the book of record? or does it describe entities that are actually controlled somewhere else? A mix of both? You may need to consider varying strategies.

One possible answer is to replace validation with an exception report. Basically, instead of "that breaks the rules, I won't let you do that", the model takes the attitude of "that breaks the rules, I'll do it, but I'm going to tell on you."

The model surfaces the problems, but lets the owners of the processes in question decide what to do about it.

My understanding is that this is a common technique in systems where the persistence model is not, in fact, the system of record. Think airport - bag handler scans a baggage tag, telling the domain that the suit case is in Heathrow. Instead of saying "you can't scan that, that bag is supposed to be in Auckland", the system says "you scanned that in the wrong place, I better notify a supervisor".

In this approach, I think you can still be aggressive in validating command messages. For example, when you are translating the command for the domain, you could create your values via a factory that enforces the data consistency rules. That by itself should mitigate a lot of the data entry problems.

But you would need to let the repository be permissive about saving and loading aggregates. On the read side, you need to be sure that an invalid aggregate doesn't destroy the entire view, but whether you skip those aggregates, or just flag them as suspect will depend on the use case. Horses for courses.

But you absolutely want to be able to support an exception report -- here are all the aggregates that don't currently conform to the validation rules. And that sort of requires you to be able to load a representation of the aggregate that illustrates the problem without the validation rules being applied.

score 1 · Answer 3 · answered May 16 '16 at 15:31

You need a data-cleansing exercise.

You don't have a chance in hell of writing decent code that will work if the existing data breaks the rules you're trying to establish. Your code is not at fault if it enforces rules on new data, but can't handle old data that - from the sound of it - wouldn't know a rule if one walked up to it and said hello. Your code simply shouldn't have to do this.

If you try to bend your new, well-thought-out design to deal with the terrible old data, you'll be ruining the point of the rewrite you've just done. It'll end up as confusing and inconsistent as the old data itself.

This problem is not solvable by writing code that addresses the database (and which must expect it to behave consistently). If the problem is in the data, that's where the problem needs to be solved.

Whoever is running this project will hate you for saying this, because data-cleansing costs time and effort; but it's true.

How do you deal with legacy data integrity issues when rewriting software?

3 Answers3