Best practice to keep different data sources in sync?

Question

I'm having doubts about how to implement and keep synchronization between two datasources in a distributed system. In my case, I have a service that checks for expired jobs in a repository. If the job has expired, then it is removed from the repository and enqueued in a distributed queue (the example is in Python but should be easy to understand):

def check_expired_jobs(self):
    jobs = self._job_repository.all()

    for job in [ job for job in jobs if job.has_expired() ]:
        self._job_queue.enqueue(job.crawl_task)
        self._job_repository.delete(job)

My concerns are that a lot of things may happen here since both queue and repo are remote data sources. If the queue operation is successful but for whatever reason the repository deletion fails I could run into an inconsistency issue. It's not the first time I encounter such a problem and I want to tackle it in the best way possible.

What would be the best practice to keep several data sources/repositories in sync?

John Wu · Accepted Answer · 2017-08-23T08:13:27.413

If the queue operation is successful but for whatever reason the repository deletion fails I could run into an inconsistency issue.

The simplest solution in this case it just to make it not an issue. You're designing this thing, right?

You just need to write any program that reads the repository to treat expired records as if they don't exist. This is actually key to any sort of fault tolerance-- you need to allow the queue to grow and shrink (even a little bit) in any reasonable sort of fault-tolerant situation-- so you ought to make this a rule anyway, if you haven't. You can do it with a view if you don't want to pollute your c# code.

Then, in a separate process, occasionally copy (not move) any expired records into the distributed queue. If the copy fails, heck, just try copying them again a few seconds or a few minutes later. I say "copy, not move," because the repository shouldn't care whether an expired record exists or not.

Of course you will eventually run out of disk space with expired records. So you also need a simple job, running maybe once every 24 hours, to physically delete the expired records, if and only if they exist somewhere in the distributed queue. You can shorten that if you need high performance. You can even do it immediately every time you add to the distributed queue.

The only difficulty really is ensuring that copying the expired records doesn't result in duplicates in the distributed queue system. You can accomplish this very simply by tagging each job with a GUID and enforcing a uniqueness constraint. Very straightforward for a database working in isolation.

Don't monkey with 2PC unless you are doing this for self-education. Given your requirements it is way overkill.

Best practice to keep different data sources in sync?

Best practice is KISS.

Jonathan van de Veen · Answer 2 · 2017-08-23T06:39:15.230

In this case using a Transaction would solve your problem. By this I don't mean a database transaction, but a Transaction Pattern, if you will.

I've never written any code in Python, so I don't know if Python has a Transaction construct of some sort. However in this case, it would be easy enough to come up with something yourself.

Basically what you would do is queue the job and then try and delete the job. If the deletion fails, you'd need to roll back the "transaction", which in your case means removing the job from the queue if that step did complete.

In order to do this, you'd need some local storage of the Transactions state, until it is either committed (because deletion was successful) or rolled back (because deletion failed and removing the job from the queue worked). The state would include either the complete Task or a way to retrieve the Task.

Best practice to keep different data sources in sync?

2 Answers2