-1

I have a collection of documents, which hold a subject id, a timestamp and a value. For example:

{ sid: 1, t: 3, v: "A" }    
{ sid: 1, t: 5, v: "B" }

Which means subject#1 is measured to have value A at t=3. Later, at t=5, value changed to "B". Only when a value is changed, a new document is inserted.

My goal is to keep the history of changes, and allow queries

  • for multiple subject IDs (or other fields not mentioned here) to get full histories or

  • for a subject and a timestamp for "value at that time".


Periodically I receive another set of documents containing various subjects and values at different times (t). Then I need to merge this new information to the existing data. Let's say I receive:

{ sid: 1, t: 2, v: "A" }
{ sid: 1, t: 6, v: "B" }    
{ sid: 2, t: 5, v: "C" }

Not every document in this new set gives me useful information. To explain:

  1. First one is useful because it lets me know a change had happened earlier than I thought, so I should use it. I'm trying to detect changes as early as possible:

  2. Second one is not useful because I already knew the value change from "A" to "B" happened at t=5, earlier than the new info.

  3. Third one is useful because I didn't have any info about subject#2 before.

So after the merge result should be:

{ sid: 1, t: 2, v: "A" }
{ sid: 1, t: 5, v: "B" }    
{ sid: 2, t: 5, v: "C" }

The new information arrives as a stream, about 2-3k docs/second. I can process it as a stream or in batches 15 minutes wide = 2.5M docs. Existing changes-data collection has hundred millions of docs.

New arriving data is not ordered in time, it can contain timestamps earlier or later than the existing data.

Easiest way that comes the mind is to query existing collection for every incoming document and see if a useful change had happened, but it would mean thousands of queries per second to the datastore (Solr).

Another way is to somehow detect which set of documents might potentially be affected by the new data and load it in one shot, revise it and write it back in one go. I couldn't figure out how can I determine those documents though.**

I tried to ask the question independent of any particular technology because I thought this design problem is independent from the tech stack too, but the changes data will be stored in Solr and I can run the aggregations in Spark. Another tech stack suggestions are welcome too if they can solve this problem.

I can also change the schema and the way data is represented in both existing and incoming data, if it can help solve this.

Do you have a design suggestion to this problem, or an answer to my starred ** question?

Doc Brown
  • 218,378
uylmz
  • 1,139

1 Answers1

5

I think there's some potentially faulty reasoning here I'd like to point out. So, you say that if you have:

{ sid: 1, t: 3, v: "A" }    
{ sid: 1, t: 5, v: "B" }

then { sid: 1, t: 6, v: "B" } is redundant. That might be true.

But let's work through an example of when treating a non-change as redundant would be a mistake:

Let's say that we get that change not at t:6, but at t:7 instead: { sid: 1, t: 7, v: "B" }. And then later on we get { sid: 1, t: 6, v: "C" }.

Adding those two example changes to our two original changes (ordered by timestamp):

{ sid: 1, t: 3, v: "A" }    
{ sid: 1, t: 5, v: "B" }
{ sid: 1, t: 6, v: "C" }
{ sid: 1, t: 7, v: "B" }
// at t:8 sid:1 is B. This is correct

Adding those two changes again, but this time let's remove non-changes as they come in:

{ sid: 1, t: 3, v: "A" }    
{ sid: 1, t: 5, v: "B" }
{ sid: 1, t: 6, v: "C" }
<deleted before event at t:6 was known because it was a non-change from t:5>
// at t:8, sid:1 is C. Whoops. That's wrong.

The t:7 may have seemed like a redundant non-change at the time, but becomes important once that t:6 is known. If we had thrown t:7 away, t:6 would have fooled us into thinking the final value of sid:1 was C.

If there is a gap into which a new change can fit, we can't make a good decision on what is actually redundant with incomplete data.

In fact, the problem isn't just with gaps. If you're receiving data from a source you don't 100% trust to be consistent you also have a problem. Say first one gets { sid: 1, t: 6, v: "C" } and at some later time one gets { sid: 1, t: 6, v: "D" }. If the first had been thrown away as "redundant" then it becomes impossible to detect that the something weird is going on when the second comes in.

You may have noticed this is something I have given some thought. I've written some code using postgresql "temporal tables" similarly configured to not persist non-updates. An essential part of that code making sense was making sure events were processed strictly in order.

So it's probably only going to be possible to remove the "redundant" data either in your queries or in a read model you maintain and update from your store of changes. But that's a different problem, and I don't know your tech or use-case to say how to do that effectively or scalably.

Nathan
  • 1,309