Algorithm for determining transactions among weekly data series?

Question

I'm trying to develop a small reporting tool (with sqlite backend). I can best describe this tool as a "transaction" ledger. What I'm trying to do is keep track of "transactions" from weekly data extract:

"new" (or add) - resource is new to my app since my app may not have tracked this resource before as it hasn't been seen via extracts.
"update" (or hit) - there is a recent use of that resource, update retention period by another week.
"delete" (or drop) - this item saw no use since last report (optional, but would be nice to have for graphing week-to-week changes in demand for resources).

All I've got is a weekly data extract (pipe-delimited flat file) coming from a legacy archiving/record-management system that I have no control over.

Each line can be distilled to basically this:
resource_id | resource info | customer_id | customer_info

Sample data:

10| Title X       | 1 | Bob
11| Another title | 1 | Bob
10| Title X       | 2 | Alice

The goal is make it easy to report on resources that haven't seen use for X-months (based on last hit). There is a retention period where resources are kept around for ease of access if they're popular. A resource that hasn't seen use for 18 months is marked for long-term archival elsewhere.

This must be a common problem. Wondering if there a general-purpose algorithm to determine what's new/same/removed between data sets (db vs. latest extract)?

J-Boss · Answer 1 · 2015-09-05T11:32:51.220

Well your answer is... Yes. There is a simple algorithm you can implement that doesn't require any of that other stuff. It's a net present value algorithm. It's easy to implement and all it requires on the DB end is that you date stamp the weekly data and write one simple query and one small recursive function or for loop, or you could do one of those other solutions.

NPV = PV-(PV(CP/T) or the New Present Value equals the Present Value times the Current Period (months since last entry) divided by the Term (e.g. 18 months) when the resource value falls to 0 it's net present value is expended.

If you give me a lang you want it in I'll post the code here in an edit

Davislor · Answer 2 · 2015-09-01T23:56:10.733

If you’re keeping the updates in a SQLite backend anyway, you could turn the weekly update into a new table and compare it to the archived data with queries, before merging it.

Example of using SQL to find new additions to a table: https://stackoverflow.com/questions/2077807/sql-query-to-return-differences-between-two-tables

If a field in your DB stores the date of the transaction, you could just query all users who have had transactions in the past 18 months. Then the archive is just the full DB. Alternatively, you could query all users who haven’t, extract their data, then drop them. Updates are just any rows timestamped this week.

Davislor · Answer 3 · 2015-09-05T18:11:16.107

Alternative idea:

Parse your list of transactions into some kind of data structure, such as an an array. (In C++, think Vector, and in Java, ArrayList.)
Perform a query on your SQL backend such as SELECT DISTINCT customer_id FROM Transactions ORDER BY customer_id and pack the sorted distinct customer IDs into a set, old. If you do the exact same thing with a WHERE clause separating the old and new transactions, you can skip step 3.
Get the unique customer IDs from the new updates into a separate data structure, in sorted order. There are a couple of data structures you could use to get is into a data structure, new. Insertion sort into a double-linked list is very simple, but using an intermediate hashtable would run in close to linear time, or if you’re sorting the original array anyway, getting a set out of that is easy.
Take the set difference new - old using your favorite language’s standard library. Your favorite language does have this algorithm in its standard library?

The other things you want to do are definitely SQL queries after you’ve updated your transaction database.

Note on step 3: Consider the nature of your data. Suppose that your text file lists orders chronologically, and in a typical week, there are a lot of first-time customers who are given a new customer_id in ascending order. Suppose that most other orders are from a small number of loyal repeat customers, with lower customer_id. Then your inputs are already mostly-sorted. An insertion sort where you try to insert low customer_id at the front of a double-linked list and high customer_id at the back would, in that situation, perform well in practice.

score 0 · Answer 4 · answered Sep 07 '15 at 07:38

As I understand from your question you actually have resource_id (+info) and "list" of customer(id + info).

So you can easily keep List of customer per resource and check last node in each list on the resource (in order to know last operation time; you just need to add date field to your customer in the code)

I'm not familiar with SQL, therefore I give my example with HashMap and List but I'm sure it's same idea: HashMap <Resource, List<Customer>>, when Resource should contain resourceID as key and Customer should contain customer ID, info and operation date.

With this idea you can know easily last operation time and can modify any resource (add\remove resource\customer).

score 0 · Answer 5 · answered Sep 07 '15 at 12:47

If you are using a SqLite database, if you add the date of the batch also as a column of the table,

10| Title X       | 1 | Bob    | 2015-03-01
11| Another title | 1 | Bob    | 2015-03-01
...............................
10| Title X       | 1 | Alice  | 2015-03-05

it would pretty easy to use a SQL to get the resources not used in last X number of days

Select distinct r.ResourceID from Resources r
where not exists (SELECT julianday('now') - julianday(r.DateUpdated)) < X

I have not tested the SQL but it should give you an idea

score 0 · Answer 6 · answered Feb 19 '17 at 17:57

From the original post, it sounds like the data being ingested does not have a field to indicate the date/time of the transaction, and I presume the file is ingested on a frequent basis on a schedule such as daily, hourly, etc.

I would handle this by adding a SQL timestamp column which is either autogenerated at the database level, or by the code that extracts the data and inserts into the DB. Then you put an index on that timestamp column and be done with it. Let the DB engine do the job of making it efficient to answer the question "how many transactions haven't happened since this time", or "how many between this time and that time".

Then you schedule a job to query and calculate the differentials that you want to report on. Transactions that are "new" are transactions that don't have any records in the DB prior to the date you are asking "new since". Old records are those that have no transactions since a cut-off date.

score -2 · Answer 7 · answered Aug 26 '15 at 10:23

Isn't this what HashTables are for ? If all you want to do is keep records of which resources have been used in the past months and delete resources that haven't been accessed in the last 18 months then you can use a HashTable where the Key is the resource_id and the value is the last access date.

For archiving the >18 months records you can go through all of the records in the hash table and just remove (or move) those specific records. (you can do this weekly when the report comes in)

Algorithm for determining transactions among weekly data series?

7 Answers7