0

I am trying to retrieve a list of transaction IDS that only contain at least 5 product IDS, the baskets could have many more product IDS but it must contain these 5 also.

My table looks something like this:

| transaction_id | trans_date | product_id | store_id |

The table is all at product_id line level detail, nothing is rolled up.

select * from Transactions
where trans_date in ('2019-03-07','2019-03-06','2019-03-08','2019-03-09','2019-03-10')
and product_id in ('000000000021003503','000000000060280873','000000000060281480','000000000060281642','000000000020059281')

This is recovering all the transactions that contain these ids but I need it to be rolled up to a transaction level, so if the transaction_id contained these ids (at least) then its included.

Hopefully this makes sense

RustyRyan
  • 89
  • 7

1 Answers1

1

If Impala supports GROUP BY, COUNT and HAVING, then you can use the following.

Here we select the records for matching dates and products, then we count how many product_ids we actually found for each transaction on each day, and we take only those results for which we found all 5 product_ids.

SELECT transaction_id, trans_date, COUNT(product_id) as counter
FROM   Transactions
WHERE  trans_date IN ('2019-03-07','2019-03-06','2019-03-08','2019-03-09','2019-03-10')
AND    product_id IN ('000000000021003503', '000000000060280873', '000000000060281480', '000000000060281642', '000000000020059281')
GROUP BY transaction_id, trans_date
HAVING count(product_id) = 5

If a single transaction is allowed to span multiple dates, then you can just leave out the trans_date from the SELECT and GROUP BY parts.

Peter B
  • 314
  • 1
  • 11