2

I am looking at the overlap and non-overlap (unique values) of users-ids from two different select statements using a full join. The main differentiation being that one table will have a deal_id = 0 and the other will have any deal_id greater than or equal to one. I am joining the select statements on exchange_id, pub_id, and user_id but not on deal_id.

The field incremental value is trying to calculate users who are present in the pool deal_id >= 1 and not present in the pool of deal_id = 0 (a main reason for the full join).

Here is a simplification of the query I've typed up, it's in two select statements :

SET
hive.auto.convert.join = TRUE
;

SELECT
    First.deal_id
    ,COALESCE( First.exchange_id, Second.exchange_id ) as exchange_id
    ,COALESCE( First.pub_id, Second.pub_id ) as pub_id
    ,COUNT (DISTINCT(case when Second.user_id is null then First.user_id else null END)) AS Incremental
    ,SUM (First.imps) AS First_imps
    ,SUM (Second.imps) AS Second_imps
    FROM
        (
            SELECT
                a.deal_id
                ,a.exchange_id
                ,a.pub_id
                ,a.user_id
                ,1 AS imps
            FROM
                logs a 
            WHERE
                a.deal_id >= 1
            AND a.event_type = 'TRUE'
        ) First 
        FULL JOIN (
            SELECT
                a.exchange_id
                ,a.pub_id
                ,a.user_id
                ,1 AS imps
            FROM
                logs a
            WHERE
            a.deal_id = 0
            AND a.event_type = 'TRUE'
        ) Second
        ON (
            First.exchange_id = Second.exchange_id
            AND First.pub_id = Second.pub_id
            AND First.user_id = Second.user_id
        )
        GROUP BY
        First.deal_id
        ,COALESCE( First.exchange_id, Second.exchange_id )
        ,COALESCE( First.pub_id, Second.pub_id )
;

Here are the results I am seeing:

DEAL_ID    EXCHANGE_ID    PUB_ID    INCREMENTAL    FIRST_IMPS    SECOND_IMPS
/N         4              1780      0              0             15
/N         4              1560      0              0             32
3389       4              1780      2              7             6
1534       4              1560      4              9             8

And here is what I would like to see:

DEAL_ID    EXCHANGE_ID    PUB_ID    INCREMENTAL    FIRST_IMPS    SECOND_IMPS
3389       4              1780      2              7             21
1534       4              1560      4              9             40

Where the results with a null deal id match up to the results with a non-null deal id based on exchange_id and pub_id.

What can I do?

Similar to this problem but this solution isn't working for this problem.

Note: I've posted this question on stackoverflow here but thought I might try dba instead

Edit: Here is a sqlfiddle that replicates the problem, note that it's using PostgreSQL while I'm using hql

userLP
  • 31
  • 1
  • 7

2 Answers2

2

All this approach does in to make your original query a derived table then group by pub_id.

SET hive.auto.convert.join = TRUE;

SELECT max(DEAL_ID) as DEAL_ID
     , EXCHANGE_ID
     , PUB_ID
     , sum(INCREMENTAL) as INCREMENTAL
     , sum(FIRST_IMPS) as FIRST_IMPS
     , sum(SECOND_IMPS) as SECOND_IMPS
  FROM (

    SELECT First.deal_id
         , COALESCE( First.exchange_id, Second.exchange_id ) as exchange_id
         , COALESCE( First.pub_id, Second.pub_id ) as pub_id
         , COUNT(DISTINCT(case when Second.user_id is null then First.user_id else null END)) AS Incremental
        , SUM(First.imps) AS First_imps
        , SUM(Second.imps) AS Second_imps
    FROM (SELECT a.deal_id
               , a.exchange_id
               , a.pub_id
               , a.user_id
               , 1 AS imps
            FROM logs a 
           WHERE a.deal_id >= 1
            AND a.event_type = 'TRUE'
         ) First 
FULL JOIN (SELECT a.exchange_id
                , a.pub_id
                , a.user_id
               , 1 AS imps
            FROM logs a
           WHERE a.deal_id = 0
             AND a.event_type = 'TRUE'
          ) Second
      ON (   First.exchange_id = Second.exchange_id
         AND First.pub_id = Second.pub_id
         AND First.user_id = Second.user_id
         )
   GROUP BY First.deal_id
          , COALESCE( First.exchange_id, Second.exchange_id )
          , COALESCE( First.pub_id, Second.pub_id )

  ) group by pub_id, exchange_id

;

RMathis
  • 482
  • 3
  • 10
1

I've found that this solution works. It's not very elegant and I'm worried over scale(does it run the subquery twice or once) but it works. Here is the fiddle

WITH subquery as
       ( 
SELECT
    First.deal_id
    ,COALESCE( First.exchange, Second.exchange ) as exchange_id
    ,COALESCE( First.publisher, Second.publisher ) as pub_id
    ,COUNT (DISTINCT(case when Second.user_id is null then First.user_id else null END)) AS Incremental
    ,SUM (First.imps) AS First_imps
    ,SUM (Second.imps) AS Second_imps
    FROM
        (
            SELECT
                a.deal_id
                ,a.exchange
                ,a.publisher
                ,a.user_id
                ,1 AS imps
            FROM
                T1 a 
            WHERE
                a.deal_id >= 1
        ) First 
        FULL OUTER JOIN (
            SELECT
                a.exchange
                ,a.publisher
                ,a.user_id
                ,1 AS imps
            FROM
                T1 a
            WHERE
            a.deal_id = 0
        ) Second
        ON (
            First.exchange = Second.exchange
            AND First.publisher = Second.publisher
            AND First.user_id = Second.user_id
        )
        GROUP BY
        First.deal_id
        ,COALESCE( First.exchange, Second.exchange )
        ,COALESCE( First.publisher, Second.publisher )
        )

SELECT
deal.deal_id,
deal.exchange_id,
deal.pub_id,
sum(deal.incremental),
sum(deal.first_imps),
sum(coalesce(deal.second_imps, 0) + coalesce(oa.second_imps,0))
FROM 
subquery deal
LEFT JOIN 
subquery oa 
ON (deal.exchange_id = oa.exchange_id
AND deal.pub_id = oa.pub_id
AND oa.deal_id is null)
WHERE deal.deal_id is not null
GROUP BY
deal.deal_id,
deal.exchange_id,
deal.pub_id
;
userLP
  • 31
  • 1
  • 7