Cumulative sum using hiveql

Question

I have a table in Hive which looks like:

col1       col2
b           1
b           2
a           3
b           2
c           4
c           5

How do I, with hiveql, group up col1 elements together, sum them up, sort by the sum, as well as create a cumulative sum (csum) based on the sum?

id       sum_all     csum
a         3           3
b         5           8
c         9           17

I've only managed to come up with the grouping and sum, but am out of ideas regarding the cumulative sum. Correlated sub-queries are not supported in Hive

select col1 as id
      sum(col2) as sum_all
from t
group by col1
order by sum_all

With the following result:

id       sum_all
a         3
b         5
c         9

ypercubeᵀᴹ · Accepted Answer · 2016-10-26T10:19:19.307

Since correlated subqueries are not allowed, try using derived tables and then joining them.

select 
    a.id,
    a.sum_all,
    sum(b.sum_all) as csum
from
        ( select col1 as id,
                 sum(col2) as sum_all
          from t
          group by col1
        )  a
    join
        ( select col1 as id,
                 sum(col2) as sum_all
          from t
          group by col1
        )  b
     on
        ( b.sum_all < a.sum_all )
     or ( b.sum_all = a.sum_all and b.id <= a.id )
group by
    a.sum_all, a.id
order by 
    a.sum_all, a.id ;

This is essentially a self-join on the derived group-by table. It might be more efficient to first save the group-by results into a temporary table and then do the self-join.

According to the manual, Hive also has windowing aggregates, so you could use them as well:

select 
    a.id,
    a.sum_all,
    sum(a.sum_all) over (order by a.sum_all, a.id
                         rows between unbounded preceding
                                  and current row)
        as csum
from
        ( select col1 as id,
                 sum(col2) as sum_all
          from t
          group by col1
        )  a
order by 
    sum_all, id ;

or with:

select 
    col1 as id,
    sum(col2) as sum_all,
    sum(sum(col2)) over (order by sum(col2), col1
                         rows between unbounded preceding
                                  and current row)
        as csum
from
    t
group by 
    col1
order by 
    sum_all, id ;

Cumulative sum using hiveql

1 Answers1