For each tuple, get the name of the first column which is non-zero

Question

I have a table in Hive which looks like:

| Name | 1990 | 1991 | 1992 | 1993 | 1994 |
| Rex  | 0    | 0    | 1    | 1    | 1    |
| Max  | 0    | 0    | 0    | 0    | 1    |
| Phil | 1    | 1    | 1    | 1    | 1    |

I would like to get, for each row, the name of the first column which is non-zero, so something like:

| Name | Column |
| Rex  | 1992   |
| Max  | 1994   |
| Phil | 1990   |

For each row, it is guaranteed that:

There is at least one column with "1"; and
If column X has is "1", for each column Y > X, column Y will also have a "1".

Vérace · Accepted Answer · 2021-12-01T17:21:12.293

There is a "trivial" solution (below) which assumes that the guarantees in the question hold - which they may well do today, but for different queries, who knows - see discussion below?

All of the code below is available on the fiddle here.

Your original table:

CREATE TABLE test
(
  name TEXT,
  y1990 SMALLINT,
  y1991 SMALLINT,
  y1992 SMALLINT,
  y1993 SMALLINT,
  y1994 SMALLINT
);

Your data:

INSERT INTO test VALUES
('Rex'   ,0    , 0    , 1    , 1     , 1),
('Max'  , 0    , 0    , 0    , 0     , 1),
('Phil' , 1    , 1    , 1    , 1     , 1);

It would appear that HiveQL doesn't have a PIVOT functionality, so you'll just have to do it the good old-fashioned manual way. There is a "short-cut" as pointed out by Akina:

SELECT 
  name,
  CASE 
    WHEN sc_1990 = 1 THEN 1990 -- the CASE will drop out after first "success"!
    WHEN sc_1991 = 1 THEN 1991 
    WHEN sc_1992 = 1 THEN 1992 
    WHEN sc_1993 = 1 THEN 1993 
    WHEN sc_1994 = 1 THEN 1994
  END AS score
FROM test
ORDER BY score;

Result:

name    m_yr
Phil    1990
Rex     1992
Max     1994

However, for a longer-term solution, and examples where:

you may not be certain that there is a 1 value in a given (or any) field, or
you might wish to query over a long period of years - say 1960 to 2010 - your SQL will be fairly horrendous if you are forced, by the table design, to query year-by-year.

So, you'd be better off using the following approach:

CREATE TABLE toradh  -- both "result" and "match" are SQL keywords! - see https://www.drupal.org/docs/develop/coding-standards/list-of-sql-reserved-words, so I used an Irish word!
(
  name TEXT,
  yr   SMALLINT,
  score SMALLINT
);

and then:

INSERT INTO toradh
VALUES
('Rex', 1990, 0),
('Rex', 1991, 0),
...
... snipped for brevity
...

and, to check: SELECT * FROM toradh;:

Result:

name      yr    score
Rex     1990        0
Rex     1991        0
Rex     1992        1
Rex     1993        1
Rex     1994        1
Max     1990        0
...
... snipped for brevity
...

So, your query could be written something like this:

SELECT
  name, MIN(yr) -- , MAX(scor) -- see what happens when you uncomment
FROM toradh
WHERE yr BETWEEN 1990 AND 1994  -- or 1960 AND 2010
AND score = 1
GROUP BY name
ORDER BY MIN(yr);

Result (same):

name     min
Phil    1990
Rex     1992
Max     1994

You will find that queries are much simpler to write into the future and that you will be able to ask much more complex questions of your data in a much simpler way.

Imagine, for example, that instead of between 1990 and 1994, your query was between 1960 and 2010 - it would be huge - this way, it would be exactly the same size - only the year from and to parameters would have to vary!

Tables are better when tall and thin, not short and fat! Also, in future, when asking questions such as this, could you please set up your own fiddle with your tables and data - it prevents duplication of effort on behalf of those trying to answer and it provides a single source of truth - help us to help you! p.s. welcome to dba.se!

Almost as importantly, the simplified query above assumes that the data is known in advance - i.e. you have stated that:

There is at least one column with "1"; and
If column X has is "1", for each column Y > X, column Y will also have a "1".

However, except in the most trivial cases, normally one cannot assume known values (even if one might be fairly sure), so you'd just have to resort to a query like this:

SELECT name, 1990 AS yr, sc_1990 AS score FROM test t1
UNION ALL
SELECT name, 1991 AS yr, sc_1991 FROM test t1
UNION ALL
SELECT name, 1992 AS yr, sc_1992 FROM test t1
UNION ALL
SELECT name, 1993 AS yr, sc_1993 FROM test t1
UNION ALL
SELECT name, 1994 AS yr, sc_1994 FROM test t1;

Result:

name    yr     score
Rex     1990       0
Max     1990       0
Phil    1990       1
Rex     1991       0
...
... snipped for brevity
...

And your query (given no prior knowledge of the data) would look like:

WITH cte AS
(
  SELECT name, 1990 AS yr, sc_1990 AS score FROM test t1
  UNION ALL
  SELECT name, 1991 AS yr, sc_1991 FROM test t1
  UNION ALL
  SELECT name, 1992 AS yr, sc_1992 FROM test t1
  UNION ALL
  SELECT name, 1993 AS yr, sc_1993 FROM test t1
  UNION ALL
  SELECT name, 1994 AS yr, sc_1994 FROM test t1
)
SELECT 
  c.name, MIN(c.yr) AS m_yr
FROM cte c 
WHERE c.score != 0
GROUP BY c.name
ORDER BY m_yr;

Same result - see fiddle.

All in all, you're better of refactoring your schema! Having the column names as part of the data, you're mixing data and metadata which is never good practice!

For each tuple, get the name of the first column which is non-zero

1 Answers1