How can I write a proper query to JOIN all of these tables without duplicates?

Question

Let's say I have four tables:

TABLE:    players
COLUMNS:  id, first_name, last_name

TABLE:    passing_stats
COLUMNS:  id, year, passing_yards (several other passing columns)

TABLE:    rushing_stats
COLUMNS:  id, year, rushing_yards (several other rushing columns)

TABLE:    receiving_stats
COLUMNS:  id, year, receiving_yards (several other receiving columns)

Let's say Michael Vick has an id of 100. I want to get his full name and all of his stats for each year (passing, rushing, and receiving).

I don't want any duplicates, meaning that rushing stats for 2011 should appear in the same row as passing stats for 2011.

What is the most elegant way to write this query? Thanks.

ypercubeᵀᴹ · Answer 1 · 2012-04-12T07:44:02.203

The following will work in Postgres. Test here: SQL-Fiddle, postgres-test. SQL-Server does not have NATURAL JOIN and MySQL has NATURAL but doesn't have FULL joins:

SELECT
    id,
    first_name,
    last_name,
    year,
    passing_yards,
    rushing_yards,
    receiving_yards
  FROM 
      players p 
    NATURAL LEFT JOIN 
      ( passing_stats pas
      NATURAL FULL JOIN 
        rushing_stats rus
      NATURAL FULL JOIN 
        receiving_stats rec 
      )

score 4 · Answer 2 · edited Jun 15 '20 at 09:05

SELECT p.first_name
     , p.last_name
     , year
     , pas.passing_yards
     , rus.rushing_yards
     , rec.receiving_yards
FROM   players p 
LEFT   JOIN (
            (SELECT * FROM passing_stats   WHERE id = 100) pas
FULL   JOIN (SELECT * FROM rushing_stats   WHERE id = 100) rus USING (id, year)
FULL   JOIN (SELECT * FROM receiving_stats WHERE id = 100) rec USING (id, year)
   ) USING (id)
WHERE  p.id = 100
ORDER  BY year;

The subqueries are the core feature - to reduce to relevant rows early which should yield the best performance.

@ypercube's idea with NATURAL JOIN is even a bit shorter but breaks more easily when the query or the underlying tables are changed. Some even frown at the USING clause for similar reasons, to a lesser extent. But it keeps the query short and easy to read in this case.

The LEFT JOIN to return players with no statistics at all.

Alternate query

My first idea was basically a variant of @kgrittn's query. My fixes are somewhat invasive, so I posted another answer. Requires PostgreSQL 8.4 or newer.

WITH x AS (SELECT 100 AS id)
   , y AS (
  SELECT year FROM x JOIN passing_stats USING (id)
  UNION
  SELECT year FROM x JOIN rushing_stats USING (id)
  UNION
  SELECT year FROM x JOIN receiving_stats USING (id)
  )
SELECT p.first_name
     , p.last_name
     , y.year
     , s.passing_yards
     , r.rushing_yards
     , c.receiving_yards
FROM  (x CROSS JOIN y)
JOIN   players p USING (id)
LEFT   JOIN passing_stats s USING (id, year)
LEFT   JOIN rushing_stats r USING (id, year)
LEFT   JOIN receiving_stats c USING (id, year)
ORDER  BY y.year;

db<>fiddle here

kgrittn · Answer 3 · 2012-04-12T12:21:22.983

You didn't provide the DDL and data to set this up, so this is untested, but something along these lines should work under PostgreSQL. I think this is mostly standard-conforming, so I would expect it to work on other products with little or no change.

WITH x(id) AS (SELECT 100),
     y(year) AS (
  SELECT year FROM passing_stats t1 JOIN x ON t1.id = x.id
  UNION
  SELECT year FROM rushing_stats r1 JOIN x ON r1.id = x.id
  UNION
  SELECT year FROM receiving_stats c1 JOIN x ON c1.id = x.id)
SELECT
    p.first_name,
    p.last_name,
    y.year,
    t.passing_yards,
    r.rushing_yards,
    c.receiving_yards
  FROM x
  JOIN y ON TRUE
  JOIN players p ON p.id = x.id
  LEFT JOIN passing_stats t ON t.id = x.id AND t.year = y.year
  LEFT JOIN rushing_stats r ON r.id = x.id AND r.year = y.year
  LEFT JOIN receiving_stats c ON c.id = x.id AND c.year = y.year;

How can I write a proper query to JOIN all of these tables without duplicates?

3 Answers3

Alternate query

Linked