1

Firstly apologies if this is a duplicate, I am fairly new to SQL and so Im not sure what the correct terminology to use in my searches

So I have a database which records motor races, with the following simplified schema

race_table
==========
race_id PK
race_date timestamp
average_speed Decimal
max_speed Decimal

drivers
=======
driver_id PK
driver_name text
date_started timestamp

driver_races
============
driver_id FK
race_id FK

If each driver has ~1000 races spread over 2/3 years

How would I go about querying the overall % change (positive of negative) in their average race speed for a given date range, for example

% Change in first 6 months

Joe Smith - 5% increase
Andy James - 4% increase

% Change in first 12 months

Joe Smith - 8% increase
Lewis May - 6% increase

UPDATE: More detail on % Change

The % Change could be calculated using linear regression, (least-squares-fit would be suitable), the average change is effectivly the y-difference on a line-of-best-fit, where each point is a race, x is the race_date and y is the average_speed for that race.

Postgres's regr_slope will give the gradient of the line which is effectivly the same as the %change

SELECT regr_slope(r.average_speed, EXTRACT(EPOCH FROM r.race_date)) as slope
    FROM race_table as r, driver_races as dr
    WHERE dr.race_id = r.race_id
    AND d.driver_id = 1

This gives the figure I want, but I now need to apply is against all users, sorted by 'slope'

DaveB
  • 319
  • 2
  • 6
  • 16

1 Answers1

4

Data model

You are saving avg. and max. speed per race, not per driver? Normally this would be something like:

CREATE TABLE race (
  race_id   serial PRIMARY KEY
 ,race_date timestamp
);

CREATE TABLE driver ( driver_id serial PRIMARY KEY ,driver_name text ,date_started timestamp );

CREATE TABLE driver_race ( driver_id int REFERENCES driver(driver_id) ,race_id int REFERENCES race(race_id) ,average_speed numeric ,max_speed numeric );

This is also how you should provide your data model: as CREATE scripts ready to use for testing.

numeric and decimal are the same in Postgres.

Query

Your query extended to all drivers:

SELECT driver_id, d.driver_name, x.slope
FROM   driver d
JOIN  (
   SELECT dr.driver_id
         ,regr_slope(dr.average_speed, EXTRACT(EPOCH FROM r.race_date)) AS slope
   FROM   race r
   JOIN   driver_race dr USING (race_id)
   WHERE  r.race_date BETWEEN '2013-3-1'::date AND '2013-3-31'::date 
   GROUP  BY dr.driver_id
   ) x USING (driver_id)
ORDER  BY slope DESC;

Simple alternative

A simple alternative would be to match the average speed of the first half of a time period to the average of the second half:

SELECT driver_id, d.driver_name
      ,round(x.avg1::numeric, 2) AS avg1
      ,round(x.avg2::numeric, 2) AS avg2
      ,round((x.avg2 / x.avg1 - 1) * 100, 2) AS pct_change
FROM   driver d
JOIN  (
   SELECT dr.driver_id
         ,avg(CASE WHEN r.race_date BETWEEN '2013-3-01'::date AND '2013-3-15'::date
              THEN dr.average_speed END) AS avg1
         ,avg(CASE WHEN r.race_date BETWEEN '2013-3-16'::date AND '2013-3-31'::date
              THEN dr.average_speed END) AS avg2
   FROM    race r
   JOIN   driver_race dr USING (race_id)
   WHERE  r.race_date BETWEEN '2013-3-1'::date AND '2013-3-31'::date 
   GROUP  BY dr.driver_id
   ) x USING (driver_id)
ORDER  BY pct_change DESC;

The CASE expression has no ELSE branch, because the default NULL serves just fine: avg() ignores NULL values.

-> SQLfiddle demo.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633