MS SQL, fastest way to delete 99% of rows from 100+ million rows table to deduplicate data

Question

Due to an error in our application, one of the tables in our MS SQL has been flooded with duplicate rows differing only in PK identity column and timestamp

+--------+---------------------+--------------------+
| id     | time                |       value        |
+--------+---------------------+--------------------+
| #10003 | 2014-02-26 12:00:00 |   uniq_pattern_1   |
| #20005 | 2014-02-26 13:00:00 |   uniq_pattern_1   |
| #30018 | 2014-02-26 14:00:00 |   uniq_pattern_1   |
| #40022 | 2014-02-26 15:00:00 |   uniq_pattern_1   |
 ....................................................
| #10005 | 2014-02-26 12:00:00 |   uniq_pattern_2   |
| #20008 | 2014-02-26 13:00:00 |   uniq_pattern_2   |
| #30020 | 2014-02-26 14:00:00 |   uniq_pattern_2   |
| #40040 | 2014-02-26 15:00:00 |   uniq_pattern_2   |
+--------+---------------------+--------------------+

There is 100+ million rows with only about 50k unique patterns spread across entire table (every existing pattern was re-inserted every hour, with a few new patterns added to the mix). Table only has clustered PK index. I need to delete all duplicates leaving only earliest occurrence of each pattern.

I plan to do the following:

Grab earliest row for each uniq pattern into temp table.

INSERT INTO #MyTempTable
SELECT [value], min([time])
FROM [MyBigTable]
GROUP BY [value]

Truncate [MyBigTable] then populate it from #MyTempTable.

Is this a good way to do this?

MS SQL, fastest way to delete 99% of rows from 100+ million rows table to deduplicate data

0 Answers0