1

Due to an error in our application, one of the tables in our MS SQL has been flooded with duplicate rows differing only in PK identity column and timestamp

+--------+---------------------+--------------------+
| id     | time                |       value        |
+--------+---------------------+--------------------+
| #10003 | 2014-02-26 12:00:00 |   uniq_pattern_1   |
| #20005 | 2014-02-26 13:00:00 |   uniq_pattern_1   |
| #30018 | 2014-02-26 14:00:00 |   uniq_pattern_1   |
| #40022 | 2014-02-26 15:00:00 |   uniq_pattern_1   |
 ....................................................
| #10005 | 2014-02-26 12:00:00 |   uniq_pattern_2   |
| #20008 | 2014-02-26 13:00:00 |   uniq_pattern_2   |
| #30020 | 2014-02-26 14:00:00 |   uniq_pattern_2   |
| #40040 | 2014-02-26 15:00:00 |   uniq_pattern_2   |
+--------+---------------------+--------------------+

There is 100+ million rows with only about 50k unique patterns spread across entire table (every existing pattern was re-inserted every hour, with a few new patterns added to the mix). Table only has clustered PK index. I need to delete all duplicates leaving only earliest occurrence of each pattern.

I plan to do the following:

Grab earliest row for each uniq pattern into temp table.

INSERT INTO #MyTempTable
SELECT [value], min([time])
FROM [MyBigTable]
GROUP BY [value]

Truncate [MyBigTable] then populate it from #MyTempTable.

Is this a good way to do this?

Misha Brukman
  • 208
  • 2
  • 12
Ivan Koshelev
  • 245
  • 1
  • 2
  • 6

0 Answers0