Questions tagged [duplication]

The Duplication tag covers the topic of duplicated data and the strategies involved in removing the duplicates, often referred to as deduplication or simply dedupe.

Duplication refers to the inadvertent creation of duplicate entries in tables and the process for resolving the repeated entries.

In the simple case, this involves identifying a survivor row and eliminating the non-survivors.

A slightly more complex situation, survivorship logic must be employed. In the following example, the rule was to simply collapse non-empty values.

Before

CustomerName | Address | Phone
Bob          | 100 Elm | 
Bob          |         | 555.1234

After

CustomerName | Address | Phone
Bob          | 100 Elm | 555.1234

The truly complex scenarios stem from conflicting values. Consider the following example, there is a conflicting value for the phone value in our deduplication candidate. Which value is the "right" is likely to be dependent on the supplied business rules and/or data collection logic.

Before

CustomerName | Address | Phone
Bob          | 100 Elm | 555.4234
Bob          |         | 555.1234

After 1

CustomerName | Address | Phone
Bob          | 100 Elm | 555.1234

After 2

CustomerName | Address | Phone
Bob          | 100 Elm | 555.4234

After 3

CustomerName | Address | Phone
Bob          |         | 555.1234

Which After is correct is an "it depends" answer.

The first After merged the changes together. Perhaps the data collection system only records changed data (and we further assume said system required address for an entry to be valid)

The second After declared the first row completely correct. Perhaps there was an additional field showing it was the most recently updated. Or because it was most completely filled out, the assumption is that it was the survivor.

The third After declared the second row completely correct. It could be that there was another field showing it to be the most recently updated (Bob is now homeless, pity him).

265 questions
45
votes
6 answers

Checking whether two tables have identical content in PostgreSQL

This has already been asked on Stack Overflow, but only for MySQL. I'm using PostgreSQL. Unfortunately (and surprisingly) PostgreSQL does not seem to have something like CHECKSUM table. A PostgreSQL solution would be fine, but a generic one would be…
Faheem Mitha
  • 1,049
  • 2
  • 12
  • 18
45
votes
3 answers

select rows where column contains same data in more than one record

I have a table that has a column called article_title. Let's say the table name is articles. I need to find out the records where the article_title data is the same on more than one record. Here's what I've got: select a.* from articles a where…
somejkuser
  • 797
  • 3
  • 8
  • 15
36
votes
1 answer

Optimal way to ignore duplicate inserts?

Background This problem relates to ignoring duplicate inserts using PostgreSQL 9.2 or greater. The reason I ask is because of this code: -- Ignores duplicates. INSERT INTO db_table (tbl_column_1, tbl_column_2) VALUES ( SELECT …
Dave Jarvis
  • 833
  • 1
  • 10
  • 25
28
votes
1 answer

Multiple on conflict targets

I have two unique indexes on columns a and b. I need something like this: insert into my_table (a, b) values (1, 2), (1, 2) on conflict (a) do update set c = 'a_violation' on conflict (b) do update set c = 'b_violation' So generally I want to make…
user606521
  • 1,415
  • 6
  • 23
  • 28
21
votes
2 answers

On duplicate key do nothing

I am inserting into the following table using LuaSQL with PtokaX API. CREATE TABLE `requests` ( `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `ctg` VARCHAR(15) NOT NULL, `msg` VARCHAR(250) NOT NULL, `nick` VARCHAR(32) NOT NULL, …
hjpotter92
  • 516
  • 2
  • 10
  • 24
12
votes
5 answers

How do I copy my SQL Azure database to a local SQL Server instance?

I have an OLTP database hosted on a SQL Azure instance. I want to pull a copy of the database down from the cloud so I can run some heavy extracts and OLAP-style queries against it without impacting the source database. How do I pull a copy of the…
Nick Chammas
  • 14,810
  • 17
  • 76
  • 124
10
votes
4 answers

How to check if a subquery has exactly one distinct result and a specified value concisely?

I found myself writing the following: select 'yes' where exists(select * from foo where val=1) and not exists(select * from foo where val<>1); and wondering if there is a more concise way without sacrificing too much readability. I found one way…
Jack Douglas
  • 40,517
  • 16
  • 106
  • 178
9
votes
2 answers

Partial db dump/restore

A part of my database (PostgreSQL 9.3) relies on extra tables (ex. County, City, Town, ... ). I don't manage these tables, they are updated regularly by a third party. Each time I get a new full dump, but I have a hard time to push the changes back…
9
votes
2 answers

How do I remove duplicate records in a join table in PostgreSQL?

I have a table that has a schema like this: create_table "questions_tags", :id => false, :force => true do |t| t.integer "question_id" t.integer "tag_id" end add_index "questions_tags", ["question_id"], :name =>…
marcamillion
  • 205
  • 1
  • 2
  • 6
9
votes
3 answers

How to delete duplicate records in MySQL, in a table without IDs?

I need to delete the duplicate records in this table. However, there is no id for each row. Example Data product amount quantity table 2000 5 chair 300 25 TV 30000 4 bike 300 25 table 2000 5 chair 300 25 chair 300 25 Expected…
Edwin Babu
  • 193
  • 1
  • 1
  • 5
8
votes
3 answers

Deleting all duplicates

I am trying to delete all the duplicates but keeping single record only (shorter id). Following query deletes duplicates but take lot of iterations to delete all copies and keeping original ones. DELETE FROM emailTable WHERE id IN ( SELECT * FROM…
Gary Lindahl
  • 443
  • 3
  • 12
8
votes
1 answer

How do I insert record only if the record doesn't exist?

I have a PaymentInformation table ID NAME Start End 1 Tina 1/7/2014 1/17/2014 2 Alice 3/7/2014 3/17/2014 3 John 5/7/2014 5/17/2014 4 Michelle 7/7/2014 7/17/2014 I'm building my SQL query…
Andy
  • 89
  • 1
  • 1
  • 3
8
votes
1 answer

Duplicate records returned from table with no duplicates

I have a stored procedure that queries a busy queue table that is used to distribute work in our system. The table in question has a primary key on WorkID and no duplicates. A simplified version of the query is: INSERT INTO #TempWorkIDs…
JNK
  • 18,064
  • 6
  • 63
  • 98
8
votes
6 answers

How can I display twice each stored row without using UNION ALL or a temporary table?

The relevant table, named emp, holds the following data: CREATE TEMPORARY TABLE emp AS SELECT * FROM ( VALUES (1,'A'), (2,'B'), (3,'C') ); ID Name -- ---- 1 A 2 B 3 C And the output or result-set of the data manipulation operation…
Anup
  • 89
  • 1
  • 1
  • 2
8
votes
3 answers

INSERT... ON DUPLICATE KEY UPDATE not working as I expect

I have a table called "Example" CREATE TABLE IF NOT EXISTS `example` ( `id` int(11) NOT NULL AUTO_INCREMENT, `a` int(11) NOT NULL, `b` int(11) NOT NULL, `c` int(11) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; I…
Relax
  • 189
  • 1
  • 1
  • 5
1
2 3
17 18