How can this unique index allow duplicate rows?

Question

Is there a way that this unique index has allowed duplicate rows? I thought maybe there are some extra space characters, but I can't find them.

=> select *, length(keyword), length(country), length(language) from keyword where id in (4588076, 4951423);
   id    |       keyword       | seed_id | source | search_count | country | language | volume | cpc  | competition | modified_on | violation | revenue | length | length | length
---------+---------------------+---------+--------+--------------+---------+----------+--------+------+-------------+-------------+-----------+---------+--------+--------+--------
 4588076 | power wallet review |         | SPYFU  |            0 |         |          |     70 | 0.11 |        0.31 |             |           |         |     19 |        |
 4951423 | power wallet review |         | SPYFU  |            2 |         |          |     70 | 0.11 |        0.31 |             |           |         |     19 |        |
(2 rows)

The index is

"keyword_keyword_country_language" UNIQUE, btree (keyword, country, language)

PostgreSQL 9.5.3

OK I was planning to remove the other two columns, but I thought I would test the keyword column and found this:

=> select k1.id, k1.keyword, k2.id, k2.keyword, k1.keyword=k2.keyword from keyword k1, keyword k2 where k1.id=4588076 and k2.id=4951423;
   id    |       keyword       |   id    |       keyword       | ?column?
---------+---------------------+---------+---------------------+----------
 4588076 | power wallet review | 4951423 | power wallet review | f

Joishi Bodio · Answer 1 · 2016-08-23T19:47:38.870

The answer can be found in the amazing documentation .. it LOOKS like you have NULL values in your tables.. When the DB checks for uniqueness, it says "does NULL equal NULL? NOPE!" and allows it.

The important bit below (emphasis mine):

are not allowed. Null values are not considered equal. A multicolumn unique

If you want to keep the uniqueness across ALL THREE COLUMNS and, at the same time, treat nulls as equal, then you have to get creative with your UNIQUE indexes by making them partial indexes..

CREATE UNIQUE INDEX ix1 ON table (col1, col2wNull, col3wNull) WHERE col2wNull is not NULL and col3wNull is not Null;
CREATE UNIQUE INDEX ix2 ON table (col1, col2wNull) WHERE col2wNull is not Null and col3wNull is Null;
CREATE UNIQUE INDEX ix3 ON table (col1, col3wNull) WHERE col2wNull is Null and col3wNull is not Null;
CREATE UNIQUE INDEX ix4 ON table (col1) WHERE col2wNull is NULL and col3wNull is NULL;

As you can see .. it can easily get a bit crazy.

A different alternative would be to make col2wNull and col3wNull be defined as NOT NULL and provide some default value for when nothing is supplied. THIS MAY OR MAY NOT BE A GOOD IDEA depending on what you're doing. "Magic values" have a tendency to give you lots of problems later.

With regard to your edit and the two strings appearing to be equal, but the database reporting that they are not - I can only imagine that there are some "invisible" characters (UTF-8?) that are in the string. Or it could be something as simple as one string has an additional space on the end. It partially depends on how you are saving it into the database. (Are you performing a trim() on them, lower(), etc..)

You could try comparing the strings in various other ways (such as looking at md5 hash). I believe you can ask postgres to convert the column value into hex to view, as well, but how to go about doing that is escaping me at the moment (my apologies).

score 0 · Accepted Answer · answered Aug 23 '16 at 20:11

OK I found that there was a non-breaking space in the text, which used 2 bytes, even though Postgres said they were the same length.

=> select k1.id, k1.keyword, k2.id, k2.keyword, k1.keyword=k2.keyword, encode(convert_to(k1.keyword, 'UTF8'), 'hex'), encode(convert_to(k2.keyword, 'UTF8'), 'hex') from keyword k1, keyword k2 where k1.id=4588076 and k2.id=4951423;
   id    |       keyword       |   id    |       keyword       | ?column? |                  encode                  |                 encode
---------+---------------------+---------+---------------------+----------+------------------------------------------+----------------------------------------
 4588076 | power wallet review | 4951423 | power wallet review | f        | 706f7765722077616c6c6574c2a0726576696577 | 706f7765722077616c6c657420726576696577
(1 row)

The two hex values were

706f7765722077616c6c6574c2a0726576696577
706f7765722077616c6c657420726576696577

The difference was c2a0. PHP:

>>> hex2bin('c2a0')
=> " "

I searched c2a0 and found it was a non-breaking space. http://www.fileformat.info/info/unicode/char/00a0/index.htm

I don't know of anyway to prevent that in a unique index. Do you?

The other answer deserves some credit too.

How can this unique index allow duplicate rows?

2 Answers2

Linked