What is the difference between different utf8mb4 binary collations?

Question

What is the difference between utf8mb4_0900_bin vs utf8mb4_bin binary collations?

Solomon Rutzky · Accepted Answer · 2020-10-14T06:49:30.327

There are three differences as far as I can tell (according to their documentation):

Case-mappings (for LOWER() / UPPER() functions):

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-uca

The LOWER() and UPPER() functions perform case folding according to the collation of their argument.

The difference between the two collations in this context is that the _0900_ version, being based on a newer version of Unicode, quite likely has more mapping definitions (and possibly even some corrections).
Padding vs No Padding (of trailing spaces):

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-pad-attributes

The pad attribute for utf8mb4_bin is PAD SPACE, whereas for utf8mb4_0900_bin it is NO PAD. Consequently, operations involving utf8mb4_0900_bin do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collations

Essentially, utf8mb4_bin ignores trailing spaces while utf8mb4_0900_bin does not ignore them. See the documentation (linked above) for an example.
Sorting (performance only, not the ordering):

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-collating-weights
- For _bin collations except utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.
- For utf8mb4_0900_bin, the weight is the utf8mb4 encoding bytes. The sort order is the same as for utf8mb4_bin, but much faster.
Translating that into human, they are saying that for a code point such as U+FF9D, utf8mb4_bin will see the UTF-8 encoded byte sequence of EF BE 9D and convert that into 00 FF 9D. But, utf8mb4_0900_bin will not convert it into the code point value. This is due to the UTF-8 byte sequence already being sequential, hence the ordering is the same as it for the code point values. So why bother with that extra conversion step?

What is the difference between different utf8mb4 binary collations?

1 Answers1