What is the difference between utf8mb4_0900_bin vs utf8mb4_bin binary collations?
- 70,048
- 8
- 160
- 306
- 428
- 1
- 5
- 20
1 Answers
There are three differences as far as I can tell (according to their documentation):
Case-mappings (for
LOWER()/UPPER()functions):https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-uca
The LOWER() and UPPER() functions perform case folding according to the collation of their argument.
The difference between the two collations in this context is that the
_0900_version, being based on a newer version of Unicode, quite likely has more mapping definitions (and possibly even some corrections).Padding vs No Padding (of trailing spaces):
The pad attribute for
utf8mb4_binisPAD SPACE, whereas forutf8mb4_0900_binit isNO PAD. Consequently, operations involvingutf8mb4_0900_bindo not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collationsEssentially,
utf8mb4_binignores trailing spaces whileutf8mb4_0900_bindoes not ignore them. See the documentation (linked above) for an example.Sorting (performance only, not the ordering):
For
_bincollations exceptutf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.For
utf8mb4_0900_bin, the weight is theutf8mb4encoding bytes. The sort order is the same as forutf8mb4_bin, but much faster.
Translating that into human, they are saying that for a code point such as U+FF9D,
utf8mb4_binwill see the UTF-8 encoded byte sequence of EF BE 9D and convert that into 00 FF 9D. But,utf8mb4_0900_binwill not convert it into the code point value. This is due to the UTF-8 byte sequence already being sequential, hence the ordering is the same as it for the code point values. So why bother with that extra conversion step?
- 70,048
- 8
- 160
- 306