11

I have a MySQL database containing data that is input by our customers.

Customers are told their data should only be encoded as UTF-8, however this is currently not being enforced and some people are using characters not encoded in UTF-8.

We intend to add some checking, however how should we detect data that isn't encoded using UTF-8? Is there a SQL or MySQL query (or admin command) that will show me the data that is NOT encoded as UTF-8?

Hannah Vernon
  • 70,928
  • 22
  • 177
  • 323
Bret
  • 113
  • 1
  • 1
  • 4

2 Answers2

15

You can check for the existence of (non-)UTF-8 data by comparing byte length to character length on a column, e.g.:

SELECT * FROM MyTable
WHERE LENGTH(MyColumn) <> CHAR_LENGTH(MyColumn)

Multibyte characters will have a greater LENGTH (bytes), so you'll need to look for where that condition isn't met.

Note that MySQL's utf8 character set isn't true Unicode UTF-8 as it only supports a maximum of 3 bytes per character. If your MySQL is later than 5.5.3 you can use utf8mb4 to get 4 bytes per character.

dartonw
  • 1,357
  • 9
  • 11
0

This will only select fields which are not valid UTF-8:

SELECT * FROM table
WHERE CONVERT(field USING binary) RLIKE '([\\xC0-\\xC1]|[\\xF5-\\xFF]|\\xE0[\\x80-\\x9F]|\\xF0[\\x80-\\x8F]|[\\xC2-\\xDF](?![\\x80-\\xBF])|[\\xE0-\\xEF](?![\\x80-\\xBF]{2})|[\\xF0-\\xF4](?![\\x80-\\xBF]{3})|(?<=[\\x00-\\x7F\\xF5-\\xFF])[\\x80-\\xBF]|(?<![\\xC2-\\xDF]|[\\xE0-\\xEF]|[\\xE0-\\xEF][\\x80-\\xBF]|[\\xF0-\\xF4]|[\\xF0-\\xF4][\\x80-\\xBF]|[\\xF0-\\xF4][\\x80-\\xBF]{2})[\\x80-\\xBF]|(?<=[\\xE0-\\xEF])[\\x80-\\xBF](?![\\x80-\\xBF])|(?<=[\\xF0-\\xF4])[\\x80-\\xBF](?![\\x80-\\xBF]{2})|(?<=[\\xF0-\\xF4][\\x80-\\xBF])[\\x80-\\xBF](?![\\x80-\\xBF]))';

(The regex is from https://stackoverflow.com/a/11709412/323407. Requires PCRE regex support. MySQL does not have that out of the box but MariaDB 10.0.5+ does. For MySQL maybe you can use lib_mysqludf_preg.)

Not sure if this is what the question is asking for - if someone stored their information in an encoding that is not UTF-8 but happens to output byte sequences that are valid UTF-8, this will not help with that (and I don't think a 100% reliable solution exists - if you know the specific encoding and language, you can maybe look for common corrupted character sequences). But it's useful for finding input that's definitely not UTF-8.

(This assumes the text is stored in a format where invalid UTF-8 is even possible, e.g. binary.)

Tgr
  • 101
  • 2