28

At work I come across a lot of Japanese text files in Shift-JIS and other encodings. It causes many mojibake (unreadable character) problems for all computer users. Unicode was intended to solve this sort of problem by defining a single character set for all languages, and the UTF-8 serialization is recommended for use on the Internet. So why doesn't everybody switch from Japanese-specific encodings to UTF-8? What issues with or disadvantages of UTF-8 are holding people back?

EDIT: The W3C lists some known problems with Unicode, could this be a reason too?

5 Answers5

30

In one word: legacy.

Shift-JIS and other encodings were used before Unicode became available/popular, since it was the only way to encode Japanese at all. Companies have invested in infrastructure that only supported Shift-JIS. Even if that infrastructure now supports Unicode, they are still stuck with Shift-JIS for various reasons ranging from it-works-so-don't-touch-it over encoding-what? to migrating-all-existing-documents-is-too-costly.

There are many western companies that are still using ASCII or latin-1 for the same reasons, only nobody notices since it's never causing a problem.

deceze
  • 2,315
11

These are the reasons that I remember were given for not making UTF-8 or another Unicode representation the default character encoding for the scripting language Ruby, which is mainly developed in Japan:

  • Reason 1: Han unification. The character sets (not sure if "alphabets" would be correct here) used China, Korea, and Japan are all related, have evolved from common history, not sure about the details. The Unicode consortium decided to waste only a single Unicode code point to encode all variants (Chinese, Japanese, and Korean) of the historic same character, even if their appearance differs in all 3 languages. Their reasoning is, appearance should be determined by the font used to display the text.

Apparently, this reasoning is as perceived to be as ridiculous by Japanese users as it would be to argue to English readers that, because the Latin alphabet has developed from the Greek alphabet, it is sufficient to have only a single code point for Greek alpha "α" and Latin "a", and let the appearance be decided by the font in use. (Same for "β" = "b", "γ" = "g", etc.)

(Note that I would not be able to include greek characters here on stackexchange if that were the case.)

  • Reason 2: Inefficient character conversions. Converting characters from Unicode to legacy Japanese encodings and back requires tables, i.e. there is no simple computation from Unicode code-point value to legacy code point value and vice versa. Also there is some loss of information when converting because not all code-points in one encoding have a unique representation in the other encoding.

More reasons may have been given that I do not remember anymore.

8

deceze's answer has a very strong element of truth to it, but there is another reason why Shift-JIS and others are still in use: UTF-8 is horrifically inefficient for some languages, mostly in the CJK set. Shift-JIS is, IIRC, a two-byte wide encoding whereas UTF-8 is typically 3-byte and occasionally even 4-byte in its encodings with CJK and others.

2

Count string size/memory usage amongst the primary reasons.

In UTF-8, east-asian languages frequently need 3 or more bytes for their characters. On average they need 50% more memory than when using UTF-16 -- the latter of which already is less efficient than native encoding.

The other main reason would be legacy, as point out by deceze.

2

Legacy and storage size, as others said, but there is one more thing: Katakana characters.

It takes only one byte to represent Katakana characters in Shift-JIS, so Japanese text including Katakana takes less than 2 bytes per character (1.5 for a 50/50 mix), making Shift-JIS somewhat more efficient than UTF-16 (2 bytes/char), and much more efficient than UTF-8 (3 bytes/char).

Cheap storage should have made this a much smaller problem, but apparently not.

azheglov
  • 7,185