41

I thought Unicode was designed to get around the whole issue of having lots of different encoding due to a small address space (8 bits) in most of the prior attempts (ASCII, etc.).

Why then are there so many Unicode encodings? Even multiple versions of the (essentially) same one, like UTF-8, UTF-16, etc.

8 Answers8

37

Unicode is a 21 bit character encoding the uniquely describes "CodePoints" each code points being represented by a a glyph (a graphical representation).

  • 16 bits used to identify a code point in a plane (most code points are on plane 0).
  • 5 bits to identify the plane.

The encodings supported are:

  • UTF-8 (to encode each point using 8 bit values)
  • UTF-16 (to encode each point using 16 bit values)
  • UTF-32 (to encode each point using 32 bit values)

But no matter what the encoding when you decode they all map back to a specific codepoint that has the same meaning (which is why it is cool).

UTF-8

This is a variable sized format. Where each codepoint is represented by 1 to 4 bytes.

UTF-16

This is a variable sized format. The code points on the "Basic Multilingual plane" (BMP or Plane 0) can be represented by 1 single 16 bit value. Code points on other planes are represented by a surrogate pair (2 16 bit values).

UTF-32

This is a fixed size format. All code points are represented by a single 32 bit value.

Loki Astari
  • 11,190
29

Because people don't want to spend 21 bits on each character. On all modern systems, this would essentially mean using three bytes per character, which is three times more than what people were used to, so they were unwilling to adopt Unicode at all. Compromises had to be found: e.g. UTF-8 is great for English text because legacy ASCII files need not be converted at all, but it is less useful for European languages, and of little use for Asian languages.

So basically, yes, we could have defined a single universal encoding as well as a single universal character chart, but the market wouldn't have accepted it.

Kilian Foth
  • 110,899
26

I think it's useful to separate the 2 ideas:

  1. Unicode - mapping of characters from all over the world to code points.
  2. Encoding - mapping of code points to bit patterns (UTF-8, UTF-16, etc).

UTF-8, UTF-16, and other encodings has each own advantages and disadvantages. Better consult Wikipedia about it.

jfs
  • 377
9

UTF-7, UTF-8, UTF-16 and UTF-32 are simply algorithmic transformation formats of the same coding (codepoints) of characters. They are encodings of one system of codification of characters.

They are also algorithmically easier to navigate forward and backward than most previous schemes for dealing with character sets larger than 256 characters.

This is very different than the generally country- and sometimes vendor-specific codification of glyphs. In Japanese alone, there were a ton of variations of JIS alone, not to mention EUC-JP and the codepage-oriented transformation of JIS that DOS/Windows machines used called Shift-JIS. (To some extent, there were algorithmic transformations of these, but they were not particularly simple and there were vendor-specific differences in characters that were available. Multiply this by a couple hundred countries and the gradual evolution of more sophisticated font systems (post greenscreen era), and you had a real nightmare.

Why would you need these transformation forms of Unicode? Because a lot of legacy systems assumed sequences of ASCII-range 7 bit characters, so you needed a 7-bit clean solution safely passing data uncorrupted through those systems, so then you needed UTF-7. Then there were more modern systems that could deal with 8-bit character sets, but nulls generally had special meanings to them, so UTF-16 didn't work for them. 2 bytes could encode the entire basic multilingual plane of Unicode in its first incarnation, so UCS-2 seemed like a reasonable approach for systems that were going to be "Unicode aware from the ground up" (like Windows NT and the Java VM); then the extensions beyond that necessitated additional characters, which resulted in the algorithmic transformation of the 21 bits worth of encodings that were reserved by the Unicode standard, and surrogate pairs were born; that necessitated UTF-16. If you had some application where consistency of character width was more important than efficiency of storage, UTF-32 (once called UCS-4) was an option.

UTF-16 is the only thing that's remotely complex to deal with, and that's easily mitigated by the small range of characters that are affected by this transformation and the fact that the lead 16-bit sequences are neatly in a totally distinct range from the trailing 16-bit sequences. It's also worlds easier than trying to move forward and backward in many early East Asian encodings, where you either needed a state machine (JIS and EUC) to deal with the escape sequences, or potentially move back several characters until you found something that was guaranteed to only be a lead byte (Shift-JIS). UTF-16 had some advantages on systems that could chug through 16-bit sequences efficiently, as well.

Unless you had to live through the dozens (hundreds, really) of different encodings out there, or had to build systems that supported multiple languages in different encodings sometimes even in the same document (like WorldScript in the older MacOs versions), you might think of the unicode transformation formats as unnecessary complexity. But it's a dramatic reduction in complexity over the earlier alternatives, and each format solves a real technical constraint. They are also really efficiently convertible between each other, requiring no complex lookup tables.

JasonTrue
  • 9,041
6

Unicode was not designed to get around the whole issue of having lots of different encodings.

Unicode was designed to get around the whole issue of one number representing many different things depending on the code page in use. Numbers 0 - 127 represent the same characters in any Ansi code page. This is what is also known as the ASCII chart or character set. In Ansi code pages, which allow for 256 characters, numbers 128 - 255 represent different characters in different code pages.

For example

  • Number $57 represents a capital W in all code pages, but
  • Number $EC represents the inifinity symbol in code page 437 (US), but a "LATIN SMALL LETTER N WITH CEDILLA" in code page 775 (Baltic)
  • The Cent Sign is number $9B in code page 437, but number 96 in code page 775

What Unicode did, was turn this all upside down. In Unicode there is no "reuse". Each number represents a single unique character. Number $00A2 in Unicode is the cent sign and the cent sign appears nowhere else in the Unicode definition.

Why then are there so many Unicode encodings? Even multiple versions of the (essentially) same one, like UTF-8, UTF-16, etc.

There are no multiple versions of the same encoding. There are multiple encodings of the same Unicode character definition map and these have been "invented" to administer to storage requirements for different usages of the various lingual planes that exist in Unicode.

Unicode defines (or has the space to define) 4.294.967.295 unique characters. If you want to map these to disk/memory storage without doing any algorithmic conversions, you need 4 bytes per character. If you need to store texts with characters from all lingual planes, then UTF-32 (which is basically a straight 1 character - 4 byte storage encoding of the unicode definition) is probably what you need.

But hardly any texts use characters from all lingual planes. And then using 4 bytes per character seems a big waste. Especially when you take into account that most languages on earth are defined within what is known as the Basic Multi-lingual Plane (BMP): the first 65536 numbers of the Unicode definition.

And thats where UTF-16 came in. If you only use characters from the BMP, UTF-16 will store that very efficiently using only two bytes per character. It will only use more bytes for characters outside of the BMP. The distinction between UTF-16LE (Little Endian) and UTF-16BE (Big Endian) really only has something to do with how numbers are represented within computer memory (byte pattern A0 meaning hex $A0 or meaning $0A).

If your text use even fewer different characters, like most texts in Western European languages, you will want to restrict the storage requirements for your texts even more. Hence UTF-8, which uses a single byte to store the characters present in the ASCII chart (the first 128 numbers) and a selection from the Ansi characters (the second 128 numbers of the various code pages). It will only use more bytes for characters outside of this "most used characters" set.

So to recap:

  • Unicode is a mapping of the characters in all of the languages on earth (and some Klingon to boot) and then some (mathematical, musical, etc..) to a unique number.
  • Encodings are algorithms defined to store texts using the numbers of this unique character map as space efficiently as possible given the "average usage" of characters within texts.
2

Unicode defines the map between numbers and characters. However, when you send a number to a receiver, you still need to define how to represent that number. That's what UTF is for. It defines how to represent a number in a byte stream.

Codism
  • 1,213
2

The rationale behind UTF-32 is simple: It's the most straightforward representation of Unicode code points. So why isn't everything in UTF-32? Two main reasons:

One is size. UTF-32 requires 4 bytes for every character. For text that uses only characters in the Basic Multilingual Place, this is twice as much space as UTF-16. For English text, it's 4 times as much space as US-ASCII.

The bigger reason is backwards compatibility. Each Unicode encoding other than the "unencoded" UTF-32 was designed for backwards compatibility with a prior standard.

  • UTF-8: Backwards compatibility with US-ASCII.
  • UTF-16: Backwards compatibility with UCS-2 (16-bit Unicode before it was expanded beyond the BMP).
  • UTF-7: Backwards compatibility with non-8-bit-clean mail servers.
  • GB18030: Backwards compatibility with the GB2312 and GBK encodings for Chinese.
  • UTF-EBCDIC: Backwards compatibility with the Basic Latin subset of EBCDIC.

I thought Unicode was designed to get around the whole issue of having lots of different encoding

It was, and it did. It's much easier to convert between UTF-8, -16, and -32 than deal with the old system of hundreds of different character encodings for different languages and different OSes.

dan04
  • 3,957
1

You know that a zip-file can compress a file to be much smaller (especially text) and then uncompress it to an identical copy of the original file.

The zipping algorithm actually have several different algorithms with different characteristics to choose from: stored (no compression), Shrunk, Reduced (methods 1-4), Imploded, Tokenizing, Deflated, Deflate64, BZIP2, LZMA (EFS), WavPack, PPMd, where it theoretically could try all of them and choose the best result but usually just go with Deflated.

UTF works much the same way. There are several encoding algorithms each with different characteristics, but you usually just pick UTF-8 because it is widely supported as opposed to the other UTF-variants, which in turn is because it is bitwise compatible to 7-bit ASCII making it easy to use on most modern computer platforms which usually use an 8-bit extension of ASCII.