Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

Question

In the event an alien invasion occurred and we were forced to support their languages in all of our existing computer systems, is UTF-8 designed in a way to allow for their possibly vast amount of characters?

(Of course, we do not know if aliens actually have languages, if or how they communicate, but for the sake of the argument, please just imagine they do.)

For instance, if their language consisted of millions of newfound glyphs, symbols, and/or combining characters, could UTF-8 theoretically be expanded in a non-breaking way to include these new glyphs and still support all existing software?

I'm more interested in if the glyphs far outgrew the current size limitations and required more bytes to represent a single glyph. In the event UTF-8 could not be expanded, does that prove that the single advantage over UTF-32 is simply size of lower characters?

amon · Accepted Answer · 2015-11-24T13:47:46.027

The Unicode standard has lots of space to spare. The Unicode codepoints are organized in “planes” and “blocks”. Of 17 total planes, there are 11 currently unassigned. Each plane holds 65,536 characters, so there's realistically half a million codepoints to spare for an alien language (unless we fill all of that up with more emoji before first contact). As of Unicode 8.0, only 120,737 code points have been assigned in total (roughly 10% of the total capacity), with roughly the same amount being unassigned but reserved for private, application-specific use. In total, 974,530 codepoints are unassigned.

UTF-8 is a specific encoding of Unicode, and is currently limited to four octets (bytes) per code point, which matches the limitations of UTF-16. In particular, UTF-16 only supports 17 planes. Previously, UTF-8 supported 6 octets per codepoint, and was designed to support 32768 planes. In principle this 4 byte limit could be lifted, but that would break the current organization structure of Unicode, and would require UTF-16 to be phased out – unlikely to happen in the near future considering how entrenched it is in certain operating systems and programming languages.

The only reason UTF-16 is still in common use is that it's an extension to the flawed UCS-2 encoding which only supported a single Unicode plane. It otherwise inherits undesirable properties from both UTF-8 (not fixed-width) and UTF-32 (not ASCII compatible, waste of space for common data), and requires byte order marks to declare endianness. Given that despite these problems UTF-16 is still popular, I'm not too optimistic that this is going to change by itself very soon. Hopefully, our new Alien Overlords will see this impediment to Their rule, and in Their wisdom banish UTF-16 from the face of the earth.

score 30 · Answer 2 · edited Oct 07 '21 at 07:34

If UTF-8 is actually to be extended, we should look at the absolute maximum it could represent. UTF-8 is structured like this:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

(shamelessly copied from the RFC.) We see that the first byte always controls how many follow-up bytes make up the current character.

If we extend it to allow up to 8 bytes we get the additional non-Unicode representations

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Calculating the maximum possible representations that this technique allows we come to

  10000000₂
+ 00100000₂ * 01000000₂
+ 00010000₂ * 01000000₂^2
+ 00001000₂ * 01000000₂^3
+ 00000100₂ * 01000000₂^4
+ 00000010₂ * 01000000₂^5
+ 00000001₂ * 01000000₂^6
+ 00000001₂ * 01000000₂^7

or in base 10:

  128
+  32 * 64
+  16 * 64^2
+   8 * 64^3
+   4 * 64^4
+   2 * 64^5
+   1 * 64^6
+   1 * 64^7

which gives us the maximum amount of representations as 4,468,982,745,216.

So, if these 4 billion (or trillion, as you please) characters are enough to represent the alien languages I am quite positive that we can, with minimal effort, extend the current UTF-8 to please our new alien overlords ;-)

score 7 · Answer 3 · edited Oct 07 '21 at 07:34

RFC3629 restricts UTF-8 to a maximum of four bytes per character, with a maximum value of 0x10FFFF, allowing a maximum of 1,112,064 code points. Obviously this restriction could be removed and the standard extended, but this would prove a breaking change for existing code that works to that limit.

From a data-file point of view, this wouldn't be a breaking change as the standard works on the basis that if the most significant bit (MSB) of each byte is set, then the next byte is part of the encoding. Even before RFC3629, the standard was limited to 31 bits, leaving the MSB of the fourth byte unset.

Extending the standard beyond 0x10FFFF would break UTF-8's partial data compatibility with UTF-16 though.

Owen · Answer 4 · 2015-11-24T20:51:21.603

Really, only 2 Unicode code-points code stand for infinitely many glyphs, if they were combining characters.

Compare, for example, the two ways that Unicode encodes for the Korean Hangul alphabet: Hangul Syllables and Hangul Jamo. The character 웃 in Hangul Syllabels is the single code-point C6C3 whereas in Hangul Jamo it is the three code-points 110B (ㅇ) 116E (ㅜ) 11B9 (ㅅ). Obviously, using combining characters takes up vastly fewer code-points, but is less efficient for writing because more bytes are needed to write each character.

With this trick, there is no need to go beyond the number of code-points that can currently be encoded in UTF-8 or UTF-16.

I guess it comes down to how offended the aliens would be if their language happened to require many more bytes per message than earthly languages. If they don't mind, say, representing each of their millions of characters using a jumble of say, 100k combining characters, then there's no problem; on the other hand if being forced to use more bytes than earthlings makes them feel like second-class citizens, we could be in for some conflict (not unlike what we already observe with UTF-8).

JacquesB · Answer 5 · 2015-11-25T11:43:41.190

Edit: The question now says "millions of new characters". This makes it easy to answer:

No. Utf-8 is a Unicode encoding. Unicode have a codespace which allows 1,114,112 distinct codepoints, and less than a million is currently unassigned. So it is not possible to support millions of new characters in Unicode. By definition no Unicode encoding can support more characters than what is defined by Unicode. (Of course you can cheat by encoding a level further - any kind of data can be represented by just two characters after all.)

To answer the original question:

Unicode does not support languages as such, it supports characters - symbols used to represent language in written form.

Not all human languages have a written representation, so not all human languages can be supported by Unicode. Furthermore many animals communicate but don't have a written language. Whales for example have a form of communication which is complex enough to call a language, but does not have any written form (and cannot be captured by existing phonetic notation either). So not even all languages on earth can be supported by Unicode.

Even worse is something like the language of bees. Not only does it not have a written form, it cannot meaningfully be represented in written form. The language is a kind of dance which basically points in a direction but relies on the current position of the sun. Therefore the dance only have informational value at the particular place and time where it is performed. A symbolic or textual representation would have to include information (location, position of the sun) which the language of bees currently cannot express.

Even a written or symbolic form of communication might not be possible to represent in Unicode. For example illustrations or wordless comics cannot be supported by Unicode since the set of glyphs is not finite. You will notice a lot of pictorial communication in international settings like an airport, so it is not inconceivable that a race of space-travelling aliens will have evolved to use a pictorial language.

Even if an alien race had a language with a writing system with a finite set of symbols, this system might not be possible to support in Unicode. Unicode expects writing to be a linear sequence of symbols. Music notation is an example of a writing system which cannot be fully represented in Unicode, because meaning is encoded in both choice of symbols and vertical and horizontal placement. (Unicode does support individual musical symbols, but cannot encode a score.) An alien race which communicated using polyphonic music (not uncommon) or a channel of communication of similar complexity, might very well have a writing system looking like an orchestral score, and Unicode cannot support this.

But lets for the sake of argument assume that all languages, even alien languages, can be expressed as a linear sequence of symbols selected from a finite set. Is Unicode big enough for an alien invasion? Unicode have currently less than a million unassigned codepoints. The Chinese language contains a hundred thousands characters according to the most comprehensive Chinese dictionary (not all of them are currently supported by Unicode as distinct characters). So only ten languages with the complexity of Chinese would use up all of Unicode. On earth we have hundreds of distinct writing systems, but luckily most are alphabetical rather than ideographical and therefore contains a small number of character. If all written languages used ideograms like Chinese, Unicode would not even be big enough for earth. The use of alphabets is derived from speech which only uses a limited number of phonemes, but that is particular for human physiology. So even a single alien planet with only a dozen of ideographical writing systems might exceed what Unicode can support. Now consider if that this alien already have invaded other planets before earth and included their writing systems in the set of characters which have to be supported.

Expansion or modification of current encodings, or introduction of new encodings will not solve this, since the limitation is in the number of code points supported by Unicode.

So the answer is most likely no.

Would UTF-8 be able to support the inclusion of a vast alien language with millions of new characters?

5 Answers5