121

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage).

UTF-8 has the added benefit of character support beyond "ASCII-characters". If that's the case, why will we ever choose ASCII encoding over UTF-8?

Is there a use-case when we will choose ASCII instead of UTF-8?

Pacerier
  • 5,053

5 Answers5

107

In some cases it can speed up access to individual characters. Imagine string str='ABC' encoded in UTF8 and in ASCII (and assuming that the language/compiler/database knows about encoding)

To access third (C) character from this string using array-access operator which is featured in many programming languages you would do something like c = str[2].

Now, if the string is ASCII encoded, all we need to do is to fetch third byte from the string.

If, however string is UTF-8 encoded, we must first check if first character is a one or two byte char, then we need to perform same check on second character, and only then we can access the third character. The difference in performance will be the bigger, the longer the string.

This is an issue for example in some database engines, where to find a beginning of a column placed 'after' a UTF-8 encoded VARCHAR, database does not only need to check how many characters are there in the VARCHAR field, but also how many bytes each one of them uses.

Mchl
  • 4,113
12

If you're going to use only the US-ASCII (or ISO 646) subset of UTF-8, then there's no real advantage to one or the other; in fact, everything is encoded identically.

If you're going to go beyond the US-ASCII character set, and use (for example) characters with accents, umlauts, etc., that are used in typical western European languages, then there's a difference -- most of these can still be encoded with a single byte in ISO 8859, but will require two or more bytes when encoded in UTF-8. There are also, of course, disadvantages: ISO 8859 requires that you use some out of band means to specify the encoding being used, and it only supports one of these languages at a time. For example, you can encode all the characters of the Cyrillic (Russian, Belorussian, etc.) alphabet using only one byte apiece, but if you need/want to mix those with French or Spanish characters (other than those in the US-ASCII/ISO 646 subset) you're pretty much out of luck -- you have to completely change character sets to do that.

ISO 8859 is really only useful for European alphabets. To support most of the alphabets used in most Chinese, Japanese, Korean, Arabian, etc., alphabets, you have to use some completely different encoding. Some of these (E.g., Shift JIS for Japanese) are an absolute pain to deal with. If there's any chance you'll ever want to support them, I'd consider it worthwhile to use Unicode just in case.

Jerry Coffin
  • 44,795
9

Yes, there are still some use cases where ASCII makes sense: file formats and network protocols. In particular, for uses where:

  • You have data that's generated and consumed by computer programs, never presented to end users;
  • But which it's useful for programmers to be able to read, for ease of development and debugging.

By using ASCII as your encoding you avoid the complexity of multi-byte encoding while retaining at least some human-readability.

A couple of examples:

  • HTTP is a network protocol defined in terms of sequences of octets, but it's very useful (at least for English-speaking programmers) that these correspond to the ASCII encoding of words like "GET", "POST", "Accept-Language" and so on.
  • The chunk types in the PNG image format consist of four octets, but it's handy if you're programming a PNG encoder or decoder that IDAT means "image data", and PLTE means "palette".

Of course you need to be careful that the data really isn't going to be presented to end users, because if it ends up being visible (as happened in the case of URLs), then users are rightly going to expect that data to be in a language they can read.

Gareth Rees
  • 1,489
5

ANSI can be many things, most being 8 bit character sets in this regard (like code page 1252 under Windows).

Perhaps you were thinking of ASCII which is 7-bit and a proper subset of UTF-8. I.e. any valid ASCII stream is also a valid UTF-8 stream.

If you were thinking of 8-bit character sets, one very important advantage would be that all representable characters are 8-bits exactly, where in UTF-8 they can be up to 24 bits.

2

First of all: your title uses/d ANSI, while in the text you refer to ASCII. Please note that ANSI does not equal ASCII. ANSI incorporates the ASCII set. But the ASCII set is limited to the first 128 numeric values (0 - 127).

If all your data is restricted to ASCII (7-bit), it doesn't matter whether you use UTF-8, ANSI or ASCII, as both ANSI and UTF-8 incorperate the full ASCII set. In other words: the numeric values 0 up to and including 127 represent exactly the same characters in ASCII, ANSI and UTF-8.

If you need characters outside of the ASCII set, you'll need to choose an encoding. You could use ANSI, but then you run into the problems of all the different code pages. Create a file on machine A and read it on machine B may/will produce funny looking texts if these machines are set up to use different code pages, simple because numeric value nnn represents differents characters in these code pages.

This "code page hell" is the reason why the Unicode standard was defined. UTF-8 is but a single encoding of that standard, there are many more. UTF-16 being the most widely used as it is the native encoding for Windows.

So, if you need to support anything beyond the 128 characters of the ASCII set, my advice is to go with UTF-8. That way it doesn't matter and you don't have to worry about with which code page your users have set up their systems.