16

I personally find reading code full of Unicode identifiers confusing. In my opinion, it also prevents the code from being easily maintained. Not to mention all the effort required for authors of various translators to implement such support. I also constantly notice the lack (or the presence) of Unicode identifiers support in the lists of (dis)advantages of various language implementations (like it really matters). I don't get it: why so much attention?

avpaderno
  • 4,004
  • 8
  • 44
  • 53

8 Answers8

18

When you think unicode, you think Chinese or Russian characters, which makes you think of some source code written in Russian you've seen on the internet, and which was unusable (unless you know Russian).

But if unicode can be used in a wrong way, it doesn't mean it's bad by itself in source code.

When writing code for a specific field, with unicode, you can shorten your code and make it more readable. Instead of:

const numeric Pi = 3.1415926535897932384626433832795;
numeric firstAlpha = deltaY / deltaX + Pi;
numeric secondAlpha = this.Compute(firstAlpha);
Assert.Equals(math.Infinity, secondAlpha);

you can write:

const numeric π = 3.1415926535897932384626433832795;
numeric α₁ = Δy / Δx + π;
numeric α₂ = this.Compute(α₁);
Assert.Equals(math.∞, α₂);

which may not be easy to read for an average developer, but is still easy to read for a person who uses mathematical symbols daily.

Or, when doing an application related to SLR photography, instead of:

int aperture = currentLens.GetMaximumAperture();
Assert.AreEqual(this.Aperture1_8, aperture);

you can replace the aperture by it's symbol ƒ, with a writing closer to ƒ/1.8:

int ƒ = currentLens.GetMaximumƒ();
Assert.AreEqual(this.ƒ1¸8, ƒ);

This may be inconvenient: when typing general C# code, I would prefer writing:

var productPrices = this.Products.Select(c => c.Price);
double average = productPrices.Average()
double sum = this.ProductPrices.Sum();

rather than:

var productPrices = this.Products.Select(c => c.Price);
double average = productPrices.x̅()
double sum = productPrices.Σ();

because in the first case, IntelliSense helps me to write the whole code nearly without typing and especially without using my mouse, while in the second case, I have no idea where to find those symbols and would be forced to rely on the mouse to go and search them in the auto-completion list.

This being said, it's still useful in some cases. currentLens.GetMaximumƒ(); of my previous example can rely on IntelliSense and is as easy to type as GetMaximumAperture, being shorter and more readable. Also, for specific domains with lots of symbols, keyboard shortcuts may help typing the symbols quicker than their literal equivalents in source code.

The same, by the way, applies to comments. No one wants to read code full of comments in Chinese (unless you know well Chinese yourself). But in some programming languages, unicode symbols can still be useful. One example is footnotes¹.


¹ I certainly wouldn't enjoy footnotes in C# code where there is a strict set of style rules of how to write comments. In PHP on the other hand, if there are lots of things to explain, but those things are not very important, why not putting them at the bottom of the file, and create a footnote in the PHPDoc of the method?

10

I would say:

  1. to ease non-professionals and novices that learn programming (e.g. at school) and don't know English. They don't write production code anyway. I've seen many times code like:

    double upsos, baros;
    cin >> upsos >> baros;
    

    Just let the poor guy to write it in his language:

    double ύψος, βάρος;
    cin >> ύψος >> βάρος;
    
  2. Don't you like it?

    class ☎ {
    public:
        ☎(const char*);
        void ();
        void ();
    };
    
    ☎ ☏("031415926");
    ☏.(("Bob"));
    ofstream f;
    f.();
    
7

As far as I am concerned, this is purely for marketing reasons. And additionally may make our lives harder.

The marketing arguments

You know this crazy lists of features that most languages boast off ? It's pretty much useless in general, because it's so far from the language that it does not provide much information on specific, but it does allow one to quickly dress tables with ticks and crosses and rightfully conclude that since X has more ticks than Y it must be better.

Well, Unicode support for the identifiers is one of those lines. It does not matter that compared to Lambda support, Generic programming support, etc... it might not be much, people drawing the tables don't care about the quality of each line, only about the number of them.

And thus they can boast: "Ah, with Y you do not have Unicode support for your identifiers! In X we do, so for students it's much easier!"

The fallacy of accessibility

Unfortunately, the argument of accessibility is fallacious.

Oh, I do understand that being able to write "résultatDuJetDeDé" instead of "diceThrowResult" (yes I am French) might seem like a win in the short term... however there are drawbacks!

Programming is about communicating

Your program is not only meant for the compiler (which could care less about the identifiers you use), it is also meant for your fellows. They need to be able to read it, and understand it.

  • reading it implies being able to visualize the characters you used, Unicode is not so well supported by all fonts
  • understanding it do mean relying on identifiers -- unless you supplement them with lenghty comments, but that is violating the DRY rule.

Of course, your classmate may speak the same language you do (not obvious, I had programming classes with Germans, Spanishs, Libanes and Chineses), and so may your teacher... but suppose that somehow you are working on it at home and suddenly need help: Internet is great, you may speak to thousands of thousands of people that know the solution, they will only answer if they understand your question though. And you need to understand their answer as well.

Programming requires understanding

Accessibility and initiation require basing yourself on libraries to do the heavylifting for you: you don't want to reinvent an IO layer to read from/write to the console on your first assignment.

  • In which language are those libraries written ?
  • In which language are those libraries documented ?

If you answer Morrocan Arabic, I will be surprised.

Unless you only rely on the lectures you assist to, and those present comprehensive documentation on every library feature you will need to use (and perhaps even translated libraries), then you will have to learn a modicrum of the English language. But then, you probably did already long before you started this programming course anyway.

English is...

... the lingua franca of programmers (and most scientists).

The sooner one admits it, and goes along with it rather than fighting against it, the sooner one can truly learn and progress.

Some will inevitably raise against this, and rightly defend their right to speak the language of their choice (their maternal language usually), however, as Babel demonstrated, the more languages are used, the more difficult communication gets.

Still...

Yes, as it had been argued over and over, some Unicode support (mainly symbols) can greatly ease comprehension for people having to translate mathematical or physics formulas, for example, into code. There is the drawback that some symbols are overloaded, but it could still help.

So why ?

Well, as said, it's not really about user convenience, as much as it is about marketing claims. It's dead easy too, since the parser is already Unicode aware for strings and comments anyway, so most take the jump.

And there might be a benefit for certain users.

But I personally will only deal with code written with English identifiers. I don't care if you need my help with your piece of code or if your library is just awesome and I could gain much by using it: if I cannot understand it, I'll just have to ignore it.

Matthieu M.
  • 15,214
5

Of course, every modern compiler must deal with Unicode source code today. For example, string constants may need to contain Unicode characters. But once this is achieved, why not allow unicode identifiers also? It's no big deal unless your compiler code depends on characters being 7-bit codes.

But the OP is right insofar: It is now possible that a Hindi speaking Indian must maintain code with russian identifiers and arabic comments. What a nightmare for the poor Chinese who is supposed to do the quality check and who can't read any of the above 3 alphabets!

Hence, it is now an organizational tasks to make sure a programs identifiers and comments are written in a common language. I can't help it but I think this is going to be english for some time to come.

Mason Wheeler
  • 83,213
Ingo
  • 3,941
4

I think it makes a lot of sense to allow unicode characters in strings and comments. And if the lexer&parser have to support unicode for that anyway, the compiler writer probably gets unicode character support in identifiers for free, so it would seem like an arbitrary limitation to allow only ASCII characters in identifiers.

nikie
  • 6,333
3

How are you going to type ASCII identifiers on a Chinese keyboard? A few language keywords is one thing, and having to do your whole code that way is another.

Programmers should have the right and ability to call their variables whatever they want. It's none of your business what language that's in.

If you feel so confused reading code with identifiers that have symbols from other people's languages in them, then I'm sure that you exactly understand how confused they feel when they have to use identifiers with symbols from your language in.

DeadMG
  • 36,914
3

According to PEP 3131 -- Supporting Non-ASCII Identifiers dated in 2007, the first part of Rationale states:

Python code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. By using identifiers in their native language, code clarity and maintainability of the code among speakers of that language improves.

I haven't investigated other languages yet, but it should be among the reasons they added the support.

1

It would really make life easier (for some of us, anyway) if the compiler would not support Unicode. Right-to-left identifiers are awful. Combined Roman alphabet and right-to-left Unicode identifiers are even worse.

The bad thing about non-support is that certain GUI wizards take the text you put in for an item and automatically use that text as the item's identifier. So what exactly would they do with Unicode text on those items? No easy answer, I'm afraid.

Unicode right-to-left comments can be funny, too. For example, in VS 2010, XML comments display (correctly) as RTL in the code... but when you use Intellisense to pull up the identifier elsewhere in code, the tooltip displays (incorrectly) LTR. Better, perhaps, if there were no support in the first place? Again, not an easy call.

sq33G
  • 278