Best practices for internationalization: composed sentences?

Question

I am working on a project where clients are able to create objects in a database. Each of these objects has a description string that describes the object. Let's assume we are looking at an object that represents a car:

A red car manufactured by BMW with 62000 miles
A pickup-truck manufactured by Dodge from 2010
A car with 5 seats

The "car" class has different attributes, and not all of them are mandatory. For example:

type of car: car, sedan, pickup truck, SUV
mileage
brand
seats
year
number of previous owners

The description sentence should contain this information. E.g. if we know the number of seats, this information should be part of the sentence, otherwise it shouldn't. If we would do this in one language only, this is not too complicated. We just need to analyze the structure of the sentence and compose the sentence as follows:

A [{color}] {type of car} [manufactured by {brand name}] [with {miles} miles] [from {year}] [with {seats} seats}

parts in [....] are only part of the final sentence if the attribute (in {...}) is set.

However, this project needs to support several languages and we need a fast way of translating this. This means, that we can't just translate "manufactured by" and all the other elements of the sentence in all different languages, and then compose the sentence with the same structure. Different languages might have a very different sentence structure. Obviously, we could translate each combination of elements separately, but the effort of that is quickly getting to high, as the number of combinations can be huge (we have objects with 10 or more attributes).

What is the recommended way of dealing with this kind of scenario?

The project is implemented in Ruby on Rails, so I am ideally looking for an approach that supports this.

score 12 · Answer 1 · answered May 24 '19 at 15:03

If you want internationalization that isn't infuriating/weird/offensive/confusing to users, you do one of two things: you take the entire block out and have the copy translated in one go per item, or you remove the narrative sentence structure you propose that includes variable attributes so that you completely avoid differences in language structure.

The most common way this is done is you have a general product blurb, written in natural language, and this is translated to any language you support. It contains no information that will vary by product class, so if you must include some options it is said like "available in 2-door or 4-door models..." (but generally you avoid this when possible).

Then you use the attribute section as basically key-value pairs, so you have something like this:

Attribute: Value
Color: Red
Doors: 4
Year: 3000

The reason this is so popular is precisely because your proposed place-holder solution simply does not work across languages. You can beat your head against it and be surprised when native speakers of other languages don't appreciate your broken translation, or just hope they still are willing to buy it anyway, but natural language is hard and it is better to either do it right (fluent translator working with a single set block of text) or avoid the problem (key-value is not natural language anywhere, it is easier to scan and compare products with, and it allows you to perform word-level translation far more easily into the majority of languages you are likely to encounter).

score 6 · Answer 2 · edited Jun 16 '20 at 10:01

Basic principles

The sentence structure and grammatical subtleties can indeed make the internationalization exercise difficult.

The core practices that you'll need to use in all the cases are:

separate the text data from the code. Put it in a separate resource file, that you can give to translators.
for numeric values, have some code that chan make unit conversions (e.g. miles to kilimoeters)

List of attributes

The easiest way is certainly to list the attributes and their values, as proposed in Brian's answer.

The inconvenience is that the presentation of the data consumes a lot of place on the screen.

Classified style

Another alternative could be to create a classified-like string that simply concatenates the different values, without even naming the attributes. This works well if the values could not lead to confusion.

Here you need another good pratice: for every error message or text assembly, do not hard-code the order of parameters, but use a multilingual string that contains named placeholders. The order of the placeholders will of course depend on language and local usages.

English example               French example                  German example
---------------               --------------                  --------------
BMW car, red, 62000 miles     Berline BMW, rouge, 99000 km    BMW Pkw, rot, 99000 km
Dodge pickup-truck, 2010      Pick-up Dodge, 2010             Dodge Pick-up, 2010
Car, 5 seats                  Berline, 5 sièges               Pkw, 5 Sizter
{brand}{type}{miles}{seats}   {type}{brand}{miles}{seats}     {brand}{type}{miles}{seats}

It's more complex but there are a couple of advantages:

It's not difficult to generate;
It's much more compact, which makes it easy for a user to browse through a list;
It's much more accessible for visually impaired users who need to use a screen reader.
It's more voice-assistant friendly than a complex screen layout.

Sentences

Generating full sentences is one level further in complexity. But depending on your requirements, this could be a must-have feature (e.g. generation of contracts, voice assistant applications, ...).

So here, in addition to the multilingual string with the placeholders in the right order, and unless you have some sophisticated grammar-aware text generator, you'll need to foresee more vocabulary, in order to cope with:

Singular (car) vs. Plurals (cars)
grammatical gender: many languages have a gender. The car type in French could be masculine ("le pick-up") or feminine ("la berline"). In German you even have three genders (masculine, feminine and neutral). With such languages, the right order of words is not sufficient, and using the english word in a localisation API to find the localized equivalent is no longer a solution either. Here you must have text generation code that copes with these grammatical constraints (e.g. in French: "le pick-up bleu" vs. "la berline bleue")

For simple sentences, or self-description of objects, it is then sufficient to manage the multilingual resources with additional attributes:

type: car:     ->  French: gender=F;  (berline, S), (berlines,  P)  
type: pick-up  ->  French: gender=M;  (pick-up, S), (pick-ups, P)
color:  red    ->  French: (rouge, M, S), (rouge, F, S), (rouges, M, P), (rouges, F, P) 
color:  blue   ->  French:  (bleu, M, S), (bleue, F, S), (bleus, M, P), (bleues, F, P)

So depending on the context, you may find out if it's singular or plural. Then you need to deduce the gender to be used (the type will tell you). Then for the remaining values, you'll pick the word with the known attributes combination.

Caution: If you want to construct more sophisticated sentences, you'll quickly face a huge complexity (e.g. using the car's self-description in a sentence can be extremely difficult in languages like German, which use declensions that may require every word to be fine-tuned depending on the grammatical functions of the group). And every languages might have different rules. Then, it could be an idea to use a NLP translation service.

Best practices for internationalization: composed sentences?

2 Answers2

Basic principles

List of attributes

Classified style

Sentences