How to generate "language-safe" UUIDs?

Question

I always wanted to use randomly generated strings for my resources' IDs, so I could have shorter URLs like this: /user/4jz0k1

But I never did, because I was worried about the random string generation creating actual words, eg: /user/f*cker. This brings two problems: it might be confusing or even offensive for users, and it could mess with the SEO too.

Then I thought all I had to do was to set up a fixed pattern like adding a number every 2 letters. I was very happy with my 'generate_safe_uuid' method, but then I realized it was only better for SEO, and worse for users, because it increased the ratio of actual words being generated, eg: /user/g4yd1ck5

Now I'm thinking I could create a method 'replace_numbers_with_letters', and check that it haven't formed any words against a dictionary or something.

Any other ideas?

ps. As I write this, I also realized that checking for words in more than one language (eg: english and french, spanish, etc) would be a mess, and I'm starting to love numbers-only IDs again.

UPDATE

Some links everyone should read:

http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx

Link

CesarGon · Answer 1 · 2012-06-25T21:45:41.080

A couple of tips that will lower the chances of inadvertently creating meaningful words:

Add some non-alpha, non-numerical characters to the mix, such as "-", "!" or "_".
Compose your UUIDs by accumulating sequences of characters (rather than single characters) that are unlikely to occur in real words, such as "zx" or "aa".

This is some C# sample code (using .NET 4):

private string MakeRandomString()  
{  
    var bits = new List<string>()  
    {  
            "a",  
            "b",  
            "c",  
            "d",  
            "e",  
            //keep going with letters.  
            "0",  
            "1",  
            "2",  
            "3",  
            //keep going with numbers.  
            "-",  
            "!",  
            "_",  
            //add some more non-alpha, non-numeric characters.  
            "zx",  
            "aa",  
            "kq",  
            "jr",  
            "yq",  
            //add some more odd combinations to the mix.  
    };  

    StringBuilder sb = new StringBuilder();  
    Random r = new Random();  
    for (int i = 0; i < 8; i++)  
    {  
        sb.Append(bits[r.Next(bits.Count)]);  
    }  

    return sb.ToString();  
}

This doesn't guarantee that you won't offend anyone, but I agree with @DeadMG that you cannot aim so high.

score 6 · Answer 2 · edited May 23 '17 at 12:40

6

Consider using a numeric or hexadecimal key instead. It will save you a lot of trouble compared to writing an i18n-aware profanity filter, and the worst you'll have to worry about is dead beef.

edited May 23 '17 at 12:40

Community

1

answered Apr 08 '12 at 17:39

score 5 · Answer 3 · answered Apr 07 '12 at 17:04

Just create a naughty word list, a letter substitution list, and then if any ID generated is a naughty word, redo it.

For instance (pseudo code)

naughty_words = ["ass", "shit", "boobs"]
substitutions = {
    "4" : "a"
    "1" : "i"
    "3" : "e"
    "7" : "t"
    "5" : "s"
    "0" : "o"
    // etc.
}

function reducestring (str) {
    newstr = ""
    for (character in str) {
        if (substitituions[character]) newstr += substitutions[character]
        else newstr += character
    }
    return tolower(newstr)
}

do {
    new_id_numeric = random_number()
    short_id = compress_to_alphanumeric(new_id_numeric) // 0-9, a-z, A-Z
    // that function should create a base 62 number
} while (!contains(naughty_words, reducestring(short_id))

(You can refer to other short url recommendations like this one for info on base 62 hashing/conversion)

Now you no longer get IDs like a55, sh1t, or "b00bs". Your letter substitution list would only need to contain characters in your naughty words, obviously.

Since no one is going to read "455" as "ass" then you might also want to return str in reducestring if it doesn't contain any letters.

Examples

The graphic-design site Dribbble has its own short string ids for posts. These use 0-9, a-z and A-Z like http://drbl.in/dCWi.

I did some experimenting and there are short ids for at least a few naughty words. I guess we'll see when they get to f, but they aren't there yet.

Granted -- giving a user their own personally-identifying url (/user/whatever) instead of just a post is much worse with naughty words.

score 4 · Answer 4 · answered Apr 08 '12 at 07:08

There are essentially two strategies that you can employ:

Create a system that won't generate any offensive strings. For example, you can compose your id's only from consonant letters. By leaving out all vowels, you can be sure that your system will never generate any English words, naughty or otherwise.
After generating a completely random id, check to make sure that the new id doesn't include any offensive substrings.

DeadMG · Answer 5 · 2012-06-26T17:03:58.337

You can never prevent an automated system from generating some string that's offensive to a user. For example, in China some numbers are considered unlucky.

All you can really do is tell the user that their ID is random and the contents are irrelevant and if they get /user/fucker then they should just ignore it. These things happen and it's just not technically feasible to avoid it- just like you can never filter profanity.

score 1 · Answer 6 · edited Oct 07 '21 at 06:47

In many situations (email spam, ip blocking, etc), a blacklist is a losing game -- you'll never be able to make a "complete" blacklist of every possible bad thing that could ever occur. a b c d e f

Many people use a whitelist of acceptable words and string them together in some random order. (Perhaps with a dash or dot or space between each word).

Some popular dictionaries that are used for converting arbitrary numbers to a pronounceable series of words include:

a list of fruits a b
the PGP word list, also called a biometric word list
the DialDice list
the Diceware list
the S/KEY dictionary from RFC1760

score 0 · Answer 7 · answered Apr 08 '12 at 15:54

0

You can either make it just randomly generated numbers, or have a regex to cancel out the ones that are offensive:

/ass/ =~ userid
/boobs/ =~ userid
/morenaughtywordshere/ =~ userid

answered Apr 08 '12 at 15:54

Billjk

1,249

How to generate "language-safe" UUIDs?

7 Answers7

Examples