2

I want to build a search with basic typo tolerance. There are quite a few string similarity algorithms (and implementations for almost all languages I guess).

However, humans tend to make some typos more frequently than others. E.g

  • mixing up the order of letters when typing, e.g tpying instead of typing
  • hitting the key next to the intended one, e.g. slright instead of alright because s and a are next to each other on most keyboard layouts
  • mixing up letters with similar sounds associated to them, e.g. there instead of their

I think, a good typo tolerance algorithm should take that into account. E.g. the pair alright and slright should get a higher similarity than alright and mlright. As far as I known, no string similarity algorithms does something like this.

Are there free algorithms (and implementations in TypeScript) which do?

(Unfortunately, just gathering masses of data of what my users actually type is not an option for me.)

cis
  • 255

2 Answers2

3

I don't think there is a single algorithm that's going to do what you want.

For point 1 (transpositions) you can use Damerau-Levenshtein Distance, or the simpler Optimal String Alignment distance (discussed in the same link).

For point 2 (close characters on the keyboard) you will need some kind of parameterised distance function. I have seen Levenshtein algorithms (can't remember where) that allow different penalties for edit, insert and delete, but I haven't seen anything that gives different penalties for different character substitutions. I think it should be possible to modify those algorithms to do that though.

For point three (homonyms) you would have to understand the grammatical context to know which of "their" or "there" was correct, so I don't see how that could be done with a basic spell checker.

rghome
  • 688
1

There are a couple of string similarity algorithms that could be useful in an editor for detecting typos. The trick is to use the current position, find the start of the current token by moving back searching for a separator, and apply the algorithm on the substring up to the next separator.

These algorithms are good for matching full identifiers, with all the situations that tou have described, e.g. tpying instead of typing.

However, these algorithms are less suitable for detecting typos as they occur, e.g. tpyi… instead of typi… So if your intention is to detect typos as they happen, you’ll need some more creativity and manage some trie. As long as the trie for the currently typed token is valid, you don’t do anything because the token may correspond to a valid one (e.g if you’d have a variable called tpyih). As soon as you reach a char that is not in the trie, you could try a string similarity to check if there is a typo. You should however limit the check to close similarity (i.e distance not larger than 2).

Christophe
  • 81,699