-1

I'm interested in finding a text distance (or string similarity) algorithm which computes a greater distance (or lower similarity) when characters are further apart.

For example, I want the distance between abc and abz to be greater than the distance between abc and abd.

It would be easy to compute a text distance like this for strings of the same length, but I'd like to find one that also works for strings of different lengths.

Common algorithms like Levenshtein, Jaro-Winkler, and Ratcliff-Obershelp compute the same values for these two examples.

Edit: People are asking for a specific distance metric, so let's say it's the absolute difference between character values divided by the length of the longer string. And to keep this simple, only ASCII characters are considered.

3 Answers3

1

First: Define what the distance between two characters is. For example, p and b, g and k, d and t Are similar. A and e are reasonably similar. If I had software converting speech to text, and the recognised word is dable, then I suspect it’s really table.

Or you take the keyboard distance. If you see ltunpstf then it’s someone typing “keyboard” blindly with their finger one position to the right.

gnasher729
  • 49,096
1

As others have said, you must first define "distance". Once you have done so, however, standard approaches can be used. I have implemented Levenshtein this way--most changes were counted as two points, but different trailing digits were only counted as one point. In practice it proved quite good at finding what the users generally wanted--other lots in the same project. The basic algorithm works the same no matter how complex your distance logic is. I have never played with the others you mention so I can't address them but I would be surprised if the same approach didn't work.

0

I agree with @gnasher729 where it is suggested that you define the meaning of a distance precisely. For example:

  • should Distance("ab","a") be > Distance ("ab","z")?
  • should Distance ("ab","a") = Distance("ab","A")?

Not sure if the code link below could help you or not, it may be useful as a starting point to make the question more specific:

Code for thought-Not actual algorithm.

NoChance
  • 12,532