How does Pearson hashing compare with other non-cryptographic hashing algorithms?

Question

FNV-1, Murmur2, and DJB2 are examples of non-cryptographic hashing functions used in actual applications (see Which hashing algorithm is best for uniqueness and speed?). These are all similar in that they have an inner loop that computes the result by using simple operations such as XOR or bit shifting.

Perhaps these algorithms are truly the best available; I don't know. But Pearson Hashing, in particular, seems to be rarely considered. Yet it is the algorithm that would seem to be the best, given no domain information about the keys, because it was designed to spread out (randomize) the range well for any domain.

This means (for character string keys) that whether the keys are single letters (a small domain) or strings of length 128 (a much larger domain), the result (the hashed value) is guaranteed to be spread out well (distributed randomly over the range of the function). The reason that such a good random distribution is desired is that such a distribution would be expected to reduce the number of collisions for random selections of keys (again, assuming no special characteristics of the key (domain) distribution).

Pearson hashing accomplishes this by using a 256-entry array that contains a random permutation having a single cycle. To clarify what I mean, here is a four-entry array specifying such a permutation of the value list [0, 1, 2, 3]:

Pearson hashing scans the key string. For each byte (it's okay for a character to span multiple bytes), it XORs the byte into a running sum hash, then looks hash up in its 256-byte permutation array. This lookup result is the next running sum. Since the array contains 256 entries, it will handle any byte. To scale up to a larger range, such as a 16 or 32 bit range, the inner loop is performed more than once. Since the algorithm uses one XOR and one simple array lookup for each byte of the range, it is probably the fastest algorithm possible, particularly if implemented in assembly language. And the size of the Pearson hashing function should be small, not much more than the 256 bytes used by its array. Except in very tiny embedded applications, I can't imagine 256 bytes of memory being an objection to the use of Pearson hashing.

Some examples of Pearson function implementations are given at https://gist.github.com/imdario/4758192 .

I would be very interested to see an analysis of Pearson hashing as compared to other more commonly-used algorithms, such as the three listed above, with input sets such as an English dictionary. I would expect it to do very well.

score 1 · Answer 1 · answered Jun 12 '16 at 06:59

I don't have a practical comparison between Pearson hashing and the other common suggestions, but I can highlight some assumptions you're making that aren't necessarily true and which might explain why it isn't as popular as you seem to expect:

You state that having good distribution of small keys throughout the entire range is just as important as good distribution of larger keys, but this is not necessarily true. In practical applications, small keys are rare, and cannot occur with any non-trivial frequency in large data sets simply because there are only a small number of possible small keys. We only care about optional performance for large data sets, as small data sets can be processed quickly enough in any case.
You assert that performance will be good due to the simplicity of the algorithm, but it doesn't really seem that simple to me. For a 32-bit hash (which is the smallest that's really useful) it requires 8 operations per byte. Compare this to Murmur's 6 operations per 4-byte word, and it is clearly not going to be competitive. Even a single byte output at 2 operations per byte is unlikely to be as fast as Murmur.

score 0 · Answer 2 · answered Nov 22 '21 at 04:41

A comprehensive test suite and review of modern hash functions can be found in SMHasher repository of Reini Urban. According to the quality values, 64-bit Pearson hash is somewhat worse than 64-bit Murmur2 hash, but is almost 17 times slower!

Hash function  |  Speed  | A.Bias | Sparse Bias |  Cycl.Bias | 2-byte Bias |
============================================================================
Spooky64       | 9747.47 |   0.8% |      0.362% |     0.167% |      0.128% |
Murmur3_32     | 2413.88 |   0.8% |      0.548% |     0.106% |      0.201% |
Murmur2_64     | 4882.95 |  12.6% |      1.121% |     0.150% |      0.364% |
Pearson64      |  287.95 |  12.8% |      1.541% |     6.881% |      6.794% |
FNV64          |  791.82 | 100.0% |     99.988% |    94.450% |     99.837% |
DJB2           |  791.82 | 100.0% |  Collisions | Collisions |  Collisions |

So, currently something like SpookyHash would be the best option to use due to high speed and high quality of the hash values. If not Spooky, than Murmur3 beats Pearson hash in every meaningful way.

How does Pearson hashing compare with other non-cryptographic hashing algorithms?

2 Answers2