3

I was implementing a hashing function for a class and I took a minute to look at the first-party collection package for Dart to see how they implemented their hashing function for collections. They use the Jenkins one-at-a-time algorithm to calculate the hashing, but I noticed something peculiar: the function was applying a bit mask to every operation that modified the running hash value:

(I'm using the ListEquality implementation here for ease of comprehension, but every collection hash function uses this bit mask. Also, the _elementEquality.hash just calculates a hash code for an element if a custom hashing function is specified, and defaults to using the element's own hashing function otherwise.)

const _hashMask = 0x7fffffff;

...

int hash(List<E>? list) { if (list == null) return null.hashCode; // Jenkins's one-at-a-time hash function. // This code is almost identical to the one in IterableEquality, except // that it uses indexing instead of iterating to get the elements. var hash = 0; for (var i = 0; i < list.length; i++) { var c = _elementEquality.hash(list[i]); hash = (hash + c) & _hashMask; hash = (hash + (hash << 10)) & _hashMask; hash ^= hash >> 6; } hash = (hash + (hash << 3)) & _hashMask; hash ^= hash >> 11; hash = (hash + (hash << 15)) & _hashMask; return hash; }

I'm not sure what function this bit mask serves, and more than that, I worry that it runs the risk of generating collisions. Consider this (albeit contrived) example:

import 'package:collection/collection.dart';

class MyInteger { final int x;

const MyInteger(this.x);

@override int get hashCode => x; }

void main() { final a = [MyInteger(0x0000000000000000)]; final b = [MyInteger(0xffffffff80000000)];

print(a[0].hashCode); // Prints: 0 print(b[0].hashCode); // Prints: -2147483648

final equality = const ListEquality<MyInteger>();

print(equality.hash(a)); // Prints: 0 print(equality.hash(b)); // Prints: 0 }

As I suspected, masking the top bit leads to collisions in hash codes that differ above the 31st bit. However, if I reimplement the hashing function with the bit mask removed, it works as expected:

// Reimplementation of the package hashing function with the bit mask removed
int hashAll<E>(List<E>? list) {
  if (list == null) return null.hashCode;

var hash = 0; for (var i = 0; i < list.length; i++) { var c = list[i].hashCode; hash += c; hash += hash << 10; hash ^= hash >> 6; } hash += hash << 3; hash ^= hash >> 11; hash += hash << 15; return hash; }

void main() { final a = [MyInteger(0x0000000000000000)]; final b = [MyInteger(0xffffffff80000000)];

print(hashAll(a)); // Prints: 0 print(hashAll(b)); // Prints: 659537742195965952 }

This strikes me as a bizarre design choice. In Dart, ints are 64-bit by definition, so this bit mask would have the effect of repeatedly chopping off the 33 most significant bits every time the hash was updated. By reducing the effective hash resolution by more than half, I'd imagine that this would significantly increase the chance of collisions, especially in collections with more .

The only thing I can think of is that since this utility was created back in 2016, it has its origins in the days when Dart was designed to be a transpiled language for Javascript. What seems to support this assumption is the fact that running that third code example in a compiled setting vs a dart2js setting. (For example, in DartPad, the generated unmasked hash is 2987540480.)

But even if that was the case, why limit the int to 31 bits when integers in Javascript are safe up to 53 bits, and why not make this an implementation detail for just dart2js? And if not, what purpose does the bit mask serve that justifies this increased chance of collision?

Abion47
  • 141

3 Answers3

3

It's hard to know the exact reason for this. That it truncates the hash to fit into 31 bits suggests that it's related to compatibility with 32-bit signed integers. It's also true that the algorithm was designed for 32-bits. It's not clear (to me at least) that the algorithm will have comparable effectiveness with more bits.

After looking more at the article, I noticed that the code this was based on uses an undeclared mask at the end. Skimming around the rest of the article, I noticed this:

The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask.

I thought it was worth noting but it doesn't really explain why a 31-bit mask was used.

One thing to realize, though, is that, in practice, you are unlikely to ever have anywhere near 2 billion 'buckets' in a hashmap. It just doesn't make sense to have an underlying list that large when you are only using a small fraction of the locations. The typical way these are implemented is with some small number of buckets (by default) and a load factor which determines how when to grow the underlying array and rehash the items.

For example, you could start with 16 'buckets'. When an item is added, its hash is calculated and that is then modded by 16 to determine which bucket it will end up in. Each 'bucket' is a list which can hold multiple items. At some point, when the number of items nears or exceeds 16, the number of buckets is increased, and the hashes are re-modded to find their new location.

Due to this, in effect, the hash will always be truncated to something smaller than 32 bits anyway. I'm struggling to imagine a case where these extra bits are going to have a significant impact on the distribution of items in the map.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
2

It seems to be designed in a way where hashing resulting in 32 bit signed integers works. You get collisions, but only one in 2^31 hashes clash. That is enough so you need to have code to handle it, but so few that it doesn’t affect performance.

And you may want a hash function that works in different languages.

gnasher729
  • 49,096
1

I can only guess, but here's what I think.

First of all you would limit hashes to 32 bits. That's the size of hash that Java (and many other languages) expects. Now hash values are typically used in hash tables. Which under the hood are arrays, and we would typically perform some kind of modulo operation in order to retrieve a concrete index.

Now taking 31 bits means we are taking a positive 32 bit signed integer. This is relevant because in many languages the % operator outputs negative numbers for negative input. Which of course cannot be used as index in array. For example this is the case for Java. So interacting with code that came from Java requires extra care. Perhaps. And Dart on Android does interact with code that came from Java, although it isn't clear to what degree.

Either way, regardless of a language, one can say that this is a defensive programming against broken implementations of hash tables (and potentially other structures). Which I assume is or was a real problem.

In principle the same problem applies to most languages, but affects Java or C# more because these languages require hash codes to be signed integers for some reason.

But as I said: this is just a guess.

There's of course a second question: how this masking in the middle (not the last one) affects the overall properties of the function. But this requires some deeper and harder analysis.

freakish
  • 2,803