11

The idea of hashes is they get drastically different results for even the smallest change of data.

What I am asking for is the opposite of that. An algorithm that will produce proximity hash values for data that is approximately the same. Regardless of where the difference is, to only measure the extent of the difference, and the resulting hash value is closer or further from the original hash value depending on whether the second data set is more or less similar to the original.

I do not target any particular data type, raw byte arrays would be fine.

dtech
  • 763

6 Answers6

14

The idea of hashes is they get drastically different results for even the smallest change of data.

No, it is not.

The idea of hashes is that they map a larger, potentially infinite input space to a smaller, finite, usually fixed output space. E.g. SHA-3 maps infinitely many octet strings into 2512 bitstrings.

What you are talking about is one of the properties of a cryptographically secure message digest, which is a (very small) special case of a hash function.

For example, the hash functions (aka fingerprints) that are used to detect copyright violations on online video platforms would be useless if they had this property.

One of the most well-known hash functions that has the opposite property, namely that similar inputs generate similar outputs, is soundex: an algorithm that produces similar hash values for words that sound similar.

What I am asking for is the opposite of that. An algorithm that will produce proximity hash values for data that is approximately the same. Regardless of where the difference is, to only measure the extent of the difference, and the resulting hash value is closer or further from the original hash value depending on whether the second data set is more or less similar to the original.

This sounds more like a similarity measure than a hash function. In particular, there is nothing in your description that implies a small, fixed-size, finite output space.

Jörg W Mittag
  • 104,619
3

What you're talking about is Cluster Analysis.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

There are numerous approaches to this, such as k-means clustering.

Erik Eidt
  • 34,819
3

It's called Perceptual Hashing

https://en.wikipedia.org/wiki/Perceptual_hashing

pHash - an open source perceptual hash library http://www.phash.org/

Blockhash.io - an open standard for perceptual hashes http://blockhash.io/

Insight - a perceptual hash tutorial http://bertolami.com/index.php?engine=blog&content=posts&detail=perceptual-hashing

1

This is probably not what you want, but since it's interesting, I thought I would bring it up. There is actually something produced about a decade ago called the Similarity Metric (see also Clustering by Compression). The idea is given x and y we want to calculate the length of the shortest program that produces x given y and y given x (modulo some fudge factors). If we normalize this appropriately, we get a good notion of relative similarity. However, this notion is defined in terms of Kolmogorov complexity which is not computable, so this doesn't actually produce a usable algorithm. We have to approximate it instead. This leads to the Normalized Compression Distance which is simply:

NCD(x,y) = (C(x++y) - min(C(x),C(y))/max(C(x),C(y))

where x++y represents the concatenation of x and y viewed as bit sequences. And, importantly, C(x) represents the bit length of the compressed representation of x for some (appropriately behaving) compression algorithm C. Basically, if you take a compression algorithm, such as gzip, you can simply compress each input and their concatenation and use the above formula to get a rational number between 0 and 1 indicating how "similar" they are. (You can then round that rational number to the nearest fixed point number for a given number of bits to get a fixed size output.) This assumes every bit is significant. It may well make sense to "normalize" the input (e.g. removing extraneous whitespace or low-pass filtering) to avoid spurious differences. Conceptually, this can be folded into the compression algorithm. This points out that this notion of similarity varies with the compression algorithm, though general-purpose compression algorithms are often adequate. Some tools to do this are here, though it would be easy to roll your own.

I suspect a real compressor will be "too good" at finding similarities for your purposes. That is, it will find similarity between things that you don't want to consider similar. That might be technically resolvable by a suitable choice of compressor, but using some other metric may make more sense than doing this.

0

As others have said a similarity metric is what you probably want. You say you have byte arrays. I will assume the arrays can be of unequal lengths but even if they are always of equal lengths it is not much different. I would start with a simple metric like cosine similarity. You say that the order is not important, just the number of times a value appears in the array is.

So let's assume a byte of -128 to 127. I would transform the two arrays to a vector within a euclidean space of 256 dimensions. Here the value of the new vector[i] is the count of the number of times -128+i appears in the original array. For simplicity we will assume sparse vector representations.

Now both arrays are within the same dimensional space. Now you can calculate cosine similarity on the new representations. Since cosine similarity is naturally bound [-1,1] we then know two arrays of a similarity of 1 are equal and two arrays of similarity of -1 are perfectly dissimilar.

Doing this would give you a similarity of 1 for between any two of the 3 following arrays [1,2,3] [2,3,1] and [3,1,2] for example.

0

You may want to consider a correlation function. A simple example being the cross correlation.

These methods yield a similarity measure and are generally translaton invariant (meaning that data offsets - as you described in your comment - would be irrelevant)

LorToso
  • 261