How to efficiently search a set of vectors for a vector which is the closest match

Question

Unfortunately, I am not even sure of the keywords to use for this question, so if this has already been asked please point me the right way.

Given a set of vectors:

[3, 5, 3]
[10, 23, 5]
[123, 53, 97]

(here, 'vector' means one dimensional sets of numeric values).

I'd like to know a way of finding the closest match for any input vector. For example,

[2, 4, 5]

Would return the first one from my list as the closest match. I'd also like to know the distance of the match.

A way of visualising this is like a line graph.

I don't want to linear search. I could have millions of vectors. I do not mind performing pre-processing time; it's search time I want to optimise.

Defining 'closest match'

In this case, let's assume the data is numeric, and the 'closest match' is a simple numeric comparison, giving an absolute distance. For example, comparing [2, 4, 5] could give distances of the test data of:

[1, 1, 2]
[8, 19, 0]
[121, 49, 92]

Varying size vectors

How would vectors of different sizes be handled? I'd like to be able to handle cases such as the following input:

[120, 99]

Matching the third example from the test data, in some "lower scoring" sense. That's because the two values are similar to the first and last values in the test data, ignoring the middle value, and in order.

Another way of describing it might be - the ordering of the values in the vector is important, but not the position.

Clustering solutions

I considered some form of clustering - hashing the vectors into buckets, then hashing the input and returning all vectors in the matching bucket. But what troubles me about this is what if a given input is close to the "boundary" of a cluster - other values in adjacent clusters would not be returned, would they?

Application domain

I want to use this algorithm to find similar musical releases by their constituent track lengths. As such, the lengths can be stored as an integer in some time unit (seconds or milliseconds, say) and the data ends up looking like the above.

So similar to the CDDB (FreeDB) algorithm, but more leniency and the ability to define and explore distance. And support for different length vectors.

Other questions I have looked at

How to find the closest vector to a given vector? seems to discuss 2D points - while this could be considered in a 2D space, the values in each vector must be considered together.

rwong · Answer 1 · 2016-08-19T08:13:50.613

This is a concept map. Since nothing is known about the application domain, no concrete suggestions can be given.

In general, the steps to tackle this problem are:

First stage Find out what is a good similarity or distance measure between two feature vectors of same length.
- There are many choices. Examples:
- Norm, e.g. L1, L2, L-infinity
- Cosine
- Earth mover's distance (EMD)
- Locality sensitive hashing
- Random projection
- While some choices may work better for specific applications, in general a toolbox that implements all of the well-known ones will be able to cover almost all applications. The chance of having to invent a new similarity measure for same-length vector comparison is low.
- However, there are still some application-specific distance measures:
  - Transformation into frequency domain. This will be explained next.
  - In other words, there are certain types of feature vectors (those that could be called time-series or signals) that will benefit from a unified approach for same-length and different-length vector comparison.
Second stage Find out how to measure similarity or distance between two feature vectors of different length.
- This is highly application-specific.
- In general, it either falls into signal processing, or pattern recognition.
- One kind of data that is common to both signal processing and pattern recognition is called time series. The axis doesn't have to correspond to time, though - the axis could be e.g. x-coordinate or y-coordinate on a 2D image, for example.
- Time series can be transformed into frequency domain, after which comparison can be performed.
- Time series can also be decomposed into scale space.
- Time series that "warp" or don't align properly can be compared using dynamic time warping (DTW) which is based on dynamic programming. However, DTW can only be applied to a pair of vectors at once. Unlike the transformations discussed earlier, DTW cannot be used to extract a representation from a single vector.
- A 2D image should be treated as a time series of time series (i.e. nested). Don't try to rasterize two images of different sizes into vectors of different lengths. Instead, resize the two images into same 2D size, then convert into a 1D vector.
Third stage Find out how to speed up large scale searches, such as
- Pre-process a large number of stored vectors so that queries that ask for the nearest vector will not need to visit every single vector in the store. (This goal is analogous to creating an index for a database table, but different techniques are needed.)
- Finding clusters from a large number of stored vectors
- These knowledge typically belongs to information retrieval, and specifically multimedia information retrieval.
- Extremely simple cases such as 2D/3D points or bounding rectangles could be searched using spatial query. However, it is ineffective in handling multimedia information retrieval, which may need to work with vectors with hundreds to thousands of dimensions.

score 2 · Answer 2 · edited Apr 13 '17 at 12:40

There was a similar question on CodeReview. The answer should be work also for your problem: https://codereview.stackexchange.com/a/139094/104803

In short:

calculate the whole space (min/max value for each dimension)
create a cluster of cells for that space
map each vector to one cell of the cluster

The clustered vector gives you a kind of sorting that requires only to search a small subset. For example, in 2 dimensions you start with 9 cells... if there is no vector within that cells, continue with 16 adjacent cells and so on.

However, the performance of that solution decrease with the number of the vector's dimensions because the number of clusters increases exponential with the number of dimensions...

If the vectors are not evenly distributed, you could also improve the clustering algorithm by using a dynamic clustering: starting with 2 cells in each dimension and splitting each cell if it contains to many vectors.

yoshi · Answer 3 · 2016-08-18T17:45:20.610

I really wish I had enough points to comment, because this is more of a clarification request than an answer, but oh well.

First:

+1 for "need definition of 'closest match'" -- you could do, say, cosine similarity (if you consider the "vectors" in a mathematical sense), which would be very different from if you considered a "vector" to be an N-dimensional point and just wanted the distance between two points. ("Vector" in programming can be a bit of an overloaded term...)

Second:

I don't want to linear search. I could have millions of vectors.

...

I considered some form of clustering - hashing the vectors into buckets, then hashing the input and returning all vectors in the matching bucket.

Unless the list of vectors is already sorted in some way that is meaningful to your definition of "closest match", I don't think you can beat linear time. The fact that you considered "hashing the vectors" also implies that you're willing to do some linear-time "pre-processing"?

A simple linear scan + keeping track of "best match so far" should be fast, even for millions of items, assuming your comparison operation is essentially constant (which it should be).

If you're just being coy and actually need to process more vectors than you're letting on... then maybe it's time to fire up some extra cores and multithread that sucker.

score 0 · Answer 4 · answered Aug 19 '16 at 06:51

Why not a simple binary search? You can sort the vectors during pre processing and simply perform logarithmic search later. If there are too many vectors to fit in memory you can use a B+ tree which works well with HDDs. You would then construct the tree during preprocessing and search later. Using a tree would also alow you to modify the set of vectors easily.