-1

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.

Below is an example of data.

skills  hobbies        certification    education
Java    fishing        PMP              MS
Python  reading novel  SCM              BS
C#      video game     PMP              B.Tech.
C++     fishing        PMP              MS

so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.

Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.

Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?

document = all_document_in_db
For i in document:
   for j in document:
      if i != j :
        compute_similarity(i,j)

PS: I use python

1 Answers1

5

I think you are taking the wrong approach by comparing every document with every other document. You should define an N-dimensional space to contain your documents then use a nearest neighbour algorithm to find the best matches. This will be many times faster than doing 100,000,000 x 100,000,000 document comparisons.

There are a number of algorithms you can consider. If you are not familiar with them I suggest you start by trying k-d tree and see how well it works for your specific problem.