Algorithm. Find the group of documents with the least amount of words

Question

I need help with a problem which I have been working for the last month.

I have a group of documents, each document has a set of unique words (if the word appears more than once in the document, I count it only once). I want to find for a particular amount of documents the optimum group which contains the least amount of different words.

For example, if I have a set of five documents, each of them containing a set of words:

d1 = [ a , b, c, d, e ]
d2 = [ b , c, f ]
d3 = [ c , e, g ]
d4 = [ a , c, d ]
d5 = [ c , d, e ]

The set of three documents with the least amount of words would be (d1,d4,d5). This group of three documents would contain only a, b, c, d and e.

So far what I have tried is the "nearest neighbor" approach. Take the document with the least amount of new words. I extended it with a recursive limited brute force: take the next n documents with the least amount of new words.

Is there any better algorithm for finding a good set? I know the optimum set can only be solved by brute force, but that is obviously not doable here.

EDIT: Why I have the impression that "nearest neighbor" is a poor solution: By extending the set of documents I sometimes get a solution which is much worse than with less documents. Theoretically, the same set of documents could always be choosen independently of how many more new documents I add.

score 1 · Answer 1 · answered Feb 08 '19 at 09:59

"Must be"? Hardly. This sounds like one of the many, many problems in which the optimal solution depends on the exact characteristics of every single element. Essentially, you'll probably not be able to prove that some kind of locally optimal partial solution is, in fact, part of the globally optimal solution. If that is the case, the problem is almost certainly NP-complete and hence not solvable efficiently and correctly.

score 1 · Answer 2 · answered Apr 14 '19 at 20:24

Depending on the size of the problem you might want to model it as a Mixed Integer Programming (MIP) problem. There exists a variety of open source (see glpk, cbc) or proprietary (cplex, gurobi, xpress-mp) to solve those problem.

In your case you would associate with each document a binary variable indicating if it is part of the optimal set or not. You would add a constraint stating that the sum of the variables associated to the documents have to be equal to the number of document you want as part of your optimum group. With each word you would associate a linear variable. For every combination of document and word being part of the document you would add a constraint stating that that the variable associated with the word has to be greater or equal than the variable associated with the document. Finally you would define your objective function has being the sum of all the variables associated to words.

Algorithm. Find the group of documents with the least amount of words

2 Answers2