36

I have the following problem: I have a database containing more than 2 million records. Each record has a string field X and I want to display a list of records for which field X contains a certain string. Each record is about 500 bytes in size.

To make it more concrete: in the GUI of my application I have a text field where I can enter a string. Above the text field I have a table displaying the (first N, e.g. 100) records that match the string in the text field. When I type or delete one character in the text field, the table content must be updated on the fly.

I wonder if there is an efficient way of doing this using appropriate index structures and / or caching. As explained above, I only want to display the first N items that match the query. Therefore, for N small enough, it should not be a big issue loading the matching items from the database. Besides, caching items in main memory can make retrieval faster.

I think the main problem is how to find the matching items quickly, given the pattern string. Can I rely on some DBMS facilities, or do I have to build some in-memory index myself? Any ideas?

EDIT

I have run a first experiment. I have split the records into different text files (at most 200 records per file) and put the files in different directories (I used the content of one data field to determine the directory tree). I end up with about 50000 files in about 40000 directories. I have then run Lucene to index the files. Searching for a string with the Lucene demo program is pretty fast. Splitting and indexing took a few minutes: this is totally acceptable for me because it is a static data set that I want to query.

The next step is to integrate Lucene in the main program and use the hits returned by Lucene to load the relevant records into main memory.

Giorgio
  • 19,764

7 Answers7

21

The technology you are looking for is full-text indexing. Most RDBMS have some sort of built-in capabilities which could work here, or you could use something like Lucene if you wanted to get fancier and/or just run it in memory.

Wyatt Barnett
  • 20,787
21

Instead of putting your data inside the DB, you can keep them as a set of documents (text files) separately and keep the link (path/url etc.) in the DB.

This is essential because, SQL query by design will be very slow both in sub-string search as well as retrieval.

Now, your problem is formulated as, having to search the text files which contains the set of strings. There are two possibilities here.

  1. Sub-string match If your text blobs is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every file to find best possible files that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside. But you may still make at least 100+ grep (worst case 2 million) before returning.

  2. Indexed search. Here you are assuming that text contains set of words and search is limited to fixed word lengths. In this case, document is indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
    a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
    b. OpenFTS : http://openfts.sourceforge.net/
    c. Sphinx http://sphinxsearch.com/

Most likely if you need "fixed words" as queries, the approach two will be very fast and effective.

Giorgio
  • 19,764
Dipan Mehta
  • 10,612
9

Have you considered a trie? Basically you build a tree using common prefixes, so all words that start with the same letters are children of the same node. If you're going to support matching on any substring, then you'll have to generate some sort of permuted index and build your trie from that. That may wind up blowing your storage requirements way out, though.

TMN
  • 11,383
5

I would like add on top of Wyatt Barnett's answer that a RDBMS solution with full-text indexing on the appropriate column will work, but if you want to utilize a local cache of previously fetched records then you need to a plan to utilize these cached records to your advantage.

One option is to collect the unique identifiers of these records that you EXPLICITLY do not want to retrieve from the query and include them, possibly in a NOT IN or a NOT EXISTS.

Word of caution though, using NOT IN or NOT EXISTS tends not to be cheap and MAY negatively influence your query performance or query plan depending on what database engine you are utilizing. Run an explain plan on your final query to ensure that all of your indexes on the affected columns are being utilized.

It also doesn't hurt to do a performance comparison between the two approaches to see which is faster. You may be surprised to find out that maintaining a local cache and filtering those from your query explicitly may have worse performance than a finely tuned query that fetches all records.

maple_shaft
  • 26,570
2

Just in case you missed it. If you use Lucene for your database instead of in-DB supported text search, you will have to be extremely careful when making modification to your DB. How do you make sure that you can have atomicity when you have to make changes in both the DB and the external resources (Lucene)? Yes it can be done, but there will be lot of work.

In short, you are losing the DB transactional support if you put Lucene in your data schema.

InformedA
  • 3,031
1

Have you considered Sphinx? http://sphinxsearch.com if you can use a 3rd party tool this would be ideal for what your're trying to achieve, its much more efficient at full text search than any RDBMS that I have personally used.

twigg
  • 275
1

It is somewhat strange that none of the answers presented the term "inverted index", the technology underlying all solutions similar to Apache Lucene and others.

The inverted index is a mapping from words to documents ("record-level inverted index") or even precise word locations within the document ("word-level inverted index").

AND and OR logical operations are trivial to implement. If you have precise word locations, it is possible to look for adjacent words, thus making phrase searches possible.

So, think about an index containing (word, file, location) tuples. When you have e.g. ("inverted", "foo.txt", 123) then you just check whether ("index", "foo.txt", 124) is part of the index to search for the full phrase "inverted index".

While I'm not recommending you to reimplement a full-text search engine from the scratch, it is useful to know how technologies such as Apache Lucene work.

So, my recommendation is to learn how inverted indexes work and choose a technology using them such as Apache Lucene. Then you at least have a solid understanding of what can be done and what can't be done.

juhist
  • 2,579
  • 12
  • 14