5

Suppose one has a column of words on which one builds a BTREE index:

CREATE TABLE myTable (
  words VARCHAR(25),
  INDEX USING BTREE (words)
);

LOAD DATA LOCAL INFILE '/usr/share/dict/words' INTO TABLE myTable (words);

And now one wants to find the records which share the longest common prefix with some search query, e.g. 'foobar'. I thought to do the following:

SELECT DISTINCT words
FROM   myTable
WHERE  words LIKE CASE
  WHEN NOT EXISTS (SELECT * FROM myTable WHERE words LIKE 'f%') THEN '%'
  WHEN NOT EXISTS (SELECT * FROM myTable WHERE words LIKE 'fo%') THEN 'f%'
  WHEN NOT EXISTS (SELECT * FROM myTable WHERE words LIKE 'foo%') THEN 'fo%'
  WHEN NOT EXISTS (SELECT * FROM myTable WHERE words LIKE 'foob%') THEN 'foo%'
  WHEN NOT EXISTS (SELECT * FROM myTable WHERE words LIKE 'fooba%') THEN 'foob%'
  WHEN NOT EXISTS (SELECT * FROM myTable WHERE words LIKE 'foobar%') THEN 'fooba%'
  ELSE 'foobar%'
END

Which is fine: it's very readable and performant; and it can easily be generated in application code.

However, this search should be even simpler to resolve: just walk through the index tree according to the search term until a branch does not exist, then return all the results that branch from the current node.

Granted that walking a path through the index only once instead of multiple times is probably a needless micro-optimisation, but it feels as though it should be possible: is it?

eggyal
  • 331
  • 2
  • 8

1 Answers1

1

If you really want to walk just the b-tree index, using the innodb_ruby project can help http://blog.jcole.us/2013/01/14/efficiently-traversing-innodb-btrees-with-the-page-directory/


I think your logic is reversed in your query. It'll start with the shortest word.

How I would handle it is this

DROP PROCEDURE IF EXISTS `find_longest_prefix`;

DELIMITER $$

CREATE PROCEDURE `find_longest_prefix`(IN `word` varchar(255), OUT `word_prefix` varchar(255))
BEGIN
    SET max_sp_recursion_depth = 255;
    SET @nextWord = LEFT(`word`, LENGTH(`word`)-1);

    SELECT COUNT(DISTINCT `words`) FROM `myTABLE` WHERE `words` LIKE CONCAT(`word`, '%') INTO @word_count;

    IF (@word_count > 0)
    THEN
        SET `word_prefix` = `word`;
    ELSE
        IF (LENGTH(@nextWord) > 0)
        THEN
            Call `find_longest_prefix`(@nextWord, `word_prefix`);
        ELSE
            SET `word_prefix` = '';
        END IF;
    END IF;
END$$

DELIMITER ;

This uses the fact that finding a miss on a btree is fast so we just loop around calling recursively until we have a hit.

Examples

No results

mysql> CALL find_longest_prefix(';autobon', @prefix);
Query OK, 1 row affected (0.01 sec)

mysql> SELECT @prefix;
+---------+
| @prefix |
+---------+
|         |
+---------+
1 row in set (0.00 sec)

A few results

mysql> CALL find_longest_prefix('autobon', @prefix);
Query OK, 1 row affected (0.00 sec)

mysql> SELECT @prefix;
+---------+
| @prefix |
+---------+
| auto    |
+---------+
1 row in set (0.00 sec)

And as you can see, the data is correct:

mysql> SELECT * FROM myTable WHERE words LIKE 'auto%' OR words LIKE ';auto%';
+----+-----------------+
| id | words           |
+----+-----------------+
| 19 | AUTOCOMMIT      |
| 20 | AUTOEXTEND_SIZE |
+----+-----------------+
2 rows in set (0.00 sec)

This method should be much faster then actually checking every step before the longest and getting a lot more returned data.

It should be easy for you to have the stored proc SELECT the rows once it finds the correct prefix and return them if you'd prefer.