59

I am working in a MySql database, with a table like this:

+--------------+
|  table_name  |
+--------------+
|    myField   |
+--------------+

...and I need to make a lot of queries like this (with 5-10 strings in the list):

SELECT myField FROM table_name
WHERE myField IN ('something', 'other stuff', 'some other a bit longer'...)

There will be around 24.000.000 unique rows

1) Should I use a FULLTEXT or and INDEX key for my VARCHAR(150)?
2) If I increase the chars from 150 to 220 or 250... would it make a great difference? (Is there any way to calculate it?)
3) As I said, they are going to be unique, so myField should be a PRIMARY KEY. Isn't it rare to add a PRIMARY KEY to a field which is already a VARCHAR INDEX/FULLTEXT?

RolandoMySQLDBA
  • 185,223
  • 33
  • 326
  • 536
Mark Tower
  • 693
  • 1
  • 6
  • 6

3 Answers3

91

SUGGESTION #1 : Standard Indexing

CREATE TABLE mytable
(
    id int not null auto_increment,
    myfield varchar(255) not null,
    primary key (id),
    key (myfield)
);

If you index like this, you can either look for the whole string or do left-oriented LIKE searches

SUGGESTION #2 : FULLTEXT Indexing

CREATE TABLE mytable
(
    id int not null auto_increment,
    myfield varchar(255) not null,
    primary key (id),
    fulltext (myfield)
);

You can effectively use searches for individual keywords as well as whole phrases. You will need to define a custom stopword list because MySQL will not index 543 words.

Here are my other posts from the past two years on FULLTEXT indexes

SUGGESTION #3 : Hash Indexing

CREATE TABLE mytable
(
    id int not null auto_increment,
    myfield varchar(255) not null,
    hashmyfield char(32) not null,
    primary key (id),
    key (hashmyfield)
);

If you are looking for one specific value and those values could be lengths well beyond 32 characters, you could store the hash value:

INSERT INTO mytable (myfield,hashmyfield)
VALUES ('whatever',MD5('whatever'));

That way, you just search for hash values to retrieve results

SELECT * FROM mytable WHERE hashmyfield = MD5('whatever');

Give it a Try !!!

RolandoMySQLDBA
  • 185,223
  • 33
  • 326
  • 536
24

MySQL enables you to define prefixed index which means you define first N characters from original string to be indexed, and the trick is to choose a number N that’s long enough to give good selectivity, but short enough to save space. The prefix should be long enough to make the index nearly as useful as it would be if you’d indexed the whole column.

Before we go further let us define some important terms. Index selectivity is ratio of the total distinct indexed values and total number of rows. Here is one example for test table:

+-----+-----------+
| id  | value     |
+-----+-----------+
| 1   | abc       |
| 2   | abd       |
| 3   | adg       |
+-----+-----------+

If we index only the first character (N=1), then index table will look like the following table:

+---------------+-----------+
| indexedValue  | rows      |
+---------------+-----------+
| a             | 1,2,3     |
+---------------+-----------+

In this case, index selectivity is equal to IS=1/3 = 0.33.

Let us now see what will happen if we increase number of indexed characters to two (N=2).

+---------------+-----------+
| indexedValue  | rows      |
+---------------+-----------+
| ab             | 1,2      |
| ad             | 3        |
+---------------+-----------+

In this scenario IS=2/3=0.66 which means we increased index selectivity, but we have also increased the size of index. Trick is to find the minimal number N which will result to maximal index selectivity.

There are two approaches you can do calculations for your database table. I will make demonstration on the this database dump.

Let's say we want to add column last_name in table employees to the index, and we want to define the smallest number N which will produce the best index selectivity.

First let us identify the most frequent last names:

select count(*) as cnt, last_name from employees group by employees.last_name order by cnt

+-----+-------------+
| cnt | last_name   |
+-----+-------------+
| 226 | Baba        |
| 223 | Coorg       |
| 223 | Gelosh      |
| 222 | Farris      |
| 222 | Sudbeck     |
| 221 | Adachi      |
| 220 | Osgood      |
| 218 | Neiman      |
| 218 | Mandell     |
| 218 | Masada      |
| 217 | Boudaillier |
| 217 | Wendorf     |
| 216 | Pettis      |
| 216 | Solares     |
| 216 | Mahnke      |
+-----+-------------+
15 rows in set (0.64 sec)

As you can see, the last name Baba is the most frequent one. Now we are going to find the most frequently occurring last_name prefixes, beginning with five-letter prefixes.

+-----+--------+
| cnt | prefix |
+-----+--------+
| 794 | Schaa  |
| 758 | Mande  |
| 711 | Schwa  |
| 562 | Angel  |
| 561 | Gecse  |
| 555 | Delgr  |
| 550 | Berna  |
| 547 | Peter  |
| 543 | Cappe  |
| 539 | Stran  |
| 534 | Canna  |
| 485 | Georg  |
| 417 | Neima  |
| 398 | Petti  |
| 398 | Duclo  |
+-----+--------+
15 rows in set (0.55 sec)

There are much more occurences of every prefix, which means we have to increase number N until the values are almost the same as in the previous example.

Here are results for N=9

select count(*) as cnt, left(last_name,9) as prefix from employees group by prefix order by cnt desc limit 0,15;

+-----+-----------+
| cnt | prefix    |
+-----+-----------+
| 336 | Schwartzb |
| 226 | Baba      |
| 223 | Coorg     |
| 223 | Gelosh    |
| 222 | Sudbeck   |
| 222 | Farris    |
| 221 | Adachi    |
| 220 | Osgood    |
| 218 | Mandell   |
| 218 | Neiman    |
| 218 | Masada    |
| 217 | Wendorf   |
| 217 | Boudailli |
| 216 | Cummings  |
| 216 | Pettis    |
+-----+-----------+

Here are results for N=10.

+-----+------------+
| cnt | prefix     |
+-----+------------+
| 226 | Baba       |
| 223 | Coorg      |
| 223 | Gelosh     |
| 222 | Sudbeck    |
| 222 | Farris     |
| 221 | Adachi     |
| 220 | Osgood     |
| 218 | Mandell    |
| 218 | Neiman     |
| 218 | Masada     |
| 217 | Wendorf    |
| 217 | Boudaillie |
| 216 | Cummings   |
| 216 | Pettis     |
| 216 | Solares    |
+-----+------------+
15 rows in set (0.56 sec)

This are very good results. This means that we can make index on column last_name with indexing only first 10 characters. In table definition column last_name is defined as VARCHAR(16), and this means we have saved 6 bytes (or more if there are UTF8 characters in the last name) per entry. In this table there are 1637 distinct values multiplied by 6 bytes is about 9KB, and imagine how this number would grow if our table contains million of rows.

You can read other ways of calculating number of N in my post Prefixed indexes in MySQL.

Using MD5 and SHA1 functions in order to generate values which should be indexed is also not good approach. Why? Read it in post How to choose right data type for a primary key in MySQL database

marc_s
  • 9,052
  • 6
  • 46
  • 52
Mr.M
  • 349
  • 2
  • 3
0

The mariab sha2() function lets you create a (virtual) column of 256 characters (or bytes, documentation is quite unclear here). Using a virtual stored column would let you index and find. But as collisions might occur, some additional selecting might be needed.

theking2
  • 111
  • 5