15

Query 1:

select distinct email from mybigtable where account_id=345

takes 0.1s

Query 2:

Select count(*) as total from mybigtable where account_id=123 and email IN (<include all from above result>)

takes 0.2s

Query 3:

Select count(*) as total from mybigtable where account_id=123 and email IN (select distinct email from mybigtable where account_id=345)

takes 22 minutes and 90% its in the "preparing" state. Why does this take so much time.

Table is innodb with 3.2mil rows on MySQL 5.0

Stewie
  • 415
  • 1
  • 4
  • 10

4 Answers4

12

In Query 3, you are basically executing a subquery for every row of mybigtable against itself.

To avoid this, you need to make two major changes:

MAJOR CHANGE #1 : Refactor the Query

Here is your original query

Select count(*) as total from mybigtable
where account_id=123 and email IN
(select distinct email from mybigtable where account_id=345)

You could try

select count(*) EmailCount from
(
    select tbl123.email from
    (select email from mybigtable where account_id=123) tbl123
    INNER JOIN
    (select distinct email from mybigtable where account_id=345) tbl345
    using (email)
) A;

or maybe the count per email

select email,count(*) EmailCount from
(
    select tbl123.email from
    (select email from mybigtable where account_id=123) tbl123
    INNER JOIN
    (select distinct email from mybigtable where account_id=345) tbl345
    using (email)
) A group by email;

MAJOR CHANGE #2 : Proper Indexing

I think you have this already since Query 1 and Query 2 run fast. Make sure you have a compound index on (account_id,email). Do SHOW CREATE TABLE mybigtable\G and make sure you have one. If you don't have it or if you are not sure, then create the index anyway:

ALTER TABLE mybigtable ADD INDEX account_id_email_ndx (account_id,email);

UPDATE 2012-03-07 13:26 EST

If you want to do a NOT IN(), change the INNER JOIN to a LEFT JOIN and check for the right side being NULL, like this:

select count(*) EmailCount from
(
    select tbl123.email from
    (select email from mybigtable where account_id=123) tbl123
    LEFT JOIN
    (select distinct email from mybigtable where account_id=345) tbl345
    using (email)
    WHERE tbl345.email IS NULL
) A;

UPDATE 2012-03-07 14:13 EST

Please read these two links on doing JOINs

Here is a great YouTube Video where I learned to refactor queries and the book it was based on

RolandoMySQLDBA
  • 185,223
  • 33
  • 326
  • 536
11

In MySQL, subselects within the IN clause are re-executed for every row in the outer query, thus creating O(n^2). The short story is, don't use IN (SELECT).

Aaron Brown
  • 5,140
  • 25
  • 25
1
  1. Do you have an index on account_id?

  2. The second problem may be with the nested sub-queries which have terrible performance in 5.0.

  3. GROUP BY with a having clause is faster than DISTINCT.

  4. What are you trying to do which may be better done through joins in addition to Item #3?

1

There is a lot of processing involved when handling an IN() subquery such as yours. You can read more about it here.

My first suggestion would be to attempt to re-write the subquery into a JOIN instead. Something like (not tested):

SELECT COUNT(*) AS total FROM mybigtable AS t1
 INNER JOIN 
   (SELECT DISTINCT email FROM mybigtable WHERE account_id=345) AS t2 
   ON t2.email=t1.email
WHERE account_id=123
Derek Downey
  • 23,568
  • 11
  • 79
  • 104