In a previous post , I analyzed how a query of the famous DBT3 benchmark was optimized by MySQL. It was this query, named "Q16" in the DBT3 jargon: select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#23' and p_type not like 'LARGE PLATED%' and p_size in (43, 1, 25, 5, 35, 12, 42, 40) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size; Here is a brief recap of conclusions I had drawn: for this query, MySQL tranforms the IN condition to EXISTS and then evaluates it with the "unique_subquery" technique, which does an index lookup into the subquery's table. IN is evaluated 120,000 times (once per combined row of the outer tables). The total execution time of query Q16...
Hello, Guilhem.
RépondreSupprimerI wrote article How to select N rows for each group. One of solutions uses LATERAL.
However, the performance was lower than expected - mysql do not uses unique index, instead of that it reads all rows from group and does filesort. Example
CREATE TABLE `posts` (
`post_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`date_added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`post_text` text NOT NULL,
PRIMARY KEY (`post_id`),
UNIQUE KEY `user_id` (`user_id`,`date_added`)
) ENGINE=InnoDB AUTO_INCREMENT=32754 DEFAULT CHARSET=utf8;
/*
15962 rows in table
20 distinct user_id
~800 rows per group
*/
mysql> explain select t2.* from (select user_id from posts group by user_id) as t1,
-> lateral (select * from posts where t1.user_id=posts.user_id order by date_added desc limit 3) as t2;
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
| 1 | PRIMARY | | NULL | ALL | NULL | NULL | NULL | NULL | 21 | 100.00 | Rematerialize () |
| 1 | PRIMARY | | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 100.00 | NULL |
| 3 | DEPENDENT DERIVED | posts | NULL | ref | user_id | user_id | 4 | t1.user_id | 801 | 100.00 | Using filesort |
| 2 | DERIVED | posts | NULL | range | user_id | user_id | 4 | NULL | 21 | 100.00 | Using index for group-by |
+----+-------------------+------------+------------+-------+---------------+---------+---------+------------+------+----------+----------------------------+
mysql> show status like 'handler_%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 4 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 102 |
| Handler_read_last | 1 |
| Handler_read_next | 15962 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 60 |
| Handler_read_rnd_next | 101 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 80 |
+----------------------------+-------+
mysql> show status like 'sort%';
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Sort_merge_passes | 0 |
| Sort_range | 0 |
| Sort_rows | 60 |
| Sort_scan | 20 |
+-------------------+-------+
In pseudo-code, this does:
select 20 distinct user_id, store into t1
for each row in t1:
select all rows from posts where user_id=t1.user_id
do filesort
take 3 last rows.
We have unique index (user_id, date_added). Optimal way like in this pseudo-code:
select 20 distinct user_id, store into t1
for each row in t1:
take 3 rows using unique key
Is it bug?
P.S. Also I translated some of your articles (LATERAL and emulate CHECK constraints).
Hello Vasiliy. First, thanks a lot fot the translations!!
RépondreSupprimerAbout your problem: that is indeed surprising; to decide if it's correct, we should look at the cost-based decision. For that, I suggest you look at the optimizer trace https://dev.mysql.com/doc/internals/en/optimizer-tracing.html ; it will tell which strategy was considered and why it was chosen or rejected.
Thanks for the reply.
RépondreSupprimerYou can download optimizer trace here
important part:
{
"reconsidering_access_paths_for_index_ordering": {
"clause": "ORDER BY",
"steps": [
],
"index_order_summary": {
"table": "`posts`",
"index_provides_order": false,
"order_direction": "undefined",
"index": "user_id",
"plan_changed": false
}
}
}
I don't understand why it's so.
Hi. Thanks. Not specific of LATERAL, also visible with
RépondreSupprimerexplain select (select post_text from posts where t1.user_id=posts.user_id order by date_added limit 3) as s from posts t1;
I'll contact a colleague who knows more, and will let you know.
Vasiliy: my colleague has looked a bit and it could theoretically be improved (no idea when, though). Could you please file a bug report? Please note there that you're doing it on my suggestion. A workaround is: extending ORDER BY, like this:
RépondreSupprimerselect (select post_text from posts where t1.user_id =
posts.user_id order by **posts.user_id**, date_added limit 3) as s from posts t1;
then MySQL thinks it can use the index.
For your original query which orders DESC, it would be:
order by posts.user_id desc, date_added desc
and then I get:
| 3 | DEPENDENT DERIVED | posts | NULL | ref | user_id | user_id | 4 | t1.user_id | 1 | 100.00 | Backward index scan |
Thank you so much. It's black magic :) but it really works.
RépondreSupprimerI have already submitted bug report - http://bugs.mysql.com/94903