When crafting a complex SQL query (i.e., the query has at least one subquery), two things have to be kept in mind:
1. Reduce the search space
2. Do query on index
As a real example, last week I tuned a query that previously took about 45 minutes to complete.
Now it takes only 12 minutes, which is an improvement of 73%. Here goes the story.
The old query has the following general form:
select ...
from (select ...
from ...
where ...
)
where ...
The subquery in the FROM clause returns data that will be filtered by criteria set forth in the WHERE clause, which belongs to the outermost query. Unfortunately, many of the data returned by the subquery get filtered by the WHERE clause. As we know, the FROM clause of a query sets forth the query search space. Hence, by returning a lot of data from the subquery, the search space is broad. This consumes a lot of time when the criteria set forth in the WHERE clause is not simple. So, to reduce the search space, I moved some of the criteria inside the subquery. In other words, when using a subquery, strive to only return the needed data, not more not less.
Another example of reducing the search space is when the same result can be obtained by either querying on entries of type A or entries of type B. In my case, it is either querying an item on its previous or current transaction to find out the movement of the item. Previously, the query always started from the previous transaction to perform a forward lookup. Being curious I counted the number of previous and current transactions of a set of items. It turned out that the number of previous transactions having the desired properties was much greater than the number of the current transactions having the desired properties, specifically, about 4.7 million and 1.5 million records, respectively. The problem was that not all of the 4.7 millions were the previous transactions of the 1.5 millions, and vice versa. Their intersection that the query was interested in only resulted in two hundreds of records. Switching strategy to perform a backward lookup by finding all of the current transactions within a subquery in a FROM clause helps a lot. So, that is for the first point to be kept in mind: reduce the search space.
Now for the second point, the old query has a subquery in the following form:
select ...
from (...) as item
where state in (select state
from group_of_states
where group_name = 'group A'
)
Previously, I didn't observe it as one of the bottle neck since it only had two instances in the old query. However, the new query needs five instances, and some of them are within another critical subqueries (i.e., their search spaces are broad). The performance suffered because the number of states under group A was about twenty. And, unfortunately, using `in' operator has the drawback of introducing another search space. Worse than that, out of the twenty states, the criterion in the WHERE clause is only interested in one of them. In other words, do not use `in' operator if it involves a subquery that returns many entries and only some of the entries matter.
At this point, I realized that the query could be transformed into another more efficient query as follows:
select ...
from (...) as item
where 'group A' = (select group_name
from group_of_states
where state = item.state
)
The query was made even faster after I made a B-tree index on group_of_states' {state} because the primary key of group_of_states was not {state} but {group_name, state}. In other words, doing a query on a part of a composite primary key didn't help; strive to query on an indexed column.
I also think that if `UNION ALL' can be used instead of `UNION' because the result sets are known to be disjoint, it is better to use `UNION ALL' so as to reduce the processing overhead of checking whether or not an entry has already existed in the other result set.
To conclude, a complex query, which involves a lot of subqueries, will be efficient if the search spaces of the subqueries are narrow and most subqueries use index. Besides that, care must be taken when using an operator returning more than one result so as to not introducing an additional search space (e.g., `in' operator).
Keywords: complex sql query optimization
EXISTS and NOT EXISTS are Often Used
I also found out that I often uses `EXISTS' or `NOT EXISTS' in a WHERE clause.
For example, the following query to retrieve all transactions recorded on table movement_snapshot at `2008-05-01 00:00:00' that do not have further transactions after the specified time until `2008-06-01 00:00:00' is not good:
select * from movement_snapshot where snapshot_time = '2008-05-01 00:00:00' and item_id in ((select item_id from movement_snapshot where snapshot_time = '2008-05-01 00:00:00' ) except (select item_id from item_transaction where move_at > '2008-05-01 00:00:00' and move_at <= '2008-06-01 00:00:00' ) )The subquery in the outermost query's WHERE clause is needed because the set of attributes of table movement_snapshot is the superset of that of table item_transaction. Although the subquery can be eliminated by specifying all attributes in the SELECT clause of the table whose set of attributes is the superset, as in the following form:
it is truly not possible in this particular case since all attributes from table movement_snapshot is to be retrieved.
However, it is more efficient to utilize `NOT EXISTS' as in the following query:
select * from movement_snapshot as t where snapshot_time = '2008-05-01 00:00:00' and not exists (select * from item_transaction where move_at > '2008-05-01 00:00:00' and move_at <= '2008-06-01 00:00:00' and item_id = t.item_id )Specifically, the new query eliminates the need of retrieving all item_id from table movement_snapshot whose snapshot_time is `2008-05-01 00:00:00' besides taking an advantage of performing query on index by doing `item_id = t.item_id'.
Best regards,
Eus (FSF member #4445)
In this digital era, where computing technology is pervasive, your freedom depends on the software controlling those computing devices.
Join free software movement today! It is free as in freedom, not as in free beer!
Join: http://www.fsf.org/jf?referrer=4445
Harnessing `join' to Reduce the Search Space
Hi Ho!
This time I looked at another query that uses `join' (e.g., from t1 left join t2) to generate additional attributes not found in the original record. The query has the following form:
select ... from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id = item_sales.item_id group by t1.id, t1.item_attr1 ) as t2 left join item_purchase on t2.id = item_purchases.item_id ) as tN left join ...The use of the `join' effectively reduces the search space compared to the following form:
select id , item_attr1 , (select sum(sold_at) from item_sales where item_id = i.id ) as total_sales , (select sum(bought_at) from item_purchase where item_id = i.id ) as total_purchase , ... from item as i where ...because in this form, each subquery in the SELECT clause introduces a new unreduced search space for each record in table item. Beside being more efficient, the use of the `join' provides the most flexible way to add additional attributes not found in the original record.
Best regards,
Eus (FSF member #4445)
In this digital era, where computing technology is pervasive,
your freedom depends on the software controlling those computing devices.
Join free software movement today!
It is free as in freedom, not as in free beer!
Join: http://www.fsf.org/jf?referrer=4445
Iterating over a Large Data Space vs. Using a Complicated Query
Hi Ho!
It turns out that iterating over a large data space using a query employing few subqueries results in a much faster execution time than iterating over a small data space using a query employing many subqueries. Almost a month ago, I had another chance to optimize the query that previously I had optimized to be faster by 73%, because of a new user requirement. Since the new requirement required the query to employ a backward lookup from a very large data space instead of a forward lookup from a small data space, I tried to measure how worse the performance of the query would be when a backward lookup was employed. To my surprise, it was much faster to perform a backward lookup from a very large data space instead of a forward lookup from a small data space. Below are the experiment result:
The forward lookup comprises of two search space subqueries that are unioned while the backward lookup comprises of only one search space subquery. The number listed under column "N (record)" is the number of records returned from running the subquery that produces the search space for use by the parent query.
It turned out that the forward lookup method required the parent query to perform more subqueries than the backward lookup as roughly illustrated below:
The query employing a forward lookup:
SELECT attr1, attr2, (subquery to perform backward lookup) AS attr3, attr4, ..., attrN FROM transaction_table WHERE primary_key in ( SELECT subquery to perform forward lookup FROM subquery to form search space 1 UNION subquery to form search space 2 WHERE attr1 != subquery to perform forward lookup )versus
The query employing a backward lookup:
To conclude, an indexed large data space can be iterated more quickly with a query having a few subqueries than an indexed small data space with a query having a lot of subqueries.
Best regards,
Eus (FSF member #4445)
In this digital era, where computing technology is pervasive,
your freedom depends on the software controlling those computing devices.
Join free software movement today!
It is free as in freedom, not as in free beer!
Join: http://www.fsf.org/jf?referrer=4445