How to Write Complex SQL Query Optimally

Submitted by Eus
on February 25, 2009 - 7:14pm

When crafting a complex SQL query (i.e., the query has at least one subquery), two things have to be kept in mind:
1. Reduce the search space
2. Do query on index

As a real example, last week I tuned a query that previously took about 45 minutes to complete.
Now it takes only 12 minutes, which is an improvement of 73%. Here goes the story.

The old query has the following general form:

select ...
from (select ...
      from ...
      where ...
     )
where ...

The subquery in the FROM clause returns data that will be filtered by criteria set forth in the WHERE clause, which belongs to the outermost query. Unfortunately, many of the data returned by the subquery get filtered by the WHERE clause. As we know, the FROM clause of a query sets forth the query search space. Hence, by returning a lot of data from the subquery, the search space is broad. This consumes a lot of time when the criteria set forth in the WHERE clause is not simple. So, to reduce the search space, I moved some of the criteria inside the subquery. In other words, when using a subquery, strive to only return the needed data, not more not less.

Another example of reducing the search space is when the same result can be obtained by either querying on entries of type A or entries of type B. In my case, it is either querying an item on its previous or current transaction to find out the movement of the item. Previously, the query always started from the previous transaction to perform a forward lookup. Being curious I counted the number of previous and current transactions of a set of items. It turned out that the number of previous transactions having the desired properties was much greater than the number of the current transactions having the desired properties, specifically, about 4.7 million and 1.5 million records, respectively. The problem was that not all of the 4.7 millions were the previous transactions of the 1.5 millions, and vice versa. Their intersection that the query was interested in only resulted in two hundreds of records. Switching strategy to perform a backward lookup by finding all of the current transactions within a subquery in a FROM clause helps a lot. So, that is for the first point to be kept in mind: reduce the search space.

Now for the second point, the old query has a subquery in the following form:

select ...
from (...) as item
where state in (select state
                from group_of_states
                where group_name = 'group A'
               )

Previously, I didn't observe it as one of the bottle neck since it only had two instances in the old query. However, the new query needs five instances, and some of them are within another critical subqueries (i.e., their search spaces are broad). The performance suffered because the number of states under group A was about twenty. And, unfortunately, using `in' operator has the drawback of introducing another search space. Worse than that, out of the twenty states, the criterion in the WHERE clause is only interested in one of them. In other words, do not use `in' operator if it involves a subquery that returns many entries and only some of the entries matter.

At this point, I realized that the query could be transformed into another more efficient query as follows:

select ...
from (...) as item
where 'group A' = (select group_name
                   from group_of_states
                   where state = item.state
                  )

The query was made even faster after I made a B-tree index on group_of_states' {state} because the primary key of group_of_states was not {state} but {group_name, state}. In other words, doing a query on a part of a composite primary key didn't help; strive to query on an indexed column.

I also think that if `UNION ALL' can be used instead of `UNION' because the result sets are known to be disjoint, it is better to use `UNION ALL' so as to reduce the processing overhead of checking whether or not an entry has already existed in the other result set.

To conclude, a complex query, which involves a lot of subqueries, will be efficient if the search spaces of the subqueries are narrow and most subqueries use index. Besides that, care must be taken when using an operator returning more than one result so as to not introducing an additional search space (e.g., `in' operator).

Keywords: complex sql query optimization

EXISTS and NOT EXISTS are Often Used

Eus
on
March 2, 2009 - 1:32am

I also found out that I often uses `EXISTS' or `NOT EXISTS' in a WHERE clause.
For example, the following query to retrieve all transactions recorded on table movement_snapshot at `2008-05-01 00:00:00' that do not have further transactions after the specified time until `2008-06-01 00:00:00' is not good:

select *
from movement_snapshot
where snapshot_time = '2008-05-01 00:00:00'
      and item_id in ((select item_id
                       from movement_snapshot
                       where snapshot_time = '2008-05-01 00:00:00'
                      )
                      except
                      (select item_id
                       from item_transaction
                       where move_at > '2008-05-01 00:00:00'
                             and move_at <= '2008-06-01 00:00:00'
                      )
                     )

The subquery in the outermost query's WHERE clause is needed because the set of attributes of table movement_snapshot is the superset of that of table item_transaction. Although the subquery can be eliminated by specifying all attributes in the SELECT clause of the table whose set of attributes is the superset, as in the following form:

(select a1, a2, a3
 from movement_snapshot
)
except all
(select *
 from item_transaction
)

it is truly not possible in this particular case since all attributes from table movement_snapshot is to be retrieved.

However, it is more efficient to utilize `NOT EXISTS' as in the following query:

select *
from movement_snapshot as t
where snapshot_time = '2008-05-01 00:00:00'
      and not exists (select *
                      from item_transaction
                      where move_at > '2008-05-01 00:00:00'
                            and move_at <= '2008-06-01 00:00:00'
                            and item_id = t.item_id
                     )

Specifically, the new query eliminates the need of retrieving all item_id from table movement_snapshot whose snapshot_time is `2008-05-01 00:00:00' besides taking an advantage of performing query on index by doing `item_id = t.item_id'.

Best regards,
Eus (FSF member #4445)

In this digital era, where computing technology is pervasive, your freedom depends on the software controlling those computing devices.

Join free software movement today! It is free as in freedom, not as in free beer!

Join: http://www.fsf.org/jf?referrer=4445

Harnessing `join' to Reduce the Search Space

Eus
on
May 5, 2009 - 7:02pm

Hi Ho!

This time I looked at another query that uses `join' (e.g., from t1 left join t2) to generate additional attributes not found in the original record. The query has the following form:

select ...
from  (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase
       from  (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales
              from  (select id, item_attr1
                     from item
                     where ...
                    ) as t1
                   left join
                    item_sales
                   on item.id = item_sales.item_id
              group by t1.id, t1.item_attr1
             ) as t2
            left join
             item_purchase
            on t2.id = item_purchases.item_id
      ) as tN
     left join
      ...

The use of the `join' effectively reduces the search space compared to the following form:

select id
       , item_attr1
       , (select sum(sold_at)
          from item_sales
          where item_id = i.id
         ) as total_sales
       , (select sum(bought_at)
          from item_purchase
          where item_id = i.id
         ) as total_purchase
       , ...
from item as i
where ...

because in this form, each subquery in the SELECT clause introduces a new unreduced search space for each record in table item. Beside being more efficient, the use of the `join' provides the most flexible way to add additional attributes not found in the original record.

Best regards,
Eus (FSF member #4445)

In this digital era, where computing technology is pervasive,
your freedom depends on the software controlling those computing devices.

Join free software movement today!
It is free as in freedom, not as in free beer!

Join: http://www.fsf.org/jf?referrer=4445

Iterating over a Large Data Space vs. Using a Complicated Query

Eus
on
July 4, 2009 - 11:07am

Hi Ho!

It turns out that iterating over a large data space using a query employing few subqueries results in a much faster execution time than iterating over a small data space using a query employing many subqueries. Almost a month ago, I had another chance to optimize the query that previously I had optimized to be faster by 73%, because of a new user requirement. Since the new requirement required the query to employ a backward lookup from a very large data space instead of a forward lookup from a small data space, I tried to measure how worse the performance of the query would be when a backward lookup was employed. To my surprise, it was much faster to perform a backward lookup from a very large data space instead of a forward lookup from a small data space. Below are the experiment result:

+---+-------------+------------------------------+-----------------------+--------+
| # | Query over  | Forward Lookup               | Backward Lookup       | Gain   |
|   | x mon/year  +-------------------+----------+------------+----------+        |
|   | of trans-   | N (record)        | time     | N (record) | time     |        |
|   | action data |                   | (minute/ |            | (minute/ |        |
|   |             |                   | hour)    |            | hour)    |        |
+---+-------------+-------------------+----------+------------+----------+--------+
| For area A                                                                      |
+---+-------------+-------------------+----------+------------+----------+--------+
| 1 | 1 month     | 705,616 + 14,680  | 9.76 m   | 793,925    | 4.77 m   | 51.13% |
| 2 | 2 months    | 1,314,098 + 9,033 | 21.42 m  | 1,439,555  | 8.50 m   | 60.32% |
| 3 | 1 year      | 13,077,642 + 63   | 6.749 h  | 6,067,423  | 45.02 m  | 88.88% |
+---+-------------+-------------------+----------+------------+----------+--------+
| For area B                                                                      |
+---+-------------+-------------------+----------+------------+----------+--------+
| 4 | 1 month     | 257,810 + 18,471  | 5.82 m   | 1,241,731  | 4.61 m   | 20.79% |
| 5 | 2 months    | 505,596 + 22,582  | 15.43 m  | 2,248,057  | 8.49 m   | 44.98% |
| 6 | 1 year      | 1,622,728 + 0     | 5.426 h  | 17,522,337 | 44.80 m  | 86.24% |
+---+-------------+-------------------+----------+------------+----------+--------+

The forward lookup comprises of two search space subqueries that are unioned while the backward lookup comprises of only one search space subquery. The number listed under column "N (record)" is the number of records returned from running the subquery that produces the search space for use by the parent query.

It turned out that the forward lookup method required the parent query to perform more subqueries than the backward lookup as roughly illustrated below:

The query employing a forward lookup:

SELECT attr1, attr2, (subquery to perform backward lookup) AS attr3, attr4, ..., attrN
FROM transaction_table
WHERE primary_key in (
  SELECT subquery to perform forward lookup
  FROM subquery to form search space 1
       UNION
       subquery to form search space 2
  WHERE attr1 != subquery to perform forward lookup
)

versus

The query employing a backward lookup:

SELECT attr1, attr2, (subquery to perform backward lookup) AS attr3, attr4, ..., attrN
FROM transaction_table
WHERE EXISTS (subquery to perform backward lookup)

To conclude, an indexed large data space can be iterated more quickly with a query having a few subqueries than an indexed small data space with a query having a lot of subqueries.

Best regards,
Eus (FSF member #4445)

In this digital era, where computing technology is pervasive,
your freedom depends on the software controlling those computing devices.

Join free software movement today!
It is free as in freedom, not as in free beer!

Join: http://www.fsf.org/jf?referrer=4445

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.