Why query optimization is needed
For some statements, the query transformer determines whether it is advantageous to rewrite the original SQL statement into a semantically equivalent SQL statement with a lower cost.
When a viable alternative exists, the database calculates the cost of the alternatives separately and chooses the lowest-cost alternative. The estimator is the component of the optimizer that determines the overall cost of a given execution plan. The percentage of rows in the row set that the query selects, with 0 meaning no rows and 1 meaning all rows.
A predicate becomes more selective as the selectivity value approaches 0 and less selective or more unselective as the value approaches 1. The cardinality is the number of rows returned by each operation in an execution plan. This input, which is crucial to obtaining an optimal plan, is common to all cost functions. The Rows column in an execution plan shows the estimated cardinality.
This measure represents units of work or resource used. As shown in the following graphic, if statistics are available, then the estimator uses them to compute the measures. The statistics improve the degree of accuracy of the measures. For the query shown in Example , the estimator uses selectivity, estimated cardinality a total return of 10 rows , and cost measures to produce its total cost estimate of The row set can be a base table, a view, or the result of a join.
A predicate filters a specific number of rows from a row set. Thus, the selectivity of a predicate indicates how many rows pass the predicate test. Selectivity ranges from 0.
A selectivity of 0. A predicate becomes more selective as the value approaches 0. The database uses different internal defaults depending on the predicate type. When statistics are available, the estimator uses them to estimate selectivity.
Assume there are distinct employee last names. The histogram captures the distribution of different values in a column, so it yields better selectivity estimates, especially for columns that have data skew. For example, if the optimizer estimate for the number of rows returned by a full table scan is , then the cardinality estimate for this operation is The cardinality estimate appears in the Rows column of the execution plan.
The optimizer determines the cardinality for each operation based on a complex set of formulas that use both table and column level statistics, or dynamic statistics, as input. The optimizer uses one of the simplest formulas when a single equality predicate appears in a single-table query, with no histogram.
In this case, the optimizer assumes a uniform distribution and calculates the cardinality for the query by dividing the total number of rows in the table by the number of distinct values in the column used in the WHERE clause predicate. For example, user hr queries the employees table as follows:. The employees table contains rows. The current database statistics indicate that the number of distinct values in the salary column is Cardinality estimates must be as accurate as possible because they influence all aspects of the execution plan.
Cardinality is important when the optimizer determines the cost of a join. For example, in a nested loops join of the employees and departments tables, the number of rows in employees determines how often the database must probe the departments table. Cardinality is also important for determining the cost of sorts. The optimizer cost model accounts for the machine resources that a query is predicted to use.
The cost is an internal numeric measure that represents the estimated resource usage for a plan. The cost is specific to a query in an optimizer environment. To estimate cost, the optimizer considers factors such as the following:. The cost is an internal measure that the optimizer uses to compare different plans for the same query. You cannot tune or change cost. The execution time is a function of the cost, but cost does not equate directly to time.
For example, if the plan for query A has a lower cost than the plan for query B , then the following outcomes are possible:. A executes in the same amount of time as B. Therefore, you cannot compare the costs of different queries with one another. Also, you cannot compare the costs of semantically equivalent queries that use different optimizer modes. The execution plan displays the cost of the entire plan, which is indicated on line 0 , and each individual operation.
For example, the following plan shows an overall cost of The access path determines the number of units of work required to get data from a base table.
To determine the overall plan cost, the optimizer assigns a cost to each access path:. The cost of the scan depends on the number of blocks to be scanned and the multiblock read count value.
The cost of an index scan depends on the levels in the B-tree, the number of index leaf blocks to be scanned, and the number of rows to be fetched using the rowid in the index keys. The cost of fetching rows using rowids depends on the index clustering factor. The join cost represents the combination of the individual access costs of the two row sets being joined, plus the cost of the join operation. The plan generator explores various plans for a query block by trying out different access paths, join methods, and join orders.
Many plans are possible because of the various combinations that the database can use to produce the same result. The optimizer picks the plan with the lowest cost. The following snippet from an optimizer trace file shows some computations that the optimizer performs:. The trace file shows the optimizer first trying the departments table as the outer table in the join.
The optimizer calculates the cost for three different join methods: nested loops join NL , sort merge SM , and hash join HA. The optimizer picks the hash join as the most efficient method:. The optimizer then tries a different join order, using employees as the outer table. This join order costs more than the previous join order, so it is abandoned.
The optimizer uses an internal cutoff to reduce the number of plans it tries when finding the lowest-cost plan. The cutoff is based on the cost of the current best plan. If the current best cost is large, then the optimizer explores alternative plans to find a lower cost plan. If the current best cost is small, then the optimizer ends the search swiftly because further cost improvement is not significant.
The optimizer compiles the SQL and generates an execution plan. The normal mode generates a reasonable plan for most SQL statements. The goal is to limit the number of rows and columns used to satisfy the query. A simple example will suffice to illustrate how a small modification in an SQL query can have a huge impact on its execution speed. This pulls in all rows and columns in an attempt to locate the requested data. In almost every case, additional processing time is required to search through data that is essentially irrelevant to the success of the query.
Tuning a query by selecting a subset of columns and rows is a much more efficient method of retrieving information from a database. Searching for opportunities to improve the performance of SQL queries is what query optimization is all about. Simulate production environments and use load testing to check on the performance of existing SQL code and to try alternatives. Run queries multiple times in parallel to test improvements without impacting your production environment. DB PowerStudio is a collection of four tools that enable teams to administer systems, manage changes, tune queries, and develop relational databases.
This service is more advanced with JavaScript available. Encyclopedia of Database Systems Edition. Editors: Ling Liu, M. Contents Search.
Query Optimization in Relational Databases. Authors Authors and affiliations Thomas Neumann. Reference work entry First Online: 07 December How to cite. It keeps backtracking until it finds another row in some table, at which point, it looks for a matching row in the next table, and so on. This query execution plan applies as easily to a single-table query as it does to a many-table query, which is why even a single-table query can be considered a join—the single-table join is the basic operation from which more complex joins are composed.
Read it from left to right and top to bottom. Figure Swim-lane diagram illustrating retrieving rows using a join. MySQL executes every kind of query in essentially the same way.
In short, MySQL coerces every kind of query into this execution plan. Still other queries can be executed with nested loops, but perform very badly as a result.
We look at some of those later. Instead, the query execution plan is actually a tree of instructions that the query execution engine follows to produce the query results. The final plan contains enough information to reconstruct the original query.
Any multitable query can conceptually be represented as a tree. For example, it might be possible to execute a four-table join as shown in Figure This is what computer scientists call a balanced tree. This is not how MySQL executes the query, though. As we described in the previous section, MySQL always begins with one table and finds matching rows in the next table. The most important part of the MySQL query optimizer is the join optimizer , which decides the best order of execution for multitable queries.
It is often possible to join the tables in several different orders and get the same results. The join optimizer estimates the cost for various plans and tries to choose the least expensive one that gives the same result. You can probably think of a few different query plans. This should be efficient, right? This is quite a different plan from the one suggested in the previous paragraph. Is this really more efficient? This shows why MySQL wants to reverse the join order: doing so will enable it to examine fewer rows in the first table.
The difference is how many of these indexed lookups it will have to do:. If the server scans the actor table first, it will have to do only index lookups into later tables.
In other words, the reversed join order will require less backtracking and rereading. The reordered query had an estimated cost of , while the estimated cost of forcing the join order was 1, Reordering joins is usually a very effective optimization. In most cases, the join optimizer will outperform a human. The join optimizer tries to produce a query execution plan tree with the lowest achievable cost. When possible, it examines all potential combinations of subtrees, beginning with all one-table plans.
Unfortunately, a join over n tables will have n -factorial combinations of join orders to examine. This is called the search space of all possible query plans, and it grows very quickly—a table join can be executed up to 3,, different ways!
When the search space grows too large, it can take far too long to optimize the query, so the server stops doing a full analysis. MySQL has many heuristics, accumulated through years of research and experimentation, that it uses to speed up the optimization stage. This is because the results for one table depend on data retrieved from another table.
These dependencies help the join optimizer reduce the search space by eliminating choices. Sorting results can be a costly operation, so you can often improve performance by avoiding sorts or by performing them on fewer rows.
We showed you how to use indexes for sorting in Chapter 3. If the values to be sorted will fit into the sort buffer, MySQL can perform the sort entirely in memory with a quicksort. It uses a quicksort to sort each chunk and then merges the sorted chunk into the results. On the other hand, it stores a minimal amount of data during the sort, so if the rows to be sorted are completely in memory, it can be cheaper to store less data and reread the rows to generate the final result.
Reads all the columns needed for the query, sorts them by the ORDER BY columns, and then scans the sorted list and outputs the specified columns. This algorithm is available only in MySQL 4.
However, it has the potential to use a lot more space, because it holds all desired columns from each row, not just the columns needed to sort the rows. This means fewer tuples will fit into the sort buffer, and the filesort will have to perform more sort merge passes. When sorting a join, MySQL may perform the filesort at two stages during the query execution.
The plan is a data structure; it is not executable byte-code, which is how many other databases execute queries. In contrast to the optimization stage, the execution stage is usually not all that complex: MySQL simply follows the instructions given in the query execution plan.
Many of the operations in the plan invoke methods implemented by the storage engine interface, also known as the handler API. Each table in the query is represented by an instance of a handler. If a table appears three times in the query, for example, the server creates three handler instances. Though we glossed over this before, MySQL actually creates the handler instances early in the optimization stage.
The optimizer uses them to get information about the tables, such as their column names and index statistics. This is enough for a query that does an index scan. Not everything is a handler operation. For example, the server manages table locks. As explained in Chapter 1 , anything that all storage engines share is implemented in the server, such as date and time functions, views, and triggers. To execute the query, the server just repeats the instructions until there are no more rows to examine.
The final step in executing a query is to reply to the client. If the query is cacheable, MySQL will also place the results into the query cache at this stage. The server generates and sends results incrementally. Think back to the single-sweep multijoin method we mentioned earlier.
As soon as MySQL processes the last table and generates one row successfully, it can and should send that row to the client. This has two benefits: it lets the server avoid holding the row in memory, and it means the client starts getting the results as soon as possible. Some of these limitations will probably be eased or removed entirely in future versions, and some have already been fixed in versions not yet released as GA generally available.
In particular, there are a number of subquery optimizations in the MySQL 6 source code, and more are in progress. MySQL sometimes optimizes subqueries very badly.
This feels natural to write with a subquery, as follows:. We said an IN list is generally very fast, so you might expect the query to be optimized to something like this:. Unfortunately, exactly the opposite happens. It rewrites the query as follows:. Sometimes this can be faster than a JOIN. MySQL has been criticized thoroughly for this particular type of subquery execution plan. Although it definitely needs to be fixed, the criticism often confuses two different issues: execution order and caching.
Rewriting the query yourself lets you take control over both aspects. Future versions of MySQL should be able to optimize this type of query much better, although this is no easy task. There are very bad worst cases for any execution plan, including the inside-out execution plan that some people think would be simple to optimize. Instead, benchmark and make your own decision.
Sometimes a correlated subquery is a perfectly reasonable, or even optimal, way to get a result. This is an example of the early-termination algorithm we mentioned earlier in this chapter.
So, in theory, MySQL will execute the queries almost identically. In reality, benchmarking is the only way to tell which approach is really faster. We benchmarked both queries on our standard setup. The results are shown in Table Sometimes a subquery can be faster. For example, it can work well when you just want to see rows from one table that match rows in another table.
The following join, which is designed to find every film that has an actor, will return duplicates because some films have multiple actors:. But what are we really trying to express with this query, and is it obvious from the SQL?
Again, we benchmarked to see which strategy was faster. In this example, the subquery performs much faster than the join. We showed this lengthy example to illustrate two points: you should not heed categorical advice about subqueries, and you should use benchmarks to prove your assumptions about query plans and execution speed.
Index merge algorithms, introduced in MySQL 5. In MySQL 5. There are three variations on the algorithm: union for OR conditions, intersection for AND conditions, and unions of intersections for combinations of the two.
The following query uses a union of two index scans, as you can see by examining the Extra column:. This is especially true if not all of the indexes are very selective, so the parallel scans return lots of rows to the merge operation.
This is another reason to design realistic benchmarks. Equality propagation can have unexpected costs sometimes. This is normally helpful, because it gives the query optimizer and execution engine more options for where to actually execute the IN check. But when the list is very large, it can result in slower optimization and execution. This is a feature offered by some other database servers, but not MySQL. However, you can emulate hash joins using hash indexes. MySQL has historically been unable to do loose index scans, which scan noncontiguous ranges of an index.
MySQL will scan the entire range of rows within these end points. An example will help clarify this. Suppose we have a table with an index on columns a, b , and we want to run the following query:. Figure shows what that strategy would look like if MySQL were able to do it. A loose index scan, which MySQL cannot currently do, would be more efficient.
Beginning in MySQL 5. This is a good optimization for this special purpose, but it is not a general-purpose loose index scan. Until MySQL supports general-purpose loose index scans, the workaround is to supply a constant or list of constants for the leading columns of the index. We showed several examples of how to get good performance with these types of queries in our indexing case study in the previous chapter.
However, in this case, MySQL will scan the whole table, which you can verify by profiling the query. This general strategy often works well when MySQL would otherwise choose to scan more rows than necessary. True, but sometimes you have to compromise your principles to get high performance. The query updates each row with the number of similar rows in the table:. To work around this limitation, you can use a derived table, because MySQL materializes it as a temporary table.
In this section, we give advice on how to optimize certain kinds of queries. Most of the advice in this section is version-dependent, and it may not hold for future versions of MySQL. You can do a web search and find more misinformation on this topic than we care to think about. COUNT is a special function that works in two very different ways: it counts values and rows. If you specify a column name or other expression inside the parentheses, COUNT counts how many times that expression has a value.
This is confusing for many people, in part because values and NULL are confusing. The Internet is not necessarily a good source of accurate information on this topic, either. One of the most common mistakes we see is specifying column names inside the parentheses when you want to count rows.
0コメント