700 Chapter 19 Query Optimization Additional transformations discussed in Chapters 4, 5, and 6 are not repeated here. We discuss next how transformations can be used in heuristic optimization. Outline of a Heuristic Algebraic Optimization Algorithm. We can now out- line the steps of an algorithm that utilizes some of the above rules to transform an initial query tree into a final tree that is more efficient to execute (in most cases). The algorithm will lead to transformations similar to those discussed in our exam- ple in Figure 19.2. The steps of the algorithm are as follows: 1. Using Rule 1, break up any SELECT operations with conjunctive conditions into a cascade of SELECT operations. This permits a greater degree of free- dom in moving SELECT operations down different branches of the tree. 2. Using Rules 2, 4, 6, and 10, 13, 14 concerning the commutativity of SELECT with other operations, move each SELECT operation as far down the query tree as is permitted by the attributes involved in the select condition. If the condition involves attributes from only one table, which means that it repre- sents a selection condition, the operation is moved all the way to the leaf node that represents this table. If the condition involves attributes from two tables, which means that it represents a join condition, the condition is moved to a location down the tree after the two tables are combined. 3. Using Rules 5 and 9 concerning commutativity and associativity of binary oper- ations, rearrange the leaf nodes of the tree using the following criteria. First, position the leaf node relations with the most restrictive SELECT operations so they are executed first in the query tree representation. The definition of most restrictive SELECT can mean either the ones that produce a relation with the fewest tuples or with the smallest absolute size.4 Another possibility is to define the most restrictive SELECT as the one with the smallest selectivity; this is more practical because estimates of selectivities are often available in the DBMS catalog. Second, make sure that the ordering of leaf nodes does not cause CARTESIAN PRODUCT operations; for example, if the two relations with the most restrictive SELECT do not have a direct join condition between them, it may be desirable to change the order of leaf nodes to avoid Cartesian products.5 4. Using Rule 12, combine a CARTESIAN PRODUCT operation with a subse- quent SELECT operation in the tree into a JOIN operation, if the condition represents a join condition. 5. Using Rules 3, 4, 7, and 11 concerning the cascading of PROJECT and the com- muting of PROJECT with other operations, break down and move lists of pro- jection attributes down the tree as far as possible by creating new PROJECT operations as needed. Only those attributes needed in the query result and in subsequent operations in the query tree should be kept after each PROJECT operation. 4Either definition can be used, since these rules are heuristic. 5Note that a CARTESIAN PRODUCT is acceptable in some cases—for example, if each relation has only a single tuple because each had a previous select condition on a key field.
19.2 Choice of Query Execution Plans 701 6. Identify subtrees that represent groups of operations that can be executed by a single algorithm. In our example, Figure 19.2(b) shows the tree in Figure 19.2(a) after applying steps 1 and 2 of the algorithm; Figure 19.2(c) shows the tree after step 3; Figure 19.2(d) after step 4; and Figure 19.2(e) after step 5. In step 6, we may group together the operations in the subtree whose root is the operation πEssn into a single algorithm. We may also group the remaining operations into another subtree, where the tuples resulting from the first algorithm replace the subtree whose root is the operation πEssn, because the first grouping means that this subtree is executed first. Summary of Heuristics for Algebraic Optimization. The main heuristic is to apply first the operations that reduce the size of intermediate results. This includes performing as early as possible SELECT operations to reduce the number of tuples and PROJECT operations to reduce the number of attributes—by moving SELECT and PROJECT operations as far down the tree as possible. Additionally, the SELECT and JOIN operations that are most restrictive—that is, result in relations with the fewest tuples or with the smallest absolute size—should be executed before other similar operations. The latter rule is accomplished through reordering the leaf nodes of the tree among themselves while avoiding Cartesian products, and adjust- ing the rest of the tree appropriately. 19.2 Choice of Query Execution Plans 19.2.1 Alternatives for Query Evaluation An execution plan for a relational algebra expression represented as a query tree includes information about the access methods available for each relation as well as the algorithms to be used in computing the relational operators represented in the tree. As a simple example, consider query Q1 from Chapter 7, whose corresponding relational algebra expression is πFname, Lname, Address(σDname=‘Research’(DEPARTMENT) Dnumber=Dno EMPLOYEE) The query tree is shown in Figure 19.3. To convert this into an execution plan, the optimizer might choose an index search for the SELECT operation on DEPARTMENT (assuming one exists), an index-based nested-loop join algorithm that loops over the records in the result of the SELECT operation on DEPARTMENT for the join operation (assuming an index exists on the Dno attribute of EMPLOYEE), and a scan of the JOIN result for input to the PROJECT operator. Additionally, the approach taken for executing the query may specify a materialized or a pipelined evaluation, although in general a pipelined evaluation is preferred whenever feasible. With materialized evaluation, the result of an operation is stored as a temporary relation (that is, the result is physically materialized). For instance, the JOIN opera- tion can be computed and the entire result stored as a temporary relation, which is then read as input by the algorithm that computes the PROJECT operation, which
702 Chapter 19 Query Optimization π Fname, Lname, Address Dnumber=Dno σ Dname=‘Research’ EMPLOYEE Figure 19.3 DEPARTMENT A query tree for query Q1. would produce the query result table. On the other hand, with pipelined evaluation, as the resulting tuples of an operation are produced, they are forwarded directly to the next operation in the query sequence. We discussed pipelining as a strategy for query processing in Section 18.7. For example, as the selected tuples from DEPARTMENT are produced by the SELECT operation, they are placed in a buffer; the JOIN operation algorithm then consumes the tuples from the buffer, and those tuples that result from the JOIN operation are pipelined to the projection operation algorithm. The advantage of pipelining is the cost savings in not having to write the intermediate results to disk and not having to read them back for the next operation. We discussed in Section 19.1 the possibility of converting query trees into equiva- lent trees so that the evaluation of the query is more efficient in terms of its execu- tion time and overall resources consumed. There are more elaborate transformations of queries that are possible to optimize, or rather to “improve.” Transformations can be applied either in a heuristic-based or cost-based manner. As we discussed in Sections 7.1.2 and 7.1.3, nested subqueries may occur in the WHERE clause as well as in the FROM clause of SQL queries. In the WHERE clause, if an inner block makes a reference to the relation used in the outer block, it is called a correlated nested query. When a query is used within the FROM clause to define a resulting or derived relation, which participates as a relation in the outer query, it is equivalent to a view. Both these types of nested subqueries are handled by the optimizer, which transforms them and rewrites the entire query. In the next two subsections, we consider these two variations of query transformation and rewriting with examples. We will call them nested subquery optimization and sub- query (view) merging transformation. In Section 19.8, we revisit this topic in the context of data warehouses and illustrate star transformation optimizations. 19.2.2 Nested Subquery Optimization We discussed nested queries in Section 7.1.2. Consider the query: SELECT E1.Fname, E1.Lname FROM EMLOYEE E1 WHERE E1.Salary = ( SELECT MAX (Salary) FROM EMPLOYEE E2)
19.2 Choice of Query Execution Plans 703 In the above nested query, there is a query block inside an outer query block. Evaluation of this query involves executing the nested query first, which yields a single value of the maximum salary M in the EMPLOYEE relation; then the outer block is simply executed with the selection condition Salary = M. The max- imum salary could be obtained just from the highest value in the index on salary (if one exists) or from the catalog if it is up-to-date. The outer query is evaluated based on the same index. If no index exists, then linear search would be needed for both. We discussed correlated nested SQL queries in Section 7.1.3. In a correlated sub- query, the inner query contains a reference to the outer query via one or more vari- ables. The subquery acts as a function that returns a set of values for each value of this variable or combination of variables. Suppose in the database of Figure 5.5, we modify the DEPARTMENT relation as: DEPARTMENT (Dnumber, Dname, Mgr_ssn, Mgr_start_date, Zipcode) Consider the query: SELECT Fname, Lname, Salary FROM EMPLOYEE E WHERE EXISTS ( SELECT * FROM DEPARTMENT D WHERE D.Dnumber = E.Dno AND D.Zipcode=30332); In the above, the nested subquery takes the E.Dno, the department where the employee works, as a parameter and returns a true or false value as a function depending on whether the department is located in zip code 30332. The naïve strat- egy for evaluating the query is to evaluate the inner nested subquery for every tuple of the outer relation, which is inefficient. Wherever possible, SQL optimizer tries to convert queries with nested subqueries into a join operation. The join can then be evaluated with one of the options we considered in Section 18.4. The above query would be converted to SELECT Fname, Lname, Salary FROM EMPLOYEE E, DEPARTMENT D WHERE WHERE D.Dnumber = E.Dno AND D.Zipcode=30332 The process of removing the nested query and converting the outer and inner query into one block is called unnesting. Here inner join is used, since D.Dnumber is unique and the join is an equi-join; this guarantees that a tuple from relation Employee will match with at most one tuple from relation Department. We showed in Chapter 7 that the query Q16, which has a subquery connected with the IN connector, was also unnested into a single block query involving a join. In general, queries involving a nested subquery connected by IN or ANY connector in SQL can always be converted into a single block query. Other techniques used include creation of temporary result tables from sub- queries and using them in joins.
704 Chapter 19 Query Optimization We repeat the example query shown in Section 18.1. (Note that the IN operator is equivalent to the =ANY operator.): Q (SJ) : SELECT COUNT(*) FROM DEPARTMENT D WHERE D.Dnumber IN ( SELECT E.Dno FROM EMPLOYEE E WHERE E.Salary > 200000) In this case again, there are two options for the optimizer: 1. Evaluate the nested subquery for each outer tuple; it is inefficient to do so. 2. Unnest the subquery using semi-join, which is much more efficient than option 1. In Section 18.1, we used this alternative to introduce and define the semi-join operator. Note that for unnesting this subquery, which refers to expressing it as a single block, inner join cannot be used, since in inner join a tuple of DEPARTMENT may match more than one tuple of EMPLOYEE and thus produce wrong results. It is easy to see that a nested subquery acts as a filter and thus it cannot, unlike inner join, produce more rows than there are in the DEPARTMENT table. Semi-join simulates this behavior. The process we described as unnesting is sometimes called decorrelation. We showed another example in Section 18.1 using the connector “NOT IN”, which was converted into a single block query using the operation anti-join. Optimization of complex nested subqueries is difficult and requires techniques that can be quite involved. We illustrate two such techniques in Section 19.2.3 below. Unnesting is a powerful optimization technique and is used widely by SQL optimizers. 19.2.3 Subquery (View) Merging Transformation There are instances where a subquery appears in the FROM clause of a query and amounts to including a derived relation, which is similar to a predefined view that is involved in the query. This FROM clause subquery is often referred to as an inline view. Sometimes, an actual view defined earlier as a separate query is used as one of the argument relations in a new query. In such cases, the transformation of the query can be referred to as a view-merging or subquery merging transformation. The techniques of view merging discussed here apply equally to both inline and predefined views, Consider the following three relations: EMP (Ssn, Fn, Ln, Dno) DEPT (Dno, Dname, Dmgrname, Bldg_id) BLDG (Bldg_id, No_storeys, Addr, Phone) The meaning of the relations is self-explanatory; the last one represents build- ings where departments are located; the phone refers to a phone number for the building lobby.
19.2 Choice of Query Execution Plans 705 The following query uses an inline view in the FROM clause; it retrieves for employ- ees named “John” the last name, address and phone number of building where they work: SELECT E.Ln, V.Addr, V.Phone FROM EMP E, ( SELECT D.Dno, D.Dname, B.Addr, B.Phone FROM DEPT D, BLDG B WHERE D.Bldg_id = B.Bldg_id ) V WHERE V.Dno = E.Dno AND E.Fn = “John”; The above query joins the EMP table with a view called V that provides the address and phone of the building where the employee works. In turn, the view joins the two tables DEPT and BLDG. This query may be executed by first temporarily materializing the view and then joining it with the EMP table. The optimizer is then constrained to consider the join order E, V or V, E; and for computing the view, the join orders possible are D, B and B, D. Thus the total number of join order candidates is limited to 4. Also, index-based join on E, V is precluded since there is no index on V on the join column Dno. The view-merging operation merges the tables in the view with the tables from the outer query block and pro- duces the following query: SELECT E.Ln, B.Addr, B.Phone FROM EMP E, DEPT D, BLDG B WHERE D.Bldg_id = B.Bldg_id AND D.Dno = E.Dno AND E.Fn = “John”; With the merged query block above, three tables appear in the FROM clause, thus affording eight possible join orders and indexes on Dno in DEPT, and Bldg_id in BLDG can be used for index-based nested loop joins that were previously excluded. We leave it to the reader to develop execution plans with and without merging to see the comparison. In general, views containing select-project-join operations are considered simple views and they can always be subjected to this type of view-merging. Typically, view merging enables additional options to be considered and results in an execution plan that is better than one without view merging. Sometimes other optimizations are enabled, such as dropping a table in the outer query if it is used within the view. View-merging may be invalid under certain conditions where the view is more complex and involves DISTINCT, OUTER JOIN, AGGREGATION, GROUP BY set operations, and so forth. We next consider a possible situation of GROUP-BY view-merging. GROUP-BY View-Merging: When the view has additional constructs besides select-project-join as we mentioned above, merging of the view as shown above may or may not be desirable. Delaying the Group By operation after performing joins may afford the advantage of reducing the data subjected to grouping in case the joins have low join selectivity. Alternately, performing early Group By may be advantageous by reducing the amount of data subjected to subsequent joins. The optimizer would typically consider execution plans with and without merging and
706 Chapter 19 Query Optimization compare their cost to determine the viability of doing the merging. We illustrate with an example. Consider the following relations: SALES (Custid, Productid, Date, Qty_sold) CUST (Custid, Custname, Country, Cemail) PRODUCT (Productid, Pname, Qty_onhand) The query: List customers from France who have bought more than 50 units of a product “Ring_234” may be set up as follows: A view is created to count total quantity of any item bought for the <Custid, Productid> pairs: CREATE VIEW CP_BOUGHT_VIEW AS SELECT SUM (S.Qty_sold) as Bought, S.Custid, S.Productid FROM SALES S GROUP BY S.Custid, S.Productid; Then the query using this view becomes: QG: SELECT C.Custid, C.Custname, C.Cemail FROM CUST C, PRODUCT P, CP_BOUGHT_VIEW V1 WHERE P.Productid = V1.Productid AND C.Custid = V1.Custid AND V1. Bought >50 AND Pname = “Ring_234” AND C.Country = “France”; The view V1 may be evaluated first and its results temporarily materialized, then the query QG may be evaluated using the materialized view as one of the tables in the join. By using the merging transformation, this query becomes: QT: SELECT C.Custid, C.Custname, C.Cemail FROM CUST C, PRODUCT P, SALES S WHERE P.Productid = S.Productid AND C.Custid = S.Custid AND Pname = “Ring_234” AND C.Country = “France” GROUP BY, P.Productid, P.rowid, C.rowid, C.Custid, C.Custname, C.Cemail HAVING SUM (S.Qty_sold) > 50; After merging, the resulting query QT is much more efficient and cheaper to exe- cute. The reasoning is as follows. Before merging, the view V1 does grouping on the entire SALES table and materializes the result, and it is expensive to do so. In the transformed query, the grouping is applied to the join of the three tables; in this operation, a single product tuple is involved from the PRODUCT table, thus filter- ing the data from SALES considerably. The join in QT after transformation may be slightly more expensive in that the whole SALES relation is involved rather than the aggregated view table CP_BOUGHT_VIEW in QG. Note, however, that the GROUP-BY operation in V1 produces a table whose cardinality is not considerably smaller than the cardinality of SALES, because the grouping is on <Custid, Productid>, which may not have high repetition in SALES. Also note the use of P.rowid and C.rowid, which refer to the unique row identifiers that are added to maintain equiv- alence with the original query. We reiterate that the decision to merge GROUP-BY views must be made by the optimizer based on estimated costs.
19.2 Choice of Query Execution Plans 707 19.2.4 Materialized Views We discussed the concept of views in Section 7.3 and also introduced the concept of materialization of views. A view is defined in the database as a query, and a materialized view stores the results of that query. Using materialized views to avoid some of the computation involved in a query is another query optimiza- tion technique. A materialized view may be stored temporarily to allow more queries to be processed against it or permanently, as is common in data ware- houses (see Chapter 29). A materialized view constitutes derived data because its content can be computed as a result of processing the defining query of the materialized view. The main idea behind materialization is that it is much cheaper to read it when needed and query against it than to recompute it from scratch. The savings can be significant when the view involves costly operations like join, aggregation, and so forth. Consider, for example, view V2 in Section 7.3, which defines the view as a relation by joining the DEPARTMENT and EMPLOYEE relations. For every department, it computes the total number of employees and the total salary paid to employees in that department. If this information is frequently required in reports or queries, this view may be permanently stored. The materialized view may contain data related only to a fragment or sub-expression of the user query. Therefore, an involved algorithm is needed to replace only the relevant fragments of the query with one or more materialized views and compute the rest of the query in a conven- tional way. We also mentioned in Section 7.3 three update (also known as refresh) strategies for updating the view: ■ Immediate update, which updates the view as soon as any of the relations participating in the view are updated ■ Lazy update, which recomputes the view only upon demand ■ Periodic update (or deferred update), which updates the view later, possibly with some regular frequency When immediate update is in force, it constitutes a large amount of overhead to keep the view updated when any of the underlying base relations have a change in the form of insert, delete, and modify. For example, deleting an employee from the database, or changing the salary of an employee, or hiring a new employee affects the tuple corre- sponding to that department in the view and hence would require the view V2 in Section 7.3 to be immediately updated. These updates are handled sometimes manu- ally by programs that update all views defined on top of a base relation whenever the base relation is updated. But there is obviously no guarantee that all views may be accounted for. Triggers (see Section 7.2) that are activated upon an update to the base relation may be used to take action and make appropriate changes to the materialized views. The straightforward and naive approach is to recompute the entire view for every update to any base table and is prohibitively costly. Hence incremental view maintenance is done in most RDBMSs today. We discuss that next. Incremental View Maintenance. The basic idea behind incremental view mainte- nance is that instead of creating the view from scratch, it can be updated incrementally
708 Chapter 19 Query Optimization by accounting for only the changes that occurred since the last time it was created/updated. The trick is in figuring out exactly what is the net change to the materialized view based on a set of inserted or deleted tuples in the base relation. We describe below the general approaches to incremental view maintenance for views involving join, selection, projection, and a few types of aggregation. To deal with modification, we can consider these approaches as a combination of delete of the old tuple followed by an insert of the new tuple. Assume a view V defined over relations R and S. The respective instances are v, r, and s. Join: If a view contains inner join of relations r and s, vold = r s, and there is a new set of tuples inserted: ri, in r, then the new value of the view contains (r ∪ ri) s. The incremental change to the view can be computed as vnew = r s ∪ ri s. Similarly, deleting a set of tuples rd from r results in the new view as vnew = r s − rd s. We will have similar expressions symmetrically when s undergoes addition or deletion. Selection: If a view is defined as V = σC R with condition C for selection, when a set of tuples ri are inserted into r, the view can be modified as vnew = vold ∪ σC ri. On the other hand, upon deletion of tuples rd from r, we get vnew = vold − σC rd. Projection: Compared to the above strategy, projection requires additional work. Consider the view defined as V = πSex, SalaryR, where R is the EMPLOYEE relation, and suppose the following <Sex, Salary> pairs exist for Salary of 50,000 in r in three distinct tuples: t5 contains <M, 50000>, t17 contains <M, 50000> and t23 contains <F, 50000>. The view v therefore contains <M, 50000> and <F, 50000> as two tuples derived from the three tuples of r. If tuple t5 were to be deleted from r, it would have no effect on the view. However, if t23 were to be deleted from r, the tuple <F, 50000> would have to be removed from the view. Similarly, if another new tuple t77 con- taining <M, 50000> were to be inserted in the relation r, it also would have no effect on the view. Thus, view maintenance of projection views requires a count to be maintained in addition to the actual columns in the view. In the above example, the original count values are 2 for <M, 50000> and 1 for <F, 50000>. Each time an insert to the base relation results in contributing to the view, the count is incre- mented; if a deleted tuple from the base relation has been represented in the view, its count is decremented. When the count of a tuple in the view reaches zero, the tuple is actually dropped from the view. When a new inserted tuple contributes to the view, its count is set to 1. Note that the above discussion assumes that SELECT DISTINCT is being used in defining the view to correspond to the project (π) opera- tion. If the multiset version of projection is used with no DISTINCT, the counts would still be used. There is an option to display the view tuple as many times as its count in case the view must be displayed as a multiset. Intersection: If the view is defined as V= R ∩ S, when a new tuple ri is inserted, it is compared against the s relation to see if it is present there. If present, it is inserted in v, else not. If tuple rd is deleted, it is matched against the view v and, if present there, it is removed from the view.
19.2 Choice of Query Execution Plans 709 Aggregation (Group By): For aggregation, let us consider that GROUP BY is used on column G in relation R and the view contains (SELECT G, aggregate- function (A)). The view is a result of some aggregation function applied to attribute A, which corresponds to (see Section 8.4.2): GℑAggregate-function(A) We consider a few aggregate-functions below: ■ Count: For keeping the count of tuples for each group, if a new tuple is inserted in r, and if it has a value for G = g1, and if g1 is present in the view, then its count is incremented by 1. If there is no tuple with the value g1 in the view, then a new tuple is inserted in the view: <g1, 1>. When the tuple being deleted has the value G = g1, its count is decremented by 1. If the count of g1 reaches zero after deletion in the view, that tuple is removed from the view. ■ Sum: Suppose the view contains (G, sum(A)). There is a count maintained for each group in the view. If a tuple is inserted in the relation r and has (g1, x1) under the columns R.G and R.A, and if the view does not have an entry for g1, a new tuple <g1, x1> is inserted in the view and its count is set to 1. If there is already an entry for g1 as <g1, s1> in the old view, it is modified as <g1, s1 + x1> and its count is incremented by 1. For the deletion from base relation of a tuple with R.G, R.A being <g1, x1>, if the count of the corre- sponding group g1 is 1, the tuple for group g1 would be removed from the view. If it is present and has count higher than 1, the count would be decre- mented by 1 and the sum s1 would be decremented to s1– x1. ■ Average: The aggregate function cannot be maintained by itself without main- taining the sum and the count functions and then computing the average as sum divided by count. So both the sum and count functions need to be maintained and incrementally updated as discussed above to compute the new average. ■ Max and Min: We can just consider Max. Min would be symmetrically han- dled. Again for each group, the (g, max(a), count) combination is main- tained, where max(a) represents the maximum value of R.A in the base relation. If the inserted tuple has R.A value lower than the current max(a) value, or if it has a value equal to max(a) in the view, only the count for the group is incremented. If it has a value greater than max(a), the max value in the view is set to the new value and the count is incremented. Upon deletion of a tuple, if its R.A value is less than the max(a), only the count is decre- mented. If the R.A value matches the max(a), the count is decremented by 1; so the tuple that represented the max value of A has been deleted. Therefore, a new max must be computed for A for the group that requires substantial amount of work. If the count becomes 0, that group is removed from the view because the deleted tuple was the last tuple in the group. We discussed incremental materialization as an optimization technique for main- taining views. However, we can also look upon materialized views as a way to reduce the effort in certain queries. For example, if a query has a component, say, R S or πLR that is available as a view, then the query may be modified to use the
710 Chapter 19 Query Optimization view and avoid doing unnecessary computation. Sometimes an opposite situation happens. A view V is used in the query Q, and that view has been materialized as v; let us say the view includes R S; however, no access structures like indexes are available on v. Suppose that indexes are available on certain attributes, say, A of the component relation R and that the query Q involves a selection condition on A. In such cases, the query against the view can benefit by using the index on a compo- nent relation, and the view is replaced by its defining query; the relation represent- ing the materialized view is not used at all. 19.3 Use of Selectivities in Cost-Based Optimization A query optimizer does not depend solely on heuristic rules or query transforma- tions; it also estimates and compares the costs of executing a query using different execution strategies and algorithms, and it then chooses the strategy with the lowest cost estimate. For this approach to work, accurate cost estimates are required so that different strategies can be compared fairly and realistically. In addition, the opti- mizer must limit the number of execution strategies to be considered; otherwise, too much time will be spent making cost estimates for the many possible execution strategies. Hence, this approach is more suitable for compiled queries, rather than ad-hoc queries where the optimization is done at compile time and the resulting execution strategy code is stored and executed directly at runtime. For interpreted queries, where the entire process shown in Figure 18.1 occurs at runtime, a full- scale optimization may slow down the response time. A more elaborate optimiza- tion is indicated for compiled queries, whereas a partial, less time-consuming optimization works best for interpreted queries. This approach is generally referred to as cost-based query optimization.6 It uses traditional optimization techniques that search the solution space to a problem for a solution that minimizes an objective (cost) function. The cost functions used in query optimization are estimates and not exact cost functions, so the optimization may select a query execution strategy that is not the optimal (absolute best) one. In Section 19.3.1, we discuss the components of query execution cost. In Sec- tion 19.3.2, we discuss the type of information needed in cost functions. This infor- mation is kept in the DBMS catalog. In Section 19.3.3, we describe histograms that are used to keep details on the value distributions of important attributes. The decision-making process during query optimization is nontrivial and has mul- tiple challenges. We can abstract the overall cost-based query optimization approach in the following way: ■ For a given subexpression in the query, there may be multiple equivalence rules that apply. The process of applying equivalences is a cascaded one; it 6This approach was first used in the optimizer for the SYSTEM R in an experimental DBMS developed at IBM (Selinger et al., 1979).
19.3 Use of Selectivities in Cost-Based Optimization 711 does not have any limit and there is no definitive convergence. It is difficult to conduct this in a space-efficient manner. ■ It is necessary to resort to some quantitative measure for evaluation of alter- natives. By using the space and time requirements and reducing them to some common metric called cost, it is possible to devise some methodology for optimization. ■ Appropriate search strategies can be designed by keeping the cheapest alter- natives and pruning the costlier alternatives. ■ The scope of query optimization is generally a query block. Various table and index access paths, join permutations (orders), join methods, group-by methods, and so on provide the alternatives from which the query optimizer must chose. ■ In a global query optimization, the scope of optimization is multiple query blocks.7 19.3.1 Cost Components for Query Execution The cost of executing a query includes the following components: 1. Access cost to secondary storage. This is the cost of transferring (reading and writing) data blocks between secondary disk storage and main mem- ory buffers. This is also known as disk I/O (input/output) cost. The cost of searching for records in a disk file depends on the type of access struc- tures on that file, such as ordering, hashing, and primary or secondary indexes. In addition, factors such as whether the file blocks are allocated contiguously on the same disk cylinder or scattered on the disk affect the access cost. 2. Disk storage cost. This is the cost of storing on disk any intermediate files that are generated by an execution strategy for the query. 3. Computation cost. This is the cost of performing in-memory operations on the records within the data buffers during query execution. Such operations include searching for and sorting records, merging records for a join or a sort operation, and performing computations on field values. This is also known as CPU (central processing unit) cost. 4. Memory usage cost. This is the cost pertaining to the number of main memory buffers needed during query execution. 5. Communication cost. This is the cost of shipping the query and its results from the database site to the site or terminal where the query originated. In distributed databases (see Chapter 23), it would also include the cost of transferring tables and results among various computers during query evaluation. 7We do not discuss global optimization in this sense in the present chapter. Details may be found in Ahmed et al. (2006).
712 Chapter 19 Query Optimization For large databases, the main emphasis is often on minimizing the access cost to secondary storage. Simple cost functions ignore other factors and compare dif- ferent query execution strategies in terms of the number of block transfers between disk and main memory buffers. For smaller databases, where most of the data in the files involved in the query can be completely stored in memory, the emphasis is on minimizing computation cost. In distributed databases, where many sites are involved (see Chapter 23), communication cost must be minimized. It is difficult to include all the cost components in a (weighted) cost function because of the difficulty of assigning suitable weights to the cost com- ponents. This is why some cost functions consider a single factor only—disk access. In the next section, we discuss some of the information that is needed for formulating cost functions. 19.3.2 Catalog Information Used in Cost Functions To estimate the costs of various execution strategies, we must keep track of any information that is needed for the cost functions. This information may be stored in the DBMS catalog, where it is accessed by the query optimizer. First, we must know the size of each file. For a file whose records are all of the same type, the number of records (tuples) (r), the (average) record size (R), and the number of file blocks (b) (or close estimates of them) are needed. The blocking factor (bfr) for the file may also be needed. These were mentioned in Section 18.3.4, and we utilized them while illustrating the various implementation algorithms for relational opera- tions. We must also keep track of the primary file organization for each file. The primary file organization records may be unordered, ordered by an attribute with or without a primary or clustering index, or hashed (static hashing or one of the dynamic hashing methods) on a key attribute. Information is also kept on all pri- mary, secondary, or clustering indexes and their indexing attributes. The number of levels (x) of each multilevel index (primary, secondary, or clustering) is needed for cost functions that estimate the number of block accesses that occur during query execution. In some cost functions the number of first-level index blocks (bI1) is needed. Another important parameter is the number of distinct values NDV (A, R) of an attribute in relation R and the attribute selectivity (sl), which is the fraction of records satisfying an equality condition on the attribute. This allows estimation of the selection cardinality (s = sl*r) of an attribute, which is the average number of records that will satisfy an equality selection condition on that attribute. Information such as the number of index levels is easy to maintain because it does not change very often. However, other information may change frequently; for example, the number of records r in a file changes every time a record is inserted or deleted. The query optimizer will need reasonably close but not necessarily com- pletely up-to-the-minute values of these parameters for use in estimating the cost of various execution strategies. To help with estimating the size of the results of que- ries, it is important to have as good an estimate of the distribution of values as pos- sible. To that end, most systems store a histogram.
No. of Employees 19.3 Use of Selectivities in Cost-Based Optimization 713 19.3.3 Histograms Histograms are tables or data structures maintained by the DBMS to record infor- mation about the distribution of data. It is customary for most RDBMSs to store histograms for most of the important attributes. Without a histogram, the best assumption is that values of an attribute are uniformly distributed over its range from high to low. Histograms divide the attribute over important ranges (called buckets) and store the total number of records that belong to that bucket in that relation. Sometimes they may also store the number of distinct values in each bucket as well. An implicit assumption is made sometimes that among the distinct values within a bucket there is a uniform distribution. All these assumptions are oversimplifications that rarely hold. So keeping a histogram with a finer granularity (i.e., larger number of buckets) is always useful. A couple of variations of histo- grams are common: in equi-width histograms, the range of values is divided into equal subranges. In equi-height histograms, the buckets are so formed that each one contains roughly the same number of records. Equi-height histograms are con- sidered better since they keep fewer numbers of more frequently occurring values in one bucket and more numbers of less frequently occurring ones in a different bucket. So the uniform distribution assumption within a bucket seems to hold bet- ter. We show an example of a histogram for salary information in a company in Figure 19.4. This histogram divides the salary range into five buckets that may cor- respond to the important sub-ranges over which the queries may be likely because they belong to certain types of employees. It is neither an equi-width nor an equi- height histogram. Figure 19.4 Histogram of salary in the relation EMPLOYEE. 700 600 500 400 300 200 100 30k–40k 40k–70k 70k–120k 120k–200k 200k–500k Salary
714 Chapter 19 Query Optimization 19.4 Cost Functions for SELECT Operation We now provide cost functions for the selection algorithms S1 to S8 discussed in Section 18.3.1 in terms of number of block transfers between memory and disk. Algorithm S9 involves an intersection of record pointers after they have been retrieved by some other means, such as algorithm S6, and so the cost function will be based on the cost for S6. These cost functions are estimates that ignore computa- tion time, storage cost, and other factors. To reiterate, the following notation is used in the formulas hereafter: CSi: Cost for method Si in block accesses rX: Number of records (tuples) in a relation X bX: Number of blocks occupied by relation X (also referred to as b) bfrX: Blocking factor (i.e., number of records per block) in relation X slA: Selectivity of an attribute A for a given condition sA: Selection cardinality of the attribute being selected (= slA * r) xA: Number of levels of the index for attribute A bI1A: Number of first-level blocks of the index on attribute A NDV (A, X): Number of distinct values of attribute A in relation X Note: In using the above notation in formulas, we have omitted the relation name or attribute name when it is obvious. ■ S1—Linear search (brute force) approach. We search all the file blocks to retrieve all records satisfying the selection condition; hence, CS1a = b. For an equality condition on a key attribute, only half the file blocks are searched on the average before finding the record, so a rough estimate for CS1b = (b/2) if the record is found; if no record is found that satisfies the condition, CS1b = b. ■ S2—Binary search. This search accesses approximately CS2 = log2b + ⎡(s/bfr)⎤ − 1 file blocks. This reduces to log2b if the equality condition is on a unique (key) attribute, because s = 1 in this case. ■ S3a—Using a primary index to retrieve a single record. For a primary index, retrieve one disk block at each index level, plus one disk block from the data file. Hence, the cost is one more disk block than the number of index levels: CS3a = x + 1. ■ S3b—Using a hash key to retrieve a single record. For hashing, only one disk block needs to be accessed in most cases. The cost function is approxi- mately CS3b = 1 for static hashing or linear hashing, and it is 2 disk block accesses for extendible hashing (see Section 16.8). ■ S4—Using an ordering index to retrieve multiple records. If the compari- son condition is >, >=, <, or <= on a key field with an ordering index, roughly half the file records will satisfy the condition. This gives a cost function of CS4 = x + (b/2). This is a very rough estimate, and although it may be correct on the average, it may be inaccurate in individual cases. A more accurate estimate is possible if the distribution of records is stored in a histogram. ■ S5—Using a clustering index to retrieve multiple records. One disk block is accessed at each index level, which gives the address of the first file disk
19.4 Cost Functions for SELECT Operation 715 block in the cluster. Given an equality condition on the indexing attribute, s records will satisfy the condition, where s is the selection cardinality of the indexing attribute. This means that ⎡(s/bfr)⎤ file blocks will be in the cluster of file blocks that hold all the selected records, giving CS5 = x + ⎡(s/bfr)⎤. ■ S6—Using a secondary (B+-tree) index. For a secondary index on a key (unique) attribute, with an equality (i.e., <attribute = value>) selection condi- tion, the cost is x + 1 disk block accesses. For a secondary index on a nonkey (nonunique) attribute, s records will satisfy an equality condition, where s is the selection cardinality of the indexing attribute. However, because the index is nonclustering, each of the records may reside on a different disk block, so the (worst case) cost estimate is CS6a = x + 1 + s. The additional 1 is to account for the disk block that contains the record pointers after the index is searched (see Figure 17.5). For range queries, if the comparison condition is >, >=, <, or <= and half the file records are assumed to satisfy the condition, then (very roughly) half the first-level index blocks are accessed, plus half the file records via the index. The cost estimate for this case, approximately, is CS6b = x + (bI1/2) + (r/2). The r/2 factor can be refined if better selectivity estimates are available through a histogram. The latter method CS6b can be very costly. For a range condition such as v1 < A < v2, the selection cardinal- ity s must be computed from the histogram or as a default, under the uniform distribution assumption; then the cost would be computed based on whether or not A is a key or nonkey with a B+-tree index on A. (We leave this as an exercise for the reader to compute under the different conditions.) ■ S7—Conjunctive selection. We can use either S1 or one of the methods S2 to S6 discussed above. In the latter case, we use one condition to retrieve the records and then check in the main memory buffers whether each retrieved record satisfies the remaining conditions in the conjunction. If multiple indexes exist, the search of each index can produce a set of record pointers (record ids) in the main memory buffers. The intersection of the sets of record pointers (referred to in S9) can be computed in main memory, and then the resulting records are retrieved based on their record ids. ■ S8—Conjunctive selection using a composite index. Same as S3a, S5, or S6a, depending on the type of index. ■ S9—Selection using a bitmap index. (See Section 17.5.2.) Depending on the nature of selection, if we can reduce the selection to a set of equality condi- tions, each equating the attribute with a value (e.g., A = {7, 13, 17, 55}), then a bit vector for each value is accessed which is r bits or r/8 bytes long. A number of bit vectors may fit in one block. Then, if s records qualify, s blocks are accessed for the data records. ■ S10—Selection using a functional index. (See Section 17.5.3.) This works similar to S6 except that the index is based on a function of multiple attributes; if that function is appearing in the SELECT clause, the corresponding index may be utilized.
716 Chapter 19 Query Optimization Cost-Based Optimization Approach. In a query optimizer, it is common to enumerate the various possible strategies for executing a query and to estimate the costs for different strategies. An optimization technique, such as dynamic program- ming, may be used to find the optimal (least) cost estimate efficiently without hav- ing to consider all possible execution strategies. Dynamic programming is an optimization technique8 in which subproblems are solved only once. This tech- nique is applicable when a problem may be broken down into subproblems that themselves have subproblems. We will visit the dynamic programming approach when we discuss join ordering in Section 19.5.5. We do not discuss optimization algorithms here; rather, we use a simple example to illustrate how cost estimates may be used. 19.4.1 Example of Optimization of Selection Based on Cost Formulas: Suppose that the EMPLOYEE file in Figure 5.5 has rE = 10,000 records stored in bE = 2,000 disk blocks with blocking factor bfrE = 5 records/block and the following access paths: 1. A clustering index on Salary, with levels xSalary = 3 and average selection cardi- nality sSalary = 20. (This corresponds to a selectivity of slSalary = 20/10000 = 0.002.) 2. A secondary index on the key attribute Ssn, with xSsn = 4 (sSsn = 1, slSsn = 0.0001). 3. A secondary index on the nonkey attribute Dno, with xDno = 2 and first-level index blocks bI1Dno = 4. There are NDV (Dno, EMPLOYEE) = 125 distinct val- ues for Dno, so the selectivity of Dno is slDno = (1/ NDV (Dno, EMPLOYEE)) = 0.008, and the selection cardinality is sDno = (rE * slDno) = (rE/NDV (Dno, EMPLOYEE)) = 80. 4. A secondary index on Sex, with xSex = 1. There are NDV (Sex, EMPLOYEE) = 2 values for the Sex attribute, so the average selection cardinality is sSex = (rE/NDV (Sex, EMPLOYEE)) = 5000. (Note that in this case, a histogram giving the percentage of male and female employees may be useful, unless the percentages are approximately equal.) We illustrate the use of cost functions with the following examples: OP1: σSsn=‘123456789’ (EMPLOYEE) OP2: σDno>5(EMPLOYEE) OP3: σDno=5(EMPLOYEE) OP4: σDno=5 AND SALARY>30000 AND Sex=‘F’ (EMPLOYEE) The cost of the brute force (linear search or file scan) option S1 will be estimated as CS1a = bE = 2000 (for a selection on a nonkey attribute) or CS1b = (bE/2) = 1,000 8For a detailed discussion of dynamic programming as a technique of optimization, the reader may con- sult an algorithm textbook such as Corman et al. (2003).
19.5 Cost Functions for the JOIN Operation 717 (average cost for a selection on a key attribute). For OP1 we can use either method S1 or method S6a; the cost estimate for S6a is CS6a = xSsn + 1 = 4 + 1 = 5, and it is chosen over method S1, whose average cost is CS1b = 1,000. For OP2 we can use either method S1 (with estimated cost CS1a = 2,000) or method S6b (with estimated cost CS6b = xDno + (bI1Dno/2) + (rE /2) = 2 + (4/2) + (10,000/2) = 5,004), so we choose the linear search approach for OP2. For OP3 we can use either method S1 (with estimated cost CS1a = 2,000) or method S6a (with estimated cost CS6a = xDno + sDno = 2 + 80 = 82), so we choose method S6a. Finally, consider OP4, which has a conjunctive selection condition. We need to esti- mate the cost of using any one of the three components of the selection condition to retrieve the records, plus the linear search approach. The latter gives cost estimate CS1a = 2000. Using the condition (Dno = 5) first gives the cost estimate CS6a = 82. Using the condition (Salary > 30000) first gives a cost estimate CS4 = xSalary + (bE/2) = 3 + (2000/2) = 1003. Using the condition (Sex = ‘F’) first gives a cost estimate CS6a = xSex + sSex = 1 + 5000 = 5001. The optimizer would then choose method S6a on the secondary index on Dno because it has the lowest cost estimate. The condition (Dno = 5) is used to retrieve the records, and the remaining part of the conjunctive condition (Salary > 30,000 AND Sex = ‘F’) is checked for each selected record after it is retrieved into memory. Only the records that satisfy these additional conditions are included in the result of the operation. Consider the Dno = 5 condition in OP3 above; Dno has 125 values and hence a B+-tree index would be appropriate. Instead, if we had an attribute Zipcode in EMPLOYEE and if the condition were Zipcode = 30332 and we had only five zip codes, bitmap indexing could be used to know what records qualify. Assuming uniform distribution, sZipcode = 2,000. This would result in a cost of 2,000 for bitmap indexing. 19.5 Cost Functions for the JOIN Operation To develop reasonably accurate cost functions for JOIN operations, we must have an estimate for the size (number of tuples) of the file that results after the JOIN opera- tion. This is usually kept as a ratio of the size (number of tuples) of the resulting join file to the size of the CARTESIAN PRODUCT file, if both are applied to the same input files, and it is called the join selectivity (js). If we denote the number of tuples of a relation R by |R|, we have: js = |(R c S)| / |(R × S)| = |(R c S)| / (|R| * |S|) If there is no join condition c, then js = 1 and the join is the same as the CARTESIAN PRODUCT. If no tuples from the relations satisfy the join condition, then js = 0. In general, 0 ≤ js ≤ 1. For a join where the condition c is an equality comparison R.A = S.B, we get the following two special cases: 1. If A is a key of R, then |(R c S)| ≤ |S|, so js ≤ (1/|R|). This is because each record in file S will be joined with at most one record in file R, since A is a key of R. A special case of this condition is when attribute B is a foreign key of S that references the primary key A of R. In addition, if the foreign key B
718 Chapter 19 Query Optimization has the NOT NULL constraint, then js = (1/|R|), and the result file of the join will contain |S| records. 2. If B is a key of S, then |(R c S)| ≤ |R|, so js ≤ (1/|S|). Hence a simple formula to use for join selectivity is: js = 1/ max (NDV (A, R), NDV (B,S) ) Having an estimate of the join selectivity for commonly occurring join conditions enables the query optimizer to estimate the size of the resulting file after the join operation, which we call join cardinality (jc). jc = |(Rc S)| = js * |R| * |S|. We can now give some sample approximate cost functions for estimating the cost of some of the join algorithms given in Section 18.4. The join operations are of the form: R A=B S where A and B are domain-compatible attributes of R and S, respectively. Assume that R has bR blocks and that S has bS blocks: ■ J1—Nested-loop join. Suppose that we use R for the outer loop; then we get the following cost function to estimate the number of block accesses for this method, assuming three memory buffers. We assume that the blocking factor for the resulting file is bfrRS and that the join selectivity is known: CJ1 = bR + (bR * bS) + ((js * |R| * |S|)/bfrRS) The last part of the formula is the cost of writing the resulting file to disk. This cost formula can be modified to take into account different numbers of memory buffers, as presented in Section 19.4. If nB main memory buffer blocks are available to perform the join, the cost formula becomes: CJ1 = bR + ( ⎡bR/(nB – 2)⎤ * bS) + ((js * |R| * |S|)/bfrRS) ■ J2—Index-based nested-loop join (using an access structure to retrieve the matching record(s)). If an index exists for the join attribute B of S with index levels xB, we can retrieve each record s in R and then use the index to retrieve all the matching records t from S that satisfy t[B] = s[A]. The cost depends on the type of index. For a secondary index where sB is the selection cardinality for the join attribute B of S,9 we get: CJ2a = bR + (|R| * (xB + 1 + sB)) + (( js * |R| * |S|)/bfrRS) For a clustering index where sB is the selection cardinality of B, we get CJ2b = bR + (|R| * (xB + (sB/bfrB))) + (( js * |R| * |S|)/bfrRS) 9Selection cardinality was defined as the average number of records that satisfy an equality condition on an attribute, which is the average number of records that have the same value for the attribute and hence will be joined to a single record in the other file.
19.5 Cost Functions for the JOIN Operation 719 For a primary index, we get CJ2c = bR + (|R| * (xB + 1)) + ((js * |R| * |S|)/bfrRS) If a hash key exists for one of the two join attributes—say, B of S—we get CJ2d = bR + (|R| * h) + ((js * |R| * |S|)/bfrRS) where h ≥ 1 is the average number of block accesses to retrieve a record, given its hash key value. Usually, h is estimated to be 1 for static and linear hashing and 2 for extendible hashing. This is an optimistic estimate, and typ- ically h ranges from 1.2 to 1.5 in practical situations. ■ J3—Sort-merge join. If the files are already sorted on the join attributes, the cost function for this method is CJ3a = bR + bS + ((js * |R| * |S|)/bfrRS) If we must sort the files, the cost of sorting must be added. We can use the formulas from Section 18.2 to estimate the sorting cost. ■ J4—Partition–hash join (or just hash join). The records of files R and S are partitioned into smaller files. The partitioning of each file is done using the same hashing function h on the join attribute A of R (for partitioning file R) and B of S (for partitioning file S). As we showed in Section 18.4, the cost of this join can be approximated to: CJ4 = 3 * (bR + bS) + ((js * |R| * |S|)/bfrRS) 19.5.1 Join Selectivity and Cardinality for Semi-Join and Anti-Join We consider these two important operations, which are used when unnesting cer- tain queries. In Section 18.1 we showed examples of subqueries that are transformed into these operations. The goal of these operations is to avoid the unnecessary effort of doing exhaustive pairwise matching of two tables based on the join condition. Let us consider the join selectivity and cardinality of these two types of joins. Semi-Join SELECT COUNT(*) FROM T1 WHERE T1.X IN (SELECT T2.Y FROM T2); Unnesting of the query above leads to semi-join. (In the following query, the nota- tion “S=” for semi-join is nonstandard.) SELECT COUNT(*) FROM T1, T2 WHERE T1.X S= T2.Y;
720 Chapter 19 Query Optimization The join selectivity of the semi-join above is given by: js = MIN(1,NDV(Y, T2)/NDV(X, T1)) The join cardinality of the semi-join is given by: jc = |T1|* js Anti-Join Consider the following query: SELECT COUNT (*) FROM T1 WHERE T1.X NOT IN (SELECT T2.Y FROM T2); Unnesting of the query above leads to anti-join.10 (In the following query, the notation “A=” for anti-join is nonstandard.) SELECT COUNT(*) FROM T1, T2 WHERE T1.X A= T2.Y; The join selectivity of the anti-join above is given by: js = 1 – MIN(1,NDV(T2.y)/NDV(T1.x)) The join cardinality of the anti-join is given by: jc = |T1|*js 19.5.2 Example of Join Optimization Based on Cost Formulas Suppose that we have the EMPLOYEE file described in the example in the previ- ous section, and assume that the DEPARTMENT file in Figure 5.5 consists of rD = 125 records stored in bD = 13 disk blocks. Consider the following two join operations: OP6: EMPLOYEE Dno=Dnumber DEPARTMENT OP7: DEPARTMENT Mgr_ssn=Ssn EMPLOYEE Suppose that we have a primary index on Dnumber of DEPARTMENT with xDnumber= 1 level and a secondary index on Mgr_ssn of DEPARTMENT with selection cardinality sMgr_ssn= 1 and levels xMgr_ssn= 2. Assume that the join selectivity for OP6 is jsOP6 = (1/|DEPARTMENT|) = 1/12511 because Dnumber is a key of DEPARTMENT. Also assume that the blocking factor for the resulting join file is bfrED= 4 records 10Note that in order for anti-join to be used in the NOT IN subquery, both the join attributes, T1.X and T2.Y, must have non-null values. For a detailed discussion, consult Bellamkonda et al. (2009). 11Note that this coincides with our other formula: = 1/ max (NDV (Dno, EMPLOYEEE), NDV (Dnumber, DEPARTMENT) = 1/max (125,125) = 1/125.
19.5 Cost Functions for the JOIN Operation 721 per block. We can estimate the worst-case costs for the JOIN operation OP6 using the applicable methods J1 and J2 as follows: 1. Using method J1 with EMPLOYEE as outer loop: CJ1 = bE + (bE * bD) + (( jsOP6 * rE* rD)/bfrED) = 2,000 + (2,000 * 13) + (((1/125) * 10,000 * 125)/4) = 30,500 2. Using method J1 with DEPARTMENT as outer loop: CJ1 = bD + (bE * bD) + (( jsOP6* rE* rD)/bfrED) = 13 + (13 * 2,000) + (((1/125) * 10,000 * 125/4) = 28,513 3. Using method J2 with EMPLOYEE as outer loop: CJ2c = bE + (rE * (xDnumber+ 1)) + (( jsOP6 * rE * rD)/bfrED = 2,000 + (10,000 * 2) + (((1/125) * 10,000 * 125/4) = 24,500 4. Using method J2 with DEPARTMENT as outer loop: CJ2a = bD + (rD * (xDno + sDno)) + (( jsOP6 * rE * rD)/bfrED) = 13 + (125 * (2 + 80)) + (((1/125) * 10,000 * 125/4) = 12,763 5. Using method J4 gives: CJ4 = 3* ( bD + bE ) + (( jsOP6 * rE * rD)/bfrED) = 3* (13+2,000) + 2,500 = 8,539 Case 5 has the lowest cost estimate and will be chosen. Notice that in case 2 above, if 15 memory buffer blocks (or more) were available for executing the join instead of just 3, 13 of them could be used to hold the entire DEPARTMENT relation (outer loop relation) in memory, one could be used as buffer for the result, and one would be used to hold one block at a time of the EMPLOYEE file (inner loop file), and the cost for case 2 could be drastically reduced to just bE + bD + (( jsOP6 * rE * rD)/bfrED) or 4,513, as discussed in Section 18.4. If some other number of main memory buf- fers was available, say nB = 10, then the cost for case 2 would be calculated as fol- lows, which would also give better performance than case 4: CJ1 = bD + (⎡bD/(nB – 2)⎤ * bE) + ((js * |R| * |S|)/bfrRS) = 13 + ( ⎡13/8⎤ * 2,000) + (((1/125) * 10,000 * 125/4) = 28,513 = 13 + (2 * 2,000) + 2,500 = 6,513 As an exercise, the reader should perform a similar analysis for OP7. 19.5.3 Multirelation Queries and JOIN Ordering Choices The algebraic transformation rules in Section 19.1.2 include a commutative rule and an associative rule for the join operation. With these rules, many equivalent join expressions can be produced. As a result, the number of alternative query trees grows very rapidly as the number of joins in a query increases. A query block that joins n relations will often have n − 1 join operations, and hence can have a large number of different join orders. In general, for a query block that has n relations,
722 Chapter 19 Query Optimization there are n! join orders; Cartesian products are included in this total number. Esti- mating the cost of every possible join tree for a query with a large number of joins will require a substantial amount of time by the query optimizer. Hence, some pruning of the possible query trees is needed. Query optimizers typically limit the structure of a (join) query tree to that of left-deep (or right-deep) trees. A left-deep join tree is a binary tree in which the right child of each non–leaf node is always a base relation. The optimizer would choose the particular left-deep join tree with the lowest estimated cost. Two examples of left-deep trees are shown in Figure 19.5(a). (Note that the trees in Figure 19.2 are also left-deep trees.) A right-deep join tree is a binary tree where the left child of every leaf node is a base relation (Figure 19.5(b)). A bushy join tree is a binary tree where the left or right child of an internal node may be an internal node. Figure 19.5(b) shows a right-deep join tree whereas Fig- ure 19.5(c) shows a bushy one using four base relations. Most query optimizers con- sider left-deep join trees as the preferred join tree and then choose one among the n! possible join orderings, where n is the number of relations. We discuss the join ordering issue in more detail in Sections 19.5.4 and 19.5.5. The left-deep tree has exactly one shape, and the join orders for N tables in a left-deep tree are given by N!. In contrast, the shapes of a bushy tree are given by the following recurrence relation (i.e., recursive function), with S(n) defined as follows: S(1) = 1. n-1 S(n) = Σ S(i) * S(n − i) i =1 The above recursive equation for S(n) can be explained as follows. It states that, for i between 1 and N – 1 as the number of leaves in the left subtree, those leaves may be rearranged in S(i) ways. Similarly, the remaining N – i leaves in the right subtree can be rearranged in S(N – i) ways. The number of permutations of the bushy trees is given by: P(n) = n! * S(n) = (2n – 2)!/(n – 1)! Table 19.1 shows the number of possible left-deep (or right-deep) join trees and bushy join trees for joins of up to seven relations. It is clear from Table 19.1 that the possible space of alternatives becomes rapidly unmanageable if all possible bushy tree alternatives were to be considered. In certain Table19.1 Number of Permutations of Left-Deep and Bushy Join Trees of n Relations No. of Relations N No. of Left-Deep No. of Bushy No. of Bushy Trees Trees N! Shapes S(N) (2N − 2)!/( N − 1)! 2 21 2 3 62 12 4 24 5 120 5 120 14 1,680 6 720 42 30,240 7 5,040 132 665,280
19.5 Cost Functions for the JOIN Operation 723 R4 R1 R3 R2 R1 R2 R4 R3 (a) R1 R1 R2 R3 R4 (c) R2 Figure 19.5 R3 R4 (a) Two left-deep join query trees. (b) (b) A right-deep join query tree. (c) A bushy query tree. cases like complex versions of snowflake schemas (see Section 29.3), approaches to considering bushy tree alternatives have been proposed.12 With left-deep trees, the right child is considered to be the inner relation when exe- cuting a nested-loop join, or the probing relation when executing an index-based nested-loop join. One advantage of left-deep (or right-deep) trees is that they are amenable to pipelining, as discussed in Section 18.7. For instance, consider the first left-deep tree in Figure 19.5(a) and assume that the join algorithm is the index-based nested-loop method; in this case, a disk page of tuples of the outer relation is used to probe the inner relation for matching tuples. As resulting tuples (records) are pro- duced from the join of R1 and R2, they can be used to probe R3 to locate their match- ing records for joining. Likewise, as resulting tuples are produced from this join, they could be used to probe R4. Another advantage of left-deep (or right-deep) trees is that having a base relation as one of the inputs of each join allows the optimizer to utilize any access paths on that relation that may be useful in executing the join. If materialization is used instead of pipelining (see Sections 18.7 and 19.2), the join results could be materialized and stored as temporary relations. The key idea from 12As a representative case for bushy trees, refer to Ahmed et al. (2014).
724 Chapter 19 Query Optimization the optimizer’s standpoint with respect to join ordering is to find an ordering that will reduce the size of the temporary results, since the temporary results (pipelined or materialized) are used by subsequent operators and hence affect the execution cost of those operators. 19.5.4 Physical Optimization For a given logical query plan based on the heuristics we have been discussing so far, each operation needs a further decision in terms of executing the operation by a specific algorithm at the physical level. This is referred to as physical optimization. If this optimization is based on the relative cost of each possible implementation, we call it cost-based physical optimization. The two sets of approaches to this deci- sion making may be broadly classified as top-down and bottom-up approaches. In the top-down approach, we consider the options for implementing each operation working our way down the tree and choosing the best alternative at each stage. In the bottom-up approach, we consider the operations working up the tree, evaluat- ing options for physical execution, and choosing the best at each stage. Theoreti- cally, both approaches amount to evaluation of the entire space of possible implementation solutions to minimize the cost of evaluation; however, the bottom- up strategy lends itself naturally to pipelining and hence is used in commercial RDBMSs. Among the most important physical decisions is the ordering of join operations, which we will briefly discuss in Section 19.5.5. There are certain heuris- tics applied at the physical optimization stage that make elaborate cost computa- tions unnecessary. These heuristics include: ■ For selections, use index scans wherever possible. ■ If the selection condition is conjunctive, use the selection that results in the smallest cardinality first. ■ If the relations are already sorted on the attributes being matched in a join, then prefer sort-merge join to other join methods. ■ For union and intersection of more than two relations, use the associative rule; consider the relations in the ascending order of their estimated car- dinalities. ■ If one of the arguments in a join has an index on the join attribute, use that as the inner relation. ■ If the left relation is small and the right relation is large and it has index on the joining column, then try index-based nested-loop join. ■ Consider only those join orders where there are no Cartesian products or where all joins appear before Cartesian products. The following are only some of the types of physical level heuristics used by the optimizer. If the number of relations is small (typically less than 6) and, there- fore, possible implementations options are limited, then most optimizers would elect to apply a cost-based optimization approach directly rather than to explore heuristics.
19.5 Cost Functions for the JOIN Operation 725 19.5.5 Dynamic Programming Approach to Join Ordering We saw in Section 19.5.3 that there are many possible ways to order n relations in an n-way join. Even for n = 5, which is not uncommon in practical applications, the possible permutations are 120 with left-deep trees and 1,680 with bushy trees. Since bushy trees expand the solution space tremendously, left-deep trees are generally preferred (over both bushy and right-deep trees). They have multiple advantages: First, they work well with the common algorithms for join, including nested-loop, index-based nested-loop, and other one-pass algorithms. Second, they can generate fully pipelined plans (i.e., plans where all joins can be evaluated using pipelining). Note that inner tables must always be materialized because in the join implementa- tion algorithms, the entire inner table is needed to perform the matching on the join attribute. This is not possible with right-deep trees. The common approach to evaluate possible permutations of joining relations is a greedy heuristic approach called dynamic programming. Dynamic programming is an opti- mization technique13 where subproblems are solved only once, and it is applicable when a problem may be broken down into subproblems that themselves have subproblems. A typical dynamic programming algorithm has the following characteristics14: 1. The structure of an optimal solution is developed. 2. The value of the optimal solution is recursively defined. 3. The optimal solution is computed and its value developed in a bottom-up fashion. Note that the solution developed by this procedure is an optimal solution and not the absolute optimal solution. To consider how dynamic programming may be applied to the join order selection, consider the problem of ordering a 5-way join of relations r1, r2, r3, r4, r5. This problem has 120 (=5!) possible left-deep tree solutions. Ideally, the cost of each of them can be estimated and compared and the best one selected. Dynamic programming takes an approach that breaks down this problem to make it more man- ageable. We know that for three relations, there are only six possible left-deep tree solutions. Note that if all possible bushy tree join solutions were to be evaluated, there would be 12 of them. We can therefore consider the join to be broken down as: r1 r2 r3 r4 r5 = (r1 r2 r3) r4 r5 The 6 (= 3!) possible options of (r1 r2 r3) may then be combined with the 6 pos- sible options of taking the result of the first join, say, temp1, and then considering the next join: (temp1 r4 r5) If we were to consider the 6 options for evaluating temp1 and, for each of them, consider the 6 options of evaluating the second join (temp1 r4 r5), the possible 13For a detailed discussion of dynamic programming as a technique of optimization, the reader may con- sult an algorithm textbook such as Corman et al. (2003). 14Based on Chapter 16 in Corman et al. (2003).
726 Chapter 19 Query Optimization solution space has 6 * 6 = 36 alternatives. This is where dynamic programming can be used to do a sort of greedy optimization. It takes the “optimal” plan for evaluating temp1 and does not revisit that plan. So the solution space now reduces to only 6 options to be considered for the second join. Thus the total number of options con- sidered becomes 6 + 6 instead of 120 (=5!) in the nonheuristic exhaustive approach. The order in which the result of the join is generated is also important for finding the best overall order of joins since for using sort-merge join with the next relation, it plays an important role. The ordering beneficial for the next join is considered an interesting join order. This approach was first proposed in System R at IBM Research.15 Besides the join attributes of the later join, System R also included grouping attributes of a later GROUP BY or a sort order at the root of the tree among interesting sort orders. For example, in the case we discussed above, the interesting join orders for the temp1 relation will include those that match the join attribute(s) required to join with either r4 or with r5. The dynamic programming algorithm can be extended to consider best join orders for each interesting sort order. The number of subsets of n relations is 2n (for n = 5 it is 32; n = 10 gives 1,024, which is still manageable), and the number of interesting join orders is small. The complexity of the extended dynamic programming algorithm to determine the optimal left-deep join tree permutation has been shown to be O(3n). 19.6 Example to Illustrate Cost-Based Query Optimization We will consider query Q2 and its query tree shown in Figure 19.1(a) to illustrate cost-based query optimization: Q2: SELECT Pnumber, Dnum, Lname, Address, Bdate FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE Dnum=Dnumber AND Mgr_ssn=Ssn AND Plocation=‘Stafford’; Suppose we have the information about the relations shown in Figure 19.6. The LOW_VALUE and HIGH_VALUE statistics have been normalized for clarity. The tree in Figure 19.1(a) is assumed to represent the result of the algebraic heuristic optimi- zation process and the start of cost-based optimization (in this example, we assume that the heuristic optimizer does not push the projection operations down the tree). The first cost-based optimization to consider is join ordering. As previously men- tioned, we assume the optimizer considers only left-deep trees, so the potential join orders—without CARTESIAN PRODUCT—are: 1. PROJECT DEPARTMENT EMPLOYEE 2. DEPARTMENT PROJECT EMPLOYEE 15See the classic reference in this area by Selinger et al. (1979).
19.6 Example to Illustrate Cost-Based Query Optimization 727 3. DEPARTMENT EMPLOYEE PROJECT 4. EMPLOYEE DEPARTMENT PROJECT Assume that the selection operation has already been applied to the PROJECT rela- tion. If we assume a materialized approach, then a new temporary relation is cre- ated after each join operation. To examine the cost of join order (1), the first join is between PROJECT and DEPARTMENT. Both the join method and the access methods for the input relations must be determined. Since DEPARTMENT has no index according to Figure 19.6, the only available access method is a table scan (that is, a linear search). The PROJECT relation will have the selection operation performed before the join, so two options exist—table scan (linear search) or use of the PROJ_PLOC index—so the optimizer must compare the estimated costs of these two options. The statistical information on the PROJ_PLOC index (see Figure 19.6) shows the number of index levels x = 2 (root plus leaf levels). The index is nonunique Figure 19.6 Sample statistical information for relations in Q2. (a) Column information. (b) Table information. (c) Index information. (a) Table_name Column_name Num_distinct Low_value High_value 200 PROJECT Plocation 200 1 2000 PROJECT Pnumber 2000 1 50 50 PROJECT Dnum 50 1 50 DEPARTMENT Dnumber 50 1 10000 50 DEPARTMENT Mgr_ssn 50 1 500 EMPLOYEE Ssn 10000 1 EMPLOYEE Dno 50 1 EMPLOYEE Salary 500 1 (b) Table_name Num_rows Blocks 100 PROJECT 2000 5 DEPARTMENT 50 2000 EMPLOYEE 10000 (c) Index_name Uniqueness Blevel* Leaf_blocks Distinct_keys PROJ_PLOC NONUNIQUE 1 4 200 EMP_SSN UNIQUE 1 50 10000 EMP_SAL NONUNIQUE 1 50 500 *Blevel is the number of levels without the leaf level.
728 Chapter 19 Query Optimization (because Plocation is not a key of PROJECT), so the optimizer assumes a uniform data distribution and estimates the number of record pointers for each Plocation value to be 10. This is computed from the tables in Figure 19.6 by multiplying Selectivity * Num_rows, where Selectivity is estimated by 1/Num_distinct. So the cost of using the index and accessing the records is estimated to be 12 block accesses (2 for the index and 10 for the data blocks). The cost of a table scan is estimated to be 100 block accesses, so the index access is more efficient as expected. In the materialized approach, a temporary file TEMP1 of size 1 block is created to hold the result of the selection operation. The file size is calculated by determin- ing the blocking factor using the formula Num_rows/Blocks, which gives 2,000/100 or 20 rows per block. Hence, the 10 records selected from the PROJECT relation will fit into a single block. Now we can compute the estimated cost of the first join. We will consider only the nested-loop join method, where the outer relation is the temporary file, TEMP1, and the inner relation is DEPARTMENT. Since the entire TEMP1 file fits in the available buffer space, we need to read each of the DEPARTMENT table’s five blocks only once, so the join cost is six block accesses plus the cost of writing the temporary result file, TEMP2. The optimizer would have to determine the size of TEMP2. Since the join attribute Dnumber is the key for DEPARTMENT, any Dnum value from TEMP1 will join with at most one record from DEPARTMENT, so the number of rows in TEMP2 will be equal to the number of rows in TEMP1, which is 10. The optimizer would determine the record size for TEMP2 and the number of blocks needed to store these 10 rows. For brevity, assume that the blocking factor for TEMP2 is five rows per block, so a total of two blocks are needed to store TEMP2. Finally, the cost of the last join must be estimated. We can use a single-loop join on TEMP2 since in this case the index EMP_SSN (see Figure 19.6) can be used to probe and locate matching records from EMPLOYEE. Hence, the join method would involve reading in each block of TEMP2 and looking up each of the five Mgr_ssn values using the EMP_SSN index. Each index lookup would require a root access, a leaf access, and a data block access (x + 1, where the number of levels x is 2). So, 10 lookups require 30 block accesses. Adding the two block accesses for TEMP2 gives a total of 32 block accesses for this join. For the final projection, assume pipelining is used to produce the final result, which does not require additional block accesses, so the total cost for join order (1) is esti- mated as the sum of the previous costs. The optimizer would then estimate costs in a similar manner for the other three join orders and choose the one with the lowest estimate. We leave this as an exercise for the reader. 19.7 Additional Issues Related to Query Optimization In this section, we will discuss a few issues of interest that we have not been able to discuss earlier.
19.7 Additional Issues Related to Query Optimization 729 19.7.1 Displaying the System’s Query Execution Plan Most commercial RDBMSs have a provision to display the execution plan produced by the query optimizer so that DBA-level personnel can view such execution plans and try to understand the descision made by the optimizer.16 The common syntax is some variation of EXPLAIN <query>. ■ Oracle uses EXPLAIN PLAN FOR <SQL Query> The query may involve INSERT, DELETE, and UPDATE statements; the output goes into a table called PLAN_TABLE. An appropriate SQL query is written to read the PLAN_TABLE. Alternately, Oracle provides two scripts UTLXPLS.SQL and UTLXPLP.SQL to display the plan table output for serial and parallel execution, respectively. ■ IBM DB2 uses EXPLAIN PLAN SELECTION [additional options] FOR <SQL-query> There is no plan table. The PLAN SELECTION is a command to indicate that the explain tables should be loaded with the explanations during the plan selection phase. The same statement is also used to explain XQUERY statements. ■ SQL SERVER uses SET SHOWPLAN_TEXT ON or SET SHOWPLAN_XML ON or SET SHOWPLAN_ALL ON The above statements are used before issuing the TRANSACT-SQL, so the plan output is presented as text or XML or in a verbose form of text corresponding to the above three options. ■ PostgreSQL uses EXPLAIN [set of options] <query>.where the options include ANALYZE, VERBOSE, COSTS, BUFFERS, TIMING, etc. 19.7.2 Size Estimation of Other Operations In Sections 19.4 and 19.5, we discussed the SELECTION and JOIN operations and size estimation of the query result when the query involves those operations. Here we consider the size estimation of some other operations. Projection: For projection of the form πList (R) expressed as SELECT <attribute- list> FROM R, since SQL treats it as a multiset, the estimated number of tuples in the result is |R|. If the DISTINCT option is used, then size of πA (R) is NDV (A, R). 16We have just illustrated this facility without describing the syntactic details of each system.
730 Chapter 19 Query Optimization Set Operations: If the arguments for an intersection, union, or set difference are made of selections on the same relation, they can be rewritten as conjunction, dis- junction, or negation, respectively. For example, σc1 (R) ∩ σc2 (R) can be rewritten as σc1 AND c2 (R); and σc1 (R) ∪ σc2 (R) can be rewritten as σc1 OR c2 (R). The size estimation can be made based on the selectivity of conditions c1 and c2. Otherwise, the estimated upper bound on the size of r ∩ s is the minimum of the sizes of r and s; the estimated upper bound on the size of r ∪ s is the sum of their sizes. Aggregation: The size of GℑAggregate-function(A) R is NDV (G, R) since there is one group for each unique value of G. Outer Join : the size of R LEFT OUTER JOIN S would be |R S| plus |R anti-join S|. Similarly, the size of R FULL OUTER JOIN S would be |r s| plus |r anti-join s| plus |s anti-join r|. We discussed anti-join selectivity estimation in Section 19.5.1. 19.7.3 Plan Caching In Chapter 2, we referred to parametric users who run the same queries or transac- tions repeatedly, but each time with a different set of parameters. For example, a bank teller uses an account number and some function code to check the balance in that account. To run such queries or transactions repeatedly, the query optimizer computes the best plan when the query is submitted for the first time and caches the plan for future use. This storing of the plan and reusing it is referred to as plan caching. When the query is resubmitted with different constants as parameters, the same plan is reused with the new parameters. It is conceivable that the plan may need to be modified under certain situations; for example, if the query involves report generation over a range of dates or range of accounts, then, depending on the amount of data involved, different strategies may apply. Under a variation called parametric query optimization, a query is optimized without a certain set of values for its parameters and the optimizer outputs a number of plans for different possible value sets, all of which are cached. As a query is submitted, the parameters are compared to the ones used for the various plans and the cheapest among the applicable plans is used. 19.7.4 Top-k Results Optimization When the output of a query is expected to be large, sometimes the user is satisfied with only the top-k results based on some sort order. Some RDBMSs have a limit K clause to limit the result to that size. Similarly, hints may be specified to inform the optimizer to limit the generation of the result. Trying to generate the entire result and then presenting only the top-k results by sorting is a naive and inefficient strat- egy. Among the suggested strategies, one uses generation of results in a sorted order so that it can be stopped after K tuples. Other strategies, such as introducing addi- tional selection conditions based on the estimated highest value, have been pro- posed. Details are beyond our scope here. The reader may consult the bibliographic notes for details.
19.8 An Example of Query Optimization in Data Warehouses 731 19.8 An Example of Query Optimization in Data Warehouses In this section, we introduce another example of query transformation and rewrit- ing as a technique for query optimization. In Section 19.2, we saw examples of query transformation and rewriting. Those examples dealt with nested subqueries and used heuristics rather than cost-based optimization. The subquery (view) merging example we showed can be considered a heuristic transformation; but the group-by view merging uses cost-based optimization as well. In this section, we consider a transformation of star-schema queries in data warehouses based on cost considerations. These queries are commonly used in data warehouse applications that follow the star schema. (See Section 29.3 for a discussion of star schemas.) We will refer to this procedure as star-transformation optimization. The star schema contains a collection of tables; it gets its name because of the schema’s resemblance to a star-like shape whose center contains one or more fact tables (relations) that reference multiple dimension tables (relations). The fact table con- tains information about the relationships (e.g., sales) among the various dimension tables (e.g., customer, part, supplier, channel, year, etc.) and measure columns (e.g., amount_sold, etc.). Consider the representative query called QSTAR given below. Assume that D1, D2, D3 are aliases for the dimension tables DIM1, DIM2, DIM3, whose primary keys are, respectively, D1.Pk, D2.Pk, and D3.Pk. These dimensions have corresponding foreign key attributes in the fact table FACT with alias F— namely, F.Fk1, F.Fk2, F.Fk3—on which joins can be defined. The query creates a grouping on attributes D1.X, D2.Y and produces a sum of the so-called “measure” attribute (see Section 29.3) F.M from the fact table F. There are conditions on attri- butes A, B, C in DIM1, DIM2, DIM3, respectively: Query QSTAR: SELECT D1.X, D2.Y, SUM (F.M) FROM FACT F, DIM1 D1, DIM2 D2, DIM3 D3 WHERE F.Fk1 = D1.Pk and F.Fk2 = D2.Pk and F.Fk3 = D3.Pk and D1.A > 5 and D2.B < 77 and D3.C = 11 GROUP BY D1.X, D2.Y The fact table is generally very large in comparison with the dimension tables. QSTAR is a typical star query, and its fact table tends to be generally very large and joined with several tables of small dimension tables. The query may also contain single-table filter predicates on other columns of the dimension tables, which are generally restrictive. The combination of these filters helps to significantly reduce the data set processed from the fact table (such as D1.A > 5 in the above query). This type of query generally does grouping on columns coming from dimension tables and aggregation on measure columns coming from the fact table. The goal of star-transformation optimization is to access only this reduced set of data from the fact table and avoid using a full table scan on it. Two types of star- transformation optimizations are possible: (A) classic star transformation, and
732 Chapter 19 Query Optimization (B) bitmap index star transformation. Both these optimizations are performed on the basis of comparative costs of the original and the transformed queries. A. Classic Star Transformation In this optimization, a Cartesian product of the dimension tables is per- formed first after applying the filters (such as D1.A > 5 ) to each dimension table. Note that generally there are no join predicates between dimension tables. The result of this Cartesian product is then joined with the fact table using B-tree indexes (if any) on the joining keys of the fact table. B. Bitmap Index Star Transformation The requirement with this optimization is that there must be bitmap17 indexes on the fact-table joining keys referenced in the query. For example, in QSTAR, there must be bitmap indexes (see Section 17.5.2) on FACT.Fk1, FACT.Fk2, and FACT.Fk3 attributes; each bit in the bitmap corresponds to a row in the fact table. The bit is set if the key value of the attribute appears in a row of the fact table. The given query QSTAR is transformed into Q2STAR as shown below. Q2STAR: SELECT D1.X, D2.Y, SUM (F.M) FROM FACT F, DIM1 D1, DIM2 D2 WHERE F.Fk1 = D1.Pk and F.Fk2 = D2.Pk and D1.A > 5 and D2.B < 77 and F.Fk1 IN (SELECT D1.Pk FROM DIM1 D1 WHERE D1.A > 5) AND F.Fk2 IN (SELECT D2.Pk FROM DIM2 D2 WHERE D2.B < 77) AND F.Fk3 IN (SELECT D3.pk FROM DIM3 D3 WHERE D3.C = 11) GROUP BY D1.X, D2.Y; The bitmap star transformation adds subquery predicates corresponding to the dimension tables. Note that the subqueries introduced in Q2STAR may be looked upon as a set membership operation; for example, F.Fk1 IN (5, 9, 12, 13, 29 …). When driven by bitmap AND and OR operations of the key values supplied by the dimension subqueries, only the relevant rows from the fact table need to be retrieved. If the filter predicates on the dimension tables and the intersection of the fact table joining each dimension table filtered out a significant subset of the fact table rows, then this optimization would prove to be much more efficient than a brute force full-table scan of the fact table. 17In some cases, the B-tree index keys can be converted into bitmaps, but we will not discuss this technique here.
19.9 Overview of Query Optimization in Oracle 733 The following operations are performed in Q2STAR in order to access and join the FACT table. 1. By iterating over the key values coming from a dimension subquery, the bitmaps are retrieved for a given key value from a bitmap index on the FACT table. 2. For a subquery, the bitmaps retrieved for various key values are merged (OR-ed). 3. The merged bitmaps for each dimension subqueries are AND-ed; that is, a conjunction of the joins is performed. 4. From the final bitmap, the corresponding tuple-ids for the FACT table are generated. 5. The FACT table rows are directly retrieved using the tuple-ids. Joining Back: The subquery bitmap trees filter the fact table based on the filter predicates on the dimension tables; therefore, it may still be necessary to join the dimension tables back to the relevant rows in the fact table using the original join predicates. The join back of a dimension table can be avoided if the column(s) selected from the subquery are unique and the columns of the dimension table are not referenced in the SELECT and GROUP-BY clauses. Note that in Q2STAR, the table DIM3 is not joined back to the FACT table, since it is not referenced in the SELECT and GROUP-BY clauses, and DIM3.Pk is unique. 19.9 Overview of Query Optimization in Oracle18 This section provides a broad overview of various features in Oracle query process- ing, including query optimization, execution, and analytics.19 19.9.1 Physical Optimizer The Oracle physical optimizer is cost based and was introduced in Oracle 7.1. The scope of the physical optimizer is a single query block. The physical optimizer examines alternative table and index access paths, operator algorithms, join order- ings, join methods, parallel execution distribution methods, and so on. It chooses the execution plan with the lowest estimated cost. The estimated query cost is a relative number proportional to the expected elapsed time needed to execute the query with the given execution plan. The physical optimizer calculates this cost based on object statistics (such as table cardinalities, number of distinct values in a column, column high and low values, data distribution of column values), the estimated usage of resources (such as I/O and CPU time), and memory needed. Its estimated cost is an internal metric that 18This section is contributed by Rafi Ahmed of Oracle Corporation. 19Support for analytics was introduced in Oracle 10.2.
734 Chapter 19 Query Optimization Heuristic-Based Transformation Parser Front-End Physical Cost-Based Optimization Transformation Figure 19.7 Execution CBQT Cost-based query Framework transformation framework (based on Ahmed et al., 2006). roughly corresponds to the run time and the required resources. The goal of cost- based optimization in Oracle is to find the best trade-off between the lowest run time and the least resource utilization. 19.9.2 Global Query Optimizer In traditional RDBMSs, query optimization consists of two distinct logical and physical optimization phases. In contrast, Oracle has a global query optimizer, where logical transformation and physical optimization phases have been inte- grated to generate an optimal execution plan for the entire query tree. The architec- ture of the Oracle query processing is illustrated in Figure 19.7. Oracle performs a multitude of query transformations, which change and trans- form the user queries into equivalent but potentially more optimal forms. Transfor- mations can be either heuristic-based or cost-based. The cost-based query transformation (CBQT) framework20 introduced in Oracle 10g provides efficient mechanisms for exploring the state space generated by applying one or more trans- formations. During cost-based transformation, an SQL statement, which may com- prise multiple query blocks, is copied and transformed and its cost is computed using the physical optimizer. This process is repeated multiple times, each time applying a new set of possibly interdependent transformations; and, at the end, one or more transformations are selected and applied to the original SQL statement, if those transformations result in an optimal execution plan. To deal with the combi- natorial explosion, the CBQT framework provides efficient strategies for searching the state space of various transformations. The availability of the general framework for cost-based transformation has made it possible for other innovative transformations to be added to the vast repertoire of 20As presented in Ahmed et al. (2006).
19.9 Overview of Query Optimization in Oracle 735 Oracle’s query transformation techniques. Major among these transformations are group-by and distinct subquery merging (in the FROM clause of the query), sub- query unnesting, predicate move-around, common subexpression elimination, join predicate push down, OR expansion, subquery coalescing, join factorization, subquery removal through window function, star transformation, group-by placement, and bushy join trees.21 The cost-based transformation framework of Oracle 10g is a good example of the sophisticated approach taken to optimize SQL queries. 19.9.3 Adaptive Optimization Oracle’s physical optimizer is adaptive and uses a feedback loop from the execu- tion level to improve on its previous decisions. The optimizer selects the most optimal execution plan for a given SQL statement using the cost model, which relies on object statistics (e.g., number of rows, distribution of column values, etc.) and system statistics (e.g., I/O bandwidth of the storage subsystem). The optimality of the final execution plan depends primarily on the accuracy of the statistics fed into the cost model as well as on the sophistication of the cost model itself. In Oracle, the feedback loop shown in Figure 19.7 establishes a bridge between the execution engine and the physical optimizer. The bridge brings valuable statistical information to enable the physical optimizer to assess the impact of its decisions and make better decisions for the current and future exe- cutions. For example, based on the estimated value of table cardinality, the opti- mizer may choose the index-based nested-loop join method. However, during the execution phase, the actual table cardinality may be detected to diverge sig- nificantly from the estimated value. This information may trigger the physical optimizer to revise its decision and dynamically change the index access join method to the hash join method. 19.9.4 Array Processing One of the critical deficiencies of SQL implementations is its lack of support for N-dimensional array-based computation. Oracle has made extensions for analyt- ics and OLAP features; these extensions have been integrated into the Oracle RDBMS engine.22 We will illustrate the need for OLAP queries when we discuss data warehousing in Chapter 29. These SQL extensions involving array-based computations for complex modeling and optimizations include access structures and execution strategies for processing these computations efficiently. The com- putation clause (details are beyond our scope here) allows the Oracle RDBMS to treat a table as a multidimensional array and specify a set of formulas over it. The formulas replace multiple joins and UNION operations that must be performed for equivalent computation with current ANSI SQL (where ANSI stands for 21More details can be found in Ahmed et al. (2006, 2014). 22See Witkowski et al. (2003) for more details.
736 Chapter 19 Query Optimization American National Standards Institute). The computation clause not only allows for ease of application development but also offers the Oracle RDBMS an opportu- nity to perform better optimization. 19.9.5 Hints An interesting addition to the Oracle query optimizer is the capability for an applica- tion developer to specify hints (also called query annotations or directives in other systems) to the optimizer. Hints are embedded in the text of an SQL statement. Hints are commonly used to address the infrequent cases where the optimizer chooses a suboptimal plan. The idea is that an application developer occasionally might need to override the optimizer decisions based on cost or cardinality mis-estimations. For example, consider the EMPLOYEE table shown in Figure 5.6. The Sex column of that table has only two distinct values. If there are 10,000 employees, then the optimizer, in the absence of a histogram on the Sex column, would estimate that half are male and half are female, assuming a uniform data distribution. If a secondary index exists, it would more than likely not be used. However, if the application developer knows that there are only 100 male employees, a hint could be specified in an SQL query whose WHERE-clause condition is Sex = ‘M’ so that the associated index would be used in processing the query. Various types of hints can be specified for different operations; these hints include but are not limited to the following: ■ The access path for a given table ■ The join order for a query block ■ A particular join method for a join between tables ■ The enabling or disabling of a transformation 19.9.6 Outlines In Oracle RDBMSs, outlines are used to preserve execution plans of SQL state- ments or queries. Outlines are implemented and expressed as a collection of hints, because hints are easily portable and comprehensible. Oracle provides an extensive set of hints that are powerful enough to specify any execution plan, no matter how complex. When an outline is used during the optimization of an SQL statement, these hints are applied at appropriate stages by the optimizer (and other components). Every SQL statement processed by the Oracle optimizer automatically generates an outline that can be displayed with the execution plan. Outlines are used for purposes such as plan stability, what-if analysis, and perfor- mance experiments. 19.9.7 SQL Plan Management Execution plans for SQL statements have a significant impact on the overall perfor- mance of a database system. New optimizer statistics, configuration parameter changes, software updates, introduction of new query optimization and processing techniques, and hardware resource utilizations are among a multitude of factors
19.10 Semantic Query Optimization 737 that may cause the Oracle query optimizer to generate a new execution plan for the same SQL queries or statements. Although most of the changes in the execution plans are beneficial or benign, a few execution plans may turn out to be suboptimal, which can have a negative impact on system performance. In Oracle 11g, a novel feature called SQL plan management (SPM) was introduced23 for managing execution plans for a set of queries or workloads. SPM provides stable and optimal performance for a set of SQL statements by preventing new subopti- mal plans from being executed while allowing other new plans to be executed if they are verifiably better than the previous plans. SPM encapsulates an elaborate mechanism for managing the execution plans of a set of SQL statements, for which the user has enabled SPM. SPM maintains the previous execution plans in the form of stored outlines associated with texts of SQL statements and compares the perfor- mances of the old and new execution plans for a given SQL statement before per- mitting them to be used by the user. SPM can be configured to work automatically, or it can be manually controlled for one or more SQL statements. 19.10 Semantic Query Optimization A different approach to query optimization, called semantic query optimization, has been suggested. This technique, which may be used in combination with the techniques discussed previously, uses constraints specified on the database schema— such as unique attributes and other more complex constraints—to modify one query into another query that is more efficient to execute. We will not discuss this approach in detail but we will illustrate it with a simple example. Consider the SQL query: SELECT E.Lname, M.Lname FROM WHERE EMPLOYEE AS E, EMPLOYEE AS M E.Super_ssn=M.Ssn AND E.Salary > M.Salary This query retrieves the names of employees who earn more than their supervisors. Suppose that we had a constraint on the database schema that stated that no employee can earn more than his or her direct supervisor. If the semantic query optimizer checks for the existence of this constraint, it does not need to execute the query because it knows that the result of the query will be empty. This may save considerable time if the constraint checking can be done efficiently. However, searching through many constraints to find those that are applicable to a given query and that may semantically optimize it can also be time-consuming. Consider another example: SELECT Lname, Salary FROM EMPLOYEE, DEPARTMENT WHERE EMPLOYEE.Dno = DEPARTMENT.Dnumber and EMPLOYEE.Salary>100000 23See Ziauddin et al. (2008).
738 Chapter 19 Query Optimization In this example, the attributes retrieved are only from one relation: EMPLOYEE; the selection condition is also on that one relation. However, there is a referential integ- rity constraint that Employee.Dno is a foreign key that refers to the primary key Department.Dnumber. Therefore, this query can be transformed by removing the DEPARTMENT relation from the query and thus avoiding the inner join as follows: SELECT Lname, Salary FROM EMPLOYEE WHERE EMPLOYEE.Dno IS NOT NULL and EMPLOYEE.Salary>100000 This type of transformation is based on the primary-key/foreign-key relationship semantics, which are a constraint between the two relations. With the inclusion of active rules and additional metadata in database systems (see Chapter 26), semantic query optimization techniques are being gradually incorpo- rated into DBMSs. 19.11 Summary In the previous chapter, we presented the strategies for query processing used by relational DBMSs. We considered algorithms for various standard relational opera- tors, including selection, projection, and join. We also discussed other types of joins, including outer join, semi-join, and anti-join, and we discussed aggregation as well as external sorting. In this chapter, our goal was to focus on query optimiza- tion techniques used by relational DBMSs. In Section 19.1 we introduced the nota- tion for query trees and graphs and described heuristic approaches to query optimization; these approaches use heuristic rules and algebraic techniques to improve the efficiency of query execution. We showed how a query tree that repre- sents a relational algebra expression can be heuristically optimized by reorganizing the tree nodes and transforming the tree into another equivalent query tree that is more efficient to execute. We also gave equivalence-preserving transformation rules and a systematic procedure for applying them to a query tree. In Section 19.2 we described alternative query evaluation plans, including pipelining and material- ized evaluation. Then we introduced the notion of query transformation of SQL queries; this transformation optimizes nested subqueries. We also illustrated with examples of merging subqueries occurring in the FROM clause, which act as derived relations or views. We also discussed the technique of materializing views. We discussed in some detail the cost-based approach to query optimization in Section 19.3. We discussed information maintained in catalogs that the query optimizer consults. We also discussed histograms to maintain distribution of important attributes. We showed how cost functions are developed for some database access algorithms for selection and join in Sections 19.4 and 19.5, respec- tively. We illustrated with an example in Section 19.6 how these cost functions are used to estimate the costs of different execution strategies. A number of addi- tional issues such as display of query plans, size estimation of results, plan cach- ing and top-k results optimization were discussed in Section 19.7. Section 19.8
Review Questions 739 was devoted to a discussion of how typical queries in data warehouses are opti- mized. We gave an example of cost-based query transformation in data ware- house queries on the so-called star schema. In Section 19.9 we presented a detailed overview of the Oracle query optimizer, which uses a number of additional tech- niques, details of which were beyond our scope. Finally, in Section 19.10 we men- tioned the technique of semantic query optimization, which uses the semantics or integrity constraints to simplify the query or completely avoid accessing the data or the actual execution of the query. Review Questions 19.1. What is a query execution plan? 19.2. What is meant by the term heuristic optimization? Discuss the main heuris- tics that are applied during query optimization. 19.3. How does a query tree represent a relational algebra expression? What is meant by an execution of a query tree? Discuss the rules for transformation of query trees, and identify when each rule should be applied during optimi- zation. 19.4. How many different join orders are there for a query that joins 10 relations? How many left-deep trees are possible? 19.5. What is meant by cost-based query optimization? 19.6. What is the optimization approach based on dynamic programming? How is it used during query optimization? 19.7. What are the problems associated with keeping views materialized? 19.8. What is the difference between pipelining and materialization? 19.9. Discuss the cost components for a cost function that is used to estimate query execution cost. Which cost components are used most often as the basis for cost functions? 19.10. Discuss the different types of parameters that are used in cost functions. Where is this information kept? 19.11. What are semi-join and anti-join? What are the join selectivity and join car- dinality parameters associated with them? Provide appropriate formulas. 19.12. List the cost functions for the SELECT and JOIN methods discussed in Sections19.4 and 19.5. 19.13. What are the special features of query optimization in Oracle that we did not discuss in the chapter? 19.14. What is meant by semantic query optimization? How does it differ from other query optimization techniques?
740 Chapter 19 Query Optimization Exercises 19.15. Develop cost functions for the PROJECT, UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT algorithms discussed in Section 19.4. 19.16. Develop cost functions for an algorithm that consists of two SELECTs, a JOIN, and a final PROJECT, in terms of the cost functions for the individual operations. 19.17. Develop a pseudo-language-style algorithm for describing the dynamic programming procedure for join-order selection. 19.18. Calculate the cost functions for different options of executing the JOIN operation OP7 discussed in Section 19.4. 19.19. Develop formulas for the hybrid hash-join algorithm for calculating the size of the buffer for the first bucket. Develop more accurate cost estimation formulas for the algorithm. 19.20. Estimate the cost of operations OP6 and OP7 using the formulas developed in Exercise 19.19. 19.21. Compare the cost of two different query plans for the following query: σSalary< 40000(EMPLOYEE Dno=DnumberDEPARTMENT) Use the database statistics shown in Figure 19.6. Selected Bibliography This bibliography provides literature references for the topics of query processing and optimization. We discussed query processing algorithms and strategies in the previous chapter, but it is difficult to separate the literature that addresses optimiza- tion from the literature that addresses query processing strategies and algorithms. Hence, the bibliography is consolidated. A detailed algorithm for relational algebra optimization is given by Smith and Chang (1975). The Ph.D. thesis of Kooi (1980) provides a foundation for query processing techniques. A survey paper by Jarke and Koch (1984) gives a taxonomy of query optimization and includes a bibliography of work in this area. A survey by Graefe (1993) discusses query execution in database systems and includes an exten- sive bibliography. Whang (1985) discusses query optimization in OBE (Office-By-Example), which is a system based on the language QBE. Cost-based optimization was introduced in the SYSTEM R experimental DBMS and is discussed in Astrahan et al. (1976). Selinger et al. (1979) is a classic paper that discussed cost-based optimization of multiway joins in SYSTEM R. Join algorithms are discussed in Gotlieb (1975), Blas- gen and Eswaran (1976), and Whang et al. (1982). Hashing algorithms for imple- menting joins are described and analyzed in DeWitt et al. (1984), Bratbergsengen
Selected Bibliography 741 (1984), Shapiro (1986), Kitsuregawa et al. (1989), and Blakeley and Martin (1990), among others. Blakely et al. (1986) discuss maintenance of materialized views. Chaudhari et al. (1995) discuss optimization of queries with materialized views. Approaches to finding a good join order are presented in Ioannidis and Kang (1990) and in Swami and Gupta (1989). A discussion of the implications of left- deep and bushy join trees is presented in Ioannidis and Kang (1991). Kim (1982) discusses transformations of nested SQL queries into canonical representations. Optimization of aggregate functions is discussed in Klug (1982) and Muralikrishna (1992). Query optimization with Group By is presented in Chaudhari and Shim (1994). Yan and Larson (1995) discuss eager and lazy aggregation. Salzberg et al. (1990) describe a fast external sorting algorithm. Estimating the size of temporary relations is crucial for query optimization. Sampling-based estimation schemes are presented in Haas et al. (1995), Haas and Swami (1995), and Lipton et al. (1990). Having the database system store and use more detailed statistics in the form of histograms is the topic of Muralikrishna and DeWitt (1988) and Poosala et al. (1996). Galindo-Legaria and Joshi (2001) discuss nested subquery and aggregation optimization. O’Neil and Graefe (1995) discuss multi-table joins using bitmap indexes. Kim et al. (1985) discuss advanced topics in query optimization. Semantic query optimization is discussed in King (1981) and Malley and Zdonick (1986). Work on semantic query optimization is reported in Chakravarthy et al. (1990), Shenoy and Ozsoyo- glu (1989), and Siegel et al. (1992). Volcano, a query optimizer based on query equivalence rules, was developed by Graefe and Mckenna (1993). Volcano and the follow-on Cascades approach by Graefe (1995) are the basis for Microsoft’s SQL Server query optimization. Carey and Kossman (1998) and Bruno et al. (2002) pres- ent approaches to query optimization for top-k results. Galindo Legaria et al. (2004) discuss processing and optimizing database updates. Ahmed et al. (2006) discuss cost-based query transformation in Oracle and give a good overview of the global query optimization architecture in Oracle 10g. Ziaud- din et al. (2008) discuss the idea of making the optimizer change the execution plan for a query. They discuss Oracle’s SQL plan management (SPM) feature, which lends stability to performance. Bellamkonda et al. (2009) provide additional tech- niques for query optimization. Ahmed et al. (2014) consider the advantages of bushy trees over alternatives for execution. Witkowski et al. (2003) discuss support for N-dimensional array-based computation for analytics that has been integrated into the Oracle RDBMS engine.
This page intentionally left blank
9part Transaction Processing, Concurrency Control, and Recovery
This page intentionally left blank
20chapter Introduction to Transaction Processing Concepts and Theory The concept of transaction provides a mechanism for describing logical units of database processing. Transaction processing systems are systems with large databases and hundreds of concurrent users executing database transactions. Examples of such systems include airline reservations, banking, credit card processing, online retail purchas- ing, stock markets, supermarket checkouts, and many other applications. These systems require high availability and fast response time for hundreds of concur- rent users. In this chapter, we present the concepts that are needed in transaction processing systems. We define the concept of a transaction, which is used to repre- sent a logical unit of database processing that must be completed in its entirety to ensure correctness. A transaction is typically implemented by a computer program that includes database commands such as retrievals, insertions, deletions, and updates. We introduced some of the basic techniques for database programming in Chapters 10 and 11. In this chapter, we focus on the basic concepts and theory that are needed to ensure the correct executions of transactions. We discuss the concurrency control prob- lem, which occurs when multiple transactions submitted by various users interfere with one another in a way that produces incorrect results. We also discuss the prob- lems that can occur when transactions fail, and how the database system can recover from various types of failures. This chapter is organized as follows. Section 20.1 informally discusses why concur- rency control and recovery are necessary in a database system. Section 20.2 defines the term transaction and discusses additional concepts related to transaction 745
746 Chapter 20 Introduction to Transaction Processing Concepts and Theory processing in database systems. Section 20.3 presents the important properties of atomicity, consistency preservation, isolation, and durability or permanency— called the ACID properties—that are considered desirable in transaction process- ing systems. Section 20.4 introduces the concept of schedules (or histories) of executing transactions and characterizes the recoverability of schedules. Sec- tion 20.5 discusses the notion of serializability of concurrent transaction execution, which can be used to define correct execution sequences (or schedules) of concur- rent transactions. In Section 20.6, we present some of the commands that support the transaction concept in SQL, and we introduce the concepts of isolation levels. Section 20.7 summarizes the chapter. The two following chapters continue with more details on the actual methods and techniques used to support transaction processing. Chapter 21 gives an overview of the basic concurrency control protocols and Chapter 22 introduces recovery techniques. 20.1 Introduction to Transaction Processing In this section, we discuss the concepts of concurrent execution of transactions and recovery from transaction failures. Section 20.1.1 compares single-user and multi- user database systems and demonstrates how concurrent execution of transactions can take place in multiuser systems. Section 20.1.2 defines the concept of transac- tion and presents a simple model of transaction execution based on read and write database operations. This model is used as the basis for defining and formalizing concurrency control and recovery concepts. Section 20.1.3 uses informal examples to show why concurrency control techniques are needed in multiuser systems. Finally, Section 20.1.4 discusses why techniques are needed to handle recovery from system and transaction failures by discussing the different ways in which transactions can fail while executing. 20.1.1 Single-User versus Multiuser Systems One criterion for classifying a database system is according to the number of users who can use the system concurrently. A DBMS is single-user if at most one user at a time can use the system, and it is multiuser if many users can use the system— and hence access the database—concurrently. Single-user DBMSs are mostly restricted to personal computer systems; most other DBMSs are multiuser. For example, an airline reservations system is used by hundreds of users and travel agents concurrently. Database systems used in banks, insurance agencies, stock exchanges, supermarkets, and many other applications are multiuser systems. In these systems, hundreds or thousands of users are typically operating on the data- base by submitting transactions concurrently to the system. Multiple users can access databases—and use computer systems—simultaneously because of the concept of multiprogramming, which allows the operating system of the computer to execute multiple programs—or processes—at the same time. A single
20.1 Introduction to Transaction Processing 747 AA B B C CPU1 Figure 20.1 t1 t2 D CPU2 Interleaved processing versus t3 Time parallel processing t4 of concurrent transactions. central processing unit (CPU) can only execute at most one process at a time. How- ever, multiprogramming operating systems execute some commands from one pro- cess, then suspend that process and execute some commands from the next process, and so on. A process is resumed at the point where it was suspended whenever it gets its turn to use the CPU again. Hence, concurrent execution of processes is actually interleaved, as illustrated in Figure 20.1, which shows two processes, A and B, execut- ing concurrently in an interleaved fashion. Interleaving keeps the CPU busy when a process requires an input or output (I/O) operation, such as reading a block from disk. The CPU is switched to execute another process rather than remaining idle during I/O time. Interleaving also prevents a long process from delaying other processes. If the computer system has multiple hardware processors (CPUs), parallel processing of multiple processes is possible, as illustrated by processes C and D in Figure 20.1. Most of the theory concerning concurrency control in databases is developed in terms of interleaved concurrency, so for the remainder of this chapter we assume this model. In a multiuser DBMS, the stored data items are the primary resources that may be accessed concurrently by interactive users or application programs, which are con- stantly retrieving information from and modifying the database. 20.1.2 Transactions, Database Items, Read and Write Operations, and DBMS Buffers A transaction is an executing program that forms a logical unit of database pro- cessing. A transaction includes one or more database access operations—these can include insertion, deletion, modification (update), or retrieval operations. The database operations that form a transaction can either be embedded within an application program or they can be specified interactively via a high-level query language such as SQL. One way of specifying the transaction boundaries is by specifying explicit begin transaction and end transaction statements in an appli- cation program; in this case, all database access operations between the two are considered as forming one transaction. A single application program may contain more than one transaction if it contains several transaction boundaries. If the database operations in a transaction do not update the database but only retrieve
748 Chapter 20 Introduction to Transaction Processing Concepts and Theory data, the transaction is called a read-only transaction; otherwise it is known as a read-write transaction. The database model that is used to present transaction processing concepts is sim- ple when compared to the data models that we discussed earlier in the book, such as the relational model or the object model. A database is basically represented as a collection of named data items. The size of a data item is called its granularity. A data item can be a database record, but it can also be a larger unit such as a whole disk block, or even a smaller unit such as an individual field (attribute) value of some record in the database. The transaction processing concepts we discuss are independent of the data item granularity (size) and apply to data items in general. Each data item has a unique name, but this name is not typically used by the pro- grammer; rather, it is just a means to uniquely identify each data item. For example, if the data item granularity is one disk block, then the disk block address can be used as the data item name. If the item granularity is a single record, then the record id can be the item name. Using this simplified database model, the basic database access operations that a transaction can include are as follows: ■ read_item(X). Reads a database item named X into a program variable. To simplify our notation, we assume that the program variable is also named X. ■ write_item(X). Writes the value of program variable X into the database item named X. As we discussed in Chapter 16, the basic unit of data transfer from disk to main memory is one disk page (disk block). Executing a read_item(X) command includes the following steps: 1. Find the address of the disk block that contains item X. 2. Copy that disk block into a buffer in main memory (if that disk block is not already in some main memory buffer). The size of the buffer is the same as the disk block size. 3. Copy item X from the buffer to the program variable named X. Executing a write_item(X) command includes the following steps: 1. Find the address of the disk block that contains item X. 2. Copy that disk block into a buffer in main memory (if that disk block is not already in some main memory buffer). 3. Copy item X from the program variable named X into its correct location in the buffer. 4. Store the updated disk block from the buffer back to disk (either immedi- ately or at some later point in time). It is step 4 that actually updates the database on disk. Sometimes the buffer is not immediately stored to disk, in case additional changes are to be made to the buffer. Usually, the decision about when to store a modified disk block whose contents are in a main memory buffer is handled by the recovery manager of the DBMS in cooperation with the underlying operating system. The DBMS will maintain in the database cache
20.1 Introduction to Transaction Processing 749 a number of data buffers in main memory. Each buffer typically holds the contents of one database disk block, which contains some of the database items being pro- cessed. When these buffers are all occupied, and additional database disk blocks must be copied into memory, some buffer replacement policy is used to choose which of the current occupied buffers is to be replaced. Some commonly used buffer replacement policies are LRU (least recently used). If the chosen buffer has been modified, it must be written back to disk before it is reused.1 There are also buffer replacement policies that are specific to DBMS characteristics. We briefly discuss a few of these in Section 20.2.4. A transaction includes read_item and write_item operations to access and update the database. Figure 20.2 shows examples of two very simple transactions. The read-set of a transaction is the set of all items that the transaction reads, and the write-set is the set of all items that the transaction writes. For example, the read-set of T1 in Figure 20.2 is {X, Y} and its write-set is also {X, Y}. Concurrency control and recovery mechanisms are mainly concerned with the database commands in a transaction. Transactions submitted by the various users may execute concurrently and may access and update the same database items. If this concurrent execution is uncontrolled, it may lead to problems, such as an inconsistent database. In the next section, we informally introduce some of the problems that may occur. 20.1.3 Why Concurrency Control Is Needed Several problems can occur when concurrent transactions execute in an uncontrolled manner. We illustrate some of these problems by referring to a much simplified air- line reservations database in which a record is stored for each airline flight. Each record includes the number of reserved seats on that flight as a named (uniquely iden- tifiable) data item, among other information. Figure 20.2(a) shows a transaction T1 that transfers N reservations from one flight whose number of reserved seats is stored in the database item named X to another flight whose number of reserved seats is stored in the database item named Y. Figure 20.2(b) shows a simpler transaction T2 that just reserves M seats on the first flight (X) referenced in transaction T1.2 To sim- plify our example, we do not show additional portions of the transactions, such as checking whether a flight has enough seats available before reserving additional seats. When a database access program is written, it has the flight number, the flight date, and the number of seats to be booked as parameters; hence, the same program can be used to execute many different transactions, each with a different flight number, date, and number of seats to be booked. For concurrency control purposes, a trans- action is a particular execution of a program on a specific date, flight, and number 1We will not discuss general-purpose buffer replacement policies here because they are typically discussed in operating systems texts. 2A similar, more commonly used example assumes a bank database, with one transaction doing a transfer of funds from account X to account Y and the other transaction doing a deposit to account X.
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374
- 375
- 376
- 377
- 378
- 379
- 380
- 381
- 382
- 383
- 384
- 385
- 386
- 387
- 388
- 389
- 390
- 391
- 392
- 393
- 394
- 395
- 396
- 397
- 398
- 399
- 400
- 401
- 402
- 403
- 404
- 405
- 406
- 407
- 408
- 409
- 410
- 411
- 412
- 413
- 414
- 415
- 416
- 417
- 418
- 419
- 420
- 421
- 422
- 423
- 424
- 425
- 426
- 427
- 428
- 429
- 430
- 431
- 432
- 433
- 434
- 435
- 436
- 437
- 438
- 439
- 440
- 441
- 442
- 443
- 444
- 445
- 446
- 447
- 448
- 449
- 450
- 451
- 452
- 453
- 454
- 455
- 456
- 457
- 458
- 459
- 460
- 461
- 462
- 463
- 464
- 465
- 466
- 467
- 468
- 469
- 470
- 471
- 472
- 473
- 474
- 475
- 476
- 477
- 478
- 479
- 480
- 481
- 482
- 483
- 484
- 485
- 486
- 487
- 488
- 489
- 490
- 491
- 492
- 493
- 494
- 495
- 496
- 497
- 498
- 499
- 500
- 501
- 502
- 503
- 504
- 505
- 506
- 507
- 508
- 509
- 510
- 511
- 512
- 513
- 514
- 515
- 516
- 517
- 518
- 519
- 520
- 521
- 522
- 523
- 524
- 525
- 526
- 527
- 528
- 529
- 530
- 531
- 532
- 533
- 534
- 535
- 536
- 537
- 538
- 539
- 540
- 541
- 542
- 543
- 544
- 545
- 546
- 547
- 548
- 549
- 550
- 551
- 552
- 553
- 554
- 555
- 556
- 557
- 558
- 559
- 560
- 561
- 562
- 563
- 564
- 565
- 566
- 567
- 568
- 569
- 570
- 571
- 572
- 573
- 574
- 575
- 576
- 577
- 578
- 579
- 580
- 581
- 582
- 583
- 584
- 585
- 586
- 587
- 588
- 589
- 590
- 591
- 592
- 593
- 594
- 595
- 596
- 597
- 598
- 599
- 600
- 601
- 602
- 603
- 604
- 605
- 606
- 607
- 608
- 609
- 610
- 611
- 612
- 613
- 614
- 615
- 616
- 617
- 618
- 619
- 620
- 621
- 622
- 623
- 624
- 625
- 626
- 627
- 628
- 629
- 630
- 631
- 632
- 633
- 634
- 635
- 636
- 637
- 638
- 639
- 640
- 641
- 642
- 643
- 1 - 50
- 51 - 100
- 101 - 150
- 151 - 200
- 201 - 250
- 251 - 300
- 301 - 350
- 351 - 400
- 401 - 450
- 451 - 500
- 501 - 550
- 551 - 600
- 601 - 643
Pages: