Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Systems • Introduction
• Database Design Theory
• Query Processing and Optimisation
Fall 2001 • Concurrency Control
• Data Base Recovery and Security
CMPUT 391: Query Processing & Optimization
• Object-Oriented Databases
• Inverted Index for IR
Dr. Osmar R. Zaïane
• XML
• Data Warehousing
• Data Mining
• Parallel and Distributed Databases
University of Alberta Chapters 12, 13
• Other Advanced Database Topics
&16 of Textbook
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 1 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 2
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 3 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 4
Overview of Query Processing
• The aim is to transform a query in a high-level
The Need for Optimization
declarative language (SQL) into a correct and
Consider:
efficient execution strategy
SELECT name, address
• Query Decomposition FROM Customer, Account
– Analysis WHERE Customer.name = Account.name
AND Balance > 2000
– Conjunctive and disjunctive normalization
– Semantic analysis There are different possibilities for execution:
πC.name,C.address(σC.name=A.name ∧ A.balance>2000(C×A))
• Query Optimization
πC.name,C.address(σC.name=A.name (C× σ A.balance>2000 (A))
• Query Evaluation (Execution)
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 5 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 6
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 15 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 16
Statistics Stored
System Catalog Information • Cardinality (Ntuples(R)): number of tuples in each relation
• Size (Npages(R)): number of pages for each relation
• For each relation
• Index Cardinality (Nkeys(I)): number of distinct key values
– Relation name, file name, file structure
– Attribute name and type for all attributes • Index Size (INPages(I)): number of pages for each index
– Index name for all indexes on the relation • Index Height (IHeight(I)): number of nonleaf levels for each tree
– Integrity constrains on the relation index
• Index Range: number of minimum (ILow(I)) and maximum
• For each index
(IHigh(I)) present key values for each index
– Index name and structure • Catalogs updated periodically.
– Search key attributes – Updating whenever data changes is too expensive; lots of approximation
anyway, so slight inconsistency ok.
• For each view • More detailed information (e.g., histograms of the values in some
– View name and definition field, or attribute weight, etc.) are sometimes stored.
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 17 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 18
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 23 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 24
Evaluating the Join
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
• Simple Nested Loop Join Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Block Nested Loop Join R S is very
Common Î Must be
• Index Nested Loop Join carefully optimized.
• Reserves:
R × S is large; so, R × S
• Sort-Merge Join – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
followed by a selection
• Hash Join is inefficient • Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 25 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 26
...
each page of S, and write out matching pairs of tuples add <r, s> to result.
<r, s>, where r is in R-page and s is in S-page. Then read next R-block, ... ...
– Cost: M + M*N = 1000 + 1000*500 I/Os. ≈(501 103) scan S, etc.
Input buffer for S Output buffer
sid sname rating age 28 103 12/4/96 guppy relations using hash fn hash 2
function
Î
Phase 2: read partitions Input buffer Output
Sailors can be sorted in 2 passes; total join cost: 7500 I/Os. once M+N. Total for Si buffer
However with BNL join could be less I/Os with 100 buffers 3(M+N) Disk B main memory buffers Disk
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 33 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 34
Motivating Example
Schema for Examples SELECT S.sname
bid=100 rating > 5
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 39 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 40
Cost Estimation Size Estimation and Reduction Factors
• For each plan considered, must estimate cost: SELECT attribute list
– Must estimate cost of each operation in plan tree. FROM relation list
• Depends on input cardinalities. • Consider a query block: WHERE term1 AND ... AND termk
• We’ve already discussed how to estimate the cost of operations
(sequential scan, index scan, joins, etc.) • Maximum # tuples in result is the product of the
– Must estimate size of result for each operation in tree! cardinalities of relations in the FROM clause.
• Use information about the input relations. • Reduction factor (RF) associated with each term reflects
• For selections and joins, assume independence of predicates.
the impact of the term in reducing result size. Result
• The System R cost estimation approach. cardinality = Max # tuples * product of all RF’s.
– Very inexact, but works OK in practice. – Implicit assumption that terms are independent!
– More sophisticated techniques known now. – Term col=value has RF 1/NKeys(I), given index I on col
• Query plans estimated at run-time or estimated once and – Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
elected plan stored and revisited for re-evaluation. – Term col>value has RF (High(I)-value)/(High(I)-Low(I))
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 41 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 42
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 43 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 44
Overview
• After ER design, schema refinement, and the definition
Understanding the Workload
of views, we have the conceptual and external schemas • For each query in the workload:
for our database. – Which relations does it access?
• The next step is to choose indexes, make clustering – Which attributes are retrieved?
decisions, and to refine the conceptual and external – Which attributes are involved in selection/join conditions? How
schemas (if necessary) to meet performance goals. selective are these conditions likely to be?
• We must begin by understanding the workload: • For each update in the workload:
– The most important queries and how often they arise. – Which attributes are involved in selection/join conditions? How
– The most important updates and how often they arise. selective are these conditions likely to be?
– The desired performance for these queries and updates. – The type of update (INSERT/DELETE/UPDATE), and the attributes
that are affected.
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 45 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 46
Decisions to Make
Choice of Indexes
• What indexes should we create?
– Which relations should have indexes? What field(s) should be • One approach: consider the most important queries
the search key? Should we build several indexes? in turn. Consider the best plan using the current
• For each index, what kind of an index should it be? indexes, and see if a better plan is possible with an
– Clustered? Hash/tree? Dynamic/static? Dense/sparse? additional index. If so, create it.
• Should we make changes to the conceptual schema?
• Before creating an index, must also consider the
– Consider alternative normalized schemas? (Remember, there are
many choices in decomposing into BCNF, etc.) impact on updates in the workload!
– Should we ``undo’’ some decomposition steps and settle for a – Trade-off: indexes can make queries go faster, updates
lower normal form? (Denormalization.) slower. Require disk space, too.
– Horizontal partitioning, replication, views ...
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 47 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 48
Issues to Consider in Index Issues in Index Selection (Contd.)
Selection • Multi-attribute search keys should be considered when a
• Attributes mentioned in a WHERE clause are candidates for WHERE clause contains several conditions.
index search keys. – If range selections are involved, order of attributes should be
carefully chosen to match the range ordering.
– Exact match condition suggests hash index.
– Such indexes can sometimes enable index-only strategies for
– Range query suggests tree index.
important queries. (no need to access the relation)
• Clustering is especially useful for range queries, although it can help on
equality queries as well in the presence of duplicates. • For index-only strategies, clustering is not important!
• Try to choose indexes that benefit as many queries as • When considering a join condition:
possible. Since only one index can be clustered per – Hash index on inner is very good for Index Nested Loops.
• Should be clustered if join column is not key for inner, and inner tuples
relation, choose it based on important queries that would need to be retrieved.
benefit the most from clustering. – Clustered B+ tree on join column(s) good for Sort-Merge.
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 49 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 50
SELECT D.mgr
FROM Dept D, Emp E
Multi-Attribute Index Keys Index-Only Plans <E.dno> WHERE D.dno=E.dno
More Guidelines for Query Tuning Guidelines for Query Tuning (Contd.)
SELECT * INTO Temp
FROM Emp E, Dept D
• Minimize the use of DISTINCT: don’t need it if duplicates
• Avoid using intermediate WHERE E.dno=D.dno
are acceptable, or if answer contains a key. AND D.mgrname=‘Joe’
• Minimize the use of GROUP BY and HAVING:
relations:
SELECT E.dno, AVG(E.sal)
and
SELECT MIN (E.age) FROM Emp E, Dept D
SELECT MIN (E.age)
vs. WHERE E.dno=D.dno SELECT T.dno, AVG(T.sal)
FROM Employee E FROM Employee E
AND D.mgrname=‘Joe’ FROM Temp T
GROUP BY E.dno WHERE E.dno=102
GROUP BY E.dno GROUP BY T.dno
HAVING E.dno=102
❖ Does not materialize the intermediate reln Temp.
❖ Consider DBMS use of index when writing arithmetic ❖ If there is a dense B+ tree index on <dno, sal>, an index-only
expressions: E.age=2*D.age will benefit from index on E.age, plan can be used to avoid retrieving Emp tuples in the second
but might not benefit from index on D.age! query!
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 71 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 72
Summary of Database Tuning Summary (Contd.)
• The conceptual schema should be refined by considering • Over time, indexes have to be fine-tuned (dropped, created,
performance criteria and workload: re-built, ...) for performance.
– May choose 3NF or lower normal form over BCNF. – Should determine the plan used by the system, and adjust the
– May choose among alternative decompositions into BCNF (or choice of indexes appropriately.
3NF) based upon the workload. • System may still not find a good plan:
– May denormalize, or undo some decompositions. – Only left-deep plans considered!
– May decompose a BCNF relation further! – Null values, arithmetic conditions, string expressions, the use of
– May choose a horizontal decomposition of a relation. ORs, etc. can confuse an optimizer.
– Importance of dependency-preservation based upon the • So, may have to rewrite the query/view:
dependency to be preserved, and the cost of the IC check.
• Can add a relation to ensure dep-preservation (for 3NF, not BCNF!); or
– Avoid nested queries, temporary relations, complex conditions,
else, can check dependency using a join. and operations like DISTINCT and GROUP BY.
Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 73 Dr. Osmar R. Zaïane, 2001 Database Management Systems University of Alberta 74