Sei sulla pagina 1di 33

High Performance Data Warehouse Design and Construction

Join Techniques

Objectives

Illustrate fundamental differences between join techniques in DSS and OLTP environments. Describe performance characteristics for major join techniques found in a DSS environment. Describe resource requirements for major join techniques found in a DSS environment.

Topics

DSS vs. OLTP Queries. Nested Loop Join. Sort Merge Join. Merge Join. Hash Join. Pointer-based Join. Query Optimization.

DSS versus OLTP


Traditional OLTP
Performance Flexibility Availability Update frequency Query type Critical. Seconds or sub-seconds Highly tuned for predefined applications and operations 24*7. MTTR in minutes. Constant. Real time. Often transaction oriented. Simple. Narrow, predefined set. Usually prime key. Often repeated Usually primary key index. Relatively small. Usually <100 rows Online transactions (humans). Transaction-oriented. Handled in real time.

Traditional DSS
Important but secondary issue. Minutes/hours. What if analysis. Changing data relationships. 16*6. MTTR sometimes hours. Periodic. Batch. Complex. Ad hoc. Aggregates. Joins. Many one time only queries. Rarely prime key. Secondary index where helpful. Varies. Can be very large. Answer sets often loaded in other tools for analysis Operational systems Handled in operational systems and checked by local programs or after loads complete.

Index Usage Result size Data source Referential integrity

Simplified OLTP Join


select customer.first_nm ,customer.last_nm ,account.balance_amt ,account.open_dt ,account.account_type_cd ,tx.tx_amt ,tx.tx_dt ,tx.tx_type_cd ,channel.channel_desc from customer ,account ,tx ,channel where customer.customer_id = '789451965' and customer.customer_id = account.customer_id and account.account_id = tx.account_id and account.account_type_cd = DDA and tx.channel_id = channel.channel_id and tx.tx_dt > '31-DEC-2011' ;

Simplified DSS Join


create table dss_result as select customer.customer_id ,customer.birth_dt ,customer.income_range_cd ,sum( case when channel.channel_type_cd else 0 end ) atm_tx_qty ,sum( case when channel.channel_type_cd else 0.0 end ) atm_total_tx_amt ,sum( case when channel.channel_type_cd else 0 end ) branch_tx_qty ,sum( case when channel.channel_type_cd else 0.0 end ) branch_total_tx_amt
= 'ATM' then 1

= 'ATM' then tx.tx_amt

= 'BRA' then 1

= 'BRA' then tx.tx_amt

Simplified DSS Join (contd)


,sum( case when channel.channel_type_cd not in ('ATM','BRA') then 1 else 0 end ) other_tx_qty ,sum( case when channel.channel_type_cd not in ('ATM','BRA') then tx.tx_amt else 0.0 end ) other_total_tx_amt from customer ,account ,tx ,channel where customer.customer_id = account.customer_id and account.account_id = tx.account_id and tx.channel_id = channel.channel_id and account.account_type_cd = 'DDA' and tx.tx_dt > '31-DEC-2011' group by customer.customer_id ;

Key Points to Observe


Traditional OLTP Query:

High selectivity based on specification of customer_id. Returns a relatively small number of rows.

Traditional DSS Query:

Join all customers with transactions since start of year. Returns a large set of rows which will be further analyzed in subsequent queries. Aggregation of detailed data.

Most Common Join Techniques


Nested Loop Joins. (Sort) Merge Joins. Hash Joins. Pointer-Based Joins.

Nested Loop Joins


Join table T1 to table T2 by taking each (qualifying) row of T1 and searching table T2 for matches.

T1 is referred to as the Outer table. T2 is referred to as the Inner table. Simplest form of join algorithm. No sorting required.

10

Nested Loop Joins


Inner table is scanned once for each qualifying row in the outer table. Indices are (almost always) essential for high performance with nested loop joins by narrowing scan of inner table. Alternatively, the RDBMS can use block nested loop joins to reduce the number of scans against the inner table.

11

Nested Loop Joins


Sum dollars sold for stores > 100,000 sq. ft.?
Store
126----------------------------------------------------------50-----------------------------------------------------------12-----------------Search Search

Sales Detail

Search

Each qualifying row in Store table results in an interrogation of Sales Detail table for matching rows.
12

Nested Loop Joins


Naive nested loop join with no indices requires (R*S) I/O operations even with (simple) blocking. Nested loop join with index on inner table may require fewer I/Os, depending on index selectivity. Index selectivity, for these purposes, should be measured in terms of blocks, not rows. Access to inner table using index will usually involve random I/O operations (as opposed to sequential I/O).

Note: R and S are the number of blocks in the inner and outer tables, respectively.

13

Nested Loop Joins


OLTP queries will almost always perform best with a nested loop join using indices:
1. Use the primary key index on customer_id to quickly fetch customer '789451965 from the customer table. 2. Use the index on the customer_id foreign key in the account table to quickly fetch all accounts with customer_id value '789451965' (highly selectivity access path). 3. For each account fetched, match account_id against the foreign key index on account_id in the tx table to fetch all transactions for the desired accounts (filtering transactions by date after match). 4. For each transaction fetched, match channel_id from tx record against channel_id (primary key) index for the channel table to fetch the (unique) channel for the transaction.

14

Nested Loop Joins


Nested loop joins may be fastest join technique when outer table has a small number of qualifying rows and the inner table is very large.

Want number of I/Os to index and data blocks for inner table to be small compared to total number of blocks in inner table. Can approximate break-even analysis by calculating number of qualifying rows in outer table times the sum of the indexing cost plus the average number of inner table rows retrieved per qualifying outer table row versus total number of blocks in inner table.

15

Nested Loop Joins


Nested loop join performs well relative to other techniques when:
(# Qualifying Rows in Outer Table) x [ (Index Depth* on Inner Table) + (Inner Table Rows Retrieved per Outer Table Qualifying Row) ] is less than (# Blocks in Inner Table)

Note: Break even point will be even more optimistic if we assume that substantial portions of the index on the inner table can be cached in memory (first term in square brackets goes to zero) or if a large sort can be avoided.
*Index depth = # of I/Os to get to the desired RID list.

16

Nested Loop Joins


For traditional DSS queries, high performance with nested loop joins is often dependent on the ability to access through a clustering index.

Traditional DSS queries are not typically provided with a selection criteria using a primary or foreign key value with extreme selectivity as is found in OLTP queries. Traditional DSS queries often join tables containing (hundreds of) millions of rows to tables with (billions) tens of millions of rows in one-to-many relationships.

Access through a clustering index implies that the data file is physically ordered on the index value, so locality of reference is guaranteed to be very high when matching on the join predicate.

17

Sort Merge Join Concept


Q: Which customers under 30 years of age have brokerage accounts?
Customer Account Filter: acct_type= brokerage

T1
Filter: age < 30

T2

Sort T1

Sort T2

Merge

Match on customer_id

R
18

Sort Merge Joins


VLDB sorting requires divide and conquer algorithms for combining main memory sorting with data set management for merging to/from disk.

Create initial runs with in-memory sorts and then merge runs into increasingly longer (and fewer) runs until a single sorted result set is produced. Need substantial disk space to manage execution of the large sort.

19

Sort Merge Joins


External sorting using divide and conquer:
...K K
In-memory sort to data sets of size k Merge

Input

...2K 2K

R/2

...Merge
...2K 2K R/2

...K K

Expected number of passes for sorting a large file is: log( R/K ) Each pass requires R reads and R writes, as well as CPU for merging sorted runs.

Note: K is the size of the output set resulting from in-memory sort (constrained by availability of main memory); R is the total number of blocks in the data set to be sorted after unneeded columns are eliminated.

20

Sort Merge Joins


Estimate of total I/O cost for sort merge join: O( R * log(R/ K) ) + O( S * log(S/K) ) + O( R+S )

Sort T1

Sort T2

Merge

Note that all reads/writes can make use of sequential I/O.

21

Sort Merge Joins


Make use of in-place sorts whenever possible to avoid additional storage overhead. Amount of main memory available for sorting and merging can have a significant impact on performance.

22

Merge Joins (sans Sort)


Most expensive operations in a sort merge join (by far) are the sort steps prior to merging.

Eliminate this overhead whenever the sorted result from a previous join step can be re-used to avoid this operation. May be able to avoid sort step if base tables are presorted or pre-hashed on join key.
T1
Filter

T2
Filter

Sort

Sort

Merge

Match IDs

R 23

Merge Join (sans Sort)


Estimate of total I/O cost for merge join: O( R+S ) Number of I/Os are linearly proportional to combined size of the two tables! Best performance yet!

24

Hash Joins
Join tables T1 and T2 by building an in-memory hash table from the smaller of the tables (after considering qualifying rows) and then probing the hash table for matches from the other table.

Table to be used in hash table construction is referred to as build input. Table to be used to search into hash table is referred to as probe input.

25

Hash Joins
When build input can be hashed all into main memory, performance can be quite attractive using the hash join technique.

No sorting required. No need to build any temporary files even when using very large probe input.

26

Hash Joins
When build input can be hashed all into main memory the number of I/Os is linearly proportional to the combined size of both tables: Estimate of total I/O cost for hash join: O( R+S ) Beware: Memory availability can be a major performance issue when build input is of significant size.

27

Hash Joins
When the build input has a large number of rows that must be placed into the hash table, overflow to disk must be considered.

Recursive partitioning must be used so that build input partitions can fit into memory. Partitioning needs to be performed using same splitting algorithm for both build and probe inputs. Binary hybrid hash operations are used to perform the partitioning.

28

Hash Joins
Hash-based binary matching operations work particularly well when the two input sizes are quite different because the recursion depth is proportional to the smaller of the two inputs.
Estimate of total I/O cost for hash join: O( R * log(R/K) ) + O( S * log(R/K) ) + O( R+S )

Partition T1

Partition T2

Hash Match

Assume R < S. Critical to avoid skew in hashing function results for good performance!

29

Pointer-Based Joins
Join tables T1 and T2 by tracing explicit links between related data items.

Found primarily in object databases in the form of object identifiers (OIDs). The links pre-compute the join path between data items. Cost of storing and maintaining links can be very expensive.

30

Query Optimization

Quality of query optimizer is critical in VLDB environments targeted to data warehouse applications.

Rule-Based Optimizer

31

Query Optimization
Require cost-based optimizer to make good decisions regarding join plans for sophisticated DSS queries.

Need to collect statistics on tables and individual columns within tables. Optimizer choices should be independent of table ordering in from clause, ordering or structure of where clause predicates, and so on - but reality is that many optimizers are very sensitive to these nuances. Optimizer should understand available resources within machine (e.g., CPU bound versus I/O bound machine configuration may influence optimization choices).

Note that the quality of optimizer software is a significant differentiator among database products currently on the market.

32

Bottom Line

Join technique is a dominant factor in determining the performance characteristics in a VLDB data warehouse environment. Join techniques required for a decision support environment are significantly more sophisticated than what is required in an OLTP environment. No one join technique is appropriate in all cases must rely on cost-based optimizer to determine optimal plan.

33

Potrebbero piacerti anche