Sei sulla pagina 1di 16

Teradata SQL Performance Tuning Case Study

Part I

Eddy Cai 2007/03

Overview

Case 1: Table Join versus Sub Query Case 2: Avoid using Teradata Function "CSUM Case 3: Ensure Matching Table Column Data Types in Join Condition Case 4: Handle Non Equality Constraints in Where Clause Case 5: Logically Splitting SQL to Avoid Using Derived Tables

Case 6: Proper PI Choice in a Table


Case 7: Collect Statistics Case 8: PPI Filter

eBay Inc. confidential

Case 1: Table Join versus Sub Query

Left outer joins are faster when there is a relatively small set of rows which will be eliminated from the set.

For (Not in)/IN/(Not exists) function need change sub query to left out/inner join.
In some situations, TD optimizer will not know result set size of sub query, this may be cause path choice issue.

Note: This will be small performance improve, so need to trade off the benefit.

eBay Inc. confidential

Case 1: Table Join versus Sub Query

Sample SQL dw_fetr.dw_fetr_sd_w.ins.sql


AND asd.auct_type_code IN ( SELECT auct_type_code FROM ${readDB}.DW_CFG_AUCT_TYPE_MODEL_ SA_MAP WHERE MODEL_SA_CODE = 'FETR' AND INCLD_YN_ID = 1);

EffectiveCPU 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

TotalIOCount 80000000 70000000 60000000 50000000 40000000 30000000 20000000 10000000 0

Rewrite SQL JOIN ${readDB}.DW_CFG_AUCT_TYPE_M ODEL_SA_MAP MP ON MP.auct_type_code=ASD.auct_type_ code


AND MP.MODEL_SA_CODE = 'FETR' And MP.INCLD_YN_ID = 1;

eBay Inc. confidential

8/ 1/ 06 8/ 8/ 0 8/ 6 15 /0 8/ 6 22 /0 8/ 6 29 /0 6 9/ 5/ 0 9/ 6 12 /0 9/ 6 18 /0 9/ 6 25 /0 10 6 /2 /0 10 6 /9 / 10 06 /1 5/ 06

Case 1: Table Join versus Sub Query

Old Explain
2) We do a single-AMP RETRIEVE step from gdw_tables.dw_cfg_auct_type_model_sa_map by way of the primary index "gdw_tables.dw_cfg_auct_type_model_sa_map.MODEL_SA_CODE = 'FETR '" with a residual condition of ( "gdw_tables.dw_cfg_auct_type_model_sa_map.INCLD_YN_ID = 1") into Spool 4 (group_amps), which is redistributed by hash code to all AMPs. Then we do a SORT to order Spool 4 by the sort key in spool field1 eliminating duplicate rows. The size of Spool 4 is estimated with high confidence to be 14 rows. The estimated time for this step is 0.00 seconds.

New Explain
2) Next, we do a single-AMP RETRIEVE step from gdw_tables.dw_cfg_auct_type_model_sa_map by way of the primary index "gdw_tables.dw_cfg_auct_type_model_sa_map.MODEL_SA_CODE = 'FETR '" with a residual condition of ( "gdw_tables.dw_cfg_auct_type_model_sa_map.INCLD_YN_ID = 1") into Spool 2 (all_amps) (compressed columns allowed), which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 7,168 rows. The estimated time for this step is 0.00 seconds.

eBay Inc. confidential

Case 2: Avoid using Teradata Function "CSUM"

We usually use the CSUM (1,1) function, when we need to generate key values. This function skews the query heavily, as the CSUM (1,1) function is performed only on one AMP. For optimal performance, we propose to use the following function instead. SUM(1) OVER (ROWS UNBOUNDED PRECEDING) It is an all AMPs operation and does not cause skew.

eBay Inc. confidential

Case 2: Avoid using Teradata Function "CSUM"

Sample SQL: dw_tns.dw_tns_user_rule_actvty_key_w.ins.sql cast((CSUM(1,1) + MAXK.MAX_KEY_NUM) as DECIMAL(18,0)) Rewrite SQL: cast((SUM(1) OVER (ROWS UNBOUNDED PRECEDING) + MAXK.MAX_KEY_NUM) as DECIMAL(18,0))
SkewOverhead 60000 50000 40000 30000 20000 10000 0 10/1/2006 ParallelEfficiency 100 90 80 70 60 50 40 30 20 10 10/8/2006 0 10/15/2006

eBay Inc. confidential

Case 3: Ensure Matching Table Column Data Types on Joining

Matching table column data type in the SQL join/where conditions is especially important in terms of performance. When data type mismatch is found in the join/where condition in a SQL, Teradata will do the data type conversion which will impact the performance greatly. The following is a list of typical table column data type mismatch in SQL join/where conditions. Joining columns with Data Type: Decimal (9,2), Decimal(9) and Float Joining columns with Data Type: Varchar and Char Joining columns with Data Type: ISO char type and UTF Char In case that table columns with different data types need to be joined, explicit data type casting must be performed in the SQL query to reduce performance impact.

eBay Inc. confidential

Case 4: Handle Non Equality Constraints in Where Clause

For any non equality constraints in SQL Where clause such as IN, NOT IN, LIKE, BETWEEN, >, <, >=, <=, specifying the constraints properly will enhance the performance of the SQL. Example: Explicitly apply non equality constraints condition to all related columns. Before explicitly apply non equality constraints condition, the SQL looks like:
where a = b = c AND a is like apple%

During performance tuning, the where clase has been updated as:
where a is like apple%, b is like apple%, c is like apple% and a = b = c

The performance has been improved after explicitly specifying the non equality constraints.

eBay Inc. confidential

Case 5: Logically Splitting SQL to Avoid Using Derived Tables

Logically splitting SQLs into sub Queries may be necessary to guarantee SQL performance.

The main reason is that the optimizer does not know the size of the derived tables in advance.
Therefore, splitting SQL statements to avoid using derived tables will improve SQL performance.

10

eBay Inc. confidential

Case 5: Logically Splitting SQL to Avoid Using Derived Tables

Example: Split a SQL statement and use working tables In dw_attr.dw_attr_sd.ins.sql in ATTR subject area, two derived tables are formed and they are combined by using the union operation. Since the optimizer does not know about the size of those derived tables and the strategy of distributing data into AMPs does not guarantee the performance. Therefore, it is recommended to join the working tables directly to avoid using the derived tables. Before the update in dw_attr.dw_attr_sd.ins.sql:
(select lstg_w.item_id, lstg_w.auct_end_dt, lstg_w.slr_cntry_id from ${readDB}.dw_lstg_item_w lstg_w union select lstg_tgt.item_id, lstg_tgt.auct_end_dt, lstg_tgt.slr_cntry_id from ${readDB}.dw_lstg_item lstg_tgt SkewOverhead ParallelEfficiency join ${db_name_read}.dw_attr_missing_items_w miss on miss.item_id = lstg_tgt.item_id 140000 )w 120000

100 90 80 70 60 50 40 30 20 10 0

Rewrite with two statements


${readDB}.dw_lstg_item_w w ; And ${readDB}.dw_lstg_item w join ${db_name_read}.dw_attr_missing_items_w miss on miss.item_id = w.item_id ;

100000 80000 60000 40000 20000 0 10/2/06

10/5/06

10/9/06

10/12/06

10/16/06

10/18/06

11

eBay Inc. confidential

Case 6: Proper PI Choice in a Table

The high data skew in a table is basically caused by the improper PI choice of the table. In this case, DBA always advise that PI should be changed (i.e. table rebuild), and NUSI may sometimes need to be added for performance improvements. If a table has a Non-Unique Primary Index (NUPI), the Teradata RDBMS is not required to enforce uniqueness on the rows. In order to avoid high data skew, the row count of the same non unique PI value should be minimized so that data will be evenly distributed across all AMPs. As a rule of thumb, the maximum rows per value should not exceed 100 rows.

12

eBay Inc. confidential

Case 6: Proper PI Choice in a Table

Example: PI change due to high Max Rows per Value In DW_BID_BYR_TRACKING table, primary index was ITEM_ID with max rows per value at 2830 which is very likely to cause data skew. In order to reduce the Max Rows per value, a PI change has been made to include the Trasnsaction_ID. This change has reduced the max rows per value to 1, which means the PI of Item_ID with Transaction_ID is a unique primary index.
SkewOverhead 40000 35000 30000 25000 20000 15000 10000 5000 0 10/1/06 10/8/06 10/15/06 ParallelEfficiency 100 90 80 70 60 50 40 30 20 10 0

13

eBay Inc. confidential

Case 7: Collect Statistics

Collecting statistics on joining columns is quite important for Teradata to understand how data is distributed. Lack of collect statistics refresh on the particular columns may cause the misleading. Inaccurate explain result will then be used by the optimizer. The following commend is used to identify TD explanation with addition information on collect statistics on V2R5+ Diagnostic helpstats on for session; As the rule of thumb, whenever there is an update to the data in a table for more than 5%. The collect statistics will need to be refreshed.

14

eBay Inc. confidential

Case 8: PPI Filter (V12 update)

Interval function:
For date 2006-10-23 interval 23 month the partition eliminate enabled.

For date 2006-10-23 23 the partition eliminate enabled.


For current_date interval 23 month, the parathion eliminate enabled. For current_date, the parathion eliminate enabled.

Dynamic Partition Elimination.


We have applied this method on ATTR. Firstly create volatile table on only has PPI date, then join back with PPI table to enable on DPE.

15

eBay Inc. confidential

Case 8: PPI Filter (V12 update)

Product Join with Dynamic Partition Elimination


Step 1: Create a volatile table contains valid PPI date columns
create volatile table SRC_CRE_DT_V as ( sel cast(SRC_CRE_DT as date) SRC_CRE_DT from ${readDB}.STG_UE_SOJ_SESS_ASQ_W STG_ASQ group by 1

) with data on commit preserve rows;

Step 2: Join back with PPI table to enable DPE


FROM batch_views.DW_UE_EMAIL_TRACKING DW_EMAIL JOIN SRC_CRE_DT_V DT_V ON DW_EMAIL.SRC_CRE_DT = DT_V.SRC_CRE_DT JOIN batch_views.STG_UE_SOJ_SESS_ASQ_W STG_ASQ

ON DW_EMAIL.EMAIL_TRACKING_ID=STG_ASQ.EMAIL_TRACKING_ID
AND DW_EMAIL.RCPNT_ID=STG_ASQ.RCPNT_ID AND DW_EMAIL.SRC_CRE_DT =cast( STG_ASQ .SRC_CRE_DT as date)

16

eBay Inc. confidential

Potrebbero piacerti anche