Sei sulla pagina 1di 12

Deloitte

Teradata Query
Optimization Guidelines
Execute Teradata Queries More Efficiently
Swarnkar, Puneet

2014
Introduction
Optimization is the technique of selecting the least expensive plan (fastest plan) for the query to fetch
results. The optimizer considers the possible query plans for a given input query, and attempts to
determine which of those plans will be the most efficient.

Teradata performance tuning is a technique of improving the process in order for query to perform
faster with the minimal use of CPU resources.

The typical goal of an SQL optimization is to get the result (data set) with less computing resources
consumed and/or with shorter response time.

Query Optimization Process
The following processes list the logical sequence of the processes undertaken by the Optimizer as it
optimizes a DML request. The processes that are listed here do not include the influence of
parameterized value peeking to determine whether the Optimizer should generate a specific plan or a
generic plan for a given request.
The input to the Optimizer is the Query Rewrite ResTree. The Optimizer then produces the optimized
white tree, which it passes to an Optimizer subcomponent called the Generator.
The Optimizer engages in the following process stages.
1. Receives the Query Rewrite ResTree as input.
2. Processes correlated sub queries by converting them to unnested SELECTs or simple joins.
3. Processes non-correlated subqueries by materializing the subquery and placing its value in the USING
row for the query regardless of whether the subquery is on the LHS or the RHS of the operator in the
predicate.
4. Searches for a relevant join or hash index.
5. Materializes subqueries to spool files.
6. Analyzes the materialized subqueries for optimization possibilities.
a. Separates conditions from one another.
b. Pushes down predicates.
c. Generates connection information.
d. Locates any complex joins.
e. Discovers aggregations and opportunities for partial group by optimizations
7. Generates size and content estimates of spool files required for further processing.
8. Generates an optimal single-table access path.
9. Simplifies and optimizes any complex joins identified in stage 6d.
10. Maps join columns from a join (spool) relation to the list of field IDs from the input base tables to
prepare the relation for join planning.
11. Generates information about local connections. A connecting condition is one that connects an outer
query and a subquery. A direct connection exists between two tables if either of the following
conditions is found.
ANDed bind term: miscellaneous terms such as inequalities, ANDs, and ORs; cross, outer, or minus join
term that satisfies the dependent information between the two tables
A spool file of an uncorrelated subquery EXIST predicate that connects with any outer table

12. Generates information about indexes that might be used in join planning, including the primary indexes
for the relevant tables and pointers to the table descriptors of any other useful indexes.
13. Performs row and column partition elimination for partitioned tables.
14. Uses a recursive greedy 1-table lookahead algorithm to generate the best join plan.
15. If the join plan identified in step14 does not meet the heuristics-based criteria for an adequate join
plan, generate another best join plan using an n-table lookahead algorithm.
16. Selects the better join plan of the two plans generated in steps 14 and 15.
17. Generates a star join plan.
18. Selects the better plan of the selection in step 16 and the star join plan generated in stage 17.
19. Passes the optimized white tree to the Generator.
The Generator then generates plastic steps for the plan chosen in step 19.

Methodologies
Optimization is one the most talked about technique in todays time for Teradata. Because of the huge
amount of data in Teradata database, it becomes very important to take out the optimized performance
from it, otherwise the queries will perform poorly and the meaning of parallelism will be lost.
In order to select the least expensive plan for the query to fetch results, mentioned techniques or
practices can be followed:
(1) STATISTICS
Collecting statistics is one of the most primary steps in Teradata query Optimization.
Statistics collection is essential for the optimal performance of the Teradata query optimizer. The query
optimizer relies on statistics to help it determine the best way to access data. Statistics also help the
optimizer ascertain how many rows exist in tables being queried and predict how many rows will qualify
for given conditions. Lack of statistics, or out-dated statistics, might result in the optimizer choosing a
less-than-optimal method for accessing data tables.
Also, statistics help Teradata determine the spool file size needed to contain the resulting data. Accurate
statistics could make the difference between a successful query and a query that runs out of spool
space.
Syntax:
To check whether the Statistics defined for the table:
Help stats table_name;
To collect or refresh the statistics:
Collect stats on table_name *index/column+(col_name, col_name, );
DIAGNOSTIC STATEMENT
DIAGNOSTIC HELPSTATS ON FOR SESSION
The above statement can be used to determine the stats that might be required to improve the
performance of the SQL. The EXPLAIN plan needs to be executed following the above statement to find
the stats suggestion.
Stats will qualify one of the below confidence levels:
1) No Confidence - no statistics defined for a table.
2) Low Confidence - Stats are difficult to use precisely.
3) High Confidence - Optimizer is sure of results based on the stats available.

Statistics need to be collected for:
1. All non-unique indexes.
2. UPI of small tables (tables with less than x rows per AMP, depends on
Available number of AMPs)
3. All indexes of a join index
4. Any column used in joins
5. Any columns used in a WHERE clause
6. Indexes of global temporary tables.

Stats cannot be collected on:
1. Volatile tables
2. LOB columns
Collected statistics are not automatically updated by the system. Should refresh statistics when 5-10 %
of the table rows have changed.
Always collect statistics at the column level even when collecting on an index. This is because indexes
can be dropped at any time, so they are often dropped and recreated.
When to collect Statistics:


After the following:
1. Fast loads
2. Multi loads
3. Non-utility (TPump/BTEQ/ODBC/JDBC) Collect statistics after a significant percentage of
data values have changed.
4. Deletes/Purges Collect Statistics after a significant percentage of data has been removed
from a table.
5. Recovery if a table is lost and then recovered from an archive, recollect statistics.
6. Reconfiguration Recollect all statistics after a system reconfiguration.

(2) USAGE OF DISTINCT KEYWORD
It is much better to use the GROUP BY instead of DISTINCT in terms of Teradata as both the
Keywords provide the same result.

DISTINCT is better for columns with a low number of rows per value:
Number of rows < Number of AMPs

GROUP BY is better for columns with a large number of rows per value:
Number of rows > Number of AMPs

(3) USAGE OF IN CLAUSE
While there may not be a theoretical limit to the number of items contained within an IN clause you
may see performance degradations if the list starts to exceed a few hundred items. There has been an
optimizer tweak in place that tries to build a spool file when a large IN list is encountered.
When the number of values exceeds the acceptable limit, the hard-coded values can be inserted into a
volatile table or a global temporary table with stats collected on the global temporary table and the IN
clause can be made to an equi-join.
Example:
SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*) FROM EMPLOYEE MD
, DEPARTMENT VN
WHERE MD.DEPT_ID = VN. DEPT_ID
AND VN.LOC = 'USA'
AND MD.DEPT_ID IN
('2087','309','4009','123','5743','456','0987','2643','7545'',8655,'9737','8654','5744','67894','04
44','755','78446','96422','59426',8365,7072,639620,6497,7220,7294,8937,5978,7894
,2497,2864,0742)
GROUP BY MD.EMP_ID , MD.EMP_NAME , VN.DEPT_DESC
HAVING DEPT_ID BETWEEN '501' AND '9450'
AND COUNT(MD. EMP_ID) >= 100
ORDER BY 2;
CREATE MULTISET VOLATILE TABLE EMP_DEPT AS
(SELECT DEPT_ID FROM DEPARTMENT)
WITH NO DATA
ON COMMIT PRESERVE ROWS;

After inserting above all the hard coded values into the volatile table, the query can be modified
as below:
SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*)
FROM EMPLOYEE MD
, DEPARTMENT VN
, EMP_DEPT TEMP
WHERE MD.DEPT_ID = VN. DEPT_ID
AND VN.LOC = 'USA'
AND MD.DEPT_ID = TEMP. DEPT_ID
GROUP BY MD.EMP_ID , MD.EMP_NAME , VN.DEPT_DESC
HAVING DEPT_ID BETWEEN '501' AND '9450'
AND COUNT(MD. EMP_ID) >= 100
ORDER BY 2;

(4) USAGE OF MANIPULATED COLUMNS IN WHERE CLAUSE
When Manipulated columns are used in the WHERE clause or JOIN conditions, optimizer will not be
able to use the stats collected on those columns.
(5) DATATYPE MISMATCH
Try to avoid data transformation during the join. Columns with the same data type should be
used in the join condition. Otherwise one of the compared fields will undergo conversion before join
happens.
(6) CHARACTER SET
While joining two tables make sure that both the columns fall under the same character set.
Otherwise implicit conversion of one to the other takes place resulting in poor performance.
(7) DATE COMPARISON
When comparing values of date in a particular range, the query may result in product join.
This can be avoided with the usage of SYS_CALENDAR.CALENDAR, which is Teradata's in-built database.

Example:
Insert into table_a
select
t2.a1,t2.a2,t2.a3,t2.a4
from
table_2 t2
join table_3 t3
on t2.a1=t3.a1
and t2.a5_dt>=t3.a4_dt
and t2.a5_dt<=t3.a5_dt;

The above query can be replaced with sys_calendar to eliminate (but not completely) product join
Example:
Insert into table_a
select
t2.a1,t2.a2,t2.a3,t2.a4
from table_2 t2
join SYS_CALENDAR.CALENDAR sys_cal
on sys_cal.calendar_date = t2.a5_dt
join table_3 t3
on t2.a1=t3.a1
and sys_cal.calendar_date >=t3.a4_dt
and sys_cal.calendar_date <=t3.a5_dt;

(8) PROPER USAGE OF ALIAS AND TABLE NAMES
Example:
Insert into table_a
select
t2.a1, t2.a2, t2.a3, t2.a4
from
table_2 t2
join table_3 t3
on t2.a1=t3.a1
join table_4 t4
on table_3.a1= t4.a1;


This may result in utilization of high CPU, if the tables used above consist of few rows as the
table name and the alias name goes for full table scan of the same table twice.(table_3) as per
the above given example.
Suppose, if either of the table is very big, the above case may lead to SPOOL error.

(9) MISSING JOINS
Example: If there are two tables tab1 & tab2, each with two below columns
tab1 tab2 tab3
empno deptno ename
ename dname dname

Select tab1.empno, tab2.deptno
from tab1,tab2,tab3
where tab1.ename = tab2.ename;

If both of the tables used in the above SQL are small, it may result in a product join (Cartesian
product) and may consume high CPU.
If either of the table is very big, the case may lead to SPOOL error.

(10) UNNECESSARY JOINS
Try to avoid unnecessary joins especially the usage of LEFT OUTER joins when no data is being
pulled from it and which does not have any filter.
Example:
SELECT
E.EMP_ID,
E.EMP_NAME,
D.DEPT_NAME,
D.DEPT_LOC
FROM
EMPLYOEE E
JOIN DEPT D ON E.DEPT_ID = D.DEPT_ID
LEFT JOIN PAY_ROLL P ON E.EMP_ID = P.EMP_ID
WHERE
E.JOIN_DT >= 2009-10-12


The above query can be modified as
SELECT
E.EMP_ID,
E.EMP_NAME,
D.DEPT_NAME,
D.DEPT_LOC
FROM
EMPLYOEE E
JOIN DEPT D ON E.DEPT_ID = D.DEPT_ID
WHERE E.JOIN_DT >= 2009-10-12

(11) PROPER SELECTION OF PI FOR A TABLE
Primary Index is the sole mechanism by which data is distributed over the AMPs and PI for a
table should be chosen in such a way that the column with the most unique values.
A table when created, by default, assumes first column to be the PI when the index is not
specified explicitly.
PI for a table should be chosen in such a way on a column with the most unique values.
The below query can be used to see the distribution of rows to the AMPs in a system.

select
hashamp(hashbucket(hashrow(primary_index_columns))) as "AMP"
,count(*)
from
your table
group by 1
order by 2 desc;

When a table does not have any column with the most unique values, identity column may help.

IDENTITY COLUMNS:
If a table does not have any column with the most unique values , identity columns can
be used.
Example:
unq_pk INTEGER GENERATED ALWAYS AS IDENTITY
(START WITH 1
INCREMENT BY 1
MINVALUE -2147483647 MAXVALUE 100000000
CYCLE),

The "unq_pk" represents column name and "INTEGER" followed by "unq_pk" represents
data type. These values are generated dynamically in a random way whenever data is
inserted in to the table holding the above Identity column.
Using the above column as a PI, would distribute the data almost equally in to the
available AMP's which reduces skewing data to a single AMP.
RESTRICTIONS ON IDENTITY COLUMNS

Identity columns need not be PI.
These columns cannot be part of a PI/SI.
The PI column,
- Should contain maximum unique values.
- Should be unchangeable (rarely changed).
- Should be defined in CREATE TABLE statement explicitly.

(12) SPLITING THE QUERY
Reduce the number of joins in a query. If many number of joins need to be used in the
query, split the query into two parts by adding few joins into the volatile table in such a
way after ensuring that the dense tables get filtered avoiding unnecessary
redistribution.
The other technique in splitting the query is to make all the INNER JOINS into a volatile
table and can be joined with rest of the other LEFT OUTER joins.

The following methods can be used to scope down the size of sqls.
1) Table denormalization: Duplicating data in another table. This provides faster access to the duplicated
data, but requires more update time.
2) Table summarization: The data from one/many table(s) is summarized into commonly used summary
tables. This provides faster access to the summarized data, but requires more update time.
3) SQL union: The DBC/SQL Union can be used to break up a large SQL process or statement into several
smaller SQL processes or statements, which would run in parallel.
4) Unix split: A large input unix files could be split into several smaller unix files, which could then be
input in series, or in parallel, to create smaller SQL processing steps.
5) Unix concatenation: A large query could be broken up into smaller independent queries, whose
output is written to several smaller unix files. Then these smaller files are unix concatenated together to
provide a single unix file.
6) Trigger tables: A group of tables, each contains a subset of the keys of the index of an original table.
the tables could be created based on some value in the index of the original table. This provides an
ability to break up a large SQL statement into multiple smaller SQL statements, but creating the trigger
tables requires more update time.
7) Sorts (order by): Although sorts take time, these are always done at the end of the query, and the sort
time is directly dependent on the size of the solution. Unnecessary sorts could be eliminated.
8) Export/Load: Table data could be exported (Bteq, Fastexport) to a unix file, and updated, and then
reloaded into the table (Bteq, fastload, Multiload).
9) C PROGRAM/UNIX SCRIPTS: Some data manipulation is very difficult and time consuming in sql. These
could be replaced with c programs/unix scripts. See the C/Embedded sql tip.
(13) CONCATENATION OF THE SQL
Concatenation allows you to retrieve data correlated to the MIN/MAX function in a single pass.
This is a special application of concatenation that precludes the need for a correlated sub query.

Example:
If you want to find the employee with the highest salary in each department, the query might
be:
SELECT Dept_No ,Salary ,Last_Name ,Fname FROM Employee_Table WHERE Dept_No,
Salary IN (SELECT Dept_No, MAX(Salary) FROM Employee_Table GROUP BY Dept_No)
ORDER BY Dept_No;
The above query could be rewritten as:
SELECT Dept_No, MAX(Salary|| || Lname || , ||Fname) FROM Employee_Table
GROUP BY Dept_No ORDER BY Dept_No ;
Note: If two or more employees have the same maximum salary, this query selects only one employee
per department.
(14) USAGE OF NESTED VIEWS:
In Teradata, there is no materialized view. So, it is better to avoid the creation of views by
nesting many other views together. If it is really required a table can be constructed and data can be
loaded on to the table in an incremental way (which satisfies any particular date/fiscal period).
(15) USAGE OF DERIVED TABLES:
Before the usage of a derived table in a query, it is better to make sure that the table returns
minimal number of rows. If the same query has to be used in a query more than once, it is good to
populate those values into a volatile table and join can be made appropriately.
(16) USAGE OF PI AND NON-PI JOINS:
If a query has PI and a non-PI join, the table with the non-PI definitely may have duplicated
values compared to the PI column.
Example:
SELECT E. DEPT_ID, D.DEPT_LOC
FROM EMP E, DEPT D
WHERE E.DEPT_ID = D.DEPT_ID;

The above query can be rewritten as
SELECT E. DEPT_ID,D. DEPT_LOC
FROM DEPT D, (SELECT DEPT_ID FROM EMP GROUP BY 1) E
WHERE E.DEPT_ID = D.DEPT_ID;

(17) USAGE OF SET AND MULTISET TABLE:
A MULTISET table accepts duplicate records where as a SET does not.
A SET table with no unique indexes will force Teradata to check for duplicate rows every time a row is
inserted or updated. This can cause a lot of overhead on such operations.
It is better to use MULTISET after ensuring that the records getting populated to the target is always
unique.

References or Sources:

1. http://apps.teradata.com/tdmo/v08n02/Features/OptimizationChallenge.aspx
2. http://www.teradatamagazine.com/v09n04/Tech2Tech/Grand-master-of-the-database/
3. http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Refere
nce/B035_1142_111A/ch02.124.068.html
4. http://apps.teradata.com/tdmo/v08n02/Features/OptimizationChallenge.aspx
5. http://www.teradata-sql.com/2012/03/sql-optimization.html
6. http://www.teradatahelp.com/2010/11/teradata-performance-tuning-basic-tips.html

Potrebbero piacerti anche