Teradata Query Optimization Guidelines

Introduction
Optimization is the technique of selecting the least expensive plan (fastest plan) for
the query to fetch results. The optimizer considers the possible query plans for a
given input query, and attempts to determine which of those plans will be the most
efficient.
Teradata performance tuning is a technique of improving the process in order for
query to perform faster with the minimal use of CPU resources.
The typical goal of an SQL optimization is to get the result (data set) with less
computing resources consumed and/or with shorter response time.
Query Optimization Process

The following processes list the logical sequence of the processes undertaken by the
Optimizer as it optimizes a DML request. The processes that are listed here do not
include the influence of parameterized value peeking to determine whether the
Optimizer should generate a specific plan or a generic plan for a given request.
The input to the Optimizer is the Query Rewrite ResTree. The Optimizer then
produces the optimized white tree, which it passes to an Optimizer subcomponent
called the Generator.
The Optimizer engages in the following process stages.
1. Receives the Query Rewrite ResTree as input.
2. Processes correlated sub queries by converting them to unnested SELECTs or simple
joins.
3. Processes non-correlated subqueries by materializing the subquery and placing its
value in the USING row for the query regardless of whether the subquery is on the
LHS or the RHS of the operator in the predicate.
4. Searches for a relevant join or hash index.
5. Materializes subqueries to spool files.
6. Analyzes the materialized subqueries for optimization possibilities.
a. Separates conditions from one another.
b. Pushes down predicates.
c. Generates connection information.
d. Locates any complex joins.
e. Discovers aggregations and opportunities for partial group by optimizations
7. Generates size and content estimates of spool files required for further processing.
8. Generates an optimal single-table access path.
9. Simplifies and optimizes any complex joins identified in stage 6d.
10. Maps join columns from a join (spool) relation to the list of field IDs from the input
base tables to prepare the relation for join planning.
11. Generates information about local connections. A connecting condition is one that
connects an outer query and a subquery. A direct connection exists between two
tables if either of the following conditions is found.
ANDed bind term: miscellaneous terms such as inequalities, ANDs, and ORs; cross,
outer, or minus join term that satisfies the dependent information between the two
tables
A spool file of an uncorrelated subquery EXIST predicate that connects with any
outer table
12. Generates information about indexes that might be used in join planning, including
the primary indexes for the relevant tables and pointers to the table descriptors of
any other useful indexes.
13. Performs row and column partition elimination for partitioned tables.
14. Uses a recursive greedy 1-table lookahead algorithm to generate the best join plan.
15. If the join plan identified in step14 does not meet the heuristics-based criteria for
an adequate join plan, generate another best join plan using an n-table lookahead
algorithm.
16. Selects the better join plan of the two plans generated in steps 14 and 15.
17. Generates a star join plan.
18. Selects the better plan of the selection in step 16 and the star join plan generated
in stage 17.
19. Passes the optimized white tree to the Generator.
The Generator then generates plastic steps for the plan chosen in step 19.
Methodologies
Optimization is one the most talked about technique in todays time for Teradata.
Because of the huge amount of data in Teradata database, it becomes very
important to take out the optimized performance from it, otherwise the queries will
perform poorly and the meaning of parallelism will be lost.
In order to select the least expensive plan for the query to fetch results, mentioned
techniques or practices can be followed:
(1) STATISTICS
Collecting statistics is one of the most primary steps in Teradata query Optimization.
Statistics collection is essential for the optimal performance of the Teradata query
optimizer. The query optimizer relies on statistics to help it determine the best way
to access data. Statistics also help the optimizer ascertain how many rows exist in
tables being queried and predict how many rows will qualify for given conditions.
Lack of statistics, or out-dated statistics, might result in the optimizer choosing a

less-than-optimal method for accessing data tables.
Also, statistics help Teradata determine the spool file size needed to contain the
resulting data. Accurate statistics could make the difference between a successful
query and a query that runs out of spool space.
Syntax:
To check whether the Statistics defined for the table:

Help stats table_name;
To collect or refresh the statistics:

Collect stats on table_name [index/column](col_name, col_name, );
DIAGNOSTIC STATEMENT
DIAGNOSTIC HELPSTATS ON FOR SESSION
The above statement can be used to determine the stats that might be
required to improve the performance of the SQL. The EXPLAIN plan needs to be
executed following the above statement to find the stats suggestion.
Stats will qualify one of the below confidence levels:
1) No Confidence - no statistics defined for a table.
2) Low Confidence - Stats are difficult to use precisely.
3) High Confidence - Optimizer is sure of results based on the stats available.
Statistics need to be collected for:
1. All non-unique indexes.
2. UPI of small tables (tables with less than x rows per AMP, depends on
Available number of AMPs)
3. All indexes of a join index
4. Any column used in joins
5. Any columns used in a WHERE clause
6. Indexes of global temporary tables.
Stats cannot be collected on:
1. Volatile tables
2. LOB columns
Collected statistics are not automatically updated by the system. Should refresh
statistics when 5-10 % of the table rows have changed.
Always collect statistics at the column level even when collecting on an index. This
is because indexes can be dropped at any time, so they are often dropped and
recreated.
When to collect Statistics:
After the following:

1. Fast loads
2. Multi loads
3. Non-utility (TPump/BTEQ/ODBC/JDBC) Collect statistics after a significant
percentage of data values have changed.
4. Deletes/Purges Collect Statistics after a significant percentage of data has
been removed from a table.
5. Recovery if a table is lost and then recovered from an archive, recollect
statistics.
6. Reconfiguration Recollect all statistics after a system reconfiguration.
(2) USAGE OF DISTINCT KEYWORD
It is much better to use the GROUP BY instead of DISTINCT in terms of Teradata as
both the
Keywords provide the same result.
DISTINCT is better for columns with a low number of rows per value:
Number of rows < Number of AMPs
GROUP BY is better for columns with a large number of rows per value:
Number of rows > Number of AMPs
(3) USAGE OF IN CLAUSE
While there may not be a theoretical limit to the number of items contained within
an IN clause you may see performance degradations if the list starts to exceed a
few hundred items. There has been an optimizer tweak in place that tries to build a
spool file when a large IN list is encountered.
When the number of values exceeds the acceptable limit, the hard-coded values
can be inserted into a volatile table or a global temporary table with stats collected
on the global temporary table and the IN clause can be made to an equi-join.
Example:
SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*) FROM

EMPLOYEE MD
, DEPARTMENT VN
WHERE MD.DEPT_ID = VN. DEPT_ID
AND VN.LOC = 'USA'
AND MD.DEPT_ID IN
('2087','309','4009','123','5743','456','0987','2643','7545'',8655,'9737','8654
','5744','67894','0444','755','78446','96422','59426',8365,7072,639620,6
497,7220,7294,8937,5978,7894,2497,2864,0742)
GROUP BY MD.EMP_ID , MD.EMP_NAME , VN.DEPT_DESC
HAVING DEPT_ID BETWEEN '501' AND '9450'
AND COUNT(MD. EMP_ID) >= 100
ORDER BY 2;
CREATE MULTISET VOLATILE TABLE EMP_DEPT AS
(SELECT DEPT_ID FROM DEPARTMENT)
WITH NO DATA
ON COMMIT PRESERVE ROWS;
After inserting above all the hard coded values into the volatile table, the
query can be modified as below:
SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*)
FROM EMPLOYEE MD
, DEPARTMENT VN
, EMP_DEPT TEMP
WHERE MD.DEPT_ID = VN. DEPT_ID
AND VN.LOC = 'USA'
AND MD.DEPT_ID = TEMP. DEPT_ID
GROUP BY MD.EMP_ID , MD.EMP_NAME , VN.DEPT_DESC
HAVING DEPT_ID BETWEEN '501' AND '9450'
AND COUNT(MD. EMP_ID) >= 100
ORDER BY 2;
(4) USAGE OF MANIPULATED COLUMNS IN WHERE CLAUSE
When Manipulated columns are used in the WHERE clause or JOIN conditions,
optimizer will not be able to use the stats collected on those columns.
(5) DATATYPE MISMATCH
Try to avoid data transformation during the join. Columns with the same data
type should be used in the join condition. Otherwise one of the compared fields will
undergo conversion before join happens.
(6) CHARACTER SET
While joining two tables make sure that both the columns fall under the same
character set. Otherwise implicit conversion of one to the other takes place resulting
in poor performance.
(7) DATE COMPARISON
When comparing values of date in a particular range, the query may result in
product join.
This can be avoided with the usage of SYS_CALENDAR.CALENDAR, which is
Teradata's in-built database.
Example:
Insert into table_a
select
t2.a1,t2.a2,t2.a3,t2.a4
from
table_2 t2
join table_3 t3
on t2.a1=t3.a1
and t2.a5_dt>=t3.a4_dt
and t2.a5_dt<=t3.a5_dt;
The above query can be replaced with sys_calendar to eliminate (but not
completely) product join
Example:
Insert into table_a
select
t2.a1,t2.a2,t2.a3,t2.a4
from table_2 t2
join SYS_CALENDAR.CALENDAR sys_cal
on sys_cal.calendar_date = t2.a5_dt
join table_3 t3
on t2.a1=t3.a1
and sys_cal.calendar_date >=t3.a4_dt
and sys_cal.calendar_date <=t3.a5_dt;
(8) PROPER USAGE OF ALIAS AND TABLE NAMES
Example:
Insert into table_a
select
t2.a1, t2.a2, t2.a3, t2.a4

from
table_2 t2
join table_3 t3
on t2.a1=t3.a1
join table_4 t4
on table_3.a1= t4.a1;
This may result in utilization of high CPU, if the tables used above consist of
few rows as the table name and the alias name goes for full table scan of
the same table twice.(table_3) as per the above given example.
Suppose, if either of the table is very big, the above case may lead to SPOOL
error.
(9) MISSING JOINS

Example: If there are two tables tab1 & tab2, each with two below columns
tab1
tab2
tab3
empno
deptn
o
enam
e
ename
dnam
e
dnam
e
Select tab1.empno, tab2.deptno

from tab1,tab2,tab3
where tab1.ename = tab2.ename;
If both of the tables used in the above SQL are small, it may result in a
product join (Cartesian product) and may consume high CPU.
If either of the table is very big, the case may lead to SPOOL error.
(10) UNNECESSARY JOINS

Try to avoid unnecessary joins especially the usage of LEFT OUTER joins
when no data is being pulled from it and which does not have any filter.
Example:
SELECT
E.EMP_ID,
E.EMP_NAME,
D.DEPT_NAME,
D.DEPT_LOC
FROM
EMPLYOEE E
JOIN DEPT D ON E.DEPT_ID = D.DEPT_ID
LEFT JOIN PAY_ROLL P ON E.EMP_ID = P.EMP_ID
WHERE
E.JOIN_DT >= 2009-10-12
The above query can be modified as

SELECT
E.EMP_ID,
E.EMP_NAME,
D.DEPT_NAME,
D.DEPT_LOC
FROM
EMPLYOEE E
JOIN DEPT D ON E.DEPT_ID = D.DEPT_ID
WHERE E.JOIN_DT >= 2009-10-12
(11) PROPER SELECTION OF PI FOR A TABLE
Primary Index is the sole mechanism by which data is distributed over the
AMPs and PI for a table should be chosen in such a way that the column with the
most unique values.
A table when created, by default, assumes first column to be the PI when the
index is not specified explicitly.
PI for a table should be chosen in such a way on a column with the most
unique values.
The below query can be used to see the distribution of rows to the AMPs in a
system.
select
hashamp(hashbucket(hashrow(primary_index_columns))) as "AMP"
,count(*)
from
your table
group by 1
order by 2 desc;
When a table does not have any column with the most unique values, identity
column may help.
IDENTITY COLUMNS:
If a table does not have any column with the most unique values ,
identity columns can be used.
Example:
unq_pk INTEGER GENERATED ALWAYS AS IDENTITY
(START WITH 1
INCREMENT BY 1
MINVALUE -2147483647 MAXVALUE 100000000
CYCLE),
The "unq_pk" represents column name and "INTEGER" followed by
"unq_pk" represents data type. These values are generated
dynamically in a random way whenever data is inserted in to the table
holding the above Identity column.
Using the above column as a PI, would distribute the data almost
equally in to the available AMP's which reduces skewing data to a
single AMP.
RESTRICTIONS ON IDENTITY COLUMNS
The PI column,
Identity columns need not be PI.

These columns cannot be part of a PI/SI.
- Should contain maximum unique values.

- Should be unchangeable (rarely changed).
- Should be defined in CREATE TABLE statement explicitly.
(12) SPLITING THE QUERY
Reduce the number of joins in a query. If many number of joins need to

be used in the query, split the query into two parts by adding few joins
into the volatile table in such a way after ensuring that the dense
tables get filtered avoiding unnecessary redistribution.
The other technique in splitting the query is to make all the INNER
JOINS into a volatile table and can be joined with rest of the other LEFT
OUTER joins.
The following methods can be used to scope down the size of sqls.
1) Table denormalization: Duplicating data in another table. This provides faster

access to the duplicated data, but requires more update time.
2) Table summarization: The data from one/many table(s) is summarized into
commonly used summary tables. This provides faster access to the summarized
data, but requires more update time.
3) SQL union: The DBC/SQL Union can be used to break up a large SQL process or
statement into several smaller SQL processes or statements, which would run in
parallel.
4) Unix split: A large input unix files could be split into several smaller unix files,
which could then be input in series, or in parallel, to create smaller SQL processing
steps.
5) Unix concatenation: A large query could be broken up into smaller independent
queries, whose output is written to several smaller unix files. Then these smaller
files are unix concatenated together to provide a single unix file.
6) Trigger tables: A group of tables, each contains a subset of the keys of the index
of an original table. the tables could be created based on some value in the index of
the original table. This provides an ability to break up a large SQL statement into
multiple smaller SQL statements, but creating the trigger tables requires more
update time.
7) Sorts (order by): Although sorts take time, these are always done at the end of
the query, and the sort time is directly dependent on the size of the solution.
Unnecessary sorts could be eliminated.
8) Export/Load: Table data could be exported (Bteq, Fastexport) to a unix file, and
updated, and then reloaded into the table (Bteq, fastload, Multiload).
9) C PROGRAM/UNIX SCRIPTS: Some data manipulation is very difficult and time
consuming in sql. These could be replaced with c programs/unix scripts. See the
C/Embedded sql tip.
(13) CONCATENATION OF THE SQL
Concatenation allows you to retrieve data correlated to the MIN/MAX
function in a
single pass.
This is a special application of concatenation that precludes the need for a
correlated sub query.
Example:
If you want to find the employee with the highest salary in each department,
the query might be:
SELECT Dept_No ,Salary ,Last_Name ,Fname FROM Employee_Table

WHERE Dept_No, Salary IN (SELECT Dept_No, MAX(Salary) FROM
Employee_Table GROUP BY Dept_No) ORDER BY Dept_No;
The above query could be rewritten as:
SELECT Dept_No, MAX(Salary|| || Lname || , ||Fname) FROM
Employee_Table GROUP BY Dept_No ORDER BY Dept_No ;
Note: If two or more employees have the same maximum salary, this query selects
only one employee per department.
(14) USAGE OF NESTED VIEWS:
In Teradata, there is no materialized view. So, it is better to avoid the
creation of views by nesting many other views together. If it is really required a
table can be constructed and data can be loaded on to the table in an incremental
way (which satisfies any particular date/fiscal period).
(15) USAGE OF DERIVED TABLES:
Before the usage of a derived table in a query, it is better to make sure that
the table returns minimal number of rows. If the same query has to be used in a
query more than once, it is good to populate those values into a volatile table and
join can be made appropriately.
(16) USAGE OF PI AND NON-PI JOINS:
If a query has PI and a non-PI join, the table with the non-PI definitely may
have duplicated values compared to the PI column.
Example:
SELECT E. DEPT_ID, D.DEPT_LOC
FROM EMP E, DEPT D
WHERE E.DEPT_ID = D.DEPT_ID;
The above query can be rewritten as
SELECT E. DEPT_ID,D. DEPT_LOC
FROM DEPT D, (SELECT DEPT_ID FROM EMP GROUP BY 1) E
WHERE E.DEPT_ID = D.DEPT_ID;
(17) USAGE OF SET AND MULTISET TABLE:
A MULTISET table accepts duplicate records where as a SET does not.
A SET table with no unique indexes will force Teradata to check for duplicate rows
every time a row is inserted or updated. This can cause a lot of overhead on such
operations.
It is better to use MULTISET after ensuring that the records getting populated to the
target is always unique.
References or Sources:
1. http://apps.teradata.com/tdmo/v08n02/Features/OptimizationChallenge.aspx
2. http://www.teradatamagazine.com/v09n04/Tech2Tech/Grand-master-of-thedatabase/
3. http://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Refere
nce/B035_1142_111A/ch02.124.068.html
4. http://apps.teradata.com/tdmo/v08n02/Features/OptimizationChallenge.aspx
5. http://www.teradata-sql.com/2012/03/sql-optimization.html
6. http://www.teradatahelp.com/2010/11/teradata-performance-tuning-basic-tips.html

Teradata Query Optimization Guidelines

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Teradata Query Optimization Guidelines

Caricato da

Copyright:

Formati disponibili

Introduction

Query Optimization Process

Lack of statistics, or out-dated statistics, might result in the optimizer choosing a

To check whether the Statistics defined for the table:

To collect or refresh the statistics:

After the following:

SELECT MD.EMP_ID,MD.EMP_NAME, VN. DEPT_DESC, COUNT(*) FROM

t2.a1, t2.a2, t2.a3, t2.a4

(9) MISSING JOINS

Select tab1.empno, tab2.deptno

(10) UNNECESSARY JOINS

The above query can be modified as

RESTRICTIONS ON IDENTITY COLUMNS

Identity columns need not be PI.

- Should contain maximum unique values.

Reduce the number of joins in a query. If many number of joins need to

1) Table denormalization: Duplicating data in another table. This provides faster

SELECT Dept_No ,Salary ,Last_Name ,Fname FROM Employee_Table

Potrebbero piacerti anche