Sei sulla pagina 1di 20

Teradata Basics and Guidelines

Created By:
Machchhagandha Shete
Sumit Wadhwani

Teradata Basics and Guidelines

Page 1

Table of Contents
Teradata Architecture..3
Parsing Engine..3
Message Parsing Layer.3
Access Module Processor..3
Teradata Architecture Diagram.4
Teradata Storage Architecture..5
Teradata Retrieval Architecture6
How Does Teradata Store Rows.7
Primary Index.7
Types of Primary Index.7
Purpose of Index..7
Essential factors while assigning the Primary Index.7
Symptoms of Poor Indexing....7
Primary Index Considerations.8
Queries to check the distribution and Space.8
Skewing..9
JOIN 10
Join Strategy.10
Join Index...10
Merge Join.......11
Nested Join......13
Hash Join....14
Product Join..14
Partitioned Primary Index15
Collect Statistics.17
Collect statistics on..17
Collect Multicolumn Statistics..17
Thumb-Rule for Collect Statistics...17
When to do Collect Statistics on table.17

Teradata Basics and Guidelines

Page 2

Teradata Architecture
Basic components of Teradata Architecture:

The Parsing Engine


The Parsing Engine is responsible for:
Parsing and optimizing your SQL requests.
Dispatching the optimized plan to the AMPs.
Sending the answer set response back to the requesting.

Message Passing Layer


The Message Passing Layer is responsible for:
Carrying messages between the AMPs and PEs.
Point-to-Point, Multi-Cast, and Broadcast communications.
Merging answer sets back to the PE.
Making Teradata parallelism possible.

Access Module Processor (AMP)


The AMPs are responsible for:
Finding the rows requested
Lock management
Sorting rows
Aggregating columns
Join processing
Output conversion and formatting
Creating answer set for client
Disk space management
Recovery processing

Teradata Basics and Guidelines

Page 3

Teradata Architecture Diagram

Answer Set Response

SQL Request

Parsing Engine

Parser
Optimizer

Dispatcher
Message Passing Layer

AMP

AMP

Teradata Basics and Guidelines

AMP

AMP 4

AMP

Page 4

Teradata Storage Architecture


The Parsing Engine interprets the SQL command and converts the data record from the host into an AMP message. It
is a component that interprets SQL requests, receives input records and passes data. To do that it sends the messages
through the Message Passing Layer to the AMPs.
The Message Passing Layer distributes the row to the appropriate Access Module Processor (AMP).
The AMP formats the row and writes it to its associated disks. It performs all the database management functions
such as sorting, aggregating, and formatting the data. It receives data from the PE, formats the rows, and distributes
the rows to the disk storage units it controls. It also retrieves the rows requested by the parsing engine.
Disks are simply disk drives associated with an AMP. They store the data rows. The disk holds the row for subsequent
access.

Records from Client (in Random sequence)


2

3
2

6
7

1
2

9
0

5
4

7
5

1
8

2
5

8
0

4
1

Parsing Engine(s)

Message Passing Layer

AMP1

AMP 2

1
8

1
2

5
4

Teradata Basics and Guidelines

AMP 3

9
0
4
1

7
5
3
2

8
0

AMP 4

6
7

2
5

Page 5

Teradata Retrieval Architecture


Retrieving data from the Teradata RDBMS simply reverses the process of the storage model. Below are the steps:
A request is made for data and is passed on to a Parsing Engine (PE). The PE optimizes the request for efficient
processing and creates tasks for the AMPs to perform, which will result in the request being satisfied.
These tasks are then dispatched to the AMPs via the Message Passing Layer.
Often times all AMPs must participate in creating the answer set, such as in returning all rows of a table. Other times,
only one or a few AMPs need participate, depending on the nature of the request. The PE will insure that only the
AMPs that are needed will be assigned tasks on behalf of this request.
Once the AMPs have been given their assignments, they will retrieve the desired rows from their respective disks. If
sorting, aggregating or formatting of any kind is needed, the AMPs will also take care of that. The rows are then
returned to the requesting PE via the Message Passing Layer. The PE takes the returned answer set and returns it to
the requesting client application.
Rows retrieved from table
2

3
2

6
7

1
2

9
0

5
4

7
5

1
8

2
5

8
0

4
1

Parsing Engine(s)

Message Passing Layer

AMP 1

1
8
5
4

AMP 2

1
2
4
1

Teradata Basics and Guidelines

9
0

AMP 3

7
5

AMP 4

8
0
3
2

6
7

2
5
6

Page 6

How Does Teradata Store Rows:


Teradata uses hashing mechanism and distribution to randomly and evenly distribute data among all
AMPs.
The rows of all the tables are distributed among all AMPs and ideally will be evenly distribute among
all AMPs.
Each AMP is responsible for a subset of the rows of each table.
Evenly distributed tables results in evenly distributed workloads.
The data is not distributed and placed in any specific order.

Primary Index
The Primary index is the mechanism for assigning a data row to an AMP.
The Primary Index is the Physical Mechanism used to retrieve and distribute data.
The Table must have a Primary Index.
The Primary index cannot be changed (table must be dropped and recreated to change PI).
Primary Index plays three roles:
Data Distribution
Fastest way to retrieve data
Incredibly important for Joins

Types of Primary Index


Unique Primary Index (UPI)
UPI Will always spread the rows of the table evenly amongst the AMPs. UPI access is always a
One Amp Operation. It also requires no Duplicate Row Checking. E.g. Employee Number
Non Unique Primary Index (NUPI)
NUPI Can contain matching keys. It does not necessarily spread the data uniformly amongst the
AMPs. NUPI access is always a One Amp Operation. More effective for Query Access and Joins as
the co-location of data is important for efficient joins. **how?

Purpose of Index
To define the distribution of the rows to the AMPs.
To provide access to rows more efficiently than with full table scan.
If the values for all the primary index columns are specified in a DML statement, single-AMP
access can be made to the rows using that primary index value.
To provide for efficient joins.
To provide a means for efficient aggregations.

Essential factors while assigning the Primary Index


Uniform distribution of rows. Columns that distribute table rows evenly across the AMPs.
Optimal access to the data.
Choose as few columns as possible.
Index column should be used in join condition

Symptoms of Poor Indexing


Inappropriate or default indexes at all result in poor query performance. Some of the major indications are:
A select statement takes too long.
A join between two or more tables takes an extremely long time.
The table will make first column as a default PI if it is not specified explicitly.
Select operations perform well, but data modification processes perform poorly.

Teradata Basics and Guidelines

Page 7

Point queries (for example, where age = 18") perform well, but range queries (for example,
where age > 18 and age < 50") perform poorly.
Spool space issue.
Primary Index Considerations
When joining tables, try to use PI columns as the joining columns so that the access will be always
faster (since PI access is always 1 AMP operation)
Base PI on the column(s) most often used for access, provided that the values are unique or nearly
unique.
Dont include VARCHAR columns in PIs change them to CHAR instead.
PI columns should be defined as NOT NULL. Nullable fields should be avoided, as that can result in
skew.
Choose a column with good distribution and no spikes.
Distinct values distribute evenly across all AMPs.
For large tables, the number of Distinct Primary Index values should be much greater than the
number of AMPs.
Duplicate values hash to the same AMP and are stored in the same data block when possible.
Very non-unique values may skew space usage among AMPs and cause premature Database Full
conditions.
A large number of NUPI duplicate values on a SET table without a USI can cause expensive
duplicate row checks.
Very non-unique PIs also result in hot amps which slow response time down to the amount of
time that the busiest amp(s) require.
Its desirable to have matching PIs on tables that are frequently joined.
Its important to define a PI as UNIQUE if it truly is unique improves performance by eliminating
the dup row checking (per last point).

Queries to check the distribution and Space


Distribution of table by AMP Row Count:
SELECT HASHAMP (HASHBUCKET
(HASHROW (Customer_number))) AS "AMP #"
,COUNT(*)
FROM Customer
GROUP BY 1
ORDER BY 1 ;

Distribution of table by AMP Space Usage:


SELECT

Vproc,
TableName,
CurrentPerm
FROM
DBC.TableSize
WHERE DatabaseName = DATABASE
AND
TableName = '<Table_Name'
ORDER BY 1 ;

Teradata Basics and Guidelines

Page 8

SKEWING
Data on a Teradata system is distributed across the AMPs solely on the basis of the primary index using
hashing algorithm (using a hashing algorithm). Therefore, if the primary index is a unique one, the data is
distributed evenly.
Example:
100 MB of data with a unique primary index on a 100 AMP system, with a unique Primary index would
assign 1 MB of data to each AMP.
Teradata assigns space on the basis of the maximum space usage one AMP times the number of AMPs, so
max space usage here is: 1 MB x 100 AMPs = 100 MB of space for 100mb of data = perfect distribution. Each
AMP is always assigned the same space for each table, regardless of use.
However, if the Primary index choice for a table is non-unique, for an extreme example, in error a user
creates a table with a PI of Source System Id with a choice of only two different IDs then the data is
distributed only across 2 AMPs of the (hypothetical) 100. Thus 50 MB each on 2 AMPs. Space usage is then
calculated as the maximum on a single AMP (50mb assuming a 50/50 split between the two IDs) x 100
AMPs = 5000 MB of space, so the data takes up 50 times the space on the system as each AMP has 50mb of
space assigned for this data.
When this happens the table is said to be skewed.

Skewing in Queries:
In joins, the data is distributed on the joining columns. When joining 2 (or more) tables, in order to perform
the join, Teradata must have the data on the same AMP. If the query is joining the tables on identical PI
values, the data is already on the same AMP and n redistribution takes place. This is extremely efficient.
However, when joining tables with different PI's, or joining on nonPI values, the data for at least one table
must be redistributed on the joining columns in order to perform the join. This may do in Spool or Memory
depending on the size of the tables (or subsets of table data) selected. Therefore the uniqueness of the
joining column values is now a deciding factor in Spool distribution.
Where tables or spool are distributed on non-unique values, leading to a handful of AMPs having most of
the data, this is referred to as being 'skewed'. Queries on Skewed data will take proportionally longer to
run. Whereas queries on data distributed evenly over 240 Amps will be 10 times quicker than data
distributed evenly on 24 AMPs. This is because the query is only as fast as the slowest Amp. Skewed Spool is
created when data is redistributed by the optimizer on Columns which contain skewed values. 95% of
occasions when a User receives a message that his query has run out of spool, the cause is: Skewed data or
Skewed Spool. Because Teradata calculates a users total spool requirement by examining the amp using the
most spool and then multiplying this value by the number of AMPs.
Thus Skewing is the major cause for spool consumption and the consumption of CPU seconds.

Teradata Basics and Guidelines

Page 9

JOINS
JOIN Strategy:
Based on the table columns which are a part of the join condition, Teradata selects one of the below three join
strategies:
Table Redistribution
Item rows are moved to the AMP where their related Orders are located. When the joining condition does not
include the primary index columns, Teradata redistributes one or both of the tables.
Table Duplication
Smaller table is duplicated to each AMP. When the joining condition does not include the primary index columns
and one of the tables is very small, the Teradata duplicates the small table on all the AMPs.
AMP-local join
No movement: Both tables related rows are already on the same AMP. When the join is based on the primary
index, it is a local AMP operation. This is the preferred situation as it performs that best

JOIN INDEX
Points to know about Join Index:
For two rows in two tables to be joined, they must reside on the same AMP.
If two tables are being joined on columns which include the Primary Index, and the Primary Indexes have
the same data type, they will already reside on the same AMP. In case of different data types, either one
or both the tables will be redistributed.
If not, then Teradata will move the rows of one or both tables so that potentially matching rows will be
on the same AMP, either in Spool or in Memory. This will be accomplished by distributing either one or
both the tables. If the Join is based on the Primary Index of one table, then only one table will be
redistributed OR One table will be duplicated across all AMPs.
Only those columns and rows that are required by the SQL code will be involved in the redistribution or
duplication.
Join Processing never moves or changes the original table rows in any way.
The performance of the Join is largely determined by the number of rows which are selected to be
involved in the join from each table.
The sequence that Tables/Views occur in the SQL code does not influence the sequence that
Tables/Views are joined. This sequence is determined by the Optimizer.
If a Join involves the columns of the Primary Indexes, and in addition other columns are involved then
Teradata will not need to redistribute the tables because the data will already be suitably distributed.
Once the data is on the correct AMP, it is sorted into row-hash sequence.

Teradata Basics and Guidelines

Page 10

Types of Join
When User provides join query, optimizer will come up with join plans to perform joins. These Join strategies
include
Merge Join
Nested Join
Hash Join
Product Join

Merge Join
Merge join is a concept in which rows to be joined must be present in same AMP. If the rows to be joined are not
on the same AMP, Teradata will either redistribute the data or duplicate the data in spool to make that happen
based on row hash of the columns involved in the joins WHERE Clause.
If two tables to be joined have same primary Index, then the records will be present in Same AMP and ReDistribution of records is not required.
There are three scenarios in which redistribution can happen for Merge Join
Case 1: If joining columns are on UPI = UPI, the records to be joined are present in Same AMP and redistribution is
not required. This is most efficient and fastest join strategy
Example:SELECT * FROM svuedw_e.RtlSale rs, svuedw_e.RtlSaleItem rsi
WHERE
rs.Rtl_Trans_Id = rsi.Rtl_Trans_Id AND
rsi.Trans_End_Dt = '2011-05-01' AND
rs.Trans_End_Dt = '2011-05-01'
Explain Plan:
1) First, we lock SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem for access, and we lock SVUEDW_T.RtlSale in
view svuedw_e.RtlSale for access.
2) Next, we do an all-AMPs JOIN step from a single partition of SVUEDW_T.RtlSale in view svuedw_e.RtlSale with
a condition of ("SVUEDW_T.RtlSale in view svuedw_e.RtlSale.Trans_End_Dt = DATE'2011-05-01'") with a residual
condition of ("SVUEDW_T.RtlSale in view svuedw_e.RtlSale.Trans_End_Dt = DATE '2011-05-01'"), which is joined to
a single partition of SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem with a condition of
("SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem.Trans_End_Dt = DATE '2011-05-01'") with a residual
condition of ("SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem.Trans_End_Dt = DATE '2011-05-01'").
SVUEDW_T.RtlSale and SVUEDW_T.RtlSaleItem are joined using a row key-based merge join, with a join condition
of
("(SVUEDW_T.RtlSale.Rtl_Trans_Id
=
SVUEDW_T.RtlSaleItem.Rtl_Trans_Id)
AND
(SVUEDW_T.RtlSale.Trans_End_Dt = SVUEDW_T.RtlSaleItem.Trans_End_Dt)"). The input tables SVUEDW_T.RtlSale
and SVUEDW_T.RtlSaleItem will not be cached in memory. The result goes into Spool 5(group_amps), which is built
locally on the AMPs. The size of Spool 5 is estimated with low confidence to be 914,021 rows (352,812,106 bytes).
The estimated time for this step is 0.37 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 5 are sent back to the user as the result of statement 1. The total estimated time
is 0.37 seconds.

Teradata Basics and Guidelines

Page 11

Case 2: If joining columns are on UPI = Non Index column, the records in 2nd table has to be redistributed on
AMP's based on data corresponding to first table.

Example:DMConsumerGroupModelScore- Index column (Consumer_Group_Id,Model_Type_Cd)


DMModelType- Index Column Model_type_Id
SELECT
T0.Consumer_Group_Id,
T0.Model_Type_Cd,
T0.Data_AsOf_Dt,
T0.Decile_Num,
T0.Score_Num,
T0.Row_Load_TS,
DMT.Model_Type_Id
FROM
ConsumerGroupModelScore_T0 T0
INNER JOIN
DMModelType DMT
ON T0.Model_Type_Id = DMT.Model_Type_Id;

Explain Plan:
1) First, we lock a distinct SVUEDWDG11_S."pseudo table" for read on a RowHash to prevent global deadlock for
SVUEDWDG11_S.T0.
2) Next, we lock a distinct SVUEDWDG11_S."pseudo table" for read on a RowHash to prevent global deadlock for
SVUEDWDG11_S.DMT.
3) We lock SVUEDWDG11_S.T0 for read, and we lock SVUEDWDG11_S.DMT for read.
4) We do an all-AMPs RETRIEVE step from SVUEDWDG11_S.T0 by way of an all-rows scan with a condition of
("NOT(SVUEDWDG11_S.T0.Model_Type_id IS NULL)") into Spool 2 (all_amps), which is redistributed by the hash
code of (SVUEDWDG11_S.T0.Model_Type_id) to all AMPs. Then we do a SORT to order Spool 2 by row hash. The
size of Spool 2 is estimated with low confidence to be 78 rows (4,446 bytes). The estimated time for this step is
0.01 seconds.
5) We do an all-AMPs JOIN step from SVUEDWDG11_S.DMT by way of a RowHash match scan with no residual
conditions, which is joined to Spool 2 (Last Use) by way of a RowHash match scan. SVUEDWDG11_S.DMT and
Spool 2 are joined using a merge join, with a join condition of ("Model_Type_id =
SVUEDWDG11_S.DMT.Model_Type_Id"). The result goes into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with index join confidence to be 78 rows (5,772 bytes). The estimated time
for this step is 0.06 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.07
seconds.

Case 3: If joining columns are on Non Index column = Non Index column , the both the tables are to be
redistributed so that matching data lies on same amp , so the join can happen on redistributed data. This strategy
is time consuming since complete redistribution of both the tables takes place across all the amps
Example:1) DMConsumerGroupModelScore- Index column (Consumer_Group_Id,Model_Type_Cd)
2) DMModelType- Index Column Model_type_Id

Teradata Basics and Guidelines

Page 12

SELECT
T0.Consumer_Group_Id,
T0.Model_Type_Cd,
DMT.Model_Type_Id
FROM
DMConsumerGroupModelScore_T0 T0
INNER JOIN
DMModelType DMT
ON
T0.Model_Type_Cd = DMT.Model_Type_Cd;

Explain Plan:
1)
First,
we
lock
SVUEDWDCAMPAIGN_US.DMConsumerGroupModelScore_T0
in
view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T0
for
access,
and
we
lock
SVUEDWDCAMPAIGN_T.DMModelType in view SVUEDWDCAMPAIGN_E.DMModelType for access.
2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from SVUEDWDCAMPAIGN_T.DMModelType in view
SVUEDWDCAMPAIGN_E.DMModelType by way of an all-rows scan with no residual conditions into Spool 5
(all_amps), which is redistributed by the hash code of (SVUEDWDCAMPAIGN_T.DMModelType.Model_Type_Cd)
to all AMPs.
Then we do a SORT to order Spool 5 by row hash. The size of Spool 5 is estimated with low confidence to be
78 rows (2,262 bytes). The estimated time for this step is 0.01 seconds.
2) We do an all-AMPs RETRIEVE step from SVUEDWDCAMPAIGN_US.DMConsumerGroupModelScore_T0 in
view SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T0 by way of an all-rows scan with no residual
conditions
into
Spool
6(all_amps),
which
is
redistributed
by
the
hash
code
of
(SVUEDWDCAMPAIGN_US.DMConsumerGroupModelScore_T0.Model_Type_Cd) to all AMPs. Then we do a SORT
to order Spool 6 by row hash. The size of Spool 6 is estimated with low confidence to be 78 rows (2,262 bytes). The
estimated time for this step is 0.01 seconds.
3) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of a RowHash match scan, which is joined to Spool
6 (Last Use) by way of a RowHash match scan. Spool 5 and Spool 6 are joined using a merge join, with a join
condition of ("Model_Type_Cd = Model_Type_Cd"). The result goes into Spool 4 (group_amps), which is built
locally on the AMPs. The size of Spool 4 is estimated with no confidence to be 689 rows (28,249 bytes). The
estimated time for this step is 0.06 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 4 are sent back to the user as the result of statement 1. The total estimated time is 0.08
seconds.

Nested Join
Uses Unique indexes to retrieve a single row from one table and join to one or more rows from another table using
an index.
Example:
Both the tables has index on Consumer_id
SELECT
c.Consumer_id,
l.Loyalty_Member_id
FROM
svuedwdconsumer_s.loyaltymember l,
svuedwdconsumer_s.consumer c
where
l.consumer_id = c.consumer_id
and c.consumer_id=860870
Explain Plan:

Teradata Basics and Guidelines

Page 13

1) First, we do a single-AMP JOIN step from svuedwdconsumer_s.l by way of the primary index
svuedwdconsumer_s.l.Consumer_Id = 860870" with no residual conditions, which is joined to
svuedwdconsumer_s.c by way of the primary index "svuedwdconsumer_s.c.Consumer_Id = 860870" with a
residual condition of ("svuedwdconsumer_s.c.Consumer_Id = 860870").
svuedwdconsumer_s.l and
svuedwdconsumer_s.c
are
joined
using
a
merge
join,
with a
join
condition
of
("svuedwdconsumer_s.l.Consumer_Id = svuedwdconsumer_s.c.Consumer_Id").
The result goes into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with
low confidence to be 4 rows (116 bytes). The estimated time for this step is 0.00
seconds.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.00
seconds.
Hash Join
The hash join does not require that both tables are sorted. The smaller table/spool is "hashed" into memory. Then,
the larger table is scanned and for each row, it looks up the row from the smaller table in the hashed table that was
created in memory.
In the hash join, the small table is loaded into memory based on the hashing of the join columns. So, each time a
row from the large table is read, it can take the join columns from the large table through the hashing algorithm
which will then point it directly to the correct row(s) from the small table. So, this is a direct access to the small
table, rather than a scan of it. In Hash Join, one or both tables which are on same amp are fit completely inside the
AMP's Memory. Amp chooses to hold small tables in its memory for joins happening on ROW hash.
Advantages of Hash joins are
1. They are faster than Merge joins since the large table doesnt need to be sorted.
2. As the join happens between the table in AMP memory and the table in unsorted spool, it happens so quickly.

Product Join
Preferred by optimizer if one table is small (typically a lookup table), and the join does not use the Primary Index of
the larger table. In this case Teradata will Redistribute the larger table, or Duplicate the smaller one. There is no
sort but there are many comparisons.
Example:
Query:
SELECT opds.*, pm.Price_Method_cd FROM
svuedw_e.OrgProductDaySales opds,
svuedw_e.PriceMethod pm
WHERE
pm.Price_Method_Id = opds.Price_Method_Id and
opds.Org_Id = 2
Explain Plan:
1) First, we lock SVUEDW_T.PriceMethod in view svuedw_e.PriceMethod for access, and we lock
SVUEDW_A.OrgProductDaySales in view svuedw_e.OrgProductDaySales for access.
2) Next, we do an all-AMPs RETRIEVE step from SVUEDW_T.PriceMethod in view svuedw_e.PriceMethod by way
of an all-rows scan with no residual conditions into Spool 6 (all_amps) (compressed columns allowed), which is
duplicated on all AMPs. The size of Spool 6 is estimated with high confidence to be 6,768 rows (128,592 bytes).
The estimated time for this step is 0.01 seconds.
3) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an all-rows scan, which is joined to
SVUEDW_A.OrgProductDaySales in view svuedw_e.OrgProductDaySales by way of an all-rows scan with a condition
of ("SVUEDW_A.OrgProductDaySales in view svuedw_e.OrgProductDaySales.Org_Id = 2").
Spool 6 and
SVUEDW_A.OrgProductDaySales are joined using a product join, with a join condition of ("Price_Method_Id =
SVUEDW_A.OrgProductDaySales.Price_Method_Id"). The input table SVUEDW_A.OrgProductDaySales will not be
cached in memory, but it is eligible for synchronized scanning. The result goes into Spool 5 (group_amps), which is

Teradata Basics and Guidelines

Page 14

built locally on the AMPs. The size of Spool 5 is estimated with low confidence to be 17,651,256 rows
(5,842,565,736 bytes). The estimated time for this step is 3 minutes and 7 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 5 are sent back to the user as the result of
statement 1. The total estimated time is 3 minutes and 7 seconds.

Teradata Basics and Guidelines

Page 15

Partitioned Primary Index


Partitioned primary index is basically physically splitting the table into a series of sub tables, one for every partitioning
value.
In the best case, the query conditions allow every partition but one to be eliminated for each partitioning expression.
Significant performance benefits can be achieved, therefore, if the data demographics and queries in the workload
lead to partition elimination.

Guidelines for PPI choices:


Use the DATE data type, if possible, for a date-based partitioning expression. A date-based partitioning
expression is often a good choice for at least one level of partitioning. This will allow the Teradata Database
to better recognize partition elimination opportunities.
Keep the partitioning expression simple. A RANGE_N partitioning expression usually works the best for
partition elimination. With multilevel partitioning, while one level usually does RANGE_N date-based
partitioning, other levels may use CASE_N partitioning.
Add query conditions on the partitioning columns, where possible, to improve partition elimination
opportunities.
If queries join on the PI but the PI doesn't include the partitioning column, consider propagating the
partitioning column value to the other table and modifying the query to also join on the partitioning
column and the column propagated to the other table.
Collect statistics on the system-derived column PARTITION and the usual recommended indexes and
columns. Collecting on the partitioning columns themselves is usually also a good idea, but statistics on
PARTITION may be enough for good plans. Check EXPLAINs and measure performance to make sure.
A partitioning expression is good only if queries take advantage of itin other words, if partition elimination occurs.
Example:
1) RtlSale and RtlSaleDiscount tables both have PPI defined on the Trans_End_Dt columns:
Query1:
SELECT * FROM svuedw_e.RtlSale rs, svuedw_e.RtlSaleItem rsi
WHERE
rs.Rtl_Trans_Id = rsi.Rtl_Trans_Id AND
rs.Trans_End_Dt = '2011-05-01'
Explain Plan:
1) First, we lock SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem for access, and we lock SVUEDW_T.RtlSale in
view svuedw_e.RtlSale for access.
2) Next, we do an all-AMPs RETRIEVE step from a single partition of SVUEDW_T.RtlSale in view svuedw_e.RtlSale
with a condition of ("SVUEDW_T.RtlSale in view svuedw_e.RtlSale.Trans_End_Dt = DATE '2011-05-01'") with a
residual condition of ("SVUEDW_T.RtlSale in
view svuedw_e.RtlSale.Trans_End_Dt = DATE '2011-05-01'") into
Spool 6 (all_amps) (compressed columns allowed), which is built locally on the AMPs. Then we do a SORT to order
Spool 6 by the hash code of (SVUEDW_T.RtlSale.Rtl_Trans_Id). The size of Spool 6 is estimated with high confidence
to be 3,644,099 rows (685,090,612 bytes). The estimated time for this step is 0.15 seconds.
3) We do an all-AMPs JOIN step from SVUEDW_T.RtlSaleItem in view
svuedw_e.RtlSaleItem by way of a
RowHash match scan with no residual conditions, which is joined to Spool 6 (Last Use) by way of a RowHash match

Teradata Basics and Guidelines

Page 16

scan. SVUEDW_T.RtlSaleItem and Spool 6 are joined using a sliding-window merge join, with a join condition of
("Rtl_Trans_Id = SVUEDW_T.RtlSaleItem.Rtl_Trans_Id"). The input table SVUEDW_T.RtlSaleItem will not be cached in
memory, but it is eligible for synchronized scanning. The result goes into Spool 5 (group_amps), which is built locally
on the AMPs. The size of Spool 5 is estimated with low confidence to be 4,916,307 rows (1,897,694,502 bytes). The
estimated time for this step is 52.68 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of
Spool 5 are sent back to the user as the result of statement 1. The total estimated time is 52.83 seconds.

Query2:
SELECT * FROM svuedw_e.RtlSale rs, svuedw_e.RtlSaleItem rsi
WHERE
rs.Rtl_Trans_Id = rsi.Rtl_Trans_Id AND
rs.Trans_End_Dt = rsi.Trans_End_Dt AND
rs.Trans_End_Dt = '2011-05-01'
SELECT * FROM svuedw_e.RtlSale rs, svuedw_e.RtlSaleItem rsi
WHERE
rs.Rtl_Trans_Id = rsi.Rtl_Trans_Id AND
rsi.Trans_End_Dt = '2011-05-01' AND
rs.Trans_End_Dt = '2011-05-01'
SELECT * FROM svuedw_e.RtlSale rs, svuedw_e.RtlSaleItem rsi
WHERE
rs.Trans_End_Dt = rsi.Trans_End_Dt AND
rs.Rtl_Trans_Id = rsi.Rtl_Trans_Id AND
rs.Trans_End_Dt = '2011-05-01'
The Order in the Join condition doesnt matter.
Explain Plan:
1) First, we lock SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem for access, and we lock SVUEDW_T.RtlSale in
view svuedw_e.RtlSale for access.
2) Next, we do an all-AMPs JOIN step from a single partition of SVUEDW_T.RtlSale in view svuedw_e.RtlSale with a
condition of ("SVUEDW_T.RtlSale in view svuedw_e.RtlSale.Trans_End_Dt = DATE'2011-05-01'") with a residual
condition of ("SVUEDW_T.RtlSale in view svuedw_e.RtlSale.Trans_End_Dt = DATE '2011-05-01'"), which is joined to a
single partition of SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem with a condition of
("SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem.Trans_End_Dt = DATE '2011-05-01'") with a residual condition
of ("SVUEDW_T.RtlSaleItem in view svuedw_e.RtlSaleItem.Trans_End_Dt = DATE '2011-05-01'"). SVUEDW_T.RtlSale
and SVUEDW_T.RtlSaleItem are joined using a row key-based merge join, with a join condition of
("(SVUEDW_T.RtlSale.Rtl_Trans_Id = SVUEDW_T.RtlSaleItem.Rtl_Trans_Id) AND (SVUEDW_T.RtlSale.Trans_End_Dt =
SVUEDW_T.RtlSaleItem.Trans_End_Dt)"). The input tables SVUEDW_T.RtlSale and SVUEDW_T.RtlSaleItem will not be
cached in memory. The result goes into Spool 5(group_amps), which is built locally on the AMPs. The size of Spool 5
is estimated with low confidence to be 914,021 rows (352,812,106 bytes). The estimated time for this step is 0.37
seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 5 are sent back to the user as the result of statement 1. The total estimated time is 0.37
seconds.

Teradata Basics and Guidelines

Page 17

Collect Statistics
Collecting Statistics on the table columns helps the optimizer obtain necessary demographic to devise the least
expensive plan to execute a query.
Statistics help the Teradata RDBMS to develop the most optimal plan to execute a query by providing information on
the distribution of the data values found in a given Index or column. The Teradata RDBMS relies most heavily on
accurate statistics when joining tables.
Why collect STATISTICS?
Statistics improve query response times, especially when collected on columns used in JOIN, WHERE clause and
SUBQUERY sql statements.
Collect statistics on
1) Primary index column(s)
2) Non-Unique Secondary Indices (NUSI)
3) PPI index
4) Foreign Keys
5) Any column used in a Join (i.e. where A.columnx = B.columnx)
6) Columns referenced in where clauses if they have very skewed. Non-unique data.
7) ) If index is on more than one column and in where clause only one column is used in join condition then do a
separate stats on that column also.
Collect Multicolumn Statistics
Groups of columns that are often together in conditions with equality predicates.
Groups of columns used for joins or aggregations, where there is either a dependency or some degree of
correlation among them. With no multicolumn statistics collected, the optimizer assumes complete
independence among the column values. The more that the combination of actual values are correlated,
the greater the value of collecting multicolumn statistics will be.
Thumb Rule for Collect Statistics
For small tables where no. of rows< 5 * no. of AMPs, collect statistics on the Primary Index.
For small tables involved in a join, collect statistics on the Primary Index.
For large tables, collect statistics on Non-Unique Primary Index.
For all tables, collect statistics on columns in any WHERE clause.
When to do collect stats on table:Collect Statistics only when absolutely necessary .i. when there is at least 10% change in the data.
The size of the table has not changed by 10%, yet the distribution of the data has changed significantly
If Statistics are not collected or are not current, and the wrong plan is used by Teradata, then many
thousands of CPU secs can be used instead of a few hundred.
Example:Query and Explain plan without statsSELECT
DMM.Consumer_Group_Id,
DMM.Model_Type_Id,
DMM.Decile_Num ,
DMM.Score_Num ,
DMM.Data_AsOf_Dt,
DMM.Row_Load_Dt ,
'D'
FROM DMConsumerGroupModelScore_T1 TS,
DMConsumerGroupModelScore DMM
WHERE

Teradata Basics and Guidelines

Page 18

TS.Model_Type_Id = DMM.Model_Type_Id AND


TS.Data_AsOf_Dt = DMM.Data_AsOf_Dt;

Explain Plan:
1)
First,
we
lock
SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore
in
view
SUEDWDCAMPAIGN_E.DMConsumerGroupModelScore
for
access,
and
we
lock
svuedwdcampaign_s.DMConsumerGroupModelScore_T1
in
view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T1 for access.
2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore in view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore by way of an
all-rows scan with no residual conditions into Spool 5(all_amps), which is redistributed by the hash code of
SVVUEDWDCAMPAIGN_T.
DMConsumerGroupModelScore.Model_Type_Id,
SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore.Data_AsOf_Dt) to all AMPs. Then we do a SORT to order
Spool 5
by row hash. The size of Spool 5 is estimated with low confidence to be 78 rows (3,042 bytes). The
estimated time for this step is 0.01 seconds.
2) We do an all-AMPs RETRIEVE step from svuedwdcampaign_s.DMConsumerGroupModelScore_T1 in view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T1 by way
of an all-rows scan with a condition of
("NOT(svuedwdcampaign_s.DMConsumerGroupModelScore_T1
in
view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T1. Model_Type_Id IS NULL)") into Spool 6 (all_amps),
which is redistributed by the hash code of
(svuedwdcampaign_s.DMConsumerGroupModelScore_T1.
Model_Type_Id, svuedwdcampaign_s.DMConsumerGroupModelScore_T1.Data_AsOf_Dt) to all AMPs. Then we
do a SORT to order Spool 6 by row hash. The size of Spool 6 is estimated with low confidence to be 78 rows (1,638
bytes). The estimated time for
this step is 0.01 seconds.
3) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of a RowHash match scan, which is joined to Spool 6
(Last Use) by way of a RowHash match
scan. Spool 5 and Spool 6 are joined using a merge join, with a join
condition of ("(Model_Type_Id = Model_Type_Id) AND (Data_AsOf_Dt =
Data_AsOf_Dt)"). The result goes into
Spool 4 (group_amps), which is built locally on the AMPs. The size of Spool 4 is estimated with no confidence to be
689 rows (33,761 bytes). The estimated time for this step is 0.06 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 4 are sent back to the user as the result of
statement 1. The total estimated time is 0.08 seconds.

Explain Plan after stats:9 records are present in DMConsumerGroupModelScore_t1 table.


COLLECT STATISTICS
Index (Consumer_Group_Id, Model_Type_Id)
Column (Model_Type_Id,Data_AsOf_Dt)
ON
svuedwdcampaign_s.DMConsumerGroupModelScore_T1;
Help stats:-

1
2

Date
11/5/2024
11/5/2024

Time
6:37:35
6:37:35

Unique
Values
9
5

Column Names
Consumer_Group_Id,Model_Type_Id
Model_Type_Id,Data_AsOf_Dt

SELECT
DMM.Consumer_Group_Id,
DMM.Model_Type_Id,
DMM.Decile_Num ,
DMM.Score_Num ,
DMM.Data_AsOf_Dt,

Teradata Basics and Guidelines

Page 19

DMM.Row_Load_Dt ,
'D'
FROM DMConsumerGroupModelScore_T1 TS,
DMConsumerGroupModelScore DMM
WHERE
TS.Model_Type_Id = DMM.Model_Type_Id AND
TS.Data_AsOf_Dt = DMM.Data_AsOf_Dt;

Explain Plan:
1)
First,
we
lock
SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore
in
view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore
for
access,
and
we
lock
svuedwdcampaign_s.DMConsumerGroupModelScore_T1
in
view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T1 for access.
2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore in view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore by way of an all-rows scan with no residual conditions into
Spool 5(all_amps), which is built locally on the AMPs. Then we do a SORT to order Spool 5 by the hash code of
(SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore.Model_Type_Id,
SVUEDWDCAMPAIGN_T.DMConsumerGroupModelScore.Data_AsOf_Dt). The size of Spool 5 is estimated with low
confidence to be 78 rows (3,042 bytes). The estimated time for this step is 0.01 seconds.
2) We do an all-AMPs RETRIEVE step from svuedwdcampaign_s.DMConsumerGroupModelScore_T1 in view
SVUEDWDCAMPAIGN_E.DMConsumerGroupModelScore_T1 by way of an all-rows scan with no residual conditions
into Spool 6 (all_amps), which is duplicated on all AMPs. Then we do a SORT to order Spool 6 by the hash code of
(svuedwdcampaign_s.DMConsumerGroupModelScore_T1.Model_Type_Id,
svuedwdcampaign_s.DMConsumerGroupModelScore_T1.Data_AsOf_Dt).
The size of Spool 6 is estimated with low confidence to be 780 rows (16,380 bytes). The estimated time for this
step is 0.01 seconds.
3) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of a RowHash match scan, which is joined to Spool 6
(Last Use) by way of a RowHash match scan. Spool 5 and Spool 6 are joined using a merge join, with a join condition
of ("(Model_Type_Id = Model_Type_Id) AND (Data_AsOf_Dt = Data_AsOf_Dt)"). The result goes into Spool 4
(group_amps), which is built locally on the AMPs.
The size of Spool 4 is estimated with no confidence to be 89 rows (4,361 bytes). The estimated time for this step is
0.06 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 4 are sent back to the user as the result of
statement 1. The total estimated time is 0.05 seconds.

Teradata Basics and Guidelines

Page 20