Sei sulla pagina 1di 22

Interview Questions:

What are the different types of locks in Teradata?


Exclusive Lock:
Exclusive locks are only applied to databases or tables, never to rows.
They are the most restrictive type of lock; all other users are locked out.
Exclusive locks are used rarely, most often when structural changes are being made to the database.

Read Lock:
Read locks are used to ensure consistency during read operations.
Several users may hold concurrent read locks on the same data, during which no modification of the data
is permitted.

Write Lock:Write locks enable users to modify data while locking out all other users except readers not
concerned about data consistency (Access lock readers).
Until a Write lock is released, no new read or write locks are allowed.

Access Lock:Access locks can be specified by users who are not concerned about data consistency.
The use of an access lock allows for reading data while modifications are in process. Access locks are
designed for decision support on large tables that are updated only by small single row changes. Access
locks are sometimes called stale read locks, i.e. you may get stale data that hasnot been updated

Lock types are automatically applied based on the SQL command:


SELECT -->applies a Read lock
UPDATE --> applies a Write lock
CREATE TABLE --> applies an Exclusive lock

=========================================================================

What is multiple statement processing?


Multiple statement processing increases the performance when loading into large tables. All statements

are sent to parser simultaneously and all statements are executed parallel.
In sql Assistance, once you place all queries and press F5. All will be executed parallel. Make sure each
sql statement ends semicolon(;).

=========================================================================

What is TENACITY and SLEEP ?


TENACITY to specify the number of hours FastLoad will try to establish a connection. The default is no
tenacity. The statement must be placed before LOGON.

SLEEP to specify the number of minutes FastLoad waits before retrying a logon.The default is 6 minutes.
The statement must be placed before LOGON.

=========================================================================

What is soft Referential Integrity and Batch Referential Integrity?


Soft Referential Integrity:
It provides a mechanism to allow user-specified Referential Integrity (RI) constraints that are not
enforced by the database.
Enables optimization techniques such as Join Elimination.

Batch Referential Integrity:


Tests an entire insert, delete, or update batch operation for referential integrity. If insertion, deletion, or
update of any row in the batch violates referential integrity, then parsing engine software rolls back the
entire batch and returns an abort message.

=========================================================================
What Teradata statistics ?

Statistics collection is essential for the optimal performance of the Teradata query optimizer.
The query optimizer relies on statistics to help it determine the best way to access data.

Statistics also help the optimizer ascertain how many rows exist in tables being queried and predict how
many rows will qualify for given conditions.
Lack of statistics, or out-dated statistics, might result in the optimizer choosing a less-than-optimal method
for accessing data tables.

Points:
1: Once a collect stats is done on the table(on index or column) where is this information stored so that
the optimizer can refer this?
Ans: Collected statistics are stored in DBC.TVFields or DBC.Indexes. However, you cannot query these
two tables.

2: How often collect stats has to be made for a table that is frequently updated?
Answer: You need to refresh stats when 5 to 10% of table's rows have changed. Collect stats could be
pretty resource consuming for large tables. So it is always advisable to schedule the job at off peak period
and normally after approximately 10% of data changes.

3: Once a collect stats has been done on the table how can i be sure that the optimizer is considering this
before execution ? i.e; until the next collect stats has been done will the optimizer refer this?

Ans: Yes, optimizer will use stats data for query execution plan if available. That's why stale stats is
dangerous as that may mislead the optimizer.

=========================================================================
What is a HOT AMP?
When the workload is not distributed across all the AMPs, only a few AMPs end up overburdened with the
work. This is a hot AMP condition.
This typically occurs when the volume of data you are dealing with is high and

(a). You are trying to retrieve the data in a TERADATA table which is not well distributed across the AMPs
on the system (bad Primary Index)
OR
(b). When you are trying to join on column with highly non unique values
OR

(c). When you apply the DISTINCT operator on a column with highly non unique values

=========================================================================
What are different SPACES available in Teradata?
There are mainly 3 types of spaces available.
Perm Space
Temp Space
Spool space

Perm Space :All databases have a defined upper limit of permanent space. Permanent space is used for
storing the data rows of tables. Perm space is not pre-allocated. It

represents a maximum limit.

Spool Space :All databases also have an upper limit of spool space. If there is no limit defined for a
particular database or user, limits are inherited from parents.
Theoretically, a user could use all unallocated pace in the system for their query. Spool space is
temporary space used to hold intermediate query results or formatted answer sets to queries. Once the
query is complete, the spool space is released.
Example: You have a database with total disk space of 100GB. You have 10GB of user data and an
additional 10GB of overhead. What is the maximum amount of spool space available for queries?

Answer: 80GB. All of the remaining space in the system is available for spool

Temp Space :The third type of space is temporary space. Temp space is used for Global and Volatile
temporary tables, and these results remain available to the user until the session is terminated. Tables
created in temp space will survive a restart.
=========================================================================
How Does Teradata Store Rows?

Teradata uses hash partitioning and distribution to randomly and evenly distribute data across all AMPs.

The rows of every table are distributed among all AMPs - and ideally will be evenly distributed among all
AMPs.

Each AMP is responsible for a subset of the rows of each table.

Evenly distributed tables result in evenly distributed workloads.


=========================================================================

Which one will take care when an AMP goes down?

Down amp recovery journal will start when AMP goes down to restore the data for the down amp.

Fall back is like it has redundant data,if one amp goes down in the cluster also it wont affect your
queries.the query will use data from fall back rows.the down amp wont be updated use the data from fall
back.

=========================================================================
Which one will take care when a NODE goes down?
In the event of node failure, all virtual processors can migrate to another available node in the clique.

All nodes in the clique must have access to the same disk arrays

=========================================================================
What is the use of EXPLIN plan?

The EXPLAIN facility allows you to preview how Teradata will execute a requested query. It returns a
summary of the steps the Teradata RDBMS would perform to execute the request.
EXPLAIN also discloses the strategy and access method to be used, how many rows will be involved,
and its cost in minutes and seconds. Use EXPLAIN to evaluate a query performance
and to develop an alternative processing strategy that may be more efficient. EXPLAIN works on any
SQL request. The request is fully parsed and optimized, but not run. The complete plan is returned to the
user in readable English statements.

EXPLAIN provides information about locking, sorting, row selection criteria, join strategy and conditions,
access method, and parallel step processing. EXPLAIN is useful for performance tuning, debugging, prevalidation of requests, and for technical training.

=========================================================================

What COALESCE function?

The newer ANSI standard COALESCE can also convert a NULL to a zero. However, it can convert a
NULL value to any data value as well.
The COALESCE searches a value list, ranging from one to many values, and returns the first Non-NULL
value it finds.
At the same time, it returns a NULL if all values in the list are NULL.

To use the COALESCE, the SQL must pass the name of a column to the function. The data in the column
is then compared for a NULL. Although one column name is all that is required, normally more than one
column is normally passed to it. Additionally, a literal value, which is never NULL, can be returned to
provide a default value if all of the previous column values are NULL

The syntax for the COALESCE follows:

SELECT

COALESCE ( [, ] )
,( COALESCE([,] ) )

FROM GROUP BY 1 ;
=========================================================================
When will you create PPI and when will you create secondary indexes?
Partitioned Primary Indexes are Created so as to divide the table onto partitions based on Range or
Values as Required. This is effective for Larger Tables partitioned on the Date and integer columns.
There is no extra Overhead on the System (no Spl Tables Created ect )

Secondary Indexes are created on the table for an alternate way to access data. This is the second
fastest method to retrieve data from a table next to the primary index. Sub tables are created.

PPI and secondary indexes do not perform full table scans but they access only a defined st of data in the
AMP's.

=========================================================================
When you chose primary index and when will you choose secondary index?
Primary index will be chosen at the time of table creation. This will help us in data distribution, data
retrieval and join operations.
Secondary indexes can be created and dropped at any time. They are used as an alternate path to
access data other than the primary index.

Diff between database and user?


Both may own objects such as tables, views, macros, procedures, and functions. Both users and
databases may hold privileges.
However, only users may log on, establish a session with the Teradata Database, and submit requests.

A user performs actions where as a database is passive. Users have passwords and startup strings;
databases do not. Users can log on to the Teradata Database, establish sessions, and submit SQL
statements; databases cannot.

Creator privileges are associated only with a user because only a user can log on and submit a CREATE
statement. Implicit privileges are associated with either a database or a user because each can hold an
object and an object is owned by the named space in which it resides

=========================================================================
When a PI is not mentioned on a table, how will Teradata consider the PI for that table?

If you don't specify a PI at table create time then Teradata must chose one. For instance, if the DDL is
ported from another database that uses a Primary Key instead of a Primary Index, the CREATE TABLE
contains a PRIMARY KEY (PK) constraint. Teradata is smart enough to know that Primary Keys must be
unique and cannot be null. So, the first level of default is to use the PRIMARY KEY column(s) as a UPI.
If the DDL defines no PRIMARY KEY, Teradata looks for a column defined as UNIQUE. As a second
level default, Teradata uses the first column defined with a UNIQUE constraint as a UPI.
If none of the above attributes are found, Teradata uses the first column defined in the table as a NONUNIQUE PRIMARY INDEX (NUPI).

=========================================================================
What is covered query in Teradata?

If a SELECT query covers all the columns that are defined in the JOIN INDEX as join columns, such type
of queries are called as COVERED query.

Multi-Column NUSI Columns used as a Covered Query


=========================================================================
What is the use of WITH CHECK OPTION in Teradata?
In Teradata, the additional key phase: WITH CHECK OPTION, indicates that the WHERE clause
conditions should be applied during the execution of an UPDATE or DELETE against the view. This is not
a concern if views are not used for maintenance activity due to restricted privileges.

=========================================================================
What is identity column?
Teradata has a concept of identity columns on their tables beginning around V2R6.x. These columns
differ from Oracle's sequence concept in that the number assigned is not guaranteed to be sequential.
The identity column in Teradata is simply used to guaranteed row-uniqueness.
Example:
CREATE MULTISET TABLE MyTable
(
ColA INTEGER GENERATED BY DEFAULT AS IDENTITY
(START WITH 1
INCREMENT BY 20)
ColB VARCHAR(20) NOT NULL
)
UNIQUE PRIMARY INDEX pidx (ColA);
Granted, ColA may not be the best primary index for data access or joins with other tables in the data
model. It just shows that you could use it as the PI on the table.

=========================================================================
How to implement UPSERT logic in Teradata using SQL?

We have MERGE-INTO option available in Teradata data which works as an UPSERT logic in teradata. If
the condition matches then update else insert.

Example: MERGE into dept_table1 as Taregt

USING (SELECT dept_no, dept_name, budget

FROM dept_table where dept_no = 20) Source

ON (Target.dept_no = 20)

WHEN MATCHED then

UPDATE set dept_name = Being Renamed

WHEN NOT MATCHED then

INSERT (dept_no, dept_name, budget)

VALUES (source.dept_no, source.dept_name, source.budget);

=========================================================================
What is value ordered NUSI?
When we define a value ordered NUSI on a column the rows in the secondary subtable get sorted based
on the secondary index value. The columns should be of integer or date type.
This is used for range queries and to avoid full table scans on large tables.

=========================================================================
What are the DBQL tables.

Database Query Log tables are the tables present in DBC database which store the history of all the
operations performed on the tables present in the databases.
The history could get very large so these tables should be purged when the data is no longer needed.

=========================================================================
What is the difference between sample and top?

The Sampling function (SAMPLE) permits a SELECT to randomly return rows from a Teradata database
table.

It allows the request to specify either an absolute number of rows or a percentage of rows to return.
Additionally, it provides an ability to return rows from multiple samples.

SELECT * FROM student_course_table SAMPLE 5;

The TOP clause is used to specify the number of records to return. The TOP clause can be very useful
on large tables with thousands of records. Returning a large number of records can impact on
performance.

SELECT TOP 2 * FROM EMP.

=========================================================================
What is difference between ZEROIFNULL and NULLIFZERO ?
The ZEROIFNULL function: will pass zero when data coming as null
The NULLIFZERO function: will pass null when data coming as zero.

=========================================================================
What is Range_N?
Range_N is defined on a partition primary index to specify the range of values of a column that should be
assigned to a partition.
The number of partitions = the number of ranges specified + no case + unknown
no case > if the value does not belong to any range
unknown -> for the values like nulls, spaces etc

=========================================================================
What is Casting in Teradata ?

It will convert the data type


The casting is similar to DDL:
CAST('02/03/2009-01:25:11' AS TIMESTAMP FORMAT 'MM/DD/YYYY-HH:MI:SS')
=========================================================================
Can we do collect stats on a table when the table is being updated?
No.We never collect stats on a table when the table is being updated. It will go for dead lock.

=========================================================================
When will you go for hash index?

A hash index organizes the search keys with their associated pointers into a hash file structure.
We apply a hash function on a search key to identify a bucket, and store the key and its associated
pointers in the bucket (or in overflow buckets).
Strictly speaking, hash indices are only secondary index structures, since if a file itself is organized using
hashing, there is no need for a separate hash index structure on it.

=========================================================================
Difference between Global and Volatile Temporary Table? Where they can be used and how they
can be used?

Global Temporary Tables

Materialize in users temporary space

Base definitions are permanent in Data Dictionary tables

Definitions can be materialized by any user with the necessary DML privileges

Can be defined for COLLECT STATISTICS

Can survive a Teradata Database restart

Up to 2000 materialized tables per session

Materialized instance is local to a session

If not dropped manually, instance is dropped automatically at session end

An instance an be dropped manually any time during a session with DROP TEMPORARY TABLE

Materialized contents are not shareable with other sessions

A very large number of global tables can prolong logoff, because materialized global tables are dropped
automatically at session end.

Volatile Tables

Occupy space in users SPOOL allocation

Definitions are kept in cache and go away at session end or during a Teradata Database restart

Private to the session in which they are created

Can be defined for COLLECT STATISTICS(TD14 feature)

Do not survive a Teradata Database reset

Up to 1000 tables per session

Local to a session

If not dropped manually, dropped automatically at session end

Can be dropped manually any time during a session with DROP TABLE

Contents are not sharable with other sessions

A very large number of volatile tables can prolong logoff, because volatile tables are dropped
automatically at session end.

What are the advantages of partitioned tables:


1)They provide efficient searches by using partition elimination at the various levels or combination of
levels.
2)They reduce the I/O for range constraint queries
3)They take advantage of dynamic partition elimination
4)They provide multiple access paths to the data, and an MLPPI provides even more partition elimination
and more partitioning expression choices,
(i.e., you can use last name or some other value that is more readily available to query on.)
5)The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities
6)Row hash locks are used for SELECT with equality conditions on the PI columns.
7)Partitioned tables allow for fast deletes of data in a partition.
8)They allow for range queries without having to use a secondary index.
9)Specific partitions maybe archived or deleted.
10)May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
11)May replace a Value Ordered NUSI for access.

=========================================================================
What are the Partitioning Rules:
1)A table can have up to 65,535 Partitions.

2)Partitioning never determines which AMP gets row.


3)Partitioning determines how an AMP will sort the row on its own.
4)Table can have up to 15 levels of partitions.
5)A table cannot have an UPI as the Primary Index if the Partition table does not include PI.
6)Total 3 forms of Partitioning Simple, RANGE and CASE.

=========================================================================
What are the Disadvantages of partitioned tables
1)Rows in a partitioned table are 2 bytes longer.
2)Access via the Primary Index may take longer.
3)Full table joins to a NPPI table with the same PI may take longer.
=========================================================================
Difference between Set and Multiset Table?
The SET table does not allow duplicate rows while the multiset table allow duplicate rows. For a SET
table, to enhance the performance and to enforce uniqueness, a UPI or a USI can be created on a
particular column. For a multiset table, a NUPI or NUSI can be created on a particular column to enhance
performance.

Characteristics of SET tables:


Logically correct Implementation
Automatic duplicate row elimination

Characteristics of MULTISET tables:

No duplicate row check overhead


Data integrity becomes a user responsibility

=========================================================================What
are the Advantages and Disadvantages of multi-column PI?
As many as 64 columns may be included in an index. Multi-column Primary Indexes may provide better
distribution.

Advantage

*More columns = more uniqueness


*Distinct value increases.
*Rows per value decreases.
*Selectivity increases.

Disadvantage
*More columns = less usability
*PI can only be used when values for all PI columns are provided in SQL statement.
*Partial values do not generate a useful hash result.
=========================================================================
What is Value List Compression? how would you implement compression ?
The Value list compression cannot compress an existing column but it can be used to add a compressed
column to a table.
Use the ALTER TABLE statement to compress columns and reduce the number of I/O operations.
Consider the following:
*Set the column default value to most frequent value.
*Compress to the default value.
*This is especially useful for sparsely populated columns.
*Overhead is not high.
*The I/O savings correlates to the percentage of data compressed out of a row.

Value List Compression (VLC) provides the Teradata database with the capacity to support multiple value
compression for fixed width columns. When you specify a values or values, the system suppresses any
data matching the compress value from a row. This saves disk space. Smaller physical row size results in
less data blocks and fewer I/Os and improved overall performance.

Because VLC allows you to specify a list of compress values for a column, the system suppresses data
when one of the specified values exists in the column. Up to 255 distinct values (plus NULL) may be
compressed per fixed-width column.

VLC improves performance as follows:


*Reduces the I/O required for scanning tables when the tables have compressible values in their
columns.

*Reduces disk space because rows are smaller.


*Permits joins to look up the tables to be eliminated.
*Improves data loading because more rows may fit into one data block after compression is applied.

=========================================================================
What have you done to resolve the running out of spool problem?
To resolve running out of spool problem,
1) The column involved in the where clause/joining column is an index column or not, should be checked .
2) Also whether statistics have been collected on the index column. If no statistics have been collected
then collect stats on it.
3) Check maxspool and peak spool from dbc.diskspace per amp basis and check for data skew.
4) Also check for improper join condition or join condition missing.
4) Despite of the steps taken, the running out of spool problem comes then explicitly increase the spool
space.

=========================================================================
What are the limitations of Fast load?

1) No Secondary Indexes are allowed on the Target Table: Usually UPI and NUPI are used in Teradata to
distribute the rows evenly across the AMPs.Secondary indexes are stored in a subtable block and many
times on a different AMP from the data row.

2)No Referential Integrity is allowed: The Referential Integrity defined on a table would take more system
checking to prevent referential constraints.

3)No Triggers are allowed at load time: Fast load focused on data load with high speed. So triggers not
allowed.

4)Duplicate Rows (in Multi-Set Tables) are not supported: Multiset tables are allowed duplicate data.
Fastload can load the data into multiset tables but duplicate rows are discarded.

5)No AMPs may go down (i.e., go offline) while FastLoad is processing: The down AMP must be repaired
before the load process can be restarted

6)No more than one data type conversion is allowed per column: Data type conversion cause high
resource utilization on the system

=========================================================================
What are the limitations of Multi load?
1)Unique Secondary Indexes are not supported on a Target Table:Like FastLoad, MultiLoad does not
support Unique Secondary Indexes (USIs). But unlike FastLoad, it does support the use of Non-Unique
Secondary Indexes (NUSIs) because the index subtable row is on the same AMP as the data row.

2)Referential Integrity is not supported: The Referential Integrity defined on a table would take more
system checking to prevent referential constraints.

3)Triggers are not supported at load time: Disable all the Triggers prior to using it.

4)No concatenation of input files is allowed: It could impact are restart if the files were concatenated in a
different sequence or data was deleted between runs.

5)No Join Indexes: All the join indexes must be dropped before running a MultiLoad and then recreate
them after the load is completed

6)Will not process aggregates, arithmetic functions or exponentiation:If you need data conversions or
math, you might be better off using an INMOD to prepare the data prior to loading it
=========================================================================
Explain the phases in Multiload?

MultiLoad Has Five IMPORT Phases:


Phase 1: Preliminary Phase : Ita Basic setup phase.Its used for several preliminary set-up activities for a
successful data load.
Phase 2: DML Transaction Phase: All the SQL Data Manipulation Language (DML) statements are
sent to Teradata database as Multilaod supports multiple DML functions.
Phase 3: Acquisition Phase: Once the setup completes the PE's plan stored on each AMP.Then Locks
the table headers and the actual input data will also be stored in the worktable.

Phase 4: Application Phase: In this phase all DML opreations are applied on target tables.
Phase 5: Cleanup Phase: Table locks will be released and all the intermediate work tables will be
dropped.
=========================================================================
What are the limitations of Tpump?

No concatenation of input data files is allowed.

TPump will not process aggregates, arithmetic functions or exponentiation.

The use of the SELECT function is not allowed.

No more than four IMPORT commands may be used in a single load task.

Dates before 1900 or after 1999 must be represented by the yyyy format for the year portion of the date,
not the default format of yy.

On some network attached systems, the maximum file size when using TPump is 2GB.

TPump performance will be diminished if Access Logging is used.

=========================================================================
How many types of error tables in Fast Load?

Each FastLoad requires two error tables. These are automatically created during run. These will
populated only errors occurred during the load.

The first error table is for any translation errors or constraint violations
For example, if a column is defined as integer but the data from source the data is coming in CHAR
format.i.e wrong data.

The second error table is for errors caused by duplicate values for Unique Primary Indexes.
=========================================================================
What is Quantiles?

Quantiles are used to divide a number of rows into a number of partitions of roughly equal size. The most
common quantile is the percentile which is based on a value of 100. It is also not uncommon to have
quartiles (based on 4), tertiles (based on 3) or deciles (based on 10). Notice that by default, both the

quantile column and the quantile value itself will be output in ascending sequence.

Aggregations may not be combined with OLAP functions, thus a derived or temporary table is sometimes
necessary to produce the aggregation before the function is applied. The standard form of the quantile
function is:

QUANTILE (quantile_constant,sortlist)

SELECT employee_number, salary_amount,


QUANTILE (100, employee_number,salary_amount )
FROM employee
QUALIFY QUANTILE (100, employee_number) < 25;

=========================================================================
What is WIDTH_BUCKET ?

The WIDTH_BUCKET function assigns partition numbers to values associated with a column or
expression, based on a specified range and based on the number of partitions desired.
Like many OLAP and aggregate functions, WIDTH_BUCKET is used to extract statistical information from
sample populations.
The format of the function is as follows:
WIDTH_BUCKET (column_expr,lower_bound, upper_bound ,partition_count)

=========================================================================
What is Random function?

The RANDOM function may be used to generate a random number between a specified range.
RANDOM (Lower limit, Upper limit) returns a random number between the lower and upper limits
inclusive. Both limits must be specified, otherwise a random number between 0 and approximately 4
billion is generated.

SEL DEPARTMENT_NUMBER,RANDOM(1,9) FROM EMPLOYEE;

limitations:
RANDOM is non-ANSI standard
RANDOM may be used in a SELECT list or a WHERE clause, but not both
RANDOM may be used in Updating, Inserting or Deleting rows
RANDOM may not be used with aggregate or OLAP functions
RANDOM cannot be referenced by numeric position in a GROUP BY or ORDER BY clause
=========================================================================

Posted 5th December 2013 by M Khan


Labels: khan
0

Add a comment

http://teradatatuning.blogspot.in/2013/11/curing-slow-inserts-and-create-tables-i.html

Eliminating hash collisions


Case description
We have an INSERT or CREATE TABLE operation that runs unreasonably long timecompared to the
affected number of rows, in spite the table is not skewed.

What is hash collision?


Hash collision is when two or more records in a table have the same hash value.
SET type of tables ensure that there are no more records with exactly the same record content within a
table. How does Teradata do it?
Teradata stores the records in a hash filesystem, where each record has a hash value calculated from
the Primary Index (PI) value. If the PI values are the same in more records, they will surely have the
same hash value either.
When INSERTING a record, Teradata has to compare the new record to the table's only those records
that have the same hash value that new record has, since all records with different hash value will
surely differ at least at the PI columns.
If we have to INSERT N records with the same hash value into an empty table, Teradata has to do N*(N1)/2 times - very CPU demanding - full record comparisons.

How to identify
Hash collisions can be easily found by using PRISE Tuning Assistant tool also, or follow this method:

DBQL filtering for qualifying queries:


The Merge (MRG) phase of the INSERT/CREATE TABLE operation consumes lot of CPU.
Look for high CPU consuming 'MRG' steps in the dbc.DBQLStepTbl:
sel a.cputime,a.MaxAmpCPUTime * (hashamp() +1)
CoveringCPUTIme, a.stepname,a.RowCount,b.* from
dbc.DBQLStepTbl a
join dbc.DBQLogTbl
b on a.ProcId=b.ProcId and a.QueryId=b.QueryId
where
a.StepName in ('MRG' /*, 'MRU' for UPDATEs also*/)
and a.CPUTime > 100 /* Performance boost: eliminates most of the records
(small cpu seconds) at low processing cost. Adapt number to your site */
qualify sum(1) over (order by a.cputime desc rows unbounded preceding) <=
100;
At a specific SQL statement (INSERT or CREATE TABLE) you have to check your PI for level of hash
collisions (number of records where the hash values are the same) in the target table.
How to make sure that the hash-collision is the reason? Let the target table be TableA, with primary
index: ColA,ColB,ColC (can be any number of columns in practice)
select top 100 hashrow(ColA,ColB,ColC), count(*) from TableA group by 1 order
by 2 desc;
The top row(s) will show the most frequent hash values. Count values >>1 mean significant hash
collisions in the order of N * N. Each high frequency hash value will generate a hash-collision group
causing comparisons in the order of N*N.
If the table still not exists, embed the producing "SELECT" statement into the script above, and count
those field values that would get to the PI columns.

Explanation
If we use "SET" type of table (this is the default setting), Teradata ensures that there will be no
perfectly alike records in the table. This can be ensured by comparing the inserted/updated record
with the existing ones.
Teradata's "hash filesystem" gives a very effective trick: only those records must be compared, whose
RowID (hash) equals, otherwise at least the PI fields must differ.
If we've chosen the Primary Index for UNIQUE, or non-UNIQUE, but on field(s) that are almost unique,
then the "SET comparison"
restricts to zero or one records in most cases.

Solution
For good solution unfortunately we have to modify the table structure.


Option 1: Change table type to MULTISET. This will eliminate duplication checks, but its
disadvantage is the same. If the process falls back on the de-duplication of SET table, you have
to replace it with programmed de-duplication (group by, left join...).

Option2: Change the table's PI to a unique or nearly unique column set. Be prudent, consider
the workload also. (joins, where conditions, group by expressions, etc.)

Summary
Tables with strongly non unique PI are highly dangered for hash collision slowed INSERTs/CREATE
TABLEs (or UPDATEs), even if they are not skewed. Use "more unique" PI, or MULTISET table.
Typical mistake: if a CREATE TABLE ... as SELECT... lacks the PRIMARY INDEX() section. In this case
Teradata chooses the first column as PI, which often causes terrible performance.

What's next
Next post will discuss Multi Value Compress (MVC) optimization.

Potrebbero piacerti anche