Understanding Table Queues

IBM® DB2® technical white paper
Communication between DB2

data partitions
Understanding table queues
Jamie Nisbet
Software Engineer, DB2 Continuing Engineering Development
IBM Canada Lab
Michael Kwok
Senior Manager, DB2 Warehouse Performance
IBM Canada Lab
David Sky
Senior Technical Writer
IBM Canada Lab
Table queues (TQs): An Introduction
1 Introduction
You can use the DB2 partitioned database environment to divide your data into
multiple database partitions. These partitions can exist on the same physical
server. They can also exist on different servers, each with its own set of
resources such as CPUs, memory, and storage subsystems. When a query is
processed, the work is divided so that each database partition works on the
query in parallel. Furthermore, as the database grows, you can maintain
consistent query performance by deploying additional database partitions with
additional resources. The ability to parallelize query processing and scale out
makes DPF an attractive solution for large data warehouse environments.
In partitioned database environments, query processing requires data to be

exchanged between database partitions. Table queues (often referred to as
TQs) are the vehicle that the DB2 product uses for passing rows between
partitions. The TQs interact with the underlying communication layer, called
FCM, to transmit the bits and bytes. TQs are a determining factor of query
performance in DPF. This article provides an in-depth look at table queues,
includes tips on how to identify whether a performance problem is due to TQ
behavior, and, more importantly, explains what you can do to resolve such a
problem.
Additional information sources are listed at the end of this paper, including a
link to the IBM DB2 for Linux, UNIX, and Windows Information Centers
as well as useful Database-related best practice papers.
2 Table queues (TQs): An introduction
2.1 Table partitioning in partitioned database

environments
In a DPF environment, each table row is distributed to a database partition
according to a distribution key that you specify in the CREATE TABLE
statement. A distribution key is essentially a column (or group of columns) in
the table that is used to determine the partition in which a particular row of
data is stored. Figure 1 depicts a 4-partition environment with two tables,
customer and store_sales. These tables are created by the following
statements:
CREATE TABLE customer (cust_id CHAR(10),

gender CHAR(1),
address VARCHAR(100))
DISTRIBUTED BY HASH (cust_id);
1
Table queues (TQs): An introduction
CREATE TABLE store_sales (cust_id CHAR(10)

qty INT,
address VARCHAR(100))
DISTRIBUTED BY HASH (cust_id);
Both tables use their cust_id columns as the distribution key. The value in
cust_id is hashed to generate a partition number, and then the corresponding
row is stored in the relevant partition.
customer
store_sales
cust_id gender address
cust_id qty address
Smith F Region1
Smith 1 Region1
Bill M Region1
Smith 2 Region2
Woe M Region2
Woe 3 Region1
Zool M Region3 Hash Table
Bill 1 Region1
Mary F Region2 key hash value
Bill 2 Region1
Smith 1
Bill 3
Hash(cust_id) Woe 4
Zool 2
Mary 1
customer
store_sales
part0
part0 part1
part1 part2
part2 part3
part3
Server
Server 11 Server
Server 22
Figure 1. Example of a partitioned environment
When a query is compiled, the DB2 optimizer forms an access plan that
facilitates parallel processing among the database partitions. The individual
results from each partition are consolidated and returned to the application.
Table queues are used to send the results back to the coordinator partition
where the query was submitted.
Ideally, all of the operations can be completed within each individual database
partition, and results are sent directly to the coordinator partition for final
processing. However, this is not always the case. When two tables are joined,
data might have to pass from one partition to one or more other partitions via
TQs.
2.2 Co-located and non-co-located joins

In general, there are two kinds of joins in a partitioned environment: co-
located joins and non-co-located joins.
Consider again the example in Figure 1. Both the customer and store_sales
tables are partitioned based on the cust_id column.
In the following query, the join is done locally at each database partition:
SELECT *
FROM customer c, store_sales s
WHERE c.cust_id=s.cust_id
This is possible because the join predicate is on the cust_id column and any
rows with matching values of the cust_id column in both tables is always in the
same partition. This kind of join is referred to as a co-located join.
If the query is changed to do a join on the address column instead, a complete

join cannot be done within each database partition. A sample query follows:
SELECT *
WHERE c.address=s.address
Because neither table is partitioned based on the address column, rows with
the same value of the address column can exist in more than one partition. For
correct results to be obtained, table rows must be passed between partitions.
For example, database partitions can broadcast the content of the customer
table to all the other partitions. This kind of join is called a non-co-located join.
For non-co-located joins, DB2 chooses the type of TQ that yields the best
performance.
2.3 Type of TQs

Some of the most common types of TQs that are used in a partitioned
environment are as follows:
• Directed table queue (DTQ). A TQ in which each row is sent to one of

many possible partitions, based on the hash value of a key column
(different from the distribution key).
• Broadcast table queue (BTQ). A TQ in which rows are sent to all the
partitions. No hashing is done to determine the receiving partitions.
• Merging table queue. A TQ that preserves the sort order, and is

sometimes called a sorted TQ. This type of queue can be a directed
merging table queue (MDTQ) or a broadcast merging table queue
(MBTQ). A TQ whose name is not prefixed with the letter M is not a
merging TQ.
Figure 2 gives an example of data flow in a BTQ and a DTQ from the partition
labeled part0.
3
BTQ DTQ (where values 1 and 3 are hashed to

part1 while value 2 is hashed to part3)
1 1 1
1
customer 2 2 2 2
customer 3
3 3 3
1 1
2 2
3 3
part0
part0 part1
part1 part2
part2 part3
part3 part0
part0 part1
part1 part2
part2 part3
part3
Server
Server 11 Server
Server 22 Server
Server 11 Server
Server 22
Figure 2. Data flow of a BTQ and a DTQ from part0
There are also some specialized TQ types, as follows:
• Listener table queue. A TQ that is used with correlated subqueries.

Collection values are passed down to the subquery on this TQ, and the
results are passed back up to the parent query block as well. A listener
table queue is denoted by TQ* in the access plan from the db2exfmt
command.
• XML table queue (XTQ). A TQ that constructs an XML sequence from

XML documents that are stored in database partitions.
• Asynchrony table queue (ATQ). A TQ that is coupled with the SHIP

operator in a federation environment to parallelize the communication
from the federated server.
• Scatter table queue (STQ). A TQ that, like a DTQ, is used to send

rows to the receiving partition. To maximize the speed, however,
hashing is not used in the case of an STQ.
• Local table queue (LTQ). A TQ that is used only when SMP or intra-
parallelism is enabled. This queue is responsible for passing data
between SMP subagents within a database partition. Be careful not to
mistake the term LTQ as “listener table queue”. There is also a merging
version of the LTQ (LMTQ).
In DB2 10.1, a new type of TQ called an EPE TQ was introduced for use with
the early probe elimination (EPE) enhancements of hash joins. With EPE, a
bloom filter is created on the database partition where the join is processed.
This filter is sent to the remote partitions of the probe side of the hash join,
and then the EPE TQ is used to filter rows before anything is sent to the join
partition.
2.4 Join methods and their use of TQs
This section of the paper describes how a non-co-located join is executed using
different types of TQs. The following simplified version of the example in Figure
1 is used:
customer
cust_id gender address
1 F Region1
2 M Region1 part0
part0 (coordinator)
(coordinator)
3 M Region2
Server
Server 11
store_sales
cust_id qty address
customer
1 1 Region1
1 2 Region2
2 3 Region1
3 1 Region1
3 2 Region1
store_sales
part1
part1 part2
part2
Server
Server 22
Figure 3. Simplified version of Figure 1
In the simplified example, there are three partitions. The part0 partition, called
the coordinator partition, accepts client connections and is responsible for
returning results to the clients that submitted the queries. The other two
partitions, part1 and part2, are data partitions where the customer and
store_sales tables are partitioned.
The following query is used in this example:

SELECT *
WHERE c.cust_id=s.cust_id
2.4.1 Broadcast joins
In a broadcast join, the rows of one of the tables are sent from each database
partition to all the other partitions.
The store_sales table is partitioned on the cust_id column, but the customer
table is partitioned on a different column, which means that a co-located join is
impossible. One approach is to broadcast the customer table to all database
partitions that have the store_sales table, through the BTQ. This approach is
depicted in Figure 4. With the broadcast join, the customer table is “duplicated”
on each partition.
5
RETURN customer
( 1)
Cost Dist. key (some_other_col)
I/O 1 2
3
|
35281.2 store_sales
DTQ q1 Dist. key (cust_id)
( 2)
26705.7 1 3 2
44692 1 2
|
4410.15
HSJOIN Part1 Part2
( 3)
26688.3
44692
1 1
/---+---\ 1 1
2 2
435574 143515 2
3
2
3
BTQ q2 TBSCAN
( 4) ( 6)
Part0
Read q1
13282.3 13122 1 1
Process
2 2
Return results
22346 22346 1 1 2 2
3 3
| | q1 q1
54446.7 1.87687e+06
TBSCAN TABLE: TPCD Scan store_sales Scan store_sales
( 5) STORE_SALES Apply predicates Apply predicates
Read q2 Read q2
13031.6 Join Join
22346 1 Insert into q1
1
Insert into q1
3 3
| 2 2
q2 q2 q2 q2
1.87687e+06
Scan customer Scan customer
TABLE: TPCD Apply predicates Apply predicates
CUSTOMER Broadcast into q2 Broadcast into q2
Part1 Part2
Figure 4. Broadcast join
2.4.2 Directed joins
In a directed join, each row of one of the tables is sent to only one database
partition. The column that is involved in the join predicate is considered to be a
temporary distribution key and is hashed to generate a hash value
corresponding to a database partition.
Consider the example in Figure 5. This time, the customer table is partitioned
on the cust_id column; however, the store_sales table is partitioned on a
different column. To join the customer and store_sales tables on the cust_id
column, the store_sales table can be hashed on cust_id. The rows are sent
directly to the correct database partition through the DTQ.
A different approach is to use a broadcast join, which involves using the BTQ
and duplicating the store_sales table on all the database partitions. This might
be cheaper only if the store_sales table is relatively small. Otherwise, a

directed join is a more efficient approach.
Rows
RETURN customer
( 1) Dist. key (cust_id)
Cost 1 2
I/O 3
|
509800 store_sales
DTQ q1 Dist. key (some_other_col)
( 2) 1 3 1
28878.4
2 2
44692
|
63725 Part1 Part2
HSJOIN
( 3)
28629.4
44692
/----+----\ 1
1
1
1
2 2
1.87687e+06 63488.8 2 2
3 3
DTQ q2 TBSCAN
( 4) ( 6)
Read q1
Part0
14690.6 13037 Process

1 1 Return results 2 2
22346 22346 1 1 2 2
3 3
| | q1 q1
1.87687e+06 1.87687e+06
TBSCAN TABLE: TPCD
Scan customer Scan customer
( 5) CUSTOMER Apply predicates Apply predicates
13176 Read q2 Read q2
Join Join
22346 1 Insert into q1 Insert into q1 2
3 2 1
|
q2 q2 q2 q2
1.87687e+06 Scan store_sales Scan store_sales
TABLE: TPCD Apply predicates Apply predicates
Hash cust_id Hash cust_id
STORE_SALES Insert into q2 Insert into q2
Part1 Part2
Figure 5. Directed join
2.4.3 Repartitioned join
In the final example, shown in Figure 6, neither the customer nor store_sales
table is partitioned on cust_id. In a repartitioned join, both tables in the join
are hashed and the rows are sent to the new database partition using a DTQ.
7
Table queues in depth
Rows
RETURN customer
( 1) Dist. key (some_other_col)
Cost 1 3
2
I/O
|
14243.9 store_sales
DTQ q1 Dist. key (some_other_col)
( 2) 1 3 1
26403.4
2 2
44692
|
1780.49 Part1 Part2
HSJOIN
( 3)
26396.3
44692
/-----+-----\ 1
1
1
1
2 2
166130 18989.1 2 2
3 3
DTQ q2 DTQ q3
( 4) ( 6)
Read q1
Part0
13303.5 13005.5 Process

1 1 Return results 2 2
22346 22346 1 1 2 2
3 3
| | q1 q1
166130 18989.1
TBSCAN TBSCAN Read q2 Read q2
( 5) ( 7) Read q3 Read q3
Join
Join
13159.1 12990 Insert q1 Insert q1
22346 22346 1
Scan store_sales
3 Scan store_sales 2 1 2
| | Apply predicates q3 q3
Apply predicates
Hash cust_id Hash cust_id
q3
1.87687e+06 1.87687e+06 q3
Insert into q3 Insert into q3
1
TABLE: TPCD TABLE: TPCD Scan customer 2 3 Scan customer
Apply predicates q2
CUSTOMER STORE_SALES q2 Apply predicates
Hash cust_id
q2 q2
Hash cust_id
Insert into q2 Insert into q2
Part1 Part2
Figure 6. Repartitioned join
3 Table queues in depth

3.1 Subsections
Before learning more about how TQs work, you must understand the concept of
a subsection.
Before a query can run with parallelism, the query must be logically divided
into smaller pieces so that each piece can run in parallel by a worker thread
(called an agent). These smaller pieces of the overall query are called
subsections. To illustrate this, Figure 7 shows the access plans taken from a
sample query. An access plan depicts how DB2 processes a query from the
bottom to the top. In Figure 7, the processing of the sample query is divided
into three subsections, represented by three colors:
• Subsection 0 (in yellow): Returns rows to the client
• Subsection 1 (in blue): Scans and builds the right leg (the build side) of
the hash join and performs the join
• Subsection 2 (in green): Scans the table on the left leg (the probe side)
of the hash join
The TQ serves as the boundary between subsections (where one subsection

ends and another one begins). Rows are sent via a TQ from the “sending side”
of a subsection (referred to as the “sender”) to the “receiving side” in another
subsection (referred to as the “receiver”).
In this figure, arrows represent the flow of rows between subsections on the
different partitions. The sender is always on the bottom and is trying to send
rows to a subsection that is higher in the diagram. The subsection with the
higher subsection number is usually trying to send to a subsection with a lower
subsection number.
At the beginning of query processing, subsections 1 and 2 are sent to database

partitions P1 and P2, where the tables with data are located. On each partition,
these subsections are executed in parallel. Because subsection 1 requires data
from the left leg of the hash join to perform the join, it waits for rows coming
from subsection 2 via the BTQ. Subsection 0 is responsible for receiving data
from each database partition and returning the final result set to the client that
submitted the query at the coordinator partition (P0). Subsection 0 therefore
runs on P0 only. Subsection 0 waits for rows from subsection 1 from P1 and P2.
Rows
RETURN
( 1)
Cost
I/O
|
35281.2
DTQ
( 2)
26705.7
44692
|
4410.15
HSJOIN
( 3)
26688.3
44692
/---+---\
435574 143515
BTQ TBSCAN
( 4) ( 6)
13282.3 13122
22346 22346
| |
54446.7 1.87687e+06
TBSCAN TABLE: TPCD
( 5) STORE_SALES
9
13031.6
22346
|
1.87687e+06
TABLE: TPCD
CUSTOMER
Figure 7. Subsections and TQ
3.2 TQ buffer flow control
The subsection that writes records into the TQ (the sender) operates
independently of the subsection that consumes those records from the TQ (the
receiver).
Specifically, the sender does not have to wait for the receiver to be ready
before the sender can send buffers to the receiver. If the sender sends data
faster than the receiver can consume it, there is no danger of a continuous pile
up of buffers because of the TQ flow control mechanism.
The flow control mechanism prevents the receiver from being flooded with
buffers that it is not ready to receive. This flow control concept is an important
one, because in some way, it is almost always at the heart of any TQ
performance analysis.
3.2.1 Sender “TQ waits”
A sender cannot continuously send buffers without getting any response from
the receiver. There is a certain allowance here: the sender does not have to
wait for an acknowledgement from the receiver after every buffer that is sent.
However, at some point, the sender must be blocked if it is not getting any
responses from the receiver.
This type of sending-side blocking is a TQ wait. When this happens, the

sending-side subsection waits until it is given permission to send again. For the
sender to become unblocked from its TQ wait, the receiver must read some of
its buffers to relieve the pressure.
To illustrate what a sender TQ wait looks like, in Figure 8, a small horizontal

line is used instead of an arrowhead. Figure 8 depicts a DTQ where subsection
2 on partition 1 needs to send a buffer to subsection 1 on partition 2, but
subsection 2 is blocked in a sender TQ wait.
Figure 8. TQ sender waits
3.2.2 Receiver “TQ waits”
Although not directly related to the flow control mechanism, the receiver can
also experience TQ waits. The reason why a TQ receiver might wait is more
obvious: it waits because a subsection has not sent it anything yet. For
example, a receiver might experience a TQ wait if the sender has not reached a
point where it can send any rows. Alternatively, perhaps the sender is blocked
in a sender TQ wait against a different receiver partition.
Figure 9 depicts a non-merging TQ, where subsection 1 on partition 1 needs to

receive data from either subsection 2 on partition 1 or subsection 2 on partition
2. Because neither of the subsection 2 senders have sent it any data, the
receiver enters into a receiving-side TQ wait. This wait is represented by an
arrow going down from the top subsection to the bottom subsection, but with a
horizontal line at the end instead of a traditional pointed arrowhead.
Figure 9. TQ receiver waits
Figure 9 also shows an example of a “wait for any” style of receiver TQ wait. In
this kind of wait, the partition from which it next receives data is not important
to the receiver. The two lines in the previous figure show that the receiver is
waiting for any of the partitions to send it data. By contrast, if a merging TQ
were used, the diagram would show only a single line. With a merging TQ, the
11
receiver must maintain the sorted order, so it picks only one out of n possible
senders to receive from.
3.2.3 Example of TQ buffer flow control
To illustrate how flow control works and why it is needed, consider the
following simple query and access plan on an instance with only two partitions
(the tables span both partitions):
CREATE TABLE tab1 (distkey INT,

joinkey INT)
DISTRIBUTED BY HASH (distkey);

joinkey INT)
DISTRIBUTED BY HASH (distkey);
SELECT *
FROM tab1 t1, tab2 t2
WHERE t1.joinkey = t2.joinkey;
Figure 10 illustrates the access plan, with the subsections color coded to match
the TQ diagram beside it.
Rows
RETURN
( 1)
Cost
I/O
|
1616
DTQ
( 2)
317.176
10
|
808
HSJOIN
( 3)
215.653
10
/---+---\
976 488
BTQ TBSCAN
( 4) ( 6)
126.314 75.586
5 5
| |
488 488
TBSCAN TABLE: DB2INST
( 5) TAB1
75.586
5
|
488
TABLE: DB2INST
TAB2
Figure 10. An example of TQ buffer flow control
In this example, assume that the client application waits for a user's input
before fetching the next row.
The query flow is as follows:
1. Subsection 1 performs HSJOIN (#3) starting with the right side which is
TBSCAN (#6)1. In parallel, subsection 2 performs TBSCAN (#5).
2. As each row is read from the tab2 table in TBSCAN (#5), it is packed
into a buffer in the BTQ (#4). After the buffer is full, it is broadcast to
both partition 1 and partition 2.
3. To perform HSJOIN (#3), subsection 1 must first receive some data
from the TQ. In the right side of the diagram, the BTQ (#4) is
represented by the arrows between the green dots (subsection 2
agents) and the blue dots (subsection 1 agents). Subsection 1 receives
a buffer from the TQ and unpacks the rows, sending them to HSJOIN
(#3).
4. HSJOIN (#3) matches rows and packs the qualified rows into DTQ (#2),
which in turn sends rows to the coordinator subsection (subsection 0).
5. The coordinator subsection receives the buffer from DTQ (#2) and gives
this result to the application.
6. The subsections continue to work as described in the previous steps,
with the data flowing via the TQs as shown by the arrows in the
diagram.
Now, suppose that the client application doesn't fetch more rows, but there is a
still lot of data left for the query to process. What happens?
In the DTQ (#2), the flow control mechanism is engaged. Even though
subsection 1 has produced more data from the hash join and needs to send it,
it cannot. Subsection 1 enters into a sending-side TQ wait to prevent flooding
the coordinator with buffers while the coordinator is busy doing something else
(waiting for the user). This situation also has an impact lower down in the
access plan. If subsection 1 is currently in a TQ wait while trying to send data
to the coordinator subsection, subsection 1 is not receiving from the BTQ (#4)
while it waits. Therefore, the sending side of the BTQ (#4) in subsection 2 is
also blocked. The situation is shown in the following diagram.
1
Access plan operators are shown in upper case followed by the operator number in
the exfmt output, e.g., TBSCAN (#6).
13
Figure 11. An example of TQ buffer flow control
In this example, the flow control mechanism being engaged was a result of the
user slowing things down at the coordinator, which is not always as obvious as
waiting for user input. The nature and layout of the data, complexity of the
access plan, network speed, and other factors can all influence the TQ flow
control and can introduce TQ waits into the query execution.
Some TQ waits are expected and normal given all these factors. However,
some TQ waits can be a performance concern and might be opportunities for
performance improvements. Sections 4 and 5 of this paper describe some
strategies for monitoring and troubleshooting TQ waits and other TQ
performance concerns.
3.3 TQ buffer overflow (TQ spilling)

Consider the following diagram of a many-to-many MDTQ.
Figure 12. Many-to-many MDTQ

In Figure 12, the arrows with horizontal ends represent TQ waits. The flow of
data is from subsection 2 at the bottom (in green) to subsection 1 above it (in
blue). The arrow going up from the bottom subsection has a horizontal end,
meaning this subsection needs to send data on that connection but it is blocked
by the flow control mechanism. Conversely, the arrow from the top that points
down, with a horizontal end, means that the receiver needs to receive data but
there is nothing to receive, so the receiver is in a receiver TQ wait.
In this example, the situation for subsections 1 (SS1) and 2 (SS2) on each
partition is as follows:
• SS2 on partition 1 needs to send to SS1 on partition 1. A sender TQ wait

is in effect.
• SS1 on partition 1 needs to receive from SS2 on partition 2. A receiver

TQ wait is in effect.

is in effect.


is in effect.

All subsections are therefore waiting for each other. A TQ deadlock occurs.
This type of situation is most common with an MDTQ. The merging flavor of the
receiver is such that it is very selective about which connection it needs to
receive from next. The receiver is trying to merge the sorted streams of data,
and so it must choose the correct connection to receive from to maintain the
sorted order. Also, the directed nature of the TQ means that it is hashing the
column values to choose which partition to send to. The MDTQ is the least
flexible type of TQ, because both the sender and the receiver are selective
about whom they need to communicate with.
The DB2 product resolves the deadlock situation through TQ spilling. With
spilling, if the current buffer that it is trying to send is blocked in a TQ wait,
instead of sending the buffer, it spills this buffer into a temporary table on the
sending-side partition. This approach allows the sender to act as if that buffer
was sent, and the sender can produce more records. Since the sender is no
longer blocked, the TQ begins flowing again, thereby breaking the deadlock.
Rather than viewing TQ spilling as a method to break a TQ deadlock, think of it

as a TQ deadlock avoidance strategy. The TQ flow control mechanism assesses
the throughput of the TQ and its timing. If the flow control mechanism detects
15
a possible traffic jam, it initiates the spill to keep the TQ flowing before it
reaches a TQ deadlock state.
3.3.1 Handling of spilled buffers
If there are spilled buffers in a TQ, the sender tries to send them. The attempt
to send them might not be successful. For example, when the sender needs to
send the spilled buffer, it might find that the receiver still has not read from the
connection. In this case, the sender tries to resend that spilled buffer later.
Often, as new buffers are produced for the same connection that is already
spilling, the new buffers are spilled behind the existing spilled buffers, creating
a backlog of spilled buffers in the temporary table. Eventually, these buffers
are sent. If the sender is producing and spilling buffers faster than it can
resend the existing spilled buffers, the temporary table grows in size as the
spilled buffers are queued.
If the sender has completed producing rows it must send any spilled buffers
before it can close the subsection.
3.3.2 Broadcast spilling versus directed spilling
MDTQs are the most prone to TQ spills. However, DTQs, MBTQs, and BTQs can
also spill. The TQ flow mechanism is essentially the same for all types of
spillable TQs, whereby if the mechanism identifies that the TQ is stalled, it
corrects the situation with spilling.
In a DTQ, each TQ connection has its own spill table. If a sender is spilling on
the connection from partition 1 to partition 2, it does not necessarily mean that
the connection from partition 1 to partition 3 is also spilling. Each connection is
treated independently, as is its spill tables.
In a BTQ, because each connection is getting the same data, there is only a
single spill table. However, each connection might be at a different point in
sending the spilled buffers. Therefore, each connection maintains a cursor
position within the temporary table to indicate what buffer it needs to send
next.
3.3.3 Performance impact of spilling
Because TQ spilling creates inserts in the temporary table, it causes activity in

the temporary table’s buffer pool and might drive up the I/O numbers as well.
TQ spilling might also require disk space to contain the temporary tables,
especially for a spilling TQ that is trying to push through millions of rows.
Having a large enough buffer pool for the temporary table space is one way to
lessen the impact of the I/O for the temporary table. An alternative strategy
might be to identify why the spilling is happening and look for an opportunity
to reduce the spilling or eliminate it.
In the following sections, some sample spilling scenarios and possible solutions
to some of the common spill-related performance issues are described.
4 Monitoring TQs
The methods that are presented here for monitoring the flow of data on the TQ
are not only useful for investigating the TQ itself. You can also use them to
understand the overall execution of a query. This knowledge can help identify
query performance issues that are unrelated to TQ behavior. The cause might
be, for example, an I/O issue, a CPU contention problem, or a buffer pool
issue. By observing the flow of data in the TQ, you can see which parts of the
plan execution are running fast and which parts are running slow. This is a
useful skill to have, because in a partitioned database warehouse, there can be
some rather enormous and complex access plans. It can be difficult to even
read these plans, let alone identify which parts of it are the bottlenecks.
Often, an observed TQ performance issue is a symptom of another

performance issue, so you must identify the root cause of the performance
degradation.
Some TQ issues might manifest themselves in a similar fashion for more than
one query or perhaps exhibit database-wide symptoms. But if you want to
properly zoom in on the TQ problem, you must start by analyzing a single
query's execution. Later, with some experience in TQ performance issues, you
might take a broader look to try to find other problem queries.
There are three steps you can follow to investigate a TQ:
1. Capture monitoring statistics for the query at regular intervals

throughout the execution
2. Get the access plan, as formatted by the db2exfmt or db2expln

command
3. Reformat the snapshot output
4.1 Capturing monitoring statistics for the query

Using a snapshot, you can find the key attributes that are useful for
investigating the TQ performance of a query2. To get a snapshot, you can use
the GET SNAPSHOT command.
2 DB2 monitoring continues to be enhanced. Many new SQL monitoring functions and metrics were
introduced over the past couple of releases. As of DB2 10.1, though, the application snapshot remains the
most effective mechanism for accessing subsection information.
17
Monitoring TQs
To get snapshots:
1. Turn on the relevant monitor switches in one of two ways:

• Turn them on at the instance level, through the database manager
dft_mon_* configuration parameters, by using UPDATE DBM CFG
• Turn them on by using the UPDATE MONITOR SWITCHES command.
To capture subsection and TQ information, the STATEMENT monitor
switch must be on. Turn on at least the following other switches:
o BUFFERPOOL
o UOW
o TIMESTAMP
2. Identify a snapshot interval that makes sense for the query. For
example, if it is a 2-hour query, taking a snapshot every 10 seconds is
too often. Instead, a snapshot every 5 minutes would be a better
choice. Conversely, if it is faster query that runs for only 3 minutes, for
example, then a snapshot taken every 20 seconds would be sufficient.
3. Set up the snapshot collection so that each snapshot is captured into its
own file as demonstrated in the shell script below.
#!/usr/bin/ksh
#
# getsnaps.ksh
#
# usage: getsnaps.ksh <dbname> <snapshot interval time> <num intervals>
#
DBNAME=$1
INTERVAL=$2
NUMSNAPS=$3
COUNT=0
while [[ $COUNT -lt $NUMSNAPS ]]

do
TIMESTAMP=`date +%m.%d_%H.%M.%S`
COUNT=` expr $COUNT + 1`
db2 get snapshot for applications on $DBNAME global > appsnap.$TIMESTAMP
sleep $INTERVAL
done
The global parameter in the snapshot command is important because it

aggregates the subsection information from all the partitions in the instance.
For example, to use this script to collect four application snapshots that are
taken every 5 seconds against the database named tester, you could issue the
following command:
/home/db2inst1/$ getsnaps.ksh tester 5 4

The output of this run would produce files such as the following ones:
/home/db2inst1/$ ls –ltr
-rw-r--r-- 1 db2inst1 db2grp 302916 2012-04-25 09:13
appsnap.04.25_09.13.08
appsnap.04.25_09.13.14
appsnap.04.25_09.13.19
appsnap.04.25_09.13.25
Because you can easily end this script by pressing Ctrl-C, you can choose a
very large number for the number of iterations.
If you know which query you are going to monitor ahead of time, start the
script to get a few snapshots before the query begins, start the query, and
then stop the snapshot collection after the query is done. You then have a set
of snapshots from before the query starts, during its execution, and after its
completion.
4.1.1 Contents of a snapshot
There are many different performance and monitoring values in the output. For
a TQ investigation, though, you are interested in only the subsection
information.
Following is an example of the subsection information that is of interest:
Application Snapshot
Application handle = 262206

…
…
Dynamic SQL statement text:
select * from tab1 t1, tab2 t2 where t1.joinkey = t2.joinkey
Subsection number = 0
Subsection database member number = 4
Subsection status = Executing
Execution elapsed time (seconds) = 2
Total user CPU time (sec.microsec) = 0.193975
Total system CPU time (sec.microsec) = 0.000000
Current number of tablequeue buffers overflowed = 0
Total number of tablequeue buffers overflowed = 0
Maximum number of tablequeue buffers overflowed = 0
Rows received on tablequeues = 28795
Rows sent on tablequeues = 0
Rows read = 0
Rows written = 0
Number of agents working on subsection = 1
Agent process/thread ID = 143
Subsection number = 1
Subsection database member number = 0
19
Monitoring TQs
Subsection status = Waiting to send on

tablequeue
Node for which waiting on tablequeue = 4
Tablequeue ID on which agent is waiting = 1
Execution elapsed time (seconds) = 2
Total user CPU time (sec.microsec) = 0.984691
Total system CPU time (sec.microsec) = 0.000000
Current number of tablequeue buffers overflowed = 0
Total number of tablequeue buffers overflowed = 0
Maximum number of tablequeue buffers overflowed = 0
Rows received on tablequeues = 368
Rows sent on tablequeues = 9126
Rows read = 250088
Rows written = 0
Number of agents working on subsection = 1
Agent process/thread ID = 304

…
…
In the output, you can identify each subsection and what it is doing. You can
use the subsection number and the TQ ID to find out what part of the access
plan each subsection is working on. Here are some examples of the information
provided in the output:
Subsection number
The subsection number.
Subsection database member number

The partition number for the subsection. The term member is used so
that this information is compatible with the DB2 pureScale Feature
naming conventions.
Subsection status
The status of the subsection. The possible values are as follows:
o Waiting to send on tablequeue
o Waiting to receive on tablequeue
o Executing
o Completed
Node for which waiting on tablequeue

If the status is either “Waiting to send on tablequeue” or “Waiting to
receive on tablequeue”, this field specifies which partition it is waiting
for. The value of ANY indicates that it does not matter which partition it
is waiting for, because it will communicate with any of them. For
example, the receiver of a non-merging TQ can receive from any
partition, and in a BTQ, the sender sends to all partitions.
Tablequeue ID on which agent is waiting

The table queue ID. This matches the TQ ID in an access plan.
Execution elapsed time (seconds)

The amount of time that this subsection has been executing. When the
subsection status is Completed, the elapsed time stops incrementing.
Total user CPU time (sec.microsec) and Total system CPU time
(sec.microsec)
Information about CPU cycles that are spent in the subsection.
Current number of tablequeue buffers overflowed

The number of buffers that are currently spilled to a temporary table for
this TQ. This number increases if new spills happen and decreases as
spilled buffers are read from the temporary table and resent on the TQ.
Total number of tablequeue buffers overflowed

The total number of buffers that spilled in this subsection. Because this
is a cumulative count that starts at 0 when the query is invoked, the
count does not decrease as the query executes.
Maximum number of tablequeue buffers overflowed

A high water mark for the current spills metric. This counter keeps track
of the largest size (number of buffers) that the spill table reached.
Rows received on tablequeues

The number of rows that were received from all TQs for this subsection.
Rows sent on tablequeues

The number of rows that were sent into a TQ for this subsection.
Rows read
The number of rows that were read for this subsection.
Rows written
The number of rows that were written for this subsection.
Number of agents working on subsection

The number of agents working on the subsection. The value of this field
is always 1 in a partitioned environment without SMP. The value is
larger than 1 only if you configured the partitioned environment with
SMP (you set the intra_parallel database manager configuration
parameter to ON). In that case, there can be multiple subsections, and
there can also be multiple copies of the same subsection.
For these fields to make sense in the context of a query’s execution, you must
relate this subsection information to the access plan that this query is
executing.
4.2 Getting the access plan

To explain the access plan, issue the db2exfmt or db2expln command. In this
paper, the db2exfmt command is the tool that is used to explain the query, but
the db2expln command works equally well. The DB2 Information Center has
21
Monitoring TQs
more detailed instructions on how to get the access plan information and all of
the different db2exfmt command options.
Here's a short example of collecting db2exfmt command output for a query

from the CLP. Assume that the query.sql file contains the following statement:
SELECT * FROM testtab;
You could then issue the following commands:
db2 connect to testdb

db2 set current explain mode explain
db2 -tvf query.sql
db2 set current explain mode no
db2exfmt -d testdb -g TIC -w -1 -n % -s % -# 0 -o query.exfmt.out
Here's a simple example of an access plan graph from the db2exfmt command.
Also shown is the detailed information about the DTQ, at access plan operator
number 2. Scroll down in the access plan output to find the details for each
operator.
Rows 2) TQ : (Table Queue)

RETURN Cumulative Total Cost: 47.572
( 1) Cumulative CPU Cost: 1.02017e+06
Cost Cumulative I/O Cost: 1
I/O Cumulative Re-Total Cost: 38.7832
| Cumulative Re-CPU Cost: 969581
720 Cumulative Re-I/O Cost: 0
DTQ Cumulative First Row Cost: 11.623
( 2) Cumulative Comm Cost: 11.438
47.572 Cumulative First Comm Cost: 0
1 Estimated Bufferpool Buffers: 1
|
180 Arguments:
TBSCAN ---------
( 3) LISTENER: (Listener Table Queue type)
21.256 FALSE
1 TQMERGE : (Merging Table Queue flag)
| FALSE
180 TQNUMBER: (Runtime Table Queue number)
TABLE: DB2INST1 1
TESTTAB TQREAD : (Table Queue Read type)
Q1 READ AHEAD
TQSECNFM: (Runtime Table Queue Receives From Section #)
1
TQSECNTO: (Runtime Table Queue Sends to Section #)
0
TQSEND : (Table Queue Write type)
DIRECTED
UNIQUE : (Uniqueness required flag)
FALSE
It is important to identify the TQ ID (runtime table queue number), and the

subsection numbers that are involved in the TQs of the plan.3 Using these
fields, you can map the snapshot information to the access plan and find out
what each subsection is doing.
3 The ability to see the subsection number and TQ ID in db2exfmt command output is available in DB2
V9.5 as of Fix Pack 9, in DB2 V9.7 as of Fix Pack 6, and in DB2 V10.1.
4.3 Reformatting the snapshot output

After collecting the snapshots and access plan for the query, start analyzing
this information to learn what it means.
The way that the information is presented in the snapshot output sometimes
makes it difficult to see what is going on. A sample Perl script that you can use
is provided with this paper (format_subsection_snap.pl) and is described in the
appendix. It parses the snapshot information and reformats the output into a
table format for ease of reading is provided. You can customize this Perl script
to suit your needs.
The sample formatting script takes a file name and the application handle as
input and formats the snapshot as in the following example:
/home/db2inst1/ $ format_subsection_snap.pl -f appsnap.04.25_09.13.19 -h 262206

Subs Part Stat WPrt TQId Elap CSpill TSpill RowsRec RowsSnt RRead Rwrit AgentID
0 4 Exec No No 2 0 0 28795 0 0 0 143
1 0 Wsend 4 1 2 0 0 368 9126 250088 0 304
1 1 Wsend 4 1 2 0 0 362 8892 250450 0 303
1 2 Wsend 4 1 2 0 0 406 10062 249858 0 301
1 3 Exec No No 2 0 0 0 0 484364 0 302
2 0 Wsend Any 2 2 0 0 0 900 901 0 310
2 1 Wsend Any 2 2 0 0 0 900 906 0 308
2 2 Wsend Any 2 2 0 0 0 900 906 0 307
2 3 Wsend Any 2 2 0 0 0 900 906 0 309
With the snapshot output formatted this way, you can see all the subsections
on all the partitions lined up with their associated monitoring metrics. You can
see the status of the subsections at a particular point in the query’s execution.
Assume that you took multiple snapshots over time. By running the formatting
script against the different snapshot files and then comparing the elapsed time
between each snapshot, you can compute some meaningful numbers, such as
these:
• Rows read per second
• Rows written per second
• Rows sent on TQ per second
• Rows received from TQ per second
It might be helpful to display this information for a single subsection's life over
a span of several formatted outputs. To do this, run the formatting script
against each snapshot file, and redirect the output to a new file, as shown in
the following example:
format_subsection_snap.pl -f appsnap.04.25_09.13.19 -h 262206 >

appsnap.04.25_09.13.19.fmt
After reformatting and redirecting all the snapshot output, identify which agent
you need to focus on by using the AgentID column in one of the files. Then,
show only the data for that agent across all of the snapshots that you collected,
as shown in the following example:
/home/db2inst1/ $ head -1 appsnap.05.01_11.40.47.fmt;cat appsnap*.fmt

| awk '{if ($15 == 566) {print $0}}'
23
TQ performance
In the previous example, the head -1 option is used to display the titles again.
The cat and awk commands sift through all the formatted output files,
displaying only the output where the value of the AgentID column is 566. The
output is now as follows:
2 1 Exec No No 3 0 0 0 6146 6146 0 566
2 1 Exec No No 8 0 0 0 17652 17652 0 566
2 1 Exec No No 13 0 0 0 28217 28217 0 566
2 1 Exec No No 18 0 0 0 37570 37570 0 566
2 1 Exec No No 24 0 0 0 50620 50620 0 566
2 1 Exec No No 29 0 405 0 63956 64361 405 566
2 1 Exec No No 35 0 405 0 75849 76254 405 566
2 1 Exec No No 40 196 601 0 88236 88641 601 566
2 1 Exec No No 45 0 735 0 101943 102678 735 566
2 1 Exec No No 51 0 735 0 112982 113717 735 566
Each line of this output is for the same agent and corresponds to the elapsed
time when the snapshot was taken.
This subsection information from the snapshots, coupled with the access plan
output, are the tools that you can use to investigate TQ performance.
5 TQ performance
If there is one idea to remember after reading this paper, it is the importance
of balance as related to TQs.
Balancing the system is generally a key concept in any performance discussion.

It is especially important for TQs because the TQ flow control mechanism is
based on analyzing the throughput on the TQ and then employing a timing
mechanism to decide when TQ waits should happen or when the TQ should
spill. An imbalance somewhere in the query flow can lead to the engagement of
the flow control mechanism.
The following sections in this paper contain a number of TQ scenarios that

demonstrate how balance plays a role in a query. The TQ scenarios might seem
simple or contrived, but the concepts also apply to very large and complex
queries. The goal here is not to list all the common types of issues. Rather, it is
to provide some examples of TQ issues and to demonstrate a methodology.
Using this information can help you to identify problems and symptoms in your
own workloads and queries and improve the performance in your database.
In these sample scenarios, it is assumed that an access plan and a series of

formatted snapshots have been collected as described in section 4 of this
paper.
5.1 TQ performance scenario 1: Do I really have a TQ

problem?
5.1.1 Query description
This query contains a hash join, followed by a sort and a TQ to the coordinator.
The query is run in two different ways. Some performance degradation is
artificially injected in the first run to demonstrate different behaviors when the
two runs are compared.
5.1.2 Setup and background
There are five partitions: four data partitions (partitions 0, 1, 2, and 3) and one
dedicated coordinator and catalog partition (partition 4). The following shows
the example query, its access plan and subsection layout.
Access plan: Query:
Rows SELECT *
RETURN FROM tab1 t1,
( 1) tab2 t2
Cost WHERE t1.joinkey = t2.distkey
I/O ORDER BY 1 FETCH FIRST 100 ROWS ONLY
|
100 DDL:
MDTQ
( 2) CREATE TABLE tab1 (distkey INT,
439035 joinkey INT,
85801.3 something INT,
| info VARCHAR(1500))
100 DISTRIBUTE BY HASH (distkey);
TBSCAN
439008 joinkey INT,
85801.3 info CHAR(4))
| DISTRIBUTE BY HASH (distkey);
100
SORT Subsection layout:
( 4)
439007 Subsection 0 includes
85801.3 The receiver of MDTQ (#2) (also referred to as TQ ID1
| in the snapshot)
147409 The coordinator which returns rows to the client
HSJOIN application
( 5)
430079 Subsection 1 includes
85801.3
/---+----\ The receiver of DTQ (#7) (also referred to as TQ ID2
2.4996e+06 125513 in the snapshot)
TBSCAN DTQ The scan of table tab2
( 6) ( 7) The hash join logic to qualify rows before inserting
189693 85693.8 them into SORT (#4) before it
15535 41867
| | Subsection 2 includes:
2.4996e+06 125513 − The scan of rows from the base table tab1
TABLE: DB2INST1 TBSCAN − The sender of DTQ (#7) (also referred to as TQ ID2 in
TAB2 ( 8) the snapshot)
Q1 55861.8
41867
|
125513
TABLE: DB2INST1
TAB1
Q2
25
TQ performance
5.1.3 Monitoring results
The snapshot to focus on is the one that was taken while subsection 2 was
sending rows to DTQ (#7), also referred to as TQ ID2 in the snapshot.
Run 1 snapshot results are as follows:

0 4 Wrecv Any 1 24 0 0 0 0 0 0 137
1 0 Exec No No 24 0 0 31982 0 0 8416 177
1 1 Exec No No 24 0 0 39548 0 0 10436 157
1 2 Exec No No 24 0 0 31131 0 0 8168 225
1 3 Exec No No 24 0 0 41309 0 0 10896 224
2 0 Wsend 3 2 24 0 0 0 37197 37198 0 226
2 1 Wsend 3 2 24 0 0 0 35127 35128 0 208
2 2 Wsend 3 2 24 0 0 0 36057 36058 0 158
2 3 Wsend 3 2 24 0 0 0 35813 35814 0 161
Observations:
• The snapshot was taken at approximately the 24-second mark of the query
execution.
• Subsection 2 read approximately 35,000 rows on each partition and sent those
into the TQ.
• The coordinator is waiting to receive.
• Subsection 2 on all partitions is in the Waiting to send on tablequeue state.
• Subsection 1 on all partitions is in the Executing state.
Run 2 snapshot results are as follows:

0 4 Wrecv Any 1 20 0 0 0 0 0 0 398
1 0 Wrecv Any 2 20 0 0 25215 0 0 0 888
1 1 Wrecv Any 2 20 0 0 31682 0 0 0 953
1 2 Wrecv Any 2 20 0 0 25121 0 0 0 951
1 3 Wrecv Any 2 20 0 0 33015 0 0 0 954
2 0 Exec No No 20 0 0 0 28299 28299 0 955
2 1 Exec No No 20 0 0 0 28586 28586 0 928
2 2 Exec No No 20 0 0 0 28395 28395 0 952
2 3 Exec No No 20 0 0 0 29738 29738 0 927
Observations:
• This snapshot was taken at approximately the 20-second mark of the query
execution.
• Subsection 2 read approximately 28,000 rows and sent them into the TQ.
• Subsection 1 is in the Waiting to receive on tablequeue state on all partitions.
• Subsection 2 is in the Executing state on all partitions.
5.1.4 Questions, answers, and discussion
Question 1.
In both runs, why was subsection 0 waiting to receive from the TQ?
Answer 1.
This situation occurred because of the SORT step in the access plan below
MDTQ (#2). The SORT step acts as a dam, such that no rows flow up the plan
to the MDTQ until after the last row has been inserted into the SORT step. Only
when the last row has been inserted into the SORT step will DB2 perform the
sorting and begin producing rows up the plan to the MDTQ. It is normal for the
coordinator to wait; this is not a concern.
Question 2.
In run 1, subsection 2 on all the partitions was waiting to send rows into the
TQ, so this must mean that a TQ problem caused these TQ waits, correct? In
run 2, the opposite situation occurred, where the receiver was waiting. Was
this also a TQ problem?
Answer 2.
No, there was no TQ problem causing the TQs to wait. However, in general, if
one side of the TQ is waiting, the question to ask is, “why is it waiting?” To
answer that, look at what the opposite side of the TQ is doing, and relate that
to the access plan.
In run 1, the sender in subsection 2 was waiting because the receiver in

subsection 1 was busy doing something else. As a result, there was a backlog
on the sender. The receiver is in the part of the access plan that performs a
hash join. Maybe something was slowly happening inside the hash join such
that subsection 1 spent more time there.
In run 2, the receiver was waiting. This meant that the sender, which was
scanning rows from a table, must have been too slow in reading rows and
inserting them into the TQ.
5.1.5 Remarks
In run 1, an I/O bottleneck was artificially introduced. Specifically, the sortheap

configuration parameter value was set very low, and the buffer pool on the
temporary table space was also made small. Doing these things meant that a
very small amount of memory was available for the hash join (hash joins use
sort memory), causing it to use temporary tables. The small temporary buffer
pool further resulted in disk I/O to support this hash join and a significant
slowdown of the execution.
This scenario demonstrates that a performance problem that has nothing to do

with the TQ logic can manifest symptoms in the TQ monitoring output: in this
case, TQ send waits. Nonetheless, you can use these symptoms to narrow
down where the bottleneck is in the query. By looking at the snapshot and the
access plan, you can identify the problem areas of the query and then focus
further investigation in those areas.
27
TQ performance
In run 2, no artificially fabricated performance degradation was applied. The

query was running naturally, with a reasonable configuration. When there is
sufficient sort memory, the hash join very quickly consumes the rows on the
right side of the join. Since these are memory operations in the hash join, it
runs faster than the sender subsection which must do some I/O to scan the
base table. The receiver is simply doing work that is naturally faster than the
work that the sender is doing, and this situation reflected in the receiving-side
TQ waits.
5.2 TQ performance scenario 2: Base table skew and

the TQ start time problem
This scenario consists of a simple hash join between tables tab1 and tab2.
Many rows are returned to the coordinator.
There are five partitions: four data partitions (partitions 0, 1, 2, and 3) and a
dedicated coordinator and catalog partition (partition 4).
The following shows the example query, its access plan and subsection layout.
Access plan: Query:
SELECT *
Rows
FROM tab1 t1,
RETURN
tab2 t2
( 1)
WHERE t1.joinkey = t2.joinkey;
Cost
I/O
| DDL:
9.02774e+07
DTQ
joinkey INT,
( 2)
info CHAR(4))
4.48707e+06
DISTRIBUTE BY HASH (distkey);
2957
|
joinkey INT,
2.25693e+07
info CHAR(4))
HSJOIN
DISTRIBUTE BY HASH (distkey);
( 3)
114313
2957 Subsection layout:
/---+----\ Subsection 0 includes:
901012 250088 − The receiver of DTQ (#2) (also referred to as TQ ID1 in
BTQ TBSCAN the snapshot)
( 4) ( 6) − The coordinator which returns results to the
52610 18985.9 application.
1401 1556
| | Subsection 1 includes:
225253 250088 − The “build” (right) side of the hash join, which
TBSCAN TABLE: DB2INST1 contains the scan of tab1.
( 5) tab1 − The receiver of BTQ (#4)(also referred to as TQ ID2 in
17103.1 Q2 the snapshot), on the “probe” (left) side of the hash
1401 join.
| − The hash join logic itself. Any matching rows are sent
225253 into TQ ID1 to the coordinator.
TABLE: DB2INST1
TAB2
Q1 Subsection 2 includes:
− The scan of tab2
− The sender of BTQ (#4) (also referred to as TQ ID2 in
the snapshot)
Subsections 1 and 2 exist on the data partitions (0, 1, 2,

and 3). Subsection 0, the coordinator, is only on partition
4.
The following shows two snapshots taken at different intervals:
Snapshot #1:
0 4 Exec No No 2 0 0 28795 0 0 0 143
1 0 Wsend 4 1 2 0 0 368 9126 250088 0 304
1 1 Wsend 4 1 2 0 0 362 8892 250450 0 303
1 2 Wsend 4 1 2 0 0 406 10062 249858 0 301
1 3 Exec No No 2 0 0 0 0 484364 0 302
2 0 Wsend Any 2 2 0 0 0 900 901 0 310
2 1 Wsend Any 2 2 0 0 0 900 906 0 308
2 2 Wsend Any 2 2 0 0 0 900 906 0 307
2 3 Wsend Any 2 2 0 0 0 900 906 0 309
Observations:
• This snapshot was taken very early in the execution of the query. There is an
elapsed time of 2 seconds for each subsection.
• Subsection 2 on all data partitions sent 900 rows on the TQ, but now seems to
be in a sender TQ wait.
• Subsection 1 has matched rows from the join. Subsection 1 sent this data to
the coordinator, but it experienced some TQ waits.
• Subsection 1 on partition 3 did not have any matched rows from the join. You
can tell because subsection 1 has not sent any data on its TQ to the
coordinator yet.
Snapshot #2:
0 4 Wrecv Any 1 13 0 0 271488 0 0 0 143
1 0 Wrecv Any 2 13 0 0 3600 90205 250088 0 304
1 1 Wrecv Any 2 13 0 0 3600 89688 250450 0 303
1 2 Wrecv Any 2 13 0 0 3600 89823 249858 0 301
1 3 Exec No No 13 0 0 0 0 2741801 17800 302
2 0 Wsend Any 2 13 0 0 0 900 901 0 310
2 1 Wsend Any 2 13 0 0 0 900 906 0 308
2 2 Wsend Any 2 13 0 0 0 900 906 0 307
2 3 Wsend Any 2 13 0 0 0 900 906 0 309
Observations:
• Eleven seconds have passed since the last snapshot collection. The query
returned more rows to the coordinator.
• Subsection 1 on partition 3 read more rows, but it still has not sent anything to
the coordinator and is still executing. This subsection read a lot more rows than
did the same subsection on partitions 0, 1, and 2.
29
TQ performance
• On partitions 0, 1, and 2, subsection 1 is in a TQ wait. This subsection is

waiting to receive more rows on TQ ID2 coming from subsection 2. So what is
subsection 2 doing?
• Subsection 2 on all partitions has not sent any more data since the 2-second
mark of the query, and is in a TQ wait after sending only 900 rows each.
Snapshot #3:
0 4 Exec No No 24 0 0 472229 0 0 0 143
1 0 Wsend 4 1 24 0 0 6684 167544 250088 0 304
1 1 Wsend 4 1 24 0 0 6490 161928 250450 0 303
1 2 Wsend 4 1 24 0 0 5693 142038 249858 0 301
1 3 Exec No No 24 0 0 0 0 3706665 29380 302
2 0 Wsend Any 2 24 1497 1497 0 225253 225268 1497 310
2 1 Wsend Any 2 24 1497 1497 0 225266 225280 1497 308
2 2 Wsend Any 2 24 1496 1496 0 225127 225141 1496 307
2 3 Wsend Any 2 24 1491 1491 0 224354 224368 1491 309
Observations:
• The query has been executing for 24 seconds. More rows have been received
in the coordinator since the last snapshot.
• Subsection 1 on partitions 0, 1, and 2 processed more data and sent more
data to the coordinator. However, on partition 3, subsection1 still has not sent
any rows on TQs or received anything from the TQs.
• Subsection 2 on all partitions performed some TQ spills. It spilled almost 1500
buffers to temporary tables on each of the four data partitions.
5.2.4 Question, answers, and discussion

Question 1.
Why was subsection 1, the sender to the coordinator, sometimes showing a
status of “waiting to send”?
Answer 1.
It is quite normal to see some sender TQ waits to the coordinator partition.
This situation occurs because many partitions are trying to send their results to
a single partition (the coordinator subsection 0), but the coordinator must send
those results to the application. The work that is involved with returning results
to the application might take a bit of time, such that the coordinator is slower
to get back to the TQ to receive more rows. Besides, there are multiple
streams moving rows to a single target (a many-to-one situation). As such, a
certain number of bottleneck symptoms are to be expected in this TQ.
Back to this example, whenever subsection 1 was experiencing a sender TQ

wait, subsection 0 was in an executing state. The coordinator was doing non-
TQ work (processing results for the client application), so it is natural that the
senders might be blocked.
Question 2.
Why was there a TQ spill? What was the performance problem?
Answer 2.
On partition 3, subsection 1 is the real clue to the problem. The fact that it
read many millions more records than the other partitions read in the same
subsection is a sign of a balance problem.
Because subsection 1 on partition 3 did not receive any rows from TQ ID2, BTQ
(#4), the query processing was still on the build (right) side of the hash join.
Based on the access plan, the right side of the join was a simple table scan
(TBSCAN). You can check the distribution of the table by using the following
query:
SELECT COUNT(distkey), DBPARTITIONNUM(distkey)
FROM tab1 GROUP BY DBPARTITIONNUM(distkey);
1 2
----------- -----------
250088 0
250450 1
249858 2
10249604 3
4 record(s) selected.
As you can see, this table is not distributed very well. A large portion of the
table exists only on a single partition. This distribution explains why partition 3,
subsection 1 read so many more rows than the other partitions did.
Figure 13 shows the layout of the subsections. The flow of data corresponds
roughly to the information shown in snapshot #1.
Figure 13. Subsections in scenario 2
The TQ senders of subsection 2 sent their data to partitions 0, 1, and 2. This

data was received by subsection 1, which joined the data and sent the results
31
TQ performance
to coordinator subsection 0 on partition 4. A BTQ was used, so each row had to

be sent to all partitions.
Subsection 2 was blocked in a sender TQ wait when trying to send its data to
partition 3. This situation occurred because the receiving end of the TQ
(subsection 1 on partition 3) was busy doing other things. Subsection 1 was
still performing a table scan because of the base table skew and the need to
process more data than other partitions.
Instead of waiting until partition 3, subsection 1 completed its work and started
reading from the TQ, the flow control intelligence detected this wait scenario
and triggered a TQ spill in each subsection 2. After the spill happened, as
shown in snapshot #3, data flowed again but at the cost of all those temporary
table writes.
In this scenario, the TQ waits and, eventually, the spill occurred because of a
base table skew that made a subsection spend more time scanning the table
before it could start reading from the TQ.
5.2.5 Remarks
The cause of this problem was a base table skew. Choose the table distribution
key wisely, such that hashing the column data results in rows being evenly
distributed across all data partitions. A unique column, such as the primary
key, is often a good choice.
5.3 TQ performance scenario 3: Join key skew

This scenario consists of a simple hash join between two tables. A regular
column (non-distribution-key column) of one table is joined with a distribution-
key column of another table. Therefore, the access plan uses a DTQ to
distribute the rows to the correct partition for the join. The data inside the
tables is evenly distributed across all data partitions.
Access plan: Query:
SELECT *
Rows
FROM tab1 t1,
RETURN
tab2 t2
( 1)
WHERE t1.joinkey = t2.distkey
Cost
ORDER BY 1 FETCH FIRST 100 ROWS ONLY;
I/O
|
100 DDL:
MDTQ
439035 joinkey INT,
TBSCAN
439008 joinkey INT,
85801.3 info CHAR(4))
| DISTRIBUTE BY HASH (distkey);
100
SORT Subsection layout:
( 4)
439007 Subsection 0 includes:
85801.3 − The receiver of MDTQ (#2) (also referred to as TQ ID1 in
| the snapshot).
147409 − The coordinator which returns rows to the application.
HSJOIN
( 5) Subsection 1 includes:
430079
− The receiver of DTQ (#7) (also referred to as TQ ID2 in
85801.3
the snapshot), on the “build” (right) side of the hash
/---+----\
join
2.4996e+06 125513
TBSCAN DTQ − The scan of table tab2 on the “probe” (left) side of the
( 6) ( 7) hash join
189693 85693.8 − The hash join logic to match rows and insert into the
15535 41867 sort above the join
| |
2.4996e+06 125513 Subsection 2 includes:
TABLE: DB2INST1 TBSCAN − The scan of table tab1
TAB2 ( 8) − The sender of DTQ (#7) (also referred to as TQ ID2 in
Q1 55861.8 the snapshot)
41867
| Subsections 1 and 2 exist on the data partitions (0, 1, 2,
125513 and 3). Subsection 0, the coordinator, is only on partition
TABLE: DB2INST1 4.
TAB1
Q2

Snapshot #1:
/home/db2inst1/$ format_subsection_snap.pl -f appsnap.05.01_11.40.58 -h 262260
0 4 Wrecv Any 1 24 0 0 0 0 0 0 398
1 0 Wrecv Any 2 24 0 0 89667 0 0 8624 558
1 1 Wrecv Any 2 24 0 0 44444 0 0 0 583
1 2 Wrecv Any 2 24 0 0 35273 0 0 0 564
1 3 Wrecv Any 2 24 0 0 46533 0 0 0 584
2 0 Exec No No 24 0 0 0 52509 52509 0 613
2 1 Exec No No 24 0 0 0 50620 50620 0 566
2 2 Exec No No 24 0 0 0 58933 58933 0 626
2 3 Exec No No 24 0 0 0 53893 53893 0 565
Observations:
• This snapshot was taken approximately 24 seconds after the query began
executing.
• Subsection 2 read 50,000 - 60,000 rows from each partition and sent them into
the TQ. The rows seem to be well distributed in the table, because partitions
are producing approximately the same number of rows.
33
TQ performance
• All the partitions except for partition 0 received approximately 35,000 - 45,000
rows from the TQ. Partition 0 received almost twice that number:
approximately 90,000 rows.
Snapshot #2:
0 4 Wrecv Any 1 83 0 0 0 0 0 0 398
1 0 Exec No No 83 0 0 207921 0 2153425 49764 558
1 1 Exec No No 83 0 0 102718 0 2516588 21040 583
1 2 Exec No No 83 0 0 81665 0 2204140 12932 564
1 3 Exec No No 83 0 0 107696 0 2368494 23004 584
2 0 Comp No No 57 0 691 0 125513 126204 691 0
2 1 Comp No No 56 0 735 0 124488 125223 735 0
2 2 Comp No No 56 0 661 0 125689 126350 661 0
2 3 Comp No No 54 0 1344 0 124310 125654 1344 0
Observations:
• This snapshot was taken at the 83-second mark of the query.
• Subsection 2 long since completed sending its data, so the query is processing
the hash join or sort, and no rows has been sent to the coordinator yet.
• Subsection 1 on partition 0 received approximately twice as many rows from
the TQ as did subsection 1 on partitions 1, 2, and 3.
• There was a bit of TQ spilling in subsection 2, which suggests some trouble in
the TQ throughput.
• Compute the rate at which rows were scanned from the base table and sent
into the TQ via subsection 2 on partition 0 as an example:
Rows sent (RowsSnt) / elapsed time (Elap)
= 125513 / 57
=~ 2208 rows sent into the TQ per second

Question 1.
Is there any base table skew in this scenario?
Answer 1.
No. Subsection 2 scanned the rows from the base table and sent them into the
TQ. Based on the number of rows that the subsection sent into the TQ and
read, the table has a good balance and distribution of rows.
Question 2.
Why did the TQ spill?
Answer 2.
An imbalance occurred in the number of rows that subsection 1 received from
subsection 2 through the TQ. On partition 0, subsection 1 received
approximately twice as many rows as subsection 1 on partitions 1, 2, and 3.
This skew caused an imbalance.
Figure 14 shows this case, where the skew caused a TQ wait scenario:
There are two possible reasons why the flow of data on this TQ was slow:
• Partition 0 might have been overloaded with more work.

• More importantly for the TQ flow control intelligence, the receivers on
partitions 1, 2, and 3 might have been waiting for work. DB2 identified
this as a problem that it decided to solve by triggering a TQ spill.
Question 3.
If there is no base table skew, what caused the imbalance?
Answer 3.
The base table is evenly distributed. In the access plan output for the DTQ we
see:
7) TQ : (Table Queue)
…
…
Arguments:
---------
…
…
PARTCOLS: (Table partitioning columns)
1: Q2.JOINKEY
TQMERGE : (Merging Table Queue flag)
FALSE
TQNUMBER: (Runtime Table Queue number)
2
TQREAD : (Table Queue Read type)
READ AHEAD
TQSECNFM: (Runtime Table Queue Receives From Section #)
2
TQSECNTO: (Runtime Table Queue Sends to Section #)
1
TQSEND : (Table Queue Write type)
DIRECTED
35
TQ performance
The DTQ is hashing the column named joinkey in the table tab1 to identify
which target partition it should send the row to. A skew occurs, whereby too
many rows are being sent to a single partition, so there must be some kind of
skew in the data for this column. You can run the following query to investigate
this:
SELECT COUNT(joinkey), joinkey
FROM tab1
GROUP BY joinkey
ORDER BY 1 DESC
FETCH FIRST 5 ROWS ONLY
1 JOINKEY
----------- -----------
125000 0
824 424
818 352
812 342
811 485
You can see that the joinkey value of 0 occurs very frequently: 125,000 times.
This is why so many rows are being directed to a single partition, causing the
skew.
Question 4.
Is sending 2000 rows into a TQ per second considered a good throughput?
Answer 4.
A throughput of 2000 is probably not good, but it depends on the situation. The
answer depends on so many things, for example, the CPU power, the disk
subsystem and layout, the concurrency, the memory usage, and the plan logic
that is producing data for the TQ. We can give a good answer to this question if
we have a comparable scenario to examine.
The following snapshot was taken using the same set of data and the same
hardware that were used earlier in this scenario, except that the column
joinkey no longer has the skew and has a more balanced distribution of values.

0 4 Wrecv Any 1 29 0 0 0 0 0 0 398
1 0 Exec No No 29 0 0 110456 0 558622 16744 558
1 1 Exec No No 29 0 0 137074 0 304700 23520 583
1 2 Exec No No 29 0 0 108990 0 371055 15632 564
1 3 Exec No No 29 0 0 143480 0 351271 25792 584
2 0 Comp No No 26 0 0 0 125513 125513 0 0
2 1 Comp No No 26 0 0 0 124488 124488 0 0
2 2 Comp No No 26 0 0 0 125689 125689 0 0
2 3 Comp No No 26 0 0 0 124310 124310 0 0
In this case, the number of rows that were sent into the TQ is the same as in
the skewed case presented previously. In the latest run, however, the elapsed
time for subsection 2 to complete its work was 26 seconds, a savings of 31

seconds.
Once again, calculate the number of rows sent per second:
Rows sent (RowsSnt) / elapsed time (Elap)

= 125513 / 26
=~ 4827 rows sent into the TQ per second
The throughput is much better: twice as fast.
5.3.5 Remarks
This performance problem was not caused by skew in the base table
distribution. Instead, the skew was in the data of a regular column, not a
distribution key column.
There are a couple of possible ways to address this type of scenario:
• Investigate the business and query logic that resulted in this type of
skew. Why are there so many items that have the same value?
Perhaps you can redesign the logic to avoid this or rewrite the query
to avoid using the column as a join key.
For example, assume that the joinkey value of 0 has a special

meaning in the application logic that is known already not to have
any qualifying rows. In that case, adding a predicate of where
joinkey <> 0 might push that predicate evaluation into the scan of
the table tab1, thereby filtering out any of those rows before they
are inserted into the TQ.
• Promote a different access plan that avoids using a DTQ hashed on

the column with the skew. In particular, using distribution statistics
from the RUNSTATS command might give the optimizer some
knowledge of the skew on the column. The optimizer might then be
able to choose a different access plan that avoids the DTQ (for
example, RUNSTATS ON TABLE db2inst1.tab1 WITH DISTRIBUTION
ON COLUMN <joinkey>). This might not be very effective in the
particular query in this scenario. It is a very simple example that
might not have many alternative plan choices, because a join must
be performed on the column with the skew. Nevertheless, in
general, having distribution statistics might allow for better access
plan choices.
5.4 TQ performance scenario 4: Correlated column

skew
In this query, two tables are joined through a merge join. The unique thing
about a merge join is that it requires the join legs to be in a sorted order. In
37
TQ performance
this case, an index provides the sorted order, instead of performing a sort. This
plan uses an MDTQ, which maintains the sorted order.
Query optimization level 3 was used in this case instead of the default
optimization level 5. Level 3 was used to force a merge join access plan instead
of a hash join access plan so that the problem could be demonstrated.
Access plan: Query:
SELECT *
Rows
FROM tab1 t1,
RETURN
tab2 t2
( 1)
WHERE t1.joinkey = t2.distkey
Cost
ORDER BY 1 FETCH FIRST 100 ROWS ONLY;
I/O
|
100 DDL:
MDTQ
123945 joinkey INT,
TBSCAN
( 3) CREATE INDEX ind1 ON tab1 (joinkey);
123931
3623.64 CREATE TABLE tab2 (distkey INT NOT
| NULL, joinkey INT, info CHAR(4),
100 PRIMARY KEY (distkey))
SORT DISTRIBUTE BY HASH (distkey);
( 4)
123930 Subsection layout:
3623.64
| Subsection 0 includes:
250088 − The receiver of MDTQ (#2) (also
MSJOIN referred to as TQ ID1 in the
( 5) snapshot)
108783 − The coordinator which returns the
3623.64 results to the application.
/-------+-------\ Subsection 1 includes:
2.4996e+06 0.100051
− The receiver of MDTQ (#9) (also
FETCH FILTER
referred to as TQ ID2 in the
( 6) ( 8)
snapshot)
280025 71866.2
14979 2125 − The scan of table tab2, though the
/----+----\ | index scan and fetch
2.4996e+06 2.4996e+06 250088 − The merge join
IXSCAN TABLE: DB2INST1 MDTQ Subsection 2 includes:
( 7) TAB2 ( 9) − The scan of table tab1, through
172103 Q1 71866.2 the index scan and fetch
4 2125 − The sender of MDTQ (#9) (also
| | referred to as TQ ID2 in the
2.4996e+06 250088 snapshot)
INDEX: SYSIBM FETCH
SQL120502094004780 ( 10) Subsections 1 and 2 exist on the data
Q1 41801.5 partitions (0, 1, 2, and 3).
2125 Subsection 0, the coordinator, is only
| on partition 4.
/---+----\
250088 250088
IXSCAN TABLE: DB2INST1
( 11) TAB1
17246.8 Q2
4
|
250088
INDEX: DB2INST1
IND1
Q2
Snapshot #1:
0 4 Wrecv Any 1 20 0 0 0 0 0 0 398
1 0 Wrecv Any 2 20 0 0 0 0 512 0 772
1 1 Wrecv Any 2 20 0 0 0 0 512 0 647
1 2 Wrecv Any 2 20 0 0 0 0 512 0 564
1 3 Wrecv Any 2 20 0 0 0 0 512 0 584
2 0 Wsend Any 2 20 2159 2159 0 250088 250088 2159 771
2 1 Exec No No 20 2008 2008 0 232900 232904 2008 644
2 2 Exec No No 20 1972 1972 0 228751 228754 1972 646
2 3 Exec No No 20 1993 1993 0 231230 231233 1993 715
Observations:
• This snapshot was taken at the 20-second mark of the query.
• The sending side of the MDTQ in subsection 2 seems to be spilling, but not a
single row has been registered in the receive column of subsection 1 (the
receiver of the TQ).
• The status of the receiving side in subsection 1 indicates that it is waiting to
receive from the TQ.
• Each sender seems to be processing approximately the same number of rows.
Snapshot #2:
0 4 Wrecv Any 1 30 0 0 0 0 0 0 398
1 0 Exec No No 31 0 0 191489 0 191918 191288 772
1 1 Exec No No 30 0 0 186369 0 186741 186352 647
1 2 Exec No No 30 0 0 189953 0 190313 189437 564
1 3 Wrecv 3 2 30 0 0 188232 0 188416 188203 584
2 0 Wsend Any 2 31 491 2159 0 250088 251756 2159 771
2 1 Wsend Any 2 30 539 2162 0 250450 252076 2162 644
2 2 Wsend Any 2 30 501 2157 0 249858 251517 2157 646
2 3 Wsend Any 2 30 526 2155 0 249604 251236 2155 715
Observations:
• This snapshot was taken at approximately the 30-second mark of the query.
• A few more spills are shown in the total spill column for subsection 2, but the
number of current spills (CSpill) has decreased.
• Subsection 1 has received many rows, and the distribution of received rows
seems fairly balanced.
39
TQ performance
Question 1.
If the sender is blocked when trying to send and the receiver is blocked when
trying to receive, then why are both sides stuck? Shouldn't the receiver have
received some rows by now?
Answer 1.
In a merging TQ, the receiver has a special initialization process that it must
complete before it can start to merge the incoming streams in an ordered
fashion. Consider this example of three sorted streams of data, for which it
needs to return the correct order (1, 2, 3, 4, 5, 6, 7, 8, and then 9):
The receiver performs this work by checking the next set of values that is sent to it
and picking the correct value in the sorted order (the lowest value, in this case). In
the first step it compares 2, 1, and 7. It chooses the lowest value, 1, removes the 1
from the middle stream queue, and replaces it with the next value in that stream, 3.
The initialization phase of a merging TQ must receive the first value from all of its
senders before it can return results. Consider again the example above. Suppose the
receiver has only the values 2 and 7; the value 1 has not arrived yet. In that case, it
would be incorrect to return the value 2 as the result, because it doesn't yet have the
full picture of all of the incoming streams.
In snapshot #1, the receiving side of the TQ in subsection 1 is currently trying to

initialize and set up the comparisons by getting a row from each of the sending
connections. The fact that it has not registered any rows received means that it is still
performing the initialization process.
Question 2.
Is there any join key skew or base table skew here?
Answer 2.
No. In snapshot #2, you can see that the counters for rows read and rows sent
are well balanced in subsection 2 and that the counters for rows received in
subsection 1 are also well balanced.
Question 3.
Why did the spill happen?
Answer 3.
In this case, there was another type of skew pattern in the query data.
However, it is impossible to identify this skew by looking at the snapshot
counters.
There are some internal ways to see the skew, by using a db2trc command,
but that goes beyond the scope of this paper. If you notice that an MDTQ is
showing heavy spilling, even at the beginning of the query, you should run
some query tests to gather more information.
The MDTQ (#9) is a many-to-many MDTQ and therefore uses a hashing

algorithm on the column joinkey to determine the target partition for the
directed send. Consider the following:
SELECT COUNT(joinkey), joinkey
FROM tab1
GROUP BY joinkey
ORDER BY 1 DESC
FETCH FIRST 5 ROWS ONLY
1 JOINKEY
----------- -----------
1 1
1 6
1 5
1 10
1 8
This result shows that the values in the joinkey column do not have any skew.
The values in this column are unique, such that it does not have any repeating
values that hash to the same target partition.
SELECT COUNT(joinkey)
FROM tab1
WHERE joinkey = distkey
1
-----------
1000000
This result shows the nature of the skew. There is a correlation between the
join key and the distribution key of the table. This would not be a problem for a
regular DTQ. However, for an MDTQ to properly merge the data, the receiver
must learn what the values are from all of the TQ connections before it can
begin to return rows (see explanation in Question 1).
41
TQ performance
The following diagram shows the layout of this particular situation.
Figure 15 focuses on subsection 1 on partition 0. In fact, subsection 1 on each

of the partitions has this issue because a many-to-many TQ is used. To keep
the figure easier to follow, only a subset of arrows are shown.
In this case, the join key is the same as the distribution key. The sender
always sends data to its own partition because the same value always hashes
to its own partition. In a situation such as this, the merging TQ never receives
any rows from other partitions, so it cannot complete its initialization. The
arrows with horizontal ends represent the attempt to read from the TQ. The
only way that it can complete its initialization is if the sender spills until it
reaches the end of its scan. Only then can the receiver learn that it does not
have any data being sent to it from the other partitions, and then it can
proceed to receive all the spilled data.
5.4.5 Remarks
This type of skew scenario is rare. This scenario was a bit contrived, because
optimization level 3 was chosen to get the merge join; a hash join was not an
option. This scenario is presented here as an example of the types of problems
to watch out for, in particular, for an MDTQ. Specifically, any pattern of data
that results in some imbalance within an MDTQ is sensitive to spilling.
In this scenario, a merging TQ was chosen because a merge join was used and
the merge join has a sort requirement. If a different type of join method was
chosen in the plan, perhaps the merging TQ would not exist, and some other
type of TQ could have been chosen that would not have this MDTQ initialization
problem. Thus, a possible solution to this problem would be to pursue an
access plan change.
5.5 TQ performance scenario 5: The “stacked” TQ

This scenario consists of an insert query from a subselect. The query is very
large and complex, so the entire query is not shown.
There are 32 data partitions in a realistic warehouse environment.
The following shows the example query’s access plan and subsection layout.
Access plan (a portion of it): Query:
A complex insert query.

1.14684e+08
INSERT
( 2) Subsection layout:
1.26275e+06
/ \ There are many subsections, but the focus is
1.14684e+08 63 on subsections 1, 2, and 3.
DTQ TABLE: DB2INST1
( 3) tab1 Subsection 1 includes:
744577 - The receiver of DTQ (#3) (also referred to
| as TQ ID1 in the snapshot)
1.14684e+08 - The insert into tab1
HSJN
( 4) Subsection 2 includes:
740783 - The scan of table tab2 on the “build”
/ \ (right) side of HSJN (#4)
1.14684e+08 1.7382e+06 - The receiver of DTQ (#5) (also referred to
DTQ SCAN as TQ ID2 in the snapshot)
( 5) (13) - The hash join logic, HSJN (#4), and the
739100 1301.29 sender of DTQ (#3) (also referred to as TQ
| | ID1 in the snapshot)
1.14684e+08 1.7382e+06
HSJN TABLE: DB2INST1 Subsection 3 executes:
( 6) tab2 - The “build” (right) side of HSJN (#6). This
737719 is marked as “etc.” because this diagram does
/ \ not show the full picture.
6.3352e+08 etc. - The scan of table TAB3 on the probe side of
SCAN the hash join.
( 7) - The sender of DTQ (#5) (also referred to as
142524 TQ ID2 in the snapshot).
|
6.3352e+08
TABLE: DB2INST1 Subsections 1, 2, and 3 exist on all
TAB3 partitions.
The following output shows subsections 1, 2, and 3 on partition 15 over a

period of time, with a snapshot interval of 30 seconds. Partition 15 was chosen
at random. All the partitions seem to be well balanced, so any partition is a
good representation of the overall query.
43
TQ performance
Subsection 3:
3 15 Wrecv Any 79 0 0 5164689 0 0 18569 10310
3 15 Wrecv Any 3 109 0 0 6987096 0 0 23938 10310
3 15 Wrecv Any 3 139 0 0 8324297 0 0 29173 10310
3 15 Wrecv Any 3 169 0 0 9154655 0 0 32434 10310
3 15 Wsend 27 2 199 0 0 9220507 98824 1608920 38988 10310
3 15 Exec No No 230 0 0 9220507 374296 5457987 44219 10310
3 15 Wsend 5 2 260 0 0 9220507 579432 9072779 49232 10310
3 15 Wsend 27 2 290 0 0 9220507 797755 12194409 54109 10310
3 15 Wsend 28 2 320 0 0 9220507 960625 16038902 58410 10310
3 15 Wsend 2 2 350 0 0 9220507 1099125 18144421 60528 10310
3 15 Wsend 4 2 380 0 0 9220507 1194642 18144476 60528 10310
3 15 Wsend 0 2 410 0 0 9220507 1292960 18144476 60528 10310
3 15 Wsend 0 2 440 0 0 9220507 1449400 18144476 60528 10310
3 15 Wsend 2 2 470 0 0 9220507 1549712 18144476 60528 10310
3 15 Wsend 0 2 501 0 0 9220507 1596587 18144476 60528 10310
3 15 Wsend 13 2 531 0 0 9220507 1670994 18144476 60528 10310
3 15 Wsend 25 2 561 0 0 9220507 1739708 18144476 60528 10310
3 15 Wsend 28 2 591 0 0 9220507 1815566 18144476 60528 10310
3 15 Wsend 10 2 621 0 0 9220507 1865953 18144476 60528 10310
Subsection 2:
2 15 Wrecv Any 2 79 0 0 0 0 55636 0 9797
2 15 Wrecv Any 2 109 0 0 0 0 55636 0 9797
2 15 Wrecv Any 2 139 0 0 0 0 55636 0 9797
2 15 Wrecv Any 2 169 0 0 0 0 55636 0 9797
2 15 Exec No No 199 0 0 95454 95453 55636 0 9797
2 15 Wsend 30 1 230 0 0 359053 359052 55636 0 9797
2 15 Wsend 29 1 260 0 26 557847 557846 55662 26 9797
2 15 Wsend 30 1 290 0 26 779561 779560 55662 26 9797
2 15 Wsend 18 1 320 0 26 940716 940715 55662 26 9797
2 15 Wsend 6 1 350 0 26 1086457 1086456 55662 26 9797
2 15 Wsend 0 1 380 0 26 1192094 1192093 55662 26 9797
2 15 Wsend 26 1 410 0 26 1284564 1284563 55662 26 9797
2 15 Wsend 26 1 440 0 26 1452593 1452592 55662 26 9797
2 15 Wsend 26 1 470 0 26 1553920 1553919 55662 26 9797
2 15 Wsend 17 1 501 0 26 1607434 1607433 55662 26 9797
2 15 Wrecv Any 2 531 0 26 1679747 1679747 55662 26 9797
2 15 Wsend 25 1 561 0 26 1742488 1742487 55662 26 9797
2 15 Wsend 25 1 591 0 26 1811626 1811625 55662 26 9797
2 15 Wsend 4 1 621 0 26 1864007 1864006 55662 26 9797
Subsection 1:
1 15 Wrecv Any 1 79 0 0 0 0 0 0 3140
1 15 Wrecv Any 1 109 0 0 0 0 0 0 3140
1 15 Wrecv Any 1 139 0 0 0 0 0 0 3140
1 15 Wrecv Any 1 169 0 0 0 0 0 0 3140
1 15 Wrecv Any 1 199 0 0 87549 0 0 87549 3140
1 15 Exec No No 230 0 0 423785 0 0 423784 3140
1 15 Wrecv Any 1 260 0 0 612194 0 0 612194 3140
1 15 Wrecv Any 1 290 0 0 839796 0 0 839796 3140
1 15 Exec No No 320 0 0 1010892 0 0 1010891 3140
1 15 Wrecv Any 1 350 0 0 1137884 0 0 1137884 3140
1 15 Wrecv Any 1 380 0 0 1319412 0 0 1319412 3140
1 15 Wrecv Any 1 410 0 0 1414306 0 0 1414306 3140
1 15 Wrecv Any 1 440 0 0 1583593 0 0 1583593 3140
1 15 Wrecv Any 1 470 0 0 1729281 0 0 1729281 3140
1 15 Wrecv Any 1 501 0 0 1790969 0 0 1790969 3140
1 15 Wrecv Any 1 531 0 0 1865127 0 0 1865127 3140
1 15 Wrecv Any 1 561 0 0 1914041 0 0 1914041 3140
Observations:
• In the first four snapshots or so, up to approximately the 170-second mark of
the query, subsections 1 and 2 seem to be idle, waiting to receive from the TQ.
Subsection 3 is receiving rows from TQ ID3.
• After the 170-second mark, subsections 1, 2, and 3 are all working together,
and the counters for rows that are sent to and received from the TQs are
steadily increasing over time.
• All three of these subsections are often in a waiting state, as follows:
Subsection 1 is often waiting to receive.
Subsection 2 is often waiting to send.
Subsection 3 is often waiting to send.

Question 1.
Is the throughput of the rows here a performance concern?
Answer 1.
Yes. Even without computing the delta numbers of rows that are sent or
received, something is causing the frequent wait states. It never gets bad
enough to result in TQ spilling. However, many waits over a period of time add
up and can result in slower query execution. Almost every snapshot is showing
the same style of TQ waits; this is not a coincidence.
Question 2.
What is causing all of these TQ waits?
Answer 2.
The following TQ layout illustrates the problem. As noted, there are 32
partitions in this example; for simplicity, only three partitions are shown in the
diagram. This diagram shows what the pattern of TQ waits and TQ receives
might look like, based on the previous observations.
45
TQ performance
All three subsections on all of the partitions are active in the same loop of
execution. Based on the access plan, there are no SORT operators, TEMP
operators, or anything else that provides a stopping point of any kind between
these subsections. This is an important fact because if there are more than two
subsections interacting with each other, the interdependencies between
subsections can become more prevalent.
You can imagine the chain reaction if any of these subsections is a bit slower
than others. In the previous diagram, subsection 1 might have to do some I/O
as part of the insert activity. While subsection 1 is inserting, the following
conditions might apply:
• Senders from subsection 2 might be blocked because the receiver from
subsection 1 is busy inserting.
• Because senders from subsection 2 might be blocked, other receivers in
subsection 1 might get a receiver TQ wait because no subsection is
sending them anything. This situation is shown for subsection 1 on
partition 1 in Figure 16.
• Because senders in subsection 2 are waiting, subsection 3 might also be
blocked when trying to send.
Question 3.
The answer to question 2 suggests that the cause of this issue is insert
performance, correct?
Answer 3.
Yes and no. If insert performance is optimized and running to the best of its
ability, it helps to reduce the TQ waits. As in Scenario 1, a TQ wait might be a
symptom of something else being slow.
However, the point of this example is to show that when a third subsection is
added, the wait scenario is amplified. The more subsections that exist in the
same loop of execution without dams, the higher the probability that at some
point, there will be some interaction of the TQs that results in a wait This
situation is especially true for merging TQs, DTQs, and MDTQs.
Question 4.
What is a causing problem in the query flow? How can I tell whether there are
multiple subsections that are all involved in the same loop of query execution?
Answer 4.
A dam in the query flow is anything that holds up or stops the flow of rows
from continuing up the query access plan.
The most common types of dams in the query flow are as follows:
Hash join build

The “build” side of the hash join in an access plan is the right side of the
hash join. The build side of the join is always executed first until the last
row is processed. Rows are not processed on the left (“probe”) side of
the join until the entire stream of rows on the build side has been read.
SORT
While the logic lower down in the access plan is producing rows and
inserting them into a sort, nothing flows up above the sort. After the
last row is inserted into the sort, the sorting can start so that another
operation can consume rows from the sort.
TEMP
A TEMP operator in the access plan is similar to a SORT operator. In
general, nothing can be read from the TEMP operator until the last row
has been inserted into it.
UNION
A union can combine results from many of its input streams. The input
streams are processed one at a time. TQ flow on the input streams that
are not currently being processed is prevented.
The concept of a dam is important, because it limits the interaction of all

subsection executions.
For example, assume that a subsection produces rows, inserts them into a
SORT operator, and then sends them on a TQ to another subsection. The
receiver does not have any work to do until the sorting is complete and the
sorted rows can be read. This temporary blockage in the query flow means that
any of the work that happened below the SORT operator in the plan can be
somewhat ignored as a contributing factor to the TQ problem being
investigated.
Usually, only a few subsections are doing work at the same time. For example,
a query with 80 subsections likely has only two or three subsections that are
actively processing data at the same time. When execution is happening in the
47
Table queue tips and facts
bottom subsections of the access plan, the dams are preventing rows from
flowing up the plan right away, such that all of the subsections higher up in the
access plan are idle, in TQ waits.
Having more subsections active at the same time increases the incidence of TQ
waits, similar to the situation in this 3-subsection scenario.
5.5.5 Remarks
This scenario might not even be a real problem that needs a solution. The
optimizer reviews the statistics and chooses the most optimal plan. If there are
no other access plan choices given the layout of the objects and the statistics,
the chosen plan is already the optimal one.
If you identify a situation where many subsections are involved in the same
loop of execution and it seems to be introducing a lot of idle time because of
TQ waits, try experimenting with query optimization, perhaps getting different
access plans. See whether the identified loop can be avoided.
6 Table queue tips and facts

This section of the paper summarizes some tips and facts about TQ.
• The coordinator subsection is always subsection number 0.
• In queries, include only the columns that you really need. Avoid using
SELECT *
The wider the rows are that are flowing into the TQ, the more space that they
occupy in the FCM buffers. If you send 1000 rows into a TQ, each row is 10
bytes wide, and each buffer is 4 KB in size, it takes three buffers to send this
data. If you send 1000 rows into a TQ and each row is 1000 bytes wide, it
takes 250 buffers to send this data. (These numbers are approximate and do
not take into account things such as overhead). Removing unneeded columns
from the query can reduce the work that is needed to pass around this data on
TQs.
• Myth: TQ spilling is bad.
In general, TQ spilling is a sign of some throughput problems in the TQ, and it

is good to investigate more deeply. However, even though TQ spilling requires
temporary tables, temporary buffer pool I/O, and CPU work, TQ spilling is not
necessarily a bad thing. Consider the reasons why the TQ spilling might have
been triggered - the flow control mechanism detected an issue in the
throughput of the TQ and decided to spill the TQ to temporary tables. If the
alternative is to have frequent TQ waits or perhaps a different access plan that
is less efficient, spilling the TQ might be a faster way to run the query.
Sometimes, TQ spilling can be the most efficient way to flow the data through
the TQ, given the nature of the data.
• Don't be fooled by the TQ throughput in the monitoring output if the TQ has

spilled.
The “rows sent” counter for the TQ in the snapshot output is incremented when
the row is packed into the buffer for sending. If the buffer is spilled instead, the
counter is not decremented. Thus, the “rows sent” counter might reflect rows
that were spilled and have not been sent yet.
More importantly, if the TQ is spilling, it no longer encounters any sender TQ

waits because it cannot send new rows before it sends the spilled buffer. Also,
new rows that are inserted into the TQ are spilled behind the previously spilled
buffers. Because there are no TQ waits when spilling, often the sender can
produce rows and spill again very quickly, resulting in a very quick increase in
the “rows sent” counter. This can be misleading, because these rows are not
being sent, but are being spilled and will be sent later.
• The final TQ that sends data to the coordinator partition is always a BTQ, even
if the db2exfmt command output specifies that the TQ is a DTQ.
This optimization happens internally. Because the coordinator is a single

partition, the TQ that sends data to the coordinator is a many-to-one TQ. There
is only one possible target partition for each of the senders. As such, it would
waste CPU cycles to hash the column of the row, because there is only one
place that this row could go. Hence, a single-partition BTQ is used.
• In general, anything that results in a subsection on one partition taking longer

to reach the TQ operator, compared to the same subsection on the other
partitions, can cause the TQ to spill. This behavior is especially true for BTQs.
For example, consider a complex query with a large amount of work below the
TQ in the access plan. The subsection on some partitions start processing the
TQ before others with a timing difference of several seconds. This can lead to
TQ waits and, in some cases, TQ spills.
• If one data partition seems to be processing more data than another for the
same subsection within a query, this could be a sign of an imbalance
somewhere. Any imbalance can lead to performance problems in the TQ. A few
thousand rows are likely not an issue, but it might be worthwhile to investigate
if the differences between partitions are in the range of tens of thousands of
rows.
• An MDTQ is typically the most prone to spilling. The sender hashes on a value
and sends its data to one of the n possible target partitions. The receiver must
maintain sorted order, so it receives from only one of the n possible partitions.
As such, both sides of the TQ are selective about which partition they
communicate with, which naturally results in wait scenarios and potential
spilling.
• If TQ spilling is prevalent in queries, ensure that the buffer pool that is

associated with the temporary table space is sufficient to help reduce the I/O
impact of the spills.
49
Conclusion
• A TQ that features frequent sender TQ waits has a larger requirement for

concurrent FCM buffer usage. If a TQ at the sending side is waiting, there is a
small backlog of in-use buffers at the receiver. Having the data in TQs flowing
well can reduce concurrent FCM buffer usage.
7 Conclusion
DB2’s partitioned database environment is a scale-out solution that is perfectly
designed for deploying large data warehouse systems. To allow parallel query
processing, TQs are used. In this paper, we discussed how TQ works, and took
an in-depth look at TQ buffer flow control and spilling. We also presented ways
to monitor TQ performance. Example scenarios were then used to illustrate
potential issues which we may encounter with TQ performance and what we
can do to resolve them. Finally, we gave some useful TQ tips and facts that
would help improve performance.
Communication between DB2 data partitions
APPENDIX
Perl script format_subsection_snap.pl

You can use and modify the format_subsection_snap.pl Perl script to parse and
reformat the output from a global application snapshot. The script produces
tabular-format output, which makes it easier to see the bigger picture of what
is happening between all the subsections in a query across all database
partitions.
The script takes two arguments:

-f filename
The file name of the application snapshot to parse.
-h application_handle
The application handle of the application to analyze in the snapshot file.
Section 4.3, “Reformatting the snapshot output” provides an overview of this

script's usage. Thanks to Albert Grankin, also from IBM, for the original Perl
script that we modified and included with this white paper.
51
REFERENCES
IBM DB2 for Linux, UNIX, and Windows Information Centers
For the list of current DB2 Information Centers, see “Accessing different
versions of the DB2 Information Center” in the IBM DB2 Version 10.1
Information Center:
http://ibm.biz/BdxPgG
Database-related best practice papers
A growing list of database-related best practice papers is available on the DB2

for Linux, UNIX, and Windows Best Practices developerWorks website:
http://ibm.biz/Bdx2ew
You might find the following papers particularly useful:
Managing data growth ( http://ibm.biz/Bdx2Gq )

Learn about physical database design and planning information that is
applicable to data life cycle management and data maintenance
scenarios that are characterized by rapid data growth.
Tuning and Monitoring Database System Performance

( http://ibm.biz/Bdx2nt )
This paper covers important principles of initial hardware and software
configuration as well as monitoring techniques that help you understand
system performance under both operational and troubleshooting
conditions. This paper provides a step-wise, methodical method for
troubleshooting performance problems.
Writing and Tuning Queries for Optimal Performance

( http://ibm.biz/Bdx2ng )
Learn best practices for minimizing the impact of SQL statements on
DB2 database performance. This paper focuses on good fundamental
writing and tuning practices that can be widely applied to help improve
DB2 database performance.
Communication between DB2 data partitions
®
© Copyright IBM Corporation 2012
IBM United States of America
Produced in the United States of America
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM
representative for information on the products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used.
Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be
used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program,
or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of
this document does not grant you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing

IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such provisions are
inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PAPER “AS IS” WITHOUT WARRANTY OF
ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow
disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes may be made periodically to the
information herein; these changes may be incorporated in subsequent versions of the paper. IBM may make improvements
and/or changes in the product(s) and/or the program(s) described in this paper at any time without notice.
Any references in this document to non-IBM Web sites are provided for convenience only and do not in any manner serve
as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product
and use of those Web sites is at your own risk.
IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of
this document does not give you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing

IBM Corporation
4205 South Miami Boulevard
Research Triangle Park, NC 27709 U.S.A.
All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent
goals and objectives only.
This information is for planning purposes only. The information herein is subject to change before the products described
become available.
If you are viewing this information softcopy, the photographs and color illustrations may not appear.
53
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in
the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in
this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of IBM trademarks is available on the web at "Copyright and trademark
information" at http://www.ibm.com/legal/copytrade.shtml.
Other company, product, or service names may be trademarks or service marks of others.

Understanding Table Queues

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Understanding Table Queues

Caricato da

Copyright:

Formati disponibili

IBM® DB2® technical white paper

Communication between DB2

Understanding table queues

In partitioned database environments, query processing requires data to be

2 Table queues (TQs): An introduction

2.1 Table partitioning in partitioned database

CREATE TABLE customer (cust_id CHAR(10),

CREATE TABLE store_sales (cust_id CHAR(10)

Figure 1. Example of a partitioned environment

2.2 Co-located and non-co-located joins

If the query is changed to do a join on the address column instead, a complete

2.3 Type of TQs

• Directed table queue (DTQ). A TQ in which each row is sent to one of

• Merging table queue. A TQ that preserves the sort order, and is

BTQ DTQ (where values 1 and 3 are hashed to

Figure 2. Data flow of a BTQ and a DTQ from part0

There are also some specialized TQ types, as follows:

• Listener table queue. A TQ that is used with correlated subqueries.

• XML table queue (XTQ). A TQ that constructs an XML sequence from

• Asynchrony table queue (ATQ). A TQ that is coupled with the SHIP

• Scatter table queue (STQ). A TQ that, like a DTQ, is used to send

2.4 Join methods and their use of TQs

Figure 3. Simplified version of Figure 1

The following query is used in this example:

2.4.1 Broadcast joins

Figure 4. Broadcast join

2.4.2 Directed joins

be cheaper only if the store_sales table is relatively small. Otherwise, a

14690.6 13037 Process

Figure 5. Directed join

2.4.3 Repartitioned join

13303.5 13005.5 Process

Figure 6. Repartitioned join

3 Table queues in depth

• Subsection 0 (in yellow): Returns rows to the client

The TQ serves as the boundary between subsections (where one subsection

At the beginning of query processing, subsections 1 and 2 are sent to database

Figure 7. Subsections and TQ

3.2 TQ buffer flow control

3.2.1 Sender “TQ waits”

This type of sending-side blocking is a TQ wait. When this happens, the

To illustrate what a sender TQ wait looks like, in Figure 8, a small horizontal

Figure 8. TQ sender waits

3.2.2 Receiver “TQ waits”

Figure 9 depicts a non-merging TQ, where subsection 1 on partition 1 needs to

Figure 9. TQ receiver waits

3.2.3 Example of TQ buffer flow control

CREATE TABLE tab1 (distkey INT,

CREATE TABLE tab2 (distkey INT,

Figure 10. An example of TQ buffer flow control

The query flow is as follows:

Figure 11. An example of TQ buffer flow control

3.3 TQ buffer overflow (TQ spilling)

Figure 12. Many-to-many MDTQ

• SS2 on partition 1 needs to send to SS1 on partition 1. A sender TQ wait

• SS1 on partition 1 needs to receive from SS2 on partition 2. A receiver

• SS2 on partition 2 needs to send to SS1 on partition 2. A sender TQ wait

• SS1 on partition 2 needs to receive from SS1 on partition 1. A receiver

• SS2 on partition 3 needs to send to SS1 on partition 3. A sender TQ wait

• SS1 on partition 3 needs to receive from SS2 on partition 3. A receiver

Rather than viewing TQ spilling as a method to break a TQ deadlock, think of it

3.3.1 Handling of spilled buffers

3.3.2 Broadcast spilling versus directed spilling

3.3.3 Performance impact of spilling