Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
cover
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
IBM PureData™ PureData™
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Netezza® and NPS® are trademarks or registered trademarks of IBM International Group B.V., an
IBM Company.
Other product and service names might be trademarks of IBM or other companies.
pref
Exercises description
Linux commands
cat filename Display a file
cd pathname Change directories
clear Clears the shell
ctrl+c Terminates the current process
find . -name filename Find a file from your home directory
grep 'text' filename Find every occurrence of the text in the filename
ls List a directory
ls-l Use a long listing
pwd Print working directory
wc -l filename Count the number of lines in a file
set -o vi Use vi-style command line editing interface
iv IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
vi IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
TOC Contents
Exercises description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Exercise 6. Loading and unloading data using the nzload utility . . . . . . . . . . . . . . . . . . . . . 6-1
6.1. Using the nzload utility with command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.2. Using the nzload utility with a control file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
viii IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 1. Access training's IBM PureData
System for Analytics
Prerequisites
None.
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
As a developer, you need to log in to the IBM PureData System for Analytics and connect to the
system database.
1-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
If you have any issues with Citrix, consult the IBM IRLP Citrix Setup Guide provided in your course
instructions.
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Important
If you get a Citrix Receiver - Security Warning pop-up message, click Permit use.
You might get an error message when you click Connect. If you do, follow these substeps
before continuing to the next step:
a. On the Remote Desktop Connection panel, click Show Options.
1-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty b. In the Computer field, enter the IP address from your course instructions.
d. Click Connect.
__ 5. At the Windows Security prompt, enter the password password.
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
1-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ c. Log into the host with the ID and password credentials you received in your course
package, which are referenced as follows throughout the rest of the exercises:
- Username = <student_id>
- Password = <student_pwd>
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Important
Ensuring that these variables are set AND set correctly is crucial to successfully completely all
subsequent exercises. DO NOT move on to the next exercise until you have completed all the steps
here and verify that the variables have the correct values.
1-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 4. Add the following three new command lines. Note that these commands and variables are
case-sensitive.
export NZ_USER=<student_id>
export NZ_PASSWORD=<student_pwd>
export NZ_DATABASE=<student_id>_db
Your file should look like this except with your information:
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
When you are working with your own PureData System for Analytics, you can set these variables
for the currently opened shell instead of setting them for the environment. To do this, instead of
editing your bash profile, export NZ_USER, NZ_PASSWORD, and NZ_DB using the following
commands:
export NZ_USER=<student_id>
export NZ_PASSWORD=<student_pwd>
export NZ_DB=<student_id>_db
1-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Important
If you get an error message regarding your credentials, this means that your user name and
password are not the same as what you set in your environment variables in the previous exercise.
Run env | grep -i nz to verify your information.
To clear the environment variables do the following:
export NZ_PASSWORD=
export NZ_USER=
Host User
---------- -----
pok-puredata <student_id>
Note
You can cache the password on both active and passive hosts by using the host switch.
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
If you do not successfully cache your password or you get an authentication error, add your user ID
and password to the command:
nzsql -d system -u <student_id> -pw <student_pwd>
1-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
End of exercise
© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
1-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 2. Administration tools
Prerequisites
Before proceeding with this lab, make sure that you have:
• Completed the previous lab successfully.
Exercise instructions
You are an IBM PureData System for Analytics developer. You need to be able to navigate
NzAdmin and Netezza Performance Portal to access the IBM PureData System for Analytics,
check on the hardware, and access the databases for which you have permissions.
2-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 2. On the Connect to IBM Netezza Server screen, enter the following, and then click OK.
__ a. Host: pok-puredata
__ b. User: <student_id>
__ c. Password: <student_pwd>.
2-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 3. Verify you see the following main application screen and review the system state and
configuration information:
__ 2. Expand SPA Units, click on SPA ID, and then click on each SPU Slot to see individual
partition statuses and details from the physical appliance:
2-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 6. To view the Admin and Object permissions for <student_id>, right-click <student_id>
and select Privileges>Admin and then Privileges>Object.
2-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 2. From the Login panel, enter User Name admin and Password password, and then click
Login.
Note
You are having to create an account for the Portal as the security credentials are stored in a
different location to the credentials for both NPS and Linux. There is no connection between the
authorization credentials for Portal and any other credentials
2-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 6. From the Create Account panel, enter your student id and password, re-enter your
password, and then click OK.
__ 8. Verify your user name is in the Accounts Administration panel, and then close the panel.
2-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 10. Double-click the Performance Portal icon in the IBM Netezza Training Tools desktop folder
and log in with your student ID and password.
__ 12. From the Add Host panel, type in the following and then click OK.
__ a. Host: pok-puredata
__ b. User: <student_id>
__ c. Password: <student_pwd>
2-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 3. To view its tables, click the Tables tab in the bottom right pane.
2-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 4. To review hardware state and configuration status, click Hardware to expand the navigation,
and then review the SPA, SPU, and data slices information. If this takes more than a couple
of minutes, without responding, cancel and expand again.
End of exercise
2-18 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 3. Databases, tables and schemas
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• Log in to the host
© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
You are an IBM PureData System for Analytics developer and want to create a database and tables
with a random distribution. You also need to know how to manage the default schema.
3-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Important
Whenever you are in nzsql and execute an SQL statement, you MUST finish the statement with a
semicolon (;). The slash (\) commands do not require a semicolon.
Information
If you did not successfully cache your password, add your user ID and password to the command:
nzsql -d system -u <student_id> -pw <student_pwd>
Notice the identifiers for database, schema, and user ID. During the rest of this course, this
helps you ensure you are using the correct database and schema, and are logged in with
the correct user ID.
© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Notice that the database and schema identifiers have changed to those of your newly
created database:
3-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Important
© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 12. To prepare for subsequent labs, you must add data to your tables. You do this by executing
the script loadtables.sh in the /home/<student_id> directory.
__ a. Exit NZSQL by typing \quit.
__ b. Type cd .. to change to the /home/<student_id> directory.
__ c. To load data into your tables, type the following command:
./loadtables.sh
__ d. To ensure that all data loads successfully, when the script has completed check the
generated nzlog files using the following command:
more <TABLE_NAME>.<STUDENT_ID>.<STUDENT_ID>_DB.nzlog
Note
3-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ d. \c system
__ e. \c <student_id>_db
End of exercise
3-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 4. Data distribution
Prerequisites
Before proceeding with this lab, make sure that you have:
• Completed the previous lab successfully
Exercise instructions
Since IBM PureData System for Analytics is built on a massively parallel architecture that
distributes data and workloads over a large number of processing and data nodes, the single most
important tuning factor is picking the right distribution key. The distribution key governs which data
rows of a table are distributed to which data slice and it is very important to pick an optimal
distribution key to avoid data skew, processing skew and to make joins co-located whenever
possible.
Tables in IBM PureData System for Analytics are distributed across data slices based on the
distribution method and key. If a bad data distribution method has been picked, it results in skewed
tables or processing skew. Data skew occurs when the distribution method puts significantly more
records of a table on one data slice than on other data slices. Apart from bad performance this also
results in a situation where the IBM PureData System for Analytics can hold significantly less data
than expected. Processing skew occurs if processing of queries is mainly taking place on some
data slices, for example, when queries only apply to data on those data slices. Even in tables that
are distributed evenly across dataslices, data processing for queries can be concentrated or
skewed to a limited number of dataslices. This can happen because IBM PureData System for
Analytics is able to ignore data extents (sets of data pages) that do not fit to a given WHERE
condition.
Both types of skew result in suboptimal performance since in a parallel system the slowest node
defines the total execution time.
4-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
If you are still in the second schema you created in the previous exercise, complete the following
steps:
__ a. alter database <student_id>_db set default schema <student_id>;
__ b. Quit nzsql.
__ c. Start nzsql.
__ d. Ensure you are using the default schema:
select current_schema;
__ 4. To see a description of the LINEITEM table’s columns and distribution key, run the describe
command:
\d lineitem
You can see that the LINEITEM table has 16 columns with different data types. Some of the
columns have a “key” suffix and substrings containing the names of other tables and are
most likely foreign keys of dimension tables. The distribution key is L_LINESTATUS.
__ 5. To return a limited number of rows you can use the limit keyword in your select queries.
Execute the following select command to return 10 rows of the LINEITEM table. For
readability only select a couple of columns including the order key, the ship date and the
linestatus distribution key:
select l_orderkey, l_quantity, l_shipdate, l_linestatus from lineitem limit
10;
From this limited sample, you can not make any definite judgments but you can make a
couple of assumptions. While the L_ORDERKEY column is not unique it seems to have a
number of distinct values. The L_SHIPDATE column also appears to have a lot of distinct
shipping date values. L_LINESTATUS on the other hand has only one shown value, which
might make it a bad distribution key. It is possible that you get different results since a
database table is an unordered set (for example, only “O” or “F” values in the
L_LINESTATUS column).
__ 6. Verify the number of distinct values in the columns with a “SELECT DISTINCT (COUNT
column_name)” call. For example, to return a list of all values that are in the
L_LINESTATUS column, execute the following SQL command:
select count (distinct l_linestatus) from lineitem;
4-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty You can see that the L_LINESTATUS column only contains two distinct values. As a
distribution key, this results in a table that is only distributed to two of the available
dataslices. L_ORDERKEY on the other hand has many distinct values.
__ 7. Verify this by executing the following SQL call, which returns a list of all dataslices that
contain rows of the LINEITEM table, and the corresponding number of rows stored in them:
select datasliceid, count(*) from lineitem group by datasliceid;
Information
Every IBM PureData System for Analytics table has a hidden column DATASLICEID that contains
the id of the dataslice in which the selected row is being stored. By executing a SQL query that
does a GROUP BY on this column and counts the number of rows for each dataslice id, data skew
can be detected. In this case the table has been, as expected, distributed to only two of the
available four dataslices. This means that only half of the available space is used and likely results
in low performance during most query executions. In general a good distribution key should have a
big number of distinct values with a good value distribution. Columns with a low number of distinct
values, especially boolean columns should not be considered as distribution keys.
Information
In the following 2 sections; 4.2 and 4.3 we use create two new tables by editing SQL files using vi. It
is possible to achieve the same results by using Create Table as (CTAS). The reason the vi method
was chosen was to help students get familiar with the editing process. If you are already familiar
with vi then it is your choice if you want to use the CTAS method.
You are going to pick a new distribution key. As you have seen, it should have a reasonable number
of distinct values. One of the columns that did fit this description was the L_SHIPDATE column as
shown above.
The column has over 2500 distinct values and has therefore more than enough distinct values to
guarantee a good data distribution on 4 dataslices. Of course this is under the assumption that the
value distribution is good as well.
__ 1. Reload the table with the new distribution key.
__ a. In PuTTY, ensure you are in the DDL directory by typing:
cd /home/<student_id>/DDL
__ b. Copy the lineitem.sql to a new file called lineitem_shipdate.sql by typing
cp lineitem.sql lineitem_shipdate.sql
__ c. Edit the lineitem.sql by typing:
vi lineitem_shipdate.sql
__ d. Use the cursor key to navigate to the DISTRIBUTE ON line, and then navigate to the
beginning of l_linestatus.
4-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ e. Type cw to enter change word mode and change the distribution key by typing:
l_shipdate
__ 6. After the nzload command has executed successfully, generate statistics for the reloaded
table. To do this, use NZSQL and type in the following command:
generate statistics on lineitem;
__ 7. Verify that the new distribution key results in a good data distribution by repeating the query,
which returns the number of rows for each datasliceid of the LINEITEM table:
select datasliceid, count(*) from lineitem group by datasliceid;
Notice that the data distribution is much better now. All data slices have a roughly equal amount
of rows.
4-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
When you redistribute a table, the system makes a copy of the table in a temporary location and
then redistributes that copy.
The ORDERS table has a key column O_ORDERKEY that is most likely the primary key
of the table. It contains information on the order value, priority and date and has been
distributed on random. This means that IBM PureData System for Analytics does not
use a hash based algorithm to distribute the data. Instead, rows are distributed randomly
on the available data slices. You can check the data distribution of the table, using the
methods we have used before for the LINEITEM table. The data distribution is perfect.
There also are no processing skews for queries on the single table, since in a random
distribution there can be no correlation between any WHERE condition and the
distribution key.
__ 2. An example query returns the average total price and item quantity of all orders grouped by
the shipping priority. This query has to join together the LINEITEM and ORDERS tables to
get the total order cost from the ORDERS table and the quantity for each shipped item from
the LINEITEM table. The tables are joined with an inner join on the L_ORDERKEY column.
Execute the following command and query and note the approximate execution time:
\time
Remember that the ORDERS table was distributed randomly and the LINEITEM table is still
distributed by the L_SHIPDATE column. The join on the other hand is taking place on the
L_ORDERKEY and O_ORDERKEY columns. What is happening is the system is
redistributing both the ORDERS and LINEITEM tables. This is bad because both tables are
of significant size so there is a considerable overhead. This inefficient redistribution occurs
because the tables are not distributed on a useful column.
__ 3. Reload the tables based on the mutual join key to enhance performance during joins. To do
this, you need to reload the LINEITEM table with the new distribution key.
__ a. Ensure you are in the DDL directory.
__ b. Edit the lineitem_shipdate.sql file by typing vi lineitem_shipdate.sql.
__ c. Use the cursor key to navigate to the DISTRIBUTE ON line, and then navigate to the
beginning of l_shipdate.
__ d. Type cw to enter change word mode and change the distribution key by typing:
l_orderkey
The line should now look like this:
DISTRIBUTE ON (l_orderkey);
__ e. Press “Esc” to switch back into command mode.
__ f. Enter :wq! and press Enter to write the file, and quit the editor without any questions.
4-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 4. Change the distribution key and reload the ORDERS table by changing the distribution key
to o_orderkey.
__ a. Copy the orders.sql to a new file called orders_orderkey.sql by typing:
cp orders.sql orders_orderkey.sql
__ b. Edit the orders_orderskey.sql by typing:
vi orders_orderkey.sql
__ c. Use the cursor key to navigate to the DISTRIBUTE ON RANDOM line, and then navigate
to the beginning of the word RANDOM.
__ d. Type cw to enter change word mode and change the distribution key to o_orderkey by
typing:
(o_orderkey)
The line should now look like this:
DISTRIBUTE ON (o_orderkey);
__ e. Press “Esc” to switch back into command mode.
__ f. Enter :wq! and press enter to write the file, and quit the editor without any questions.
__ 5. Using the NZSQL console, drop the LINEITEM and ORDERS tables:
drop table lineitem;
drop table orders;
__ 6. Exit NZSQL and type in the following to recreate the LINEITEM and ORDERS tables:
nzsql -db <student_id>_db -f lineitem_shipdate.sql
nzsql -db <student_id>_db -f orders_orderkey.sql
__ 7. After this statement has executed successfully, reload the new tables by issuing the
following commands:
time nzload -db <student_id>_db -t lineitem -df
/home/<student_id>/DATA/LINEITEM.unl -delim ‘|’ -maxErrors 10
__ 8. After the nzload commands have executed successfully, generate statistics for the reloaded
tables. To do this, use the NZSQL console and type in the following commands:
generate statistics on lineitem;
generate statistics on orders;
__ 9. Repeat step 2 and note the new execution time as it should have improved significantly.
The query should return the same results as in the previous section but run much faster.
You now have loaded the LINEITEM and ORDERS table into your IBM PureData System for
Analytics using the optimal distribution key for these tables for most situations.
• Both tables are distributed evenly across dataslices, so there is no data skew.
4-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty • The distribution key is highly unlikely to result in processing skew, since most WHERE
conditions restrict a key column evenly.
• Since ORDERS is a parent table of LINEITEM, with a foreign key relationship between them,
most queries joining them together use the join key. These queries are co-located.
End of exercise
4-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 5. Loading and unloading data using
external tables
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
In every data warehouse environment there is a need to load new data into
the database. The task to load data into the database is not just a one-time
operation, but rather a continuous operation that can occur hourly, daily,
weekly, or even monthly. The loading of data into a database is a vital
operation that needs to be supported by the data warehouse system. IBM
PureData System for Analytics provides a framework to support not only the
loading of data into its database environment but also the unloading of data
from its database environment. This framework contains more than one
component, and some of these components are:
• External tables – Tables stored as flat files on the host or client systems
and registered like tables in the IBM PureData System for Analytics catalog.
They can be used to load data into the IBM PureData System for Analytics or
unload data to the file system.
• nzload – A wrapper command line tool around external tables that provides
an easy method for loading data into the IBM PureData System for Analytics.
• Format options – Options to format the data load to and from external
tables.
5-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Hint
You can also verify your file was created using the NzAdmin tool. If you want to list just the external
tables, use the \dx command.
5-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 3. You can list the properties of the external table using the \d nation_ext command.
This output includes the columns and associated data types in the external table. Notice that
this is similar to the NATION table since the external table was created using the sameas
clause in the create external table command. The output also includes the properties
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
of the external table. The most notable property is the DataObject property which shows
the location and the name of the external datasource file used for the external table.
__ 4. Still in session 1, unload the data from the base table nation to the external table
nation_ext:
insert into nation_ext select * from nation;
Result:
INSERT 0 257
__ 5. Using session 2, review the external file on the host corresponding to the external table and
count the rows using the wc -l (word count lines) command.
wc -l /tmp/<student_id>_nation_ext.unl
Result:
257 /tmp/<student_id>_nation_ext.unl
__ 2. List just the external tables by typing \dx. Notice there are now two external tables.
5-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Notice there are only two columns for this external table since you only specified two
columns when creating the external table. The rest of the output is very similar to the
properties of the other two external tables that you created, with two main exceptions. The
5-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty first difference is obviously the DataObjects field, since the filename is different. The other
difference is the string used for the delimiter, since it is now ‘=’ instead of the default, ‘|’.
__ 3. Unload the data from the NATION table, but only the data from columns N_NAME and
N_COMMENT:
insert into nation_ext3 select n_name, n_comment from nation;
Hint
Alternatively, you could create the external table and unload the data in one step using the following
command:
create external table nation_ext3 '/tmp/<student_id>_nation_ext3.unl' using
(delimiter '=') as select n_name, n_comment from nation;
Notice that only two columns are present in the flat file using the ‘=’ string as a delimiter.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
SELECT FROM statement. The external table is named nation_ext4 using another ASCII delimited
text file named /tmp/<student_id>_nation_ext4.unl.
__ 1. Using session 1, create the next external table, and unload data from both the REGION and
NATION table joined on the REGIONKEY column to list all of the countries and their
associated regions. Instead of specifying the columns in the create external table statement,
use the AS SELECT option:
create external table nation_ext4 '/tmp/<student_id>_nation_ext4.unl' as
select n_name, r_name from nation, region where n_regionkey=r_regionkey;
5-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
5-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
5-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty Notice that the option for compress has changed from false to true indicating that the
datasource file is compressed and the format has changed from text to internal, which
is required for compressed files.
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 2. Even though the external table definition no longer exists within the student database, the
flat file named /tmp/<student_id>_nation_ext.unl still exists. To verify this, in
PuTTY session 2 type:
ls /tmp/*.unl
5-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
End of exercise
© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
5-18 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 6. Loading and unloading data using the
nzload utility
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All eight tables in your database have been created.
© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
You are an IBM PureData System for Analytics developer and need to load data into your tables,
verify the data is loaded, and check the distribution.
For this section of the lab you continue to use the STUDENT user to load data into the
<student_id>_db database. The nzload utility is used to load records from an external datasource
file into the NATION table. Along with this the nzload log files are reviewed to examine the nzload
options. Since you are loading data into a populated NATION table, you use the truncate table
command to remove the rows from the table.
We continue to use the two PuTTY sessions from the external table lab.
• Session 1 is connected to the NZSQL console to execute SQL commands, for example to
review tables after load operations
• Session 2 is used for operating system commands, to execute nzload commands, view data
files, etc.
6-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty 6.1. Using the nzload utility with command line options
The first method for using the nzload utility to load data into a table is to specify options at the
command line. You only need to specify the datasource file and use default options for the rest. In
this case we have created for you a script containing the nzload commands.
__ 1. Using PuTTY session 1, remove the rows in the following tables:
truncate table nation;
truncate table part;
truncate table customer;
truncate table lineitem;
truncate table orders;
truncate table partsupp;
truncate table region;
truncate table supplier;
__ 2. Using NzAdmin on the Desktop, verify that your tables currently have no data.
__ 3. Return to the PuTTY session 2, and ensure you are in your home directory
__ 4. Use the vi editor to see the nzload commands in loadtables.sh.
__ 5. Type :q! to exit editing mode without writing to the file.
__ 6. Load data into the tables using the following command:
./loadtables.sh
© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 8. For every load task performed there is always an associated log file,
<TABLE>.<DB_SCHEMA>.<DATABASE>.nzlog created. By default this log file is created in
the current working directory. Type:
more NATION.<STUDENT_ID>.<STUDENT_ID>_DB.nzlog
Notice that the log file contains the Load Options and the statistics of the load, along with
environment information to identify the database and table.
6-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Information
The –db, -u, and –pw, options specify the database name, the user, and the password,
respectively. Alternatively, you could omit these options if the NZ environment variables are set to
the appropriate database, username and password values. Since the NZ environment variables,
nz_database, nz_user, and nz_password are set to system, admin, and password, you need to use
these options so the load is against the <student_id>_db database using the student user. The
other options are:
• -t specifies the target table name in the database
• -df specifies the datasource file to be loaded
• -delimiter specifies the string to use as the delimiter in an ASCII delimited text file
There are other options that you can use with the nzload utility. These options were not specified
here since the default values were sufficient for this load task.
The following command is equivalent to the nzload command you used above, but with other
options you can use, which you can omit when using default values. In the next exercise, you learn
about the –lf, -bf, and –maxErrors options. The –compress and –format options indicate that
the datasource file is an ASCII-delimited text file. For a compressed binary datasource file, you use
-compress true –format internal.
nzload –db <student_id>_db –u <student_id> –pw <student_pwd> –t nation –df
nation_student.unl –delimiter ‘|’ –outputDir ‘<current directory>’ –lf
<table>.<database>.nzlog bf<table>.<database>.nzlog
© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
6-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
__ d. To save your changes, press “Esc” to switch back into command mode.
__ e. Enter :wq! and press Enter to write the file, and quit the editor without any questions.
__ 3. Still in the second session, type:
chmod 755 /export/home/<student_id>
__ 4. Load the data using the nzload utility with the control file you created, and with the following
command line options: -u <user>, -pw <password>, -cf <control file>
nzload -u <student_id> -pw <student_pwd> -cf nation.ctl
__ 5. Check the nzload log which was renamed from the default to nation.log.
__ 6. Using the first PuTTY session, ensure that the rows were added to the NATION table:
select * from nation;
End of exercise
© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
6-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 7. Generate statistics
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.
Exercise instructions
As a developer you want to ensure that statistics are maintained and up-to-date so that the
generated Query Plans are optimal. Our first long running customer query returns the average
order price by customer segment for a given year and order priority. It joins the customer table for
the market segment and the ORDERS table for the total price of the order. Due to restrictive join
conditions it should not require too much processing time.
7-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
The IBM PureData System for Analytics optimizer uses statistics about the data in the system to
estimate the number of rows that result from WHERE conditions, joins, etc. Doing wrong
approximations can lead to bad execution plans. For example, a huge result set could be broadcast
for a join instead of doing a double redistribution.
__ 7. To see the estimated rows for the WHERE conditions in our query run the following
EXPLAIN command:
explain verbose select count(*) from orders where extract(year from
o_orderdate) = 1996 and o_orderpriority = '1-URGENT';
Scroll up the output from your command and you should see estimated rows = 450.
The execution plan of this query consists of two nodes. First, the table is scanned and the
WHERE conditions are applied, which can be seen in the Restrictions sub node. Since we
use a COUNT(*) the Projections node is empty. Then, an Aggregation node is applied to
count the rows that are returned by node 1. When we look at the estimated number of rows
we can see that it is not correct. The IBM PureData System for Analytics optimizer
estimates, in this case, from its available statistics, that only 450 rows are returned by the
WHERE conditions; this might not be very accurate.
7-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 8. One way to help the optimizer in its estimates is the collection of detailed statistics about the
involved tables. Execute the following command to generate detailed statistics about the
ORDERS table. Since generating full statistics involves a table scan this command might
take several minutes to execute.
generate statistics on orders;
__ 9. Check to see if generating statistics improved the estimates. Execute the EXPLAIN
command, then scroll up in the output and you should see the following:
As you can see, the estimated rows of the SELECT query have improved drastically. The
optimizer now assumes this WHERE condition applies, in this case, to 9000 rows of the
ORDERS table. This is a much better result than the original estimate of 450.
Estimations are difficult to make. Obviously the optimizer cannot do the actual computation during
planning; it relies on current statistics about the involved columns. Statistics include min/max
values, distinct values, numbers of null values etc. Some of these statistics are collected
dynamically, but the most detailed statistics can be generated manually with the generate
statistics command. Generating full statistics after loading a table or changing its content
significantly is one of the most important administrative tasks on IBM PureData System for
Analytics. The appliance automatically generates express statistics after many tasks, such as load
operations and just-in-time statistics during planning. Nevertheless, full statistics should be
generated on a regular basis.
An estimate, which is what you get in a plan file or the explain output, is just that, an educated
guess. Sometimes it is right on the money but just as often, sometimes it is too high or too low.
If there are no restrictions, then we will probably end up using 100% of the rows in the table. So the
rowcount / estimate should be right on the money in that case and the confidence should be 100%.
With one restriction we are starting to make some educated guesses. Because it is a guess, our
level of confidence in our answer starts to drop down to 80%.
With two restrictions the guesses, and the odds of being right, just multiply, so our confidence is
now 80% of 80% (or 64% total).
If the optimizer comes up with one plan with a low cost and a confidence of 100% versus another
plan with the same cost but a much lower confidence level then it is probably going to choose the
plan with the higher confidence but for any given plan there are dozens of costs, various confidence
levels associated with each cost, etc.
7-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 3. Choose Generate full statistics, and then select columns of interest and generate.
__ 4. Right-click on the table you generated statistics for, and then choose View Statistics. The
table lineitem has statistics run in a previous step so this screen shot may not look the same
as the one you see.
End of exercise
7-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 8. Analyzing query plans
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
IBM PureData System for Analytics uses a cost-based optimizer to determine the best method for
scan and join operations, join order, and data movement between SPUs (redistribute or broadcast
operations if necessary). For example the planner tries to avoid redistributing large tables because
of the performance impact. The optimizer can also dynamically rewrite queries to improve query
performance. The optimizer takes a SQL query as input and creates a detailed execution or query
plan for the database system. For the optimizer to create the best execution plan that results in the
best performance, it must have the most up-to-date statistics. You can use explain, html (also
known as bubble), and text plans to analyze how the IBM PureData System for Analytics executes
a query.
The explain tool is quite useful to spot and identify performance problems, bad distribution keys,
badly written SQL queries and out-of-date statistics. For example, you query the database and see
that the performance could be improved, but when you look at the distribution of data you see it is
not too skewed. Then, you generate the query plans and analyze them. Based on the results, you
define new distribution criteria and apply those to the existing tables and rerun your queries.
During our proof-of-concept, we have identified a couple of long running customer queries that have
significantly worse performance than the number of rows involved would suggest. In this exercise,
you use explain functionality to identify the concrete bottlenecks and if possible fix them to improve
query performance.
Note
A snippet is a unit of work. It is one distinct C++ program. A snippet could have dozens (hundreds)
of nodes or steps or operations. One node might scan the table then we might have a node that does
some aggregations on the data and another node that sorts the results. In that case we are probably
scanning + aggregating at the same time as we go along. Whereas we can't start the sort node until
we have finished first two steps completely. When processing a table you will always see a:
ScanNode
RestrictNode
ProjectNode
So even though it is listed as 3 operations, they are really being done all at the same time. We read
a 128KB page of data and throw away the rows + columns we don't want and then do some further
processing of the data. Before we go back to get the next 128KB page of data. In this case, the 3
nodes are basically combined into one operation most of which occurs in the FPGA itself.
8-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 2. Try to answer the following questions through reviewing the execution plan.
__ a. Which columns of the ORDERS table are used in further computations?
__ b. Is the ORDERS table redistributed, broadcast or can it be joined locally?
__ c. Is the CUSTOMER table redistributed, broadcast or can it be joined locally?
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ d. In which node are the WHERE conditions applied and how many rows does IBM
PureData System for Analytics expect to fulfill the where condition?
__ e. What kind of join takes place and in which node?
__ f. What is the number of estimated rows for the join?
__ g. What is the most expensive node and why?
Hint
A stream operation in IBM PureData System for Analytics explain is a join whose output is not
persisted on disk but streamed to further computation nodes.
8-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty We can see in the “Restrictions” clause that the WHERE conditions of our query are applied
during the first node as well. This should be clear since both of the WHERE conditions are
applied to the ORDERS table and they can be executed during the scan of the ORDERS
table. As we can see in the “Estimated Rows” clause, the optimizer estimates a returned set
of 9000 rows which we know is underestimated since in reality tens of thousands of rows
are returned from this table.
__ e. What kind of join takes place and in which node?
The third node of our execution plan contains the join between the two tables. It is a Nested
Loop Join which means that every row of the first join set is compared to each row of the
second join set. If the join condition holds true the joined row is then added to the result set.
This can be a very efficient join for small tables but for large tables its complexity is
quadratic and therefore in general less fast than for example a Hash Join. The Hash Join
though cannot be used in cases of inequality join conditions, floating point join keys etc.
[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {}]
-- Estimated Rows = 405000000, Width = 18, Cost = 65097.6 .. 2080409.8, Conf
= 64.0
__ f. What is the number of estimated rows for the join?
We can see in the Estimated Rows clause that the optimizer estimates this join node to
return roughly 4 billion rows, which is the number of rows from the first node times the
number of rows from the second node.
__ g. What is the most expensive node and why?
As we can see from the Cost clause, the optimizer estimates that the SPU Sort, SPU Group
and Host Aggregate have costs in the range from 37862669.4.. 37879544.4. So our
performance problems clearly originate in the join Node 3 and the problem continues
throughout the rest of the nodes. So what is happening here? If we take a look at the query
we can assume that it is intended to compute the average order cost per market segment.
This means we should join all customers to their corresponding order rows. But for this to
happen we would need a join condition that joins the customer table and the ORDERS table
on the customer key. Instead the query performs a Cartesian Join, joining each customer
row to each orders row. This is a very work intensive query that results in the behavior we
have seen. The joined result set becomes huge. And it even returns results that cannot
have been expected for the query we see.
__ 4. Fix this by adding a join condition to the query to make sure that customers are only joined
to their orders. This additional join condition is O.O_CUSTKEY=C.C_CUSTKEY. Execute
the following EXPLAIN command for the modified query.
explain verbose select c.c_mktsegment, avg(o.o_totalprice) from orders as o,
customer as c where extract(year from o.o_orderdate) = 1996 and
o.o_orderpriority = '1-URGENT' and o.o_custkey = c.c_custkey group by
c.c_mktsegment;
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
You should see something similar to the following results. Scroll up to your query to see the
scan and join nodes.
Note
If you do not get the same query plan, please follow the explanation and bypass the SQL
execution.The cardinality (adjusted) in the plan is the number of expected unique values for the
column and is not the same as the Estimated Rows for the results set. It is used by the Optimizer for
determining such things as numbers of duplicate rows and potential sorts
As you can see there have been some changes to the execution plan.
• In Node 1:
- The ORDERS table projections are now O_TOTALPRICE and O_CUSTKEY.
8-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty - The O_TOTALPRICE and O_CUSTKEY are broadcast and later used in a hash join.
• In Node 2:
- The CUSTOMER table is scanned with projections of C_MKTSEGMENT and
C_CUSTKEY.
• In Node 3:
- Does a hash join of the results set from both Node 1 and Node 2 using the customer
key.
- The estimated number of rows is now 450,000, which is the same as the number of
customers; since we have a 1:n relationship between customers and orders this is
as we would expect.
- The estimated cost of Node 3 has come down significantly to 57.9 to 80.4.
• In Nodes 4, 5 and 6, there has been a significant reduction in the cost as well.
__ 5. Make sure that the query performance has improved. Switch on the display of elapsed
query time with the following command: \time
__ 6. Execute our modified query:
select c.c_mktsegment, avg(o.o_totalprice) from orders as o,customer as c
where extract(year from o.o_orderdate) = 1996 and o.o_orderpriority =
'1-URGENT' and o.o_custkey = c.c_custkey group by c.c_mktsegment;
__ 7. The results should look similar to this:
Before we made our changes and generated statistics the query took much longer than it
does now. In this relatively simple case we might have been able to pinpoint the problem
through analyzing the SQL on its own. But this can be almost impossible for complicated
multi-join queries that are often used in warehousing. Reporting and BI tools tend to create
very complicated portable SQL as well. In these cases EXPLAIN can be a valuable tool to
pinpoint the problem.
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ b. At the Edit connection name window, type any name for your new connection to Aginity,
and then click OK.
__ c. On the Connect to PureData system for Analytics window, enter the following
information:
i. In the Server field, type pok-puredata.clp.local
ii. In the User ID and Password fields, enter your <student_id> and password.
iii. For the Database field, select your <student_id>_db from the drop-down list.
iv. Ensure the Netezza ODBC driver and NOT the Netezza OLEDB driver is
selected.
8-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
v. Click Save.
__ 2. To connect to your <student_id>_db, click OK.
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
8-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 6. For each query, copy the SQL and paste each into a separate Aginity Workbench Query
tab.
Important
Before you execute a query, ensure that your cursor is positioned at the beginning of the statement.
Also, ensure that the Database drop-down box shows the correct database e.g.
<student_id>_db. If it does not show the correct database, all SQL is executed on the wrong
database and errors occur.
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ c. Click Execute. Keep a note of the time and keep the query tabs open.
Important
Note which tables each query uses because you need this information for the following exercise.
8-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 8. In Aginity, for each query add explain before the select clause, and execute the queries
so that you can analyze the Query Plans.
__ 9. In PuTTY, use the CREATE TABLE AS (CTAS) command, redistribute the tables based on
good criteria.
__ a. Connect to your database.
__ b. Choose distribution keys based on the query/joins to minimize the data movement.
__ c. After creating the CTAS tables, drop the base tables and rename the CTAS tables to the
original base table names.
© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 10. Remove explain verbose and execute the queries. What do you see now for the query
times?
Query 1 Time: _______________________
Query 2 Time: _______________________
Query 3 Time: _______________________
End of exercise
8-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 9. Zone maps and clustered base tables
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• The LINEITEM table in your database has been loaded with the data
supplied.
• You have completed exercise 8
© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
We have received a set of new customer queries on the ORDERS table that do not only restrict the
table by order date but also only accesses orders in a given price range. These queries make up a
significant part of the system workload and we look into ways to increase performance for them.
The following query is a template for the queries in question. It returns the aggregated total price of
all orders by order priority for a given year (in this case 1996) and price range (in this case between
150000 and 180000).
select o_orderpriority, sum(o_totalprice) from orders where extract(year from
o_orderdate) = 1996 and o_totalprice > 150000 and o_totalprice <= 180000 group
by o_orderpriority;
In this example, there is a significantly restrictive WHERE condition on two columns
O_ORDERDATE and O_TOTALPRICE, which can help us to increase performance. The ORDERS
table has around 220,000 rows with an order date of 1996 and 160,000 rows with the given price
range. But it only has 20,000 columns that satisfy both conditions. Materialized views provide their
main performance improvements on one column. Also INSERTS to the ORDERS table are frequent
and time critical; therefore, you would not want to use materialized views. This exercise investigates
the use of clustered base tables.
Clustered base tables are IBM PureData System for Analytics tables that are created with an
ORGANIZE ON keyword. They use a special space filling algorithm to organize a table by up to 4
columns. Zone maps for a clustered base table provide approximately the same performance
increases for all organization columns. This is useful if your query restricts a table on more than one
column or if your workload consists of multiple queries hitting the same table using different
columns in WHERE conditions. In contrast to materialized views no additional disk space is
needed, since the base table itself is reordered.
9-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
shows you the first data slice, this can be changed to a different data slice using a switch in
the command.
__ 7. To see the zone map values of the O_ORDERDATE column, execute the following
command:
nz_zonemap <student_id>_db orders_cbt o_orderdate o_totalprice
9-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
If the “Sort” column shows “true”, then the (min) value in this extent is greater than or equal to the
(max) value of the previous extent. Which indicates optimal zonemap usage and results in optimal
performance, basically to see if the data is sorted on disk in ASCending order. And since a
DESCending sort order should perform just as well as an ASCending sort order (when it comes to
zonemaps), the column will show “true” if the (max) value in this extent is less than or equal to the
(min) value of the previous extent.
The groom command is covered in detail in a following presentation and exercise; but you use it in
the next section to reorganize the table.
© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
You can see that both columns have some form of order now. The query is restricting rows
in two ranges:
Condition 1: O_ORDERDATE = 1996
AND
Condition 2: 150000 < O_TOTALPRICE <= 180000
There are now 2 extents that have rows from 1996 in them and 1 extent that contains rows
in the price range from 150000 to 18000; this is the only row that satisfies both conditions
and needs to be scanned during query execution.
9-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty The new NPS7 architecture has much finer grained extents. In this case there are 48 in total
and only a limited number need to be read:
• 20 extents that might have O_ORDERDATE in 1996
• 14 extents that might have O_TOTALPRICE between 150000 and 180000
• 2 extents for which both conditions apply
This means by using CBTs in the new NPS7 architecture we can restrict the amount of data
that needs to be queried by a factor of 16. This is 3-5 times less of what would need to be
read if the table is only ordered on a single column.
In this exercise, you created a clustered base table and used the groom command to organize it.
Throughout the exercise, you used the nz_zonemap tool to see zone maps and get a better idea on
how data is stored in the IBM PureData System for Analytics.
© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
9-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 2. Query both tables using restrictions on the l_orderkey column to see the performance
difference between perfectly-ordered data and clustered data:
select l_partkey, l_shipinstruct from ag_ord_li where l_orderkey=100007;
© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 3. Query using restrictions on the l_partkey column to see the performance difference between
nearly-ordered data and clustered data:
select l_orderkey, l_partkey, l_shipinstruct from ag_ord_li where
l_partkey=173466;
9-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
End of exercise
© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
9-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 10.Materialized views
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• You redistributed the tables by making good choices and saw
performance improvements.
© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
A materialized view is a view of a database table that projects a subset of the base table’s columns
into a physical manifestation. It can be sorted on a specific set of the projected columns. When a
materialized view is created, the sorted projection of the base table’s data is stored in a physical
table on disk. Materialized views reduce the width of data being scanned in a base table. They are
beneficial for wide tables that contain many columns (i.e. 50-500 columns) where typical queries
only reference a small subset of the columns.Materialized views also provide fast, single or few
record lookup operations. The thin materialized view is automatically substituted by the optimizer
for the base table, allowing faster response, particularly for shorter tactical queries that examine
only a small segment of the overall database table.
In the last few exercises, you recreated a customer database in our IBM PureData System for
Analytics, and you picked distribution keys, loaded the data and made some first performance
investigations. In this exercise, you look deeper into some customer queries and try to enhance
their performance by tuning the system. You see acceptable levels of performance from your IBM
PureData System for Analytics; however, you can improve performance still further by using
materialized views for enhancements in SQL selection criteria.
10-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
If you are still in the SQL directory from the previous exercise, type cd .. to return to your home
directory.
__ 3. Type in nzsql.
__ 4. Make sure table statistics have been generated so that more accurate estimated query
costs can be reported by explain commands we look at. Generate statistics for the
ORDERS and LINEITEM tables using the following commands:
generate statistics on orders;
generate statistics on lineitem;
__ 5. Execute the following query which computes the total quantity of items shipped and their
average tax rate for a given month, which in this case is the fourth month or April:
\time
Note
Notice the extract(month from l_shipdate) command. The extract command can be used to retrieve
parts of a date or time column like year, month or day.
© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 6. To get the projected cost from the optimizer, use the following explain verbose command:
explain verbose select l_shipdate, sum(l_quantity), avg(l_tax) from lineitem
where extract(month from l_shipdate) = 4 group by 1;
You see a long output on the screen. Scroll up until you reach the command you just
executed:
__ 7. Since this query is run frequently we want to enhance the scanning performance. And since
it only uses 3 of the 16 LINEITEM columns we have decided to create a materialized view
covering these three columns. This should significantly increase scan speed since only a
small subset of the data needs to be scanned. To create the materialized view
THINLINEITEM execute the following command. This command can take several minutes
since we effectively create a copy of the three columns of the table:
create materialized view thinlineitem as select l_shipdate, l_quantity, l_tax
from lineitem order by l_shipdate;
10-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 8. Repeat the explain call from step 5. The results should now look like the following:
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview "_MTHINLINEITEM" {(LINEITEM.L_SHIPDATE)}]
-- Estimated Rows = 359932, Width = 16, Cost = 0.0 .. 128.9, Conf = 80.0
Restrictions:
(DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4)
Projections:
1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 32, Cost = 133.4 .. 133.4, Conf = 0.0
Projections:
1:SUM(LINEITEM.L_QUANTITY)
2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX)))
[SPU Return]
[HOST Merge Aggs]
[Host Return]
QUERY PLANTEXT:
Note
Notice that the IBM PureData System for Analytics optimizer has automatically replaced the
LINEITEM table with the view THINLINEITEM. We didn’t need to make any changes to the query.
Also notice that the expected cost has been reduced to 174 which is less than 10% of the original.
It is possible you will not get the same result as above due to the different workloads on the system
and the different SPU activities. If this is the case examine the output above and discuss with your
instructor.
As you have seen in cases where you have wide database tables, with queries only touching a
subset of them, a materialized view of the hot columns can significantly increase performance for
these queries, without any changes to the executed queries.
© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
You can see that on the 15th June of 1995 there have been only a percentage of the shipments
returned out of the total. Notice the use of the CASE statement to change the L_RETURNFLAG
column into a Boolean 0-1 value, which is easily countable.
__ 3. Look at the underlying data distribution of the LINEITEM table and its zone map values. To
do this exit the NZSQL console by executing the \q command.
__ 4. The IBM PureData System for Analytics support tools are installed in the /nz directory. One
of these tools is the nz_zonemap tool that returns detailed information about the zone map
values associated with a given database table. To look at the zone mappable columns of
the LINEITEM table, execute the following command:
nz_zonemap <student_id>_db lineitem
10-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
This command returns an overview of the zone mappable columns of the LINEITEM table in
the <student_id>_db database. Seven of the sixteen columns have zone maps created for
them. Zone mappable columns include integer and date data types. We see that the
L_SHIPDATE column we have in the WHERE condition of the customer query is zone
mappable.
Note
The support tools are available as an installation package in /nz on your IBM PureData System for
Analytics or you can obtain them from IBM support.
__ 5. To look at the zone map values for the L_SHIPDATE column, execute the following
command. This command returns a list of all extents that make up the LINEITEM table and
the minimum and maximum values of the data in the L_SHIPDATE column for each extent.
nz_zonemap <student_id>_db lineitem l_shipdate
© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
This is the actual ouput from the LINEITEM table. There are 21 rows which means the table
consists of 21 extents. We can also see the minimum and maximum values for the
L_SHIPDATE column in each extent. These values are stored in the zone map and
automatically updated when rows are inserted, updated or deleted. If a query has a where
condition on the L_SHIPDATE column that falls outside of the data range of an extent, the
whole extent can be discarded by IBM PureData System for Analytics without scanning it.
In this case the data has been distributed across the 21 extents. This means that our query
which has a WHERE condition on the 15th June of 1995 does not profit from the zone maps
and requires a full table scan.
__ 6. Return to the NZSQL command interface.
__ 7. To create a materialized view that is ordered on the L_SHIPDATE column, execute the
following command:
create materialized view shiplineitem as select l_shipdate from lineitem
order by l_shipdate;
10-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Note
Note that our customer query has a WHERE condition on the L_SHIPDATE column but aggregates
the L_RETURNFLAG column. Nevertheless we didn’t add the L_RETURNFLAG column to the
materialized view. We could have done it to enhance the performance of our specific query even
more. But in this case we assume that there are lots of customer queries which are restricted on the
ship date and access different columns of the LINEITEM table. A materialized view retains the
information about the location of a parent row in the base table and can be used for lookups even if
columns of the parent table are accessed in the SELECT clause.
You can specify more than one order column. In that case they are ordered first by the first
column; in case this column has equal values the next column is used to order rows with the
same value in column one etc. In general only the first order column provides a significant
impact on performance.
__ 8. Look at the zone map of the newly created view. Leave the NZSQL console again with the
\q command.
__ 9. Display the zone map values of the materialized view SHIPLINEITEM with the following
command:
nz_zonemap <student_id>_db shiplineitem l_shipdate
The results should look something similar to the following:
Notice the materialized view is significantly smaller than the base table. In this case the
number of rows returned reduced from 21 in the original to 6. Also notice that the data
values in the extent are ordered on the L_SHIPDATE column. This means that for our query,
which is accessing data from the 15th June of 1995, only extent 3 needs to be accessed at
all, since only this extent has a data range that contains this date value.
__ 10. Return to the NZSQL command interface.
© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 11. Use the explain command again to verify that our materialized view is used by the optimizer:
explain verbose select sum(case when l_returnflag <> 'N' then 1 else 0 end)
as ret, count(*) as total from lineitem where l_shipdate='1995-06-15';
You see a long text output; scroll up until you find the command you just executed. Your
result should look like the following:
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM"
{(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 24, Cost = 62.2 .. 62.2, Conf = 0.0
Projections:
1:SUM(CASE WHEN (LINEITEM.L_RETURNFLAG <> 'N'::BPCHAR) THEN 1 ELSE 0 END)
2:COUNT(*)
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...
Note
Notice that the Optimizer has automatically changed the table scan to a scan of the view
SHIPLINEITEM we just created. This is possible even though the projection is taking place on
column L_RETURNFLAG of the base table.
__ 12. In some cases you might want to disable or suspend an associated materialized view. For
troubleshooting or administrative tasks on the base table. For these cases use the following
command to suspend the view:
alter view shiplineitem materialize suspend;
__ 13. To make sure that the view is not used anymore during query execution, execute the
EXPLAIN command for our query again. With the view suspended we can see that the
optimizer again scans the original table LINEITEM:
explain verbose select sum(case when l_returnflag <> 'N' then 1 else 0 end)
as ret, count(*) as total from lineitem where l_shipdate='1995-06-15';
10-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty You see a long text output; scroll up until you find the command you just executed. Your
result should look like the following:
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 60012, Width = 1, Cost = 0.0 .. 2417.5, Conf = 80.0
Restrictions:
...
Note
Note that we have only suspended the view, not dropped it.
__ 14. Reactivate the view with the following refresh command. This command can also be used to
reorder materialized views in case the base table has been changed. While INSERTs,
UPDATEs and DELETEs into the base table are automatically reflected in associated
materialized views, the view is not reordered for every change. Therefore it is advisable to
refresh them periodically especially after major changes to the base table:
alter view shiplineitem materialize refresh;
__ 15. To check that the optimizer again uses the materialized view for query execution, execute
the following command.Make sure that the optimizer again uses the materialized view for its
first scan operation.
explain verbose select sum(case when l_returnflag <> 'N' then 1 else 0 end)
as ret, count(*) as total from lineitem where l_shipdate='1995-06-15';
You see a long text output; scroll up until you find the command you just executed. Your
result should look like the following:
© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
and might require regular maintenance. Therefore materialized views should be used sparingly. In
the next chapter, you learn an alternative approach to speed up scan speeds on a database table.
End of exercise
10-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 11.Transactions and Truncate table
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
In this section, you learn how transactions can leave logically deleted rows in a table which later as
an administrative task need to be removed with the groom command. You need to truncate a table
in a transaction which is in a different schema to the session’s current schema. However, another
transaction is currently inserting from the same table.
11-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 4. Insert a new row into the REGION table for the region Australia with the following SQL
command:
insert into region values (5, 'as', 'AUSTRALIA');
__ 5. Do a select on the REGION table, but this time you query the hidden fields CREATEXID,
DELETEXID and ROWID:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
As you can see, there are now six rows in the REGION table. The new row for Australia has
the id of the last transaction as CREATEXID and “0” as DELETEXID since it has not yet
been deleted. There could be transactions with a lower transaction id than yours running
concurrently, they will not see this new row. Note also that each row has a unique rowid.
Rowids do not need to be consecutive, but they are unique across all dataslices for one
table.
11-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
In a real IBM PureData System for Analytics changing system configuration parameters can be a
very dangerous thing that is normally not advisable without IBM PureData System for Analytics
service support.
__ 1. If you exited nzsql after 11.1 then re-enter the NZSQL console and connect to the
<student_id>_db database.
__ 2. To disable invisibility lists and show all records, run the following command:
set show_deleted_records=true;
__ 3. Update the row you inserted in the last section to the REGION table:
update region set r_comment='Australia' where r_regionkey=5;
__ 4. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You should see something similar to the following output:
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Normally, you would now see 6 rows with the update value; since the invisibility lists are
disabled, you now see 7 rows in the REGION table. The transaction that updated the
“AUSTRALIA” row has an entry in the DELETEXID column. Transactions with a higher
transaction id do not see a row with a deletexid, which indicates that it has been logically
deleted before the transaction is run. You can also see a newly inserted row with the new
comment value ‘Australia’; it has the same rowid as the deleted row and the same
CREATEXID as the transaction that did the insert.
__ 5. Clean up the table again by deleting the Australia row:
delete from region where r_regionkey=5;
__ 6. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You should see something similar to the following output:
You can see the updated row was logically deleted as well; it now has a DELETEXID field
with the value of the new transaction. Normally, the logically deleted rows are filtered out
automatically by the FPGA. If you do a select, the FPGA removes all rows that have a:
• CREATEXID which is bigger than the current transaction id.
• CREATEXID of an uncommitted transaction.
• DELETENXID which is smaller than the current transaction, but only if the transaction of
the DELETEXID field is committed.
• DELETEXID of 1 which means that the insert has been aborted.
11-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note that you have the same results as in the last chapter, the original row for the MIDDLE
EAST region was logically deleted by updating its DELETEXID field, and a new row with the
updated comment and new rowid has been added. Note that its CREATEXID is the same as
the DELETEXID of the old row, since they were updated by the same transaction.
__ 4. Rollback the transaction:
rollback;
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
We can see that the transaction has been rolled back. The DELETEXID of the old version of
the row has been reset to 0, which means that it is a valid row that can be seen by other
transactions, and the DELETEXID of the new row has been set to 1 which marks it as
aborted.
11-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
You can see that the groom command purged 3 rows, exactly the number of aborted and
logically deleted rows we have generated in the previous chapter.
__ 3. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You can see that the groom command has removed all logically deleted rows from the table.
Remember that we still have the parameter switched on that allows us to see any logically
deleted rows. Especially in tables that are heavily changed with a lot of updates and deletes,
running the groom command frees up hard drive space and increase performance.
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 6. In session 1, execute the following command and make a note of the elapsed time to do the
count:
select count(*) from <student_id>_2.lineitem;
__ 7. In session 2, exit NZSQL, and navigate to the SQL directory:
cd SQL
__ 8. Open the lineitem.sql file in an editor:
vi lineitem.sql
__ 9. Change the following:
• For set schema student2, change student2 to your new schema,
<student_id>_2
• In the three lines containing student.lineitem, change student to <student_id>
11-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 12. While lineitem.sh is running in the session 1, execute the following SQL select statement in
session 2 several times before and after the execution of the TRUNCATE has executed in
session 1.
select count(*) from <student_id>_2.lineitem;.
11-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 13. Once both sessions have completed execution, review the different times for the select
statement in session 2. The last execution should have a longer elapsed time due to the
wait time caused by the concurrent TRUNCATE command.
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
11-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 11. In session 1, after completion of ./lineitem2.sh you should see the following. Note that the
number of rows might vary but that is not relevant.
© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 12. In session 2,after completion of ./truncate.sh you should see the following. Note that the
number of rows might vary but that is not relevant.
End of exercise
11-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 12. GROOM
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.
Note
The groom command is also run using the new nzreclaim script (nzreclaim is a wrapper around
groom with many command line options). You might be familiar with nzreclaim in earlier releases.
Exercise instructions
Your database is getting larger and you need to save on some space. You realize that you make
updates and deletes in this database and the IBM PureData System for Analytics does not really
remove the data from the database; it simply hides data by flagging it. Essentially, the system is
using up much-needed space; therefore, you need to use groom to reclaim the unused space.
12-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 4. Exit NZSQL.
__ 5. Check the physical table size for ORDERS and see if the size decreased using the same
command as in step 1:
nz_db_size <student_id>_db
The output should be the same as above showing that the ORDERS table did not change in
size and is still 240MB. This is because the deleted rows were logically deleted but are still
on the disk. The rows are still accessible to transactions that started before the DELETE
statement which you just executed. In practical terms, this means that the transactions that
are still active have a lower transaction id than the deleted rows.
__ 6. Start NZSQL.
__ 7. When you run the groom table command, it removes outdated and deleted records from
tables. Use the groom table command, specifying table ORDERS, to physically delete
the rows you just logically deleted:
groom table orders;
Your results might vary from the following:
You can see that 2192233 rows were removed from disk. Notice that this is the same
number of rows that you previously deleted.
__ 8. Exit NZSQL.
__ 9. Use the nz_db_size command to check to see if the ORDERS table size on disk has
shrunk. Execute the following command:
nz_db_size <student_id>_db
12-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note the reduced size of the ORDERS table. You can see that GROOM did purge the
deleted rows from disk.
All rows are affected by the update resulting in a doubled number of physical rows in the
table. This is because the update operation leaves a copy of the rows before the update
occurred in case a transaction is still operating on the rows. New rows are created and the
results of the UPDATE are put in these rows. The old rows that are left on disk are marked
as logically deleted.
__ 3. To measure the performance of the test query, configure the NZSQL console to show the
elapsed execution time using the \time command.
__ 4. Run the test query and note the performance:
select count(*) from orders;
Your results might vary from the following:
__ 5. Rerun the query a few more times to estimate a consistent query time on the system, and
make note of the times.
__ 6. Run the groom table command on the ORDERS table:
groom table orders;
12-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 7. How much disk space did the groom save? (It is the number of extents times 3MB.)
__ 8. Run the test query again and you should see a difference in performance:
End of exercise
12-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty
Exercise 13.Stored procedures
Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• The ORDERS table in your database has been loaded with the data
supplied.
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Exercise instructions
So far, you have created a database, loaded data, and performed some optimization and
administrative tasks. In this exercise, you enhance the database with a couple of stored
procedures. As mentioned before, IBM PureData System for Analytics does not check referential
integrity or unique constraints. This is normally not critical since data loading in a data warehousing
environment is a controlled task. In this IBM PureData System for Analytics implementation, you
have the requirement to allow some non-administrative database users to add new customers to
the customer table. Since this happens rarely, there are no performance requirements, so you
implement this with a stored procedure that is accessible for these users and checks the input
values and referential constraints.
You also implement a business logic function as a stored procedure, based on this data model,
returning a result set.
In the following exercises, you create the stored procedure to insert data into the customer table.
The information that is added for a new customer is the customer key, name, phone number and
nation; the rest of the information is updated through other processes.
13-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Note
This folder already contains the empty file for the stored procedure script you create. It also
contains the solution file that you can review using the ls command:
ls addCustomer_sol.sql
__ 5. Create a stored procedure that adds a new customer entry and sets these 4 fields:
C_CUSTKEY, C_NAME, C_NATIONKEY, and C_PHONE. All other fields are set with an
empty value or 0, since the fields are flagged as NOT NULL.
__ a. Exit the NZSQL console by executing the \q command.
__ b. To create a stored procedure, use the internal vi editor. Open the already existing empty
file addCustomer.sql with the following command:
vi addCustomer.sql
__ c. To edit the file, switch to INSERT mode by pressing “i”.
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ d. Create the interface of the stored procedure to test; use the 4 fields and return an
integer return code. To do this, enter the following text:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer,
varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
END_PROC;
__ e. Exit the insert mode by pressing ESC and enter :wq! to save the file and quit vi.
This minimal stored procedure does not yet do anything since it has an empty body. You
create the signature with the input and output variables using the command CREATE or
REPLACE so you can later execute the same command multiple times to update the
stored procedure with more code. The input variables cannot be given names, so you
only add the data types for the input parameters key, name, nation and phone and return
an integer return code.
Note
You have to specify the procedure language even though NZPLSQL is the only available option in
IBM PureData System for Analytics.
You see the following result and the procedure ADDCUSTOMER with the specified
arguments:
13-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 9. Execute the stored procedure with the following dummy input parameters:
call addcustomer(1,'test', 2, 'test');
You should see the following:
The result shows that there is a syntax error in the stored procedure. Every stored
procedure needs at least one BEGIN..END block that encapsulates the code to be
executed. Stored procedures are compiled when they are first executed not when they are
created, therefore errors in the code can only be seen during execution.
__ 10. Exit the NZSQL console.
__ 11. Using vi, open the addCustomer.sql file:
vi addCustomer.sql
__ 12. Switch to insert mode by pressing “i".
__ 13. To create a simple stored procedure that inserts the new entry into the customer table, you
need some variables. Add variables that alias the input variables $1, $2, $3, and $4 after the
BEGIN_PROC statement:
DECLARE
C_KEY ALIAS FOR $1;
C_NAME ALIAS FOR $2;
N_KEY ALIAS FOR $3;
PHONE ALIAS FOR $4;
Information
Each BEGIN..END block in the stored procedure can have its own DECLARE section. Variables
are valid in the block they belong to. It is a good best practice to change the input parameters into
readable variable names to make the stored procedure code maintainable. Be careful not to use
variable names that are restricted by IBM PureData System for Analytics, for example NAME.
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Your complete stored procedure should now look like the following:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer,
varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
__ 15. Save the file and exit vi.
__ 16. Log in to NZSQL.
__ 17. Execute the stored procedure script with the following command: \i addCustomer.sql.
__ 18. To test the stored procedure add a new customer John Smith with customer key 999999,
phone number 555-5555 and nation 2 (which is the key for the United States in our nation
table). You can also check to ensure the customer does not yet exist:
call addCustomer(999999,'John Smith', 2, '555-5555');
You should get the following results:
This result shows that the insert was successful. You have built your first IBM PureData
System for Analytics stored procedure.
13-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
In this case, you use an IF condition to check if a customer record with the key already exists and
has been selected by the previous select condition. You could do an implicit check on the record or
any of its fields and see if it compares to the null value, but IBM PureData System for Analytics
provides a number of special variables that make this more convenient.
• FOUND specifies if the last SELECT INTO statement has returned any records.
• ROW_COUNT contains the number of found rows in the last SELECT INTO statement.
• LAST_OID is the object id of the last inserted row, this variable is not very useful unless
used for catalog tables.
• RAISE EXCEPTION statement throws an error and abort the stored procedure. To add
variable values to the return string use the % symbol anywhere in the string. This is a
similar approach as used, for example, by the C printf statement.
__ 7. Check the foreign key relationship to NATION by adding the following lines after the lines
added in step 6:
select * into rec from nation where n_nationkey = n_key;
if not found rec then
raise exception 'No Nation with nation key %', n_key;
end if;
This is very similar to the last check, only this time you check if a record was NOT found.
Notice that you can reuse the REC record since it is not typed to a particular table.
13-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty Your stored procedure should now look like the following:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer,
varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
REC RECORD;
BEGIN
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
Note
If you have difficulty writing this stored procedure, remember a solution file exists in the same
directory called addCustomer_sol.sql.
__ 8. Save the stored procedure by pressing ESC, and then entering :wq! and pressing Enter.
__ 9. Log in to NZSQL.
nzsql
__ 10. In NZSQL, replace the stored procedure from the script by executing the following
command:
\i addCustomer.sql
__ 11. Test the check for duplicate customer ids by repeating the last CALL statement; remember,
you know that a customer record with the id 999999 already exists:
call addCustomer(999999,'John Smith', 2, '555-5555');
You should get the following result:
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
This is expected since the key value already exists and the first error condition is thrown.
__ 12. Check the foreign key integrity by executing the following command with a customer id that
does not yet exist and a nation key that does not exist in the NATION table, as well. You can
double check this using select statements if you want:
call addCustomer(999998,'James Brown', 999, '555-5555');
You should get the following result:
This is also as expected. The customer key did not yet exist so the first IF condition is not
thrown, but the check for the nation key table throws an error.
__ 13. For the addCustomer.sql to be successful, execute the following command with a customer
id that does not yet exist and the nation key 2:
call addCustomer(999998,'James Brown', 2, '555-5555');
You should see a successful execution:
You have successfully created a stored procedure that can be used to insert values into the
CUSTOMER table and checks for unique and foreign key constraints.
Information
You should remember that IBM PureData System for Analytics is not optimized to do lookup
queries so this is a pretty slow operation and should not be used for thousands of inserts. But for
occasional management, it is a perfectly valid solution to the problem of missing constraints in IBM
PureData System for Analytics.
13-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 15. In preparation for the next exercise, you need to set the schema in the stored procedure by
editing the SQL using:
vi addCustomer.sql
__ 16. Move the cursor down to the BEGIN statement and type in Shift+a, this will take you to the
end of the line and put you into insert mode. Press the Enter key.
__ 17. Type in the following
set schema <student_id>;
__ 18. Save the stored procedure by pressing ESC, and then entering :wq! and pressing Enter.
__ 19. Log in to NZSQL.
nzsql
__ 20. Replace the stored procedure from the script by executing the following command:
\i addCustomer.sql
__ 21. Quit nzsql by typing \q and pressing Enter
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
13-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Important
You can see that he has the same password as the other users in our labs. This is for simplification,
since it allows omitting the password during user switches; this would, of course, not be done in a
production environment.
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
__ 8. To grant <student_id>_admin the right to execute a specific stored procedure, you need
to specify the full name including all input parameters. The easiest way to get these in the
correct syntax is to first list them with the SHOW PROCEDURE command:
show procedure all;
You can see that the user has only the rights you gave him. He can select data from the
customer table and execute our stored procedure, but he is not allowed to change the
customer table directly or execute anything except for the stored procedure.
__ 11. To test this, switch to the <student_id>_admin user with the following command:
\c <student_id>_db <student_id>_admin <student_pwd>
__ 12. Add another customer to the customer table:
call <student_id>.addCustomer(999997,'Jake Jones', 2, '555-5554');
If the insert is successful, you have another row in your table; you can check this with a
SELECT query if you want.
__ 13. To make changes to the stored procedure, switch back to your <student_id>:
\c <student_id>_db <student_id> <student_pwd>
13-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
EXempty __ 14. Before you modify the stored procedure, look at it in detail:
show procedure addcustomer verbose;
You should see the following:
You can see the input and output arguments, procedure name, owner, if it is executed as
owner or caller and other details. Verbose also shows you the source code of the stored
procedure. You see that the description field is still empty, so you can add a comment to the
stored procedure. This is important to do if you have a large number of stored procedures in
your system.
Information
For a convenient way to manage your stored procedures, use nzadmin since it provides most of the
managing functionality used in this lab in a graphical UI.
© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Information
Altering a stored procedure so it is executed as the caller instead of the owner means that whoever
executes the stored procedure needs to have access rights to all the objects that are touched in the
stored procedure otherwise it fails. This should be the default for stored procedures that
encapsulate business logic and do not do extensive data checking.
In this section, you setup the permissions for the addCustomer stored procedure and the
<student_id>_admin who is supposed to use it. You also added comments to the stored
procedure.
End of exercise
13-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
backpg
Back page