Sei sulla pagina 1di 176

V10.

cover

IBM Training Front cover


Student Exercises

IBM PureData System for Analytics Programming and Usage


Course code DW585 ERC 3.0

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
IBM PureData™ PureData™
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Netezza® and NPS® are trademarks or registered trademarks of IBM International Group B.V., an
IBM Company.
Other product and service names might be trademarks of IBM or other companies.

February 2016 edition


The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

© Copyright International Business Machines Corporation 2013, 2016.


This document may not be reproduced in whole or in part without the prior written permission of IBM.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

pref
Exercises description

Purpose and scope


The following labs are intended to give you hands-on experience with the
IBM PureData System for Analytics.
You work through exercises that enable you to learn advanced features of
the IBM PureData System for Analytics.

Helpful lab references

Linux commands
cat filename Display a file
cd pathname Change directories
clear Clears the shell
ctrl+c Terminates the current process
find . -name filename Find a file from your home directory
grep 'text' filename Find every occurrence of the text in the filename
ls List a directory
ls-l Use a long listing
pwd Print working directory
wc -l filename Count the number of lines in a file
set -o vi Use vi-style command line editing interface

Visual editor commands


vi filename Open file for edit
a Append after current position
cc Change entire line
dd Delete entire line
:e <file> Edit <file>
[Esc] Exit out of insert and replace mode
grep -i Grep ignoring case
h Move the cursor one space to the left
i Insert new text to the left of the cursor
j Move the cursor one space down

© Copyright IBM Corp. 2013, 2016 Exercises description iii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

k Move the cursor one space up


l Move the cursor one space to the right
n Repeat previous search forward, backward
p Pastes text after before the cursor postion
:q! Quit without saving changes
:set ic Case insensitive
[Shift] G Go to last line of the file
[Shift] R Replace existing text (overwrite)
u Undo the previous editing command
w Move forward a word at a time
:w <file> Write buffer to (save as) <file>
:wq Write (save file) and quit (exit vi)
wc -l Word count from a command
. Repeat the previous editing command
/ ? Search forward, backward

IBM PureData system for analytics quick lookup


/nz/kit/doc
netezza_database_user_guide.pdf
netezza_system_admin_guide.pdf
/nz/kit/bin
nzbackup Create a backup of a database
nzcontents Display the revision and build number for all executables
nzevent Define notification procedures for events
nzhostbackup Backup the host catalogs
nzhostrestore Restore the host catalogs
nzinventory Display information about hardware components
nzload Bulk data loader
nzpassword Create an encrypted password cache on disk
nzreclaim Reclaim disk space on the SPUs
nzrestore Restore a database backup
nzrev Display the current software revision
nzsession Manage sessions/connections to the database
nzsfi Send a reset message to the SFIs in the system

iv IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

pref nzspu Control one or more of the SPUs in the system


nzsql Netezza SQL interactive query tool
nzstart Start the database
nzstate Display the current system state
nzstats Display system information and statistics
nzstop Stop the database
nzsystem Stop, restart, pause, or resume database operations
-u <user> Database username [ NZ_USER ]
-pw <password> Database password [ NZ_PASSWORD ]
-db <database> Database name [ NZ_DATABASE ]
-host <name/ip> Netezza hostname/ IP address [ NZ_HOST ]

NZSQL command line options


-A Unaligned table output mode
-t Print rows only. Combine -A and -t to dump a table.
-f <filename> Execute queries from file, then exit
-o <filename> Send query output to filename
-c <query> Run only single query and exit
-E Display queries that internal commands generate
-F <string> Set field separator (default "|")
-x Turn on expanded

© Copyright IBM Corp. 2013, 2016 Exercises description v


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

vi IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

TOC Contents
Exercises description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Exercise 1. Access training's IBM PureData System for Analytics . . . . . . . . . . . . . . . . . . . 1-1


1.1. Connecting to the IBM PureData System for Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1.2. Setting environment variables in your profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
1.3. Caching your password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
1.4. Connecting to the system database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
1.5. Logging out of the host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13

Exercise 2. Administration tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1


2.1. Launching NzAdmin and connecting to host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.2. Using NzAdmin tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3. Launching IBM Netezza Performance Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2.4. Using IBM Netezza Performance Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15

Exercise 3. Databases, tables and schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1


3.1. Creating a database and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.2. Creating second database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7

Exercise 4. Data distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1


4.1. Determining distribution key for data skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4.2. Correcting data skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4.3. Implementing co-location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9

Exercise 5. Loading and unloading data using external tables . . . . . . . . . . . . . . . . . . . . . . 5-1


5.1. Unloading data using external tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5.2. Dropping external tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
5.3. Loading data using external tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17

Exercise 6. Loading and unloading data using the nzload utility . . . . . . . . . . . . . . . . . . . . . 6-1
6.1. Using the nzload utility with command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.2. Using the nzload utility with a control file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6

Exercise 7. Generate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1


7.1. Analyzing a query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7.2. Using nzadmin to generate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7

Exercise 8. Analyzing query plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1


8.1. Identifying join problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
8.2. Using Aginity for analyzing SQL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8

Exercise 9. Zone maps and clustered base tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1


9.1. Using clustered base tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
9.2. Maintaining clustered base tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.3. Comparing a regular table with a CBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4. Comparing a regular table with a CBT restricting non zone-mappable fields . . . . . . . . . . 9-11

© Copyright IBM Corp. 2013, 2016 Contents vii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise 10. Materialized views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1


10.1. Creating a materialized view to reduce data width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
10.2. Using a materialized view as an index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6

Exercise 11. Transactions and Truncate table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-1


11.1. Inserting a transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
11.2. Updating and deleting transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
11.3. Aborting transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
11.4. Cleaning up with GROOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
11.5. Understanding concurrent select and truncate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
11.6. Understanding concurrent insert and truncate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14

Exercise 12. GROOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-1


12.1. Grooming logically deleted rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
12.2. Understanding performance benefits of GROOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6

Exercise 13. Stored procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-1


13.1. Creating a stored procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-3
13.2. Adding integrity checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-7
13.3. Managing your stored procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-13

viii IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 1. Access training's IBM PureData
System for Analytics

Prerequisites
None.

What this exercise is about


To gain experience connecting to an IBM PureData System for Analytics and
its database.

What you should be able to do


At the end of the exercise, you should be able to:
• Access and log in to Training's IBM PureData System for Analytics
• Perform Linux shell commands to manage your username and password
environment variables
• Connect to the system database using NZSQL
• Get help information on NZSQL ‘slash' commands

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
As a developer, you need to log in to the IBM PureData System for Analytics and connect to the
system database.

1-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 1.1. Connecting to the IBM PureData System for Analytics


In this exercise, you remotely connect to the training system through a Citrix gateway and log onto
the host using PuTTy
__ 1. To connect remotely to the IBM PureData System for Analytics, go to the following URL:
https://pokctx.edu.ihost.com/vpn/index.html
__ 2. Enter the user ID and password from your course instructions, and then click Log On..

Note

If you have any issues with Citrix, consult the IBM IRLP Citrix Setup Guide provided in your course
instructions.

__ 3. Click the Remote Desktop icon.

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Important

If you get a Citrix Receiver - Security Warning pop-up message, click Permit use.

__ 4. From the Remote Desktop Connection panel, click Connect.

You might get an error message when you click Connect. If you do, follow these substeps
before continuing to the next step:
a. On the Remote Desktop Connection panel, click Show Options.

1-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty b. In the Computer field, enter the IP address from your course instructions.

c. Change the User name to IBM.

d. Click Connect.
__ 5. At the Windows Security prompt, enter the password password.

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 6. Open the Netezza Training Tools Desktop folder:

__ 7. Open a PuTTY session:


__ a. Double-click the PuTTY icon.

1-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ b. Click Open.

__ c. Log into the host with the ID and password credentials you received in your course
package, which are referenced as follows throughout the rest of the exercises:
- Username = <student_id>
- Password = <student_pwd>

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

1.2. Setting environment variables in your profile


To avoid having to enter your user name and password when you run commands, set these
variables in your profile.

Important

Ensuring that these variables are set AND set correctly is crucial to successfully completely all
subsequent exercises. DO NOT move on to the next exercise until you have completed all the steps
here and verify that the variables have the correct values.

In PuTTY, use the vi editor to modify your profile:


__ 1. Type vi .bash_profile.
__ 2. Type i to enter Edit mode.
__ 3. Use the cursor key to navigate to the end of the export PATH line, and then press Enter.

1-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 4. Add the following three new command lines. Note that these commands and variables are
case-sensitive.
export NZ_USER=<student_id>
export NZ_PASSWORD=<student_pwd>
export NZ_DATABASE=<student_id>_db

Your file should look like this except with your information:

__ 5. To save your changes:


__ a. Press “Esc” to switch back into command mode. Notice the “—INSERT—“ string at the
bottom of the screen vanishes.
__ b. Enter :wq! and press Enter to write the file, and quit the editor without any questions.
__ 6. Verify your changes:
more .bash_profile
Your file should look like this except with your NZ_USER, NZ_PASSWORD, and
NZ_DATABASE:

__ 7. Refresh your profile:


. .bash_profile
__ 8. Verify that your environment variables are enabled by typing:
env | grep -i nz

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Information

When you are working with your own PureData System for Analytics, you can set these variables
for the currently opened shell instead of setting them for the environment. To do this, instead of
editing your bash profile, export NZ_USER, NZ_PASSWORD, and NZ_DB using the following
commands:
export NZ_USER=<student_id>
export NZ_PASSWORD=<student_pwd>
export NZ_DB=<student_id>_db

1-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 1.3. Caching your password


The nzpassword command encrypts and persists the password locally in your home directory. It is
used for many nz commands that require authentication.
__ 1. To cache the password, type the following command:
nzpassword add –u <student_id> -pw <student_pwd>

Important

If you get an error message regarding your credentials, this means that your user name and
password are not the same as what you set in your environment variables in the previous exercise.
Run env | grep -i nz to verify your information.
To clear the environment variables do the following:
export NZ_PASSWORD=
export NZ_USER=

__ 2. Ensure the password is cached:


nzpassword

Host User
---------- -----
pok-puredata <student_id>

Note

You can cache the password on both active and passive hosts by using the host switch.

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

1.4. Connecting to the system database


In this exercise, you make your initial connection to the system database.
__ 1. Type nzsql -d system to connect to the system database:

Information

If you do not successfully cache your password or you get an authentication error, add your user ID
and password to the command:
nzsql -d system -u <student_id> -pw <student_pwd>

__ 2. To view help for the ‘\' command, type the following:


SYSTEM.ADMIN(STUDENT_ID)=> \?
__ 3. Exit NZSQL:
SYSTEM.ADMIN(STUDENT_ID)=> \q

1-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 1.5. Logging out of the host


__ 1. To logout of the host, type logout at the prompt.

Note

This automatically closes your PuTTY session.

End of exercise

© Copyright IBM Corp. 2013, 2016 Exercise 1. Access training's IBM PureData System for Analytics 1-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

1-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 2. Administration tools

Prerequisites
Before proceeding with this lab, make sure that you have:
• Completed the previous lab successfully.

What this exercise is about


To gain experience with the functions and features provided by NzAdmin and
Netezza Performance Portal.

What you should be able to do


At the end of the exercise, you should be able to:
• Launch the NzAdmin Tool
• Use features of the NzAdmin tool
• Log in to the Netezza Performance Portal
• Use features of Netezza Performance Portal

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-1


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
You are an IBM PureData System for Analytics developer. You need to be able to navigate
NzAdmin and Netezza Performance Portal to access the IBM PureData System for Analytics,
check on the hardware, and access the databases for which you have permissions.

2-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 2.1. Launching NzAdmin and connecting to host


IBM Netezza Administrator can be launched using the icon in the IBM Netezza Training Tools
desktop folder or by navigating to it from the Windows Start menu, Start > All Programs>IBM
Netezza>IBM Netezza Administrator.
__ 1. Launch NzAdmin by double-clicking the NzAdmin icon in the IBM Netezza Training Tools
desktop folder:

__ 2. On the Connect to IBM Netezza Server screen, enter the following, and then click OK.
__ a. Host: pok-puredata
__ b. User: <student_id>
__ c. Password: <student_pwd>.

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-3


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

2-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 3. Verify you see the following main application screen and review the system state and
configuration information:

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-5


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

2.2. Using NzAdmin tool


In this exercise, you walk through an overview of the NzAdmin tool.
__ 1. Click on SPU Units to see the status of each.

__ 2. Expand SPA Units, click on SPA ID, and then click on each SPU Slot to see individual
partition statuses and details from the physical appliance:

2-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 3. Click Data Slices to review their status.

__ 4. From the Database tab, click Users.

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-7


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 5. To view user properties, right-click <student_id> and select Properties.

__ 6. To view the Admin and Object permissions for <student_id>, right-click <student_id>
and select Privileges>Admin and then Privileges>Object.

__ 7. Click the X in th upper-right corner of the window to exit NzAdmin.

2-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 2.3. Launching IBM Netezza Performance Portal


In this exercise, you create a new user for yourself in the Performance Portal.
__ 1. Double-click the Performance Portal icon in the IBM Netezza Training Tools desktop folder.

__ 2. From the Login panel, enter User Name admin and Password password, and then click
Login.

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-9


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 3. Click Portal Administration in the top, right of the toolbar.

__ 4. In the Portal Administration panel, click Accounts Administration.

__ 5. On the next panel, click Create Account.

Note

You are having to create an account for the Portal as the security credentials are stored in a
different location to the credentials for both NPS and Linux. There is no connection between the
authorization credentials for Portal and any other credentials

2-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 6. From the Create Account panel, enter your student id and password, re-enter your
password, and then click OK.

__ 7. From the confirmation popup, click OK.

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-11


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 8. Verify your user name is in the Accounts Administration panel, and then close the panel.

__ 9. Click Logout in the top, right of the Portal Administration toolbar.

2-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 10. Double-click the Performance Portal icon in the IBM Netezza Training Tools desktop folder
and log in with your student ID and password.

__ 11. In the left menu, select Add Host.

__ 12. From the Add Host panel, type in the following and then click OK.
__ a. Host: pok-puredata
__ b. User: <student_id>
__ c. Password: <student_pwd>

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-13


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 13. Verify that the Status is Online.

2-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 2.4. Using IBM Netezza Performance Portal


In this exercise, you walk though an overview of the Performance Portal.
__ 1. In the left pane, from the System folder click Database Administration.

__ 2. In the top-right pane, click QHIST_STUDENTS database.

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-15


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 3. To view its tables, click the Tables tab in the bottom right pane.

2-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 4. To review hardware state and configuration status, click Hardware to expand the navigation,
and then review the SPA, SPU, and data slices information. If this takes more than a couple
of minutes, without responding, cancel and expand again.

© Copyright IBM Corp. 2013, 2016 Exercise 2. Administration tools 2-17


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 5. Close Netezza Performance Portal.

End of exercise

2-18 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 3. Databases, tables and schemas

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• Log in to the host

What this exercise is about


To gain experience creating a database and its tables. In addition the
exercise allows the student to gain experience enabling schema support and
manage database schemas.

What you should be able to do


At the end of the exercise, you should be able to:
• Create a database
• Connect to the new database and examine the schemas
• Create a new schema
• Change the default schema to the second schema

© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
You are an IBM PureData System for Analytics developer and want to create a database and tables
with a random distribution. You also need to know how to manage the default schema.

3-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 3.1. Creating a database and tables


In this exercise, you create your student database, create its tables, and load the tables with data.

Important

Whenever you are in nzsql and execute an SQL statement, you MUST finish the statement with a
semicolon (;). The slash (\) commands do not require a semicolon.

__ 1. Using PuTTY, log into the host.


__ 2. To connect to the system database, log into nzsql with your student ID and password, type:
nzsql -d system

Information

If you did not successfully cache your password, add your user ID and password to the command:
nzsql -d system -u <student_id> -pw <student_pwd>

Notice the identifiers for database, schema, and user ID. During the rest of this course, this
helps you ensure you are using the correct database and schema, and are logged in with
the correct user ID.

__ 3. Type the following to create your database:


create database <student_id>_db;
Ensure you use the exact same database name as you specified in your environment
variables, for example:
create database stud5073_1_db;
__ 4. Change from the system database to your new student database:
\c <student_id>_db

© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Notice that the database and schema identifiers have changed to those of your newly
created database:

__ 5. To exit nzsql, type:


\quit
__ 6. Change to your /home/<student_id>/DDL directory.
cd /home/<student_id>/DDL

__ 7. To view the table DDLs in your directory, type ls.

__ 8. To review the contents of a DDL, type, for example:


more customer.sql
__ 9. To create a table in your new database you use the following command structure:
nzsql –d <student_id>_db –f <ddl_name.sql>

3-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty Create tables using each of the following table DDLs:


• customer.sql
• lineitem.sql
• nation.sql
• orders.sql
• part.sql
• partsupp.sql
• region.sql
• supplier.sql

Important

Do not create a table using orders_cbt.sql as you do this in a later exercise.

__ 10. Connect to your database.


nzsql

© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 11. Review your tables using the \dt command.

__ 12. To prepare for subsequent labs, you must add data to your tables. You do this by executing
the script loadtables.sh in the /home/<student_id> directory.
__ a. Exit NZSQL by typing \quit.
__ b. Type cd .. to change to the /home/<student_id> directory.
__ c. To load data into your tables, type the following command:
./loadtables.sh
__ d. To ensure that all data loads successfully, when the script has completed check the
generated nzlog files using the following command:
more <TABLE_NAME>.<STUDENT_ID>.<STUDENT_ID>_DB.nzlog

Note

Ensure you use uppercase letters in *.nzlog file names.

3-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 3.2. Creating second database schema


In this exercise you create a second database schema and set it as the default schema.
__ 1. Using the same PuTTY window as in the previous exercise
__ 2. type nzsql to go back into your database
__ 3. Type select current_schema;

__ 4. Change the default database schema using the following commands:


__ a. create schema <student_id>_2;

__ b. Verify the new schema is created by typing:


select * from system.definition_schema._v_schema_xdb;

© Copyright IBM Corp. 2013, 2016 Exercise 3. Databases, tables and schemas 3-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

You should see something similar to the following:

__ c. alter database <student_id>_db set default schema <student_id>_2;

__ d. \c system
__ e. \c <student_id>_db

__ 5. Change back to your default schema:


alter database <student_id>_db set default schema <student_id>;
__ 6. Type \q to exit NZSQL.

End of exercise

3-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 4. Data distribution

Prerequisites
Before proceeding with this lab, make sure that you have:
• Completed the previous lab successfully

What this exercise is about


To understand what data skew means and how to co-locate tables.

What you should be able to do


At the end of the exercise, you should be able to:
• Investigate data skew
• Implement co-location of data

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-1


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
Since IBM PureData System for Analytics is built on a massively parallel architecture that
distributes data and workloads over a large number of processing and data nodes, the single most
important tuning factor is picking the right distribution key. The distribution key governs which data
rows of a table are distributed to which data slice and it is very important to pick an optimal
distribution key to avoid data skew, processing skew and to make joins co-located whenever
possible.
Tables in IBM PureData System for Analytics are distributed across data slices based on the
distribution method and key. If a bad data distribution method has been picked, it results in skewed
tables or processing skew. Data skew occurs when the distribution method puts significantly more
records of a table on one data slice than on other data slices. Apart from bad performance this also
results in a situation where the IBM PureData System for Analytics can hold significantly less data
than expected. Processing skew occurs if processing of queries is mainly taking place on some
data slices, for example, when queries only apply to data on those data slices. Even in tables that
are distributed evenly across dataslices, data processing for queries can be concentrated or
skewed to a limited number of dataslices. This can happen because IBM PureData System for
Analytics is able to ignore data extents (sets of data pages) that do not fit to a given WHERE
condition.
Both types of skew result in suboptimal performance since in a parallel system the slowest node
defines the total execution time.

4-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 4.1. Determining distribution key for data skew


In this exercise, you determine the best distribution key to prevent data skew on the LINEITEM
table.
__ 1. Ensure you are logged into the host with your <student_id> and <student_pwd>.
__ 2. Type nzsql to log into your database.
__ 3. Check to ensure you are using the default schema by issuing the following command:
select current_schema;

Information

If you are still in the second schema you created in the previous exercise, complete the following
steps:
__ a. alter database <student_id>_db set default schema <student_id>;
__ b. Quit nzsql.
__ c. Start nzsql.
__ d. Ensure you are using the default schema: 
select current_schema;

__ 4. To see a description of the LINEITEM table’s columns and distribution key, run the describe
command:
\d lineitem
You can see that the LINEITEM table has 16 columns with different data types. Some of the
columns have a “key” suffix and substrings containing the names of other tables and are
most likely foreign keys of dimension tables. The distribution key is L_LINESTATUS.

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-3


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 5. To return a limited number of rows you can use the limit keyword in your select queries.
Execute the following select command to return 10 rows of the LINEITEM table. For
readability only select a couple of columns including the order key, the ship date and the
linestatus distribution key:
select l_orderkey, l_quantity, l_shipdate, l_linestatus from lineitem limit
10;

From this limited sample, you can not make any definite judgments but you can make a
couple of assumptions. While the L_ORDERKEY column is not unique it seems to have a
number of distinct values. The L_SHIPDATE column also appears to have a lot of distinct
shipping date values. L_LINESTATUS on the other hand has only one shown value, which
might make it a bad distribution key. It is possible that you get different results since a
database table is an unordered set (for example, only “O” or “F” values in the
L_LINESTATUS column).
__ 6. Verify the number of distinct values in the columns with a “SELECT DISTINCT (COUNT
column_name)” call. For example, to return a list of all values that are in the
L_LINESTATUS column, execute the following SQL command:
select count (distinct l_linestatus) from lineitem;

4-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty You can see that the L_LINESTATUS column only contains two distinct values. As a
distribution key, this results in a table that is only distributed to two of the available
dataslices. L_ORDERKEY on the other hand has many distinct values.

__ 7. Verify this by executing the following SQL call, which returns a list of all dataslices that
contain rows of the LINEITEM table, and the corresponding number of rows stored in them:
select datasliceid, count(*) from lineitem group by datasliceid;

Information

Every IBM PureData System for Analytics table has a hidden column DATASLICEID that contains
the id of the dataslice in which the selected row is being stored. By executing a SQL query that
does a GROUP BY on this column and counts the number of rows for each dataslice id, data skew
can be detected. In this case the table has been, as expected, distributed to only two of the
available four dataslices. This means that only half of the available space is used and likely results
in low performance during most query executions. In general a good distribution key should have a
big number of distinct values with a good value distribution. Columns with a low number of distinct
values, especially boolean columns should not be considered as distribution keys.

__ 8. Exit the NZSQL interface by typing \quit.

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-5


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

4.2. Correcting data skew

Information

In the following 2 sections; 4.2 and 4.3 we use create two new tables by editing SQL files using vi. It
is possible to achieve the same results by using Create Table as (CTAS). The reason the vi method
was chosen was to help students get familiar with the editing process. If you are already familiar
with vi then it is your choice if you want to use the CTAS method.

You are going to pick a new distribution key. As you have seen, it should have a reasonable number
of distinct values. One of the columns that did fit this description was the L_SHIPDATE column as
shown above.
The column has over 2500 distinct values and has therefore more than enough distinct values to
guarantee a good data distribution on 4 dataslices. Of course this is under the assumption that the
value distribution is good as well.
__ 1. Reload the table with the new distribution key.
__ a. In PuTTY, ensure you are in the DDL directory by typing:
cd /home/<student_id>/DDL
__ b. Copy the lineitem.sql to a new file called lineitem_shipdate.sql by typing
cp lineitem.sql lineitem_shipdate.sql
__ c. Edit the lineitem.sql by typing:
vi lineitem_shipdate.sql
__ d. Use the cursor key to navigate to the DISTRIBUTE ON line, and then navigate to the
beginning of l_linestatus.

4-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ e. Type cw to enter change word mode and change the distribution key by typing:
l_shipdate

__ 2. To save your changes


__ a. Press “Esc” to switch back into command mode.
__ b. Enter :wq! and press Enter to write the file and exit editing mode.
__ 3. Using the NZSQL interface drop the LINEITEM table: drop table lineitem;
__ 4. Exit NZSQL and type in the following to recreate the LINEITEM table:
nzsql -db <student_id>_db -f lineitem_shipdate.sql
__ 5. After this statement has executed successfully, reload the new table:
tnzload -db <student_id>_db -t lineitem -df
/home/<student_id>/DATA/LINEITEM.unl -delim ‘|’ -maxErrors 10

__ 6. After the nzload command has executed successfully, generate statistics for the reloaded
table. To do this, use NZSQL and type in the following command:
generate statistics on lineitem;

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-7


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 7. Verify that the new distribution key results in a good data distribution by repeating the query,
which returns the number of rows for each datasliceid of the LINEITEM table:
select datasliceid, count(*) from lineitem group by datasliceid;

Notice that the data distribution is much better now. All data slices have a roughly equal amount
of rows.

4-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 4.3. Implementing co-location


The most basic warehouse schema consists of a fact table containing a list of all business
transactions and a set of dimension tables that contain the different actors, objects, locations and
time points that have taken part in these transactions. This means that most queries not only
access one database table but require joins between tables. In IBM PureData System for Analytics
database tables are distributed over a potentially large numbers of data slices on different SPUs.
This means that during a join of two tables there are two possibilities:
• Rows of the two tables that belong together are situated on the same dataslice, which means
that they are co-located and can be joined locally
• Rows that belong together are situated on different dataslices which means that tables need to
be redistributed.

Information

When you redistribute a table, the system makes a copy of the table in a temporary location and
then redistributes that copy.

For this exercise, use the ORDERS table.


__ 1. To view the description of the ORDERS table’s columns and distribution key, use the
NZSQL interface to connect to the student database and then run the describe command \d
orders to get a description of the table.

The ORDERS table has a key column O_ORDERKEY that is most likely the primary key
of the table. It contains information on the order value, priority and date and has been
distributed on random. This means that IBM PureData System for Analytics does not
use a hash based algorithm to distribute the data. Instead, rows are distributed randomly
on the available data slices. You can check the data distribution of the table, using the
methods we have used before for the LINEITEM table. The data distribution is perfect.
There also are no processing skews for queries on the single table, since in a random

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-9


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

distribution there can be no correlation between any WHERE condition and the
distribution key.
__ 2. An example query returns the average total price and item quantity of all orders grouped by
the shipping priority. This query has to join together the LINEITEM and ORDERS tables to
get the total order cost from the ORDERS table and the quantity for each shipped item from
the LINEITEM table. The tables are joined with an inner join on the L_ORDERKEY column.
Execute the following command and query and note the approximate execution time:
\time

select avg(o.o_totalprice) as price, avg(l.l_quantity) as quantity,


o_orderpriority from orders as o, lineitem as l where
l.l_orderkey=o.o_orderkey group by o_orderpriority;

Remember that the ORDERS table was distributed randomly and the LINEITEM table is still
distributed by the L_SHIPDATE column. The join on the other hand is taking place on the
L_ORDERKEY and O_ORDERKEY columns. What is happening is the system is
redistributing both the ORDERS and LINEITEM tables. This is bad because both tables are
of significant size so there is a considerable overhead. This inefficient redistribution occurs
because the tables are not distributed on a useful column.
__ 3. Reload the tables based on the mutual join key to enhance performance during joins. To do
this, you need to reload the LINEITEM table with the new distribution key.
__ a. Ensure you are in the DDL directory.
__ b. Edit the lineitem_shipdate.sql file by typing vi lineitem_shipdate.sql.
__ c. Use the cursor key to navigate to the DISTRIBUTE ON line, and then navigate to the
beginning of l_shipdate.
__ d. Type cw to enter change word mode and change the distribution key by typing:
l_orderkey
The line should now look like this:
DISTRIBUTE ON (l_orderkey);
__ e. Press “Esc” to switch back into command mode.
__ f. Enter :wq! and press Enter to write the file, and quit the editor without any questions.

4-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 4. Change the distribution key and reload the ORDERS table by changing the distribution key
to o_orderkey.
__ a. Copy the orders.sql to a new file called orders_orderkey.sql by typing:
cp orders.sql orders_orderkey.sql
__ b. Edit the orders_orderskey.sql by typing:
vi orders_orderkey.sql
__ c. Use the cursor key to navigate to the DISTRIBUTE ON RANDOM line, and then navigate
to the beginning of the word RANDOM.
__ d. Type cw to enter change word mode and change the distribution key to o_orderkey by
typing:
(o_orderkey)
The line should now look like this:
DISTRIBUTE ON (o_orderkey);
__ e. Press “Esc” to switch back into command mode.
__ f. Enter :wq! and press enter to write the file, and quit the editor without any questions.
__ 5. Using the NZSQL console, drop the LINEITEM and ORDERS tables:
drop table lineitem;
drop table orders;
__ 6. Exit NZSQL and type in the following to recreate the LINEITEM and ORDERS tables:
nzsql -db <student_id>_db -f lineitem_shipdate.sql
nzsql -db <student_id>_db -f orders_orderkey.sql

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-11


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 7. After this statement has executed successfully, reload the new tables by issuing the
following commands:
time nzload -db <student_id>_db -t lineitem -df
/home/<student_id>/DATA/LINEITEM.unl -delim ‘|’ -maxErrors 10

time nzload -db <student_id>_db -t orders -df


/home/<student_id>/DATA/ORDERS.unl -delim ‘|’ -maxErrors 10

__ 8. After the nzload commands have executed successfully, generate statistics for the reloaded
tables. To do this, use the NZSQL console and type in the following commands:
generate statistics on lineitem;
generate statistics on orders;
__ 9. Repeat step 2 and note the new execution time as it should have improved significantly.

The query should return the same results as in the previous section but run much faster.
You now have loaded the LINEITEM and ORDERS table into your IBM PureData System for
Analytics using the optimal distribution key for these tables for most situations.
• Both tables are distributed evenly across dataslices, so there is no data skew.

4-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty • The distribution key is highly unlikely to result in processing skew, since most WHERE
conditions restrict a key column evenly.
• Since ORDERS is a parent table of LINEITEM, with a foreign key relationship between them,
most queries joining them together use the join key. These queries are co-located.

End of exercise

© Copyright IBM Corp. 2013, 2016 Exercise 4. Data distribution 4-13


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

4-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 5. Loading and unloading data using
external tables

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.

What this exercise is about


This lab helps you explore the IBM PureData System for Analytics framework
components for loading and unloading data from the database. You use the
various commands to create external tables to unload and load data. In
Exercise 6, you get a basic understanding of the nzload utility.

What you should be able to do


At the end of the exercise, you should be able to:
• Create an external table in your database
• Unload the data using a delimited file in ASCII format
• Verify that data was created without problems
• Reload the table with data from the external table

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
In every data warehouse environment there is a need to load new data into
the database. The task to load data into the database is not just a one-time
operation, but rather a continuous operation that can occur hourly, daily,
weekly, or even monthly. The loading of data into a database is a vital
operation that needs to be supported by the data warehouse system. IBM
PureData System for Analytics provides a framework to support not only the
loading of data into its database environment but also the unloading of data
from its database environment. This framework contains more than one
component, and some of these components are:
• External tables – Tables stored as flat files on the host or client systems
and registered like tables in the IBM PureData System for Analytics catalog.
They can be used to load data into the IBM PureData System for Analytics or
unload data to the file system.
• nzload – A wrapper command line tool around external tables that provides
an easy method for loading data into the IBM PureData System for Analytics.
• Format options – Options to format the data load to and from external
tables.

5-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 5.1. Unloading data using external tables


An external table allows the IBM PureData System for Analytics to treat an external file as a
database table. An external table has a definition (a table schema) in the IBM PureData System for
Analytics’ system catalog, but the actual data exists outside of the appliance’s database. This is
referred to as a datasource file. External tables can be used to access files which are stored on the
file system. After you have created the external table definition, you can use INSERT INTO
statements to load data from the external file into a database table, or SELECT FROM statements
to query the external table.
For this exercise, you use external tables to unload rows from the REGION or NATION tables as
records into an external datasource file. You follow five different use cases to help you understand
how to use external tables to unload data from a database using the NZSQL interface.
Since you examine the external datasource files for the external tables, start two PuTTY sessions
to make it easier to view the external files:
__ 1. Ensure you are in the <student_id> home directory
__ 2. Using PuTTY, log into NZSQL and make sure you are connected to <student_id>_db. This
window is referred to as session 1, and you use it to o execute SQL commands, for example
to review tables after load operations.
__ 3. Open a second PuTTY session. This window is referred to as session 2, and you use it for
operating system commands, to execute nzload commands, view data files, etc.

5.0.1. Unloading data with an external table created with the


SAMEAS clause
Use the first external table to unload data from the NATION table into an ASCII delimited text file.
Name this external table NATION_EXT and use the same column definition as the NATION table.
After you create the NATION_EXT external table, use it to unload all the rows from the NATION
table. The records for the NATION_EXT external table are in the external datasource file,
/tmp/<student_id>_nation_ext.unl.
__ 1. In session 1, use the following command to create an external table called nation_ext so
that you can offload the data to this table.
create external table nation_ext sameas nation using (DATAOBJECT
(‘/tmp/<student_id>_nation_ext.unl’) delimiter ‘|’);

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 2. Verify your file was created using the following command:


\d

Hint

You can also verify your file was created using the NzAdmin tool. If you want to list just the external
tables, use the \dx command.

5-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 3. You can list the properties of the external table using the \d nation_ext command.

This output includes the columns and associated data types in the external table. Notice that
this is similar to the NATION table since the external table was created using the sameas
clause in the create external table command. The output also includes the properties

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

of the external table. The most notable property is the DataObject property which shows
the location and the name of the external datasource file used for the external table.
__ 4. Still in session 1, unload the data from the base table nation to the external table
nation_ext:
insert into nation_ext select * from nation;
Result:
INSERT 0 257
__ 5. Using session 2, review the external file on the host corresponding to the external table and
count the rows using the wc -l (word count lines) command.
wc -l /tmp/<student_id>_nation_ext.unl
Result:
257 /tmp/<student_id>_nation_ext.unl

5.0.2. Unloading data with an external table using the AS


SELECT clause
The second external table is also used to unload data from the REGION table into an ASCII
delimited text file using a different method. You create the external table and unload the data in the
same create statement, so a separate step is not required to unload the data. (The first method
used to create an external table required the data to be unloaded in a second step using an
INSERT statement.) Then, name the external table NATION_EXT2 and the external datasource file
/tmp/<student_id>_nation_ext2.unl.
__ 1. In session 1 in NZSQL, create an external table and unload the data in a single step by
typing the following command using NZSQL:
create external table nation_ext2 '/tmp/<student_id>_nation_ext2.unl' as
select * from nation;

__ 2. List just the external tables by typing \dx. Notice there are now two external tables.

5-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 3. Using session 2, review the file you created by typing:


more /tmp/<student_id>_nation_ext2.unl

5.0.3. Unloading data with an external table using defined


columns
The first two external tables that you created used the exact same columns from the NATION table,
using an implicit table schema. You can also create an external table by explicitly specifying the
columns. This is referred to as explicit table schema. The third external table that you create is still
used to unload data from the NATION table but only from the N_NAME and N_COMMENT
columns. You create the nation_ext3 external table in one step and then unload the data in the
/tmp/nation_ext3.unl file using a different delimiter string. The basic syntax is:
create external table table_name({column_name type} [, ... ])[USING
external_table_options}]
__ 1. Using session 1, create a new external table to only include the N_NAME and
N_COMMENT columns, and exclude the N_REGIONKEY column from the NATION table.
Along with this, change the delimiter string from the default ‘|’ to ‘=’:
create external table nation_ext3 (n_name char(25),n_comment
varchar(152))using(dataobject('/tmp/<student_id>_nation_ext3.unl') delimiter
'=');

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 2. List the properties of the NATIONS_EXT3 external table:


\d nation_ext3

Notice there are only two columns for this external table since you only specified two
columns when creating the external table. The rest of the output is very similar to the
properties of the other two external tables that you created, with two main exceptions. The

5-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty first difference is obviously the DataObjects field, since the filename is different. The other
difference is the string used for the delimiter, since it is now ‘=’ instead of the default, ‘|’.
__ 3. Unload the data from the NATION table, but only the data from columns N_NAME and
N_COMMENT:
insert into nation_ext3 select n_name, n_comment from nation;

Hint

Alternatively, you could create the external table and unload the data in one step using the following
command:
create external table nation_ext3 '/tmp/<student_id>_nation_ext3.unl' using
(delimiter '=') as select n_name, n_comment from nation;

__ 4. Using session 2, review the file you created


more /tmp/<student_id>_nation_ext3.unl

Notice that only two columns are present in the flat file using the ‘=’ string as a delimiter.

5.0.4. Unloading data with an external table from two tables


The first three external tables unload data from one table. The next external table you create is
based on using a table join between the REGION and NATION table. The two tables are joined on
the REGIONKEY and only the N_NAME and R_NAME columns are defined for the external table.
This exercise illustrates how data can be unloaded using SQL statements other than a simple

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

SELECT FROM statement. The external table is named nation_ext4 using another ASCII delimited
text file named /tmp/<student_id>_nation_ext4.unl.
__ 1. Using session 1, create the next external table, and unload data from both the REGION and
NATION table joined on the REGIONKEY column to list all of the countries and their
associated regions. Instead of specifying the columns in the create external table statement,
use the AS SELECT option:
create external table nation_ext4 '/tmp/<student_id>_nation_ext4.unl' as
select n_name, r_name from nation, region where n_regionkey=r_regionkey;

5-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 2. List the properties of the NATION_ext4 external table:


\d nation_ext4

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 3. View the data of the NATION_ext4 external table:


select * from nation_ext4;

__ 4. Use the second session to review the file you created:


more /tmp/<student_id>_nation_ext4.unl

5-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 5.0.5. Unloading data with an external table using the


compress format
The previous external tables that you created used the default ASCII delimited text format. This last
external table is similar to the second external table that you created; however, instead of using an
ASCII delimited text format, use the compressed binary format. The name of the external table is
NATION_ext5 and the datasource file name is /tmp/<student_id>_nation_ext5.unl. The
external table options compress and format must be specified to use the compressed binary
format. The basic syntax to create this type of external table is:
create external table table_name 'filename' USING (COMPRESS true FORMAT
‘internal’) AS select_statement;
__ 1. Create one last external table using a similar method that you used to create the second
external table, in 6.1.2. Instead of using an ASCII delimited-text format, compress the
datasource using the compress and format external table options. As a reminder the
external table is created and the data is unloaded in the same operation using the AS
select clause:
create external table nation_ext5 '/tmp/<student_id>_nation_ext5.unl' using
(compress true format 'internal') as select * from nation;

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 2. List the properties of the NATION_EXT5 external table:


\d nation_ext5

5-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty Notice that the option for compress has changed from false to true indicating that the
datasource file is compressed and the format has changed from text to internal, which
is required for compressed files.

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

5.2. Dropping external tables


Dropping external tables is similar to dropping a regular IBM PureData System for Analytics table.
The column definition for the external table is removed from the IBM PureData System for Analytics
catalog. Keep in mind that dropping the table does not delete the external datasource file so this
also has to be maintained. The external datasource file can still be used for loading data into a
different table. In this section, you drop the NATION_EXT table but do not delete the associated
external datasource file, /tmp/<student_id>_nation_ext.unl. This datasource file is used
later in this lab to load data into the NATION table.
__ 1. Using NZSQL, drop the first external table that you created, NATION_EXT, using the drop
table command and verify that the external table has been dropped using the internal
slash option, \dx:
drop table nation_ext;
\dx

__ 2. Even though the external table definition no longer exists within the student database, the
flat file named /tmp/<student_id>_nation_ext.unl still exists. To verify this, in
PuTTY session 2 type:
ls /tmp/*.unl

__ 3. Repeat Step 1 to drop all other external tables EXCEPT nation_ext2.


__ 4. Verify the tables are dropped using \dx.

5-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 5.3. Loading data using external tables


External tables can also be used to load data into tables in the database. In this section, since you
load data into the NATION table, you first have to remove the existing rows from the NATION table.
The method to load data from external tables into a table is similar to using the DML INSERT INTO
and SELECT FROM statements. You use two different methods to load data into the NATION table,
one using an external table and the other using the external datasource file directly. Loading data
into a table from any external table has an associated log file with a default name of
<table_name>.<db_schema>.<database>.nzlog
__ 1. Before loading the data into the NATION table, delete the rows from the data using the
truncate table statement:
truncate table nation;
__ 2. Check that the table is empty with the select * statement:
select * from nation;
__ 3. Use an insert statement to load data into the NATION table from the NATION_EXT2
external table:
insert into nation select * from nation_ext2;
__ 4. Check to ensure that the table contains the 257 rows using the select * statement.
__ 5. Again delete the rows in the NATION table using the truncate statement.
__ 6. Check to ensure that the table is empty using the select * statement.
__ 7. Using the ASCII delimited file that was created for external table NATION_EXT, load data
into the NATION table. Remember that the definition of the external table was removed from
that database, but the external data source file, /tmp/<student_id>_nation_ext.unl
still exists:
insert into nation select * from external
'/tmp/<student_id>_nation_ext.unl';
__ 8. Check to ensure that the table contains rows using the select * statement.
__ 9. Since this is a load operation there is always an associated log file,
<table_name>.<db_schema>.<database>.nzlog created for each load performed. By
default this log file is created in the /tmp directory. Review this file using the second PuTTY
session:
more /tmp/NATION.<STUDENT_ID>.<STUDENT_ID>_DB.nzlog
Notice that the log file contains the load options and the statistics of the load, along
with environment information to identify the table.

End of exercise

© Copyright IBM Corp. 2013, 2016 Exercise 5. Loading and unloading data using external tables 5-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

5-18 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 6. Loading and unloading data using the
nzload utility

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All eight tables in your database have been created.

What this exercise is about


To gain experience loading data into your database tables using the nzload
utility.The nzload command is a SQL CLI client application that allows you to
load data from the local host or a remote client, on all the supported client
platforms. The nzload command processes command-line load options to
send queries to the host to create an external table definition, run the
insert/select query to load data, and when the load completes, drop the
external table. The nzload command is a command-line program that
accepts options from multiple sources, where some of the sources can be
from the:
• Command line
• Control file
• NZ Environment Variables
Without a control file, you can only do one load at a time. Using a control file
allows multiple loads. The nzload command connects to a database with a
user name and password, just like any other PureData System for Analytics
client application. The user name specifies an account with a particular set of
privileges, and the system uses this account to verify access.

What you should be able to do


At the end of the exercise, you should be able to:
• Load the tables with data
• Ensure that the data is loaded and you are able to review it using the
NzAdmin tool
• Review distribution

© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
You are an IBM PureData System for Analytics developer and need to load data into your tables,
verify the data is loaded, and check the distribution.
For this section of the lab you continue to use the STUDENT user to load data into the
<student_id>_db database. The nzload utility is used to load records from an external datasource
file into the NATION table. Along with this the nzload log files are reviewed to examine the nzload
options. Since you are loading data into a populated NATION table, you use the truncate table
command to remove the rows from the table.
We continue to use the two PuTTY sessions from the external table lab.
• Session 1 is connected to the NZSQL console to execute SQL commands, for example to
review tables after load operations
• Session 2 is used for operating system commands, to execute nzload commands, view data
files, etc.

6-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 6.1. Using the nzload utility with command line options
The first method for using the nzload utility to load data into a table is to specify options at the
command line. You only need to specify the datasource file and use default options for the rest. In
this case we have created for you a script containing the nzload commands.
__ 1. Using PuTTY session 1, remove the rows in the following tables:
truncate table nation;
truncate table part;
truncate table customer;
truncate table lineitem;
truncate table orders;
truncate table partsupp;
truncate table region;
truncate table supplier;
__ 2. Using NzAdmin on the Desktop, verify that your tables currently have no data.

__ 3. Return to the PuTTY session 2, and ensure you are in your home directory
__ 4. Use the vi editor to see the nzload commands in loadtables.sh.
__ 5. Type :q! to exit editing mode without writing to the file.
__ 6. Load data into the tables using the following command:
./loadtables.sh

© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 7. Verify that the data is loaded using NzAdmin.

__ 8. For every load task performed there is always an associated log file,
<TABLE>.<DB_SCHEMA>.<DATABASE>.nzlog created. By default this log file is created in
the current working directory. Type:
more NATION.<STUDENT_ID>.<STUDENT_ID>_DB.nzlog
Notice that the log file contains the Load Options and the statistics of the load, along with
environment information to identify the database and table.

6-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Information

The –db, -u, and –pw, options specify the database name, the user, and the password,
respectively. Alternatively, you could omit these options if the NZ environment variables are set to
the appropriate database, username and password values. Since the NZ environment variables,
nz_database, nz_user, and nz_password are set to system, admin, and password, you need to use
these options so the load is against the <student_id>_db database using the student user. The
other options are:
• -t specifies the target table name in the database
• -df specifies the datasource file to be loaded
• -delimiter specifies the string to use as the delimiter in an ASCII delimited text file
There are other options that you can use with the nzload utility. These options were not specified
here since the default values were sufficient for this load task.

The following command is equivalent to the nzload command you used above, but with other
options you can use, which you can omit when using default values. In the next exercise, you learn
about the –lf, -bf, and –maxErrors options. The –compress and –format options indicate that
the datasource file is an ASCII-delimited text file. For a compressed binary datasource file, you use
-compress true –format internal.
nzload –db <student_id>_db –u <student_id> –pw <student_pwd> –t nation –df
nation_student.unl –delimiter ‘|’ –outputDir ‘<current directory>’ –lf
<table>.<database>.nzlog bf<table>.<database>.nzlog

–compress false –format text –maxErrors 1

© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

6.2. Using the nzload utility with a control file


You learned about running the nzload command with command line options. In this section, you
learn another method for running nzload, and that is by specifying the options in a control file. Using
a control file is useful because loading data into a database for a data warehouse environment is a
continuous operation, and the control file can be updated and modified. An nzload control file has
the following basic structure:
DATAFILE <filename> {[<option name> <option value>] }
And the –cf option is used at the nzload command line to use a control file:
nzload –u <username> -pw <password> -cf <control file>
The –u and –pw options are optional if the NZ_USER and NZ_PASSWORD environment variables
are set to the appropriate user and password. Using the –u and –pw options overrides the values in
the NZ environment variables. In this section you load rows into an empty NATION table using the
nzload utility with a control file. The control file sets the following options:
•delimiter
•logDir
•logFile
•badFile
•database
•tablename
The nation.del file is the datasource file used in this section.
__ 1. In session 1, remove all the rows in the NATION table:
truncate table nation;
__ 2. Using the second PuTTY session, edit the nation.ctl control file. This control file is used
with the nzload utility to load data into the NATION table using the nation.del data file, and
includes the following options:
Database - Database name
Tablename - Table name
Delimiter - Delimiter string
LogFile - Log file name
BadFile - Bad record log file name

6-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty

__ a. Type the following to open nation.ctl in an editor:


vi nation.ctl
__ b. Use the cursor key to navigate to the Database line, and navigate to the beginning of the
<student_id>_db.
__ c. Type cw and then type the name of your database, <student_id>_db. Your file should
now look similar to this:

__ d. To save your changes, press “Esc” to switch back into command mode.
__ e. Enter :wq! and press Enter to write the file, and quit the editor without any questions.
__ 3. Still in the second session, type:
chmod 755 /export/home/<student_id>
__ 4. Load the data using the nzload utility with the control file you created, and with the following
command line options: -u <user>, -pw <password>, -cf <control file>
nzload -u <student_id> -pw <student_pwd> -cf nation.ctl

__ 5. Check the nzload log which was renamed from the default to nation.log.
__ 6. Using the first PuTTY session, ensure that the rows were added to the NATION table:
select * from nation;

End of exercise

© Copyright IBM Corp. 2013, 2016 Exercise 6. Loading and unloading data using the nzload utility 6-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

6-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 7. Generate statistics

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.

What this exercise is about


To gain experience with Generate Statistics on the database as well as the
tables.

What you should be able to do


At the end of the exercise, you should be able to:
• Generate Statistics on columns
• Generate Statistics on tables

© Copyright IBM Corp. 2013, 2016 Exercise 7. Generate statistics 7-1


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
As a developer you want to ensure that statistics are maintained and up-to-date so that the
generated Query Plans are optimal. Our first long running customer query returns the average
order price by customer segment for a given year and order priority. It joins the customer table for
the market segment and the ORDERS table for the total price of the order. Due to restrictive join
conditions it should not require too much processing time.

7-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 7.1. Analyzing a query


In this exercise you analyze the following long running customer query:
select c.c_mktsegment, avg(o.o_totalprice) from orders as o, customer as c where
extract(year from o.o_orderdate) = 1996 and o.o_orderpriority = '1-URGENT' group
by c.c_mktsegment;
__ 1. Using a PuTTY session log into the host as <student_id>.
__ 2. Using NZSQL drop the ORDERS and CUSTOMER tables, and then quit NZSQL.
__ 3. Change to the DDL directory and issue the following commands:
nzsql -db <student_id>_db -f customer.sql
nzsql -db <student_id>_db -f orders.sql
__ 4. To load the tables, issue the following commands:
nzload -db <student_id>_db -t orders -df /home/<student_id>/DATA/ORDERS.unl
-delim ‘|’ -maxErrors 10

nzload -db <student_id>_db -t customer -df


/home/<student_id>/DATA/CUSTOMER.unl -delim ‘|’ -maxErrors 10
__ 5. Log into NZSQL.
__ 6. Look at the two tables and the WHERE conditions to get an idea of the row numbers
involved. Our query joins the CUSTOMER table without any where condition applied to it
and the ORDERS table that has two where conditions restricting it on the date and order
priority. From the data distribution lab we know that the CUSTOMER table has 450,000
rows. To get the rows that are involved from the ORDERS table, execute the following
COUNT(*) command, you may get a different total from the screen shot below:
select count(*) from orders where extract(year from o_orderdate) = 1996 and
o_orderpriority = '1-URGENT';

Information

The IBM PureData System for Analytics optimizer uses statistics about the data in the system to
estimate the number of rows that result from WHERE conditions, joins, etc. Doing wrong
approximations can lead to bad execution plans. For example, a huge result set could be broadcast
for a join instead of doing a double redistribution.

© Copyright IBM Corp. 2013, 2016 Exercise 7. Generate statistics 7-3


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 7. To see the estimated rows for the WHERE conditions in our query run the following
EXPLAIN command:
explain verbose select count(*) from orders where extract(year from
o_orderdate) = 1996 and o_orderpriority = '1-URGENT';

Scroll up the output from your command and you should see estimated rows = 450.
The execution plan of this query consists of two nodes. First, the table is scanned and the
WHERE conditions are applied, which can be seen in the Restrictions sub node. Since we
use a COUNT(*) the Projections node is empty. Then, an Aggregation node is applied to
count the rows that are returned by node 1. When we look at the estimated number of rows
we can see that it is not correct. The IBM PureData System for Analytics optimizer
estimates, in this case, from its available statistics, that only 450 rows are returned by the
WHERE conditions; this might not be very accurate.

7-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 8. One way to help the optimizer in its estimates is the collection of detailed statistics about the
involved tables. Execute the following command to generate detailed statistics about the
ORDERS table. Since generating full statistics involves a table scan this command might
take several minutes to execute.
generate statistics on orders;
__ 9. Check to see if generating statistics improved the estimates. Execute the EXPLAIN
command, then scroll up in the output and you should see the following:

As you can see, the estimated rows of the SELECT query have improved drastically. The
optimizer now assumes this WHERE condition applies, in this case, to 9000 rows of the
ORDERS table. This is a much better result than the original estimate of 450.
Estimations are difficult to make. Obviously the optimizer cannot do the actual computation during
planning; it relies on current statistics about the involved columns. Statistics include min/max
values, distinct values, numbers of null values etc. Some of these statistics are collected
dynamically, but the most detailed statistics can be generated manually with the generate
statistics command. Generating full statistics after loading a table or changing its content

© Copyright IBM Corp. 2013, 2016 Exercise 7. Generate statistics 7-5


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

significantly is one of the most important administrative tasks on IBM PureData System for
Analytics. The appliance automatically generates express statistics after many tasks, such as load
operations and just-in-time statistics during planning. Nevertheless, full statistics should be
generated on a regular basis.
An estimate, which is what you get in a plan file or the explain output, is just that, an educated
guess. Sometimes it is right on the money but just as often, sometimes it is too high or too low.
If there are no restrictions, then we will probably end up using 100% of the rows in the table. So the
rowcount / estimate should be right on the money in that case and the confidence should be 100%.
With one restriction we are starting to make some educated guesses. Because it is a guess, our
level of confidence in our answer starts to drop down to 80%.
With two restrictions the guesses, and the odds of being right, just multiply, so our confidence is
now 80% of 80% (or 64% total).
If the optimizer comes up with one plan with a low cost and a confidence of 100% versus another
plan with the same cost but a much lower confidence level then it is probably going to choose the
plan with the higher confidence but for any given plan there are dozens of costs, various confidence
levels associated with each cost, etc.

7-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 7.2. Using nzadmin to generate statistics


In this exercise, you use the nzadmin tool to generate statistics on one or more tables.
__ 1. Using the NzAdmin tool, open your database and click Tables.
__ 2. Right-click one of the tables and launch Generate Statistics.

© Copyright IBM Corp. 2013, 2016 Exercise 7. Generate statistics 7-7


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 3. Choose Generate full statistics, and then select columns of interest and generate.

__ 4. Right-click on the table you generated statistics for, and then choose View Statistics. The

table lineitem has statistics run in a previous step so this screen shot may not look the same
as the one you see.

End of exercise

7-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 8. Analyzing query plans

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.

What this exercise is about


This is an opportunity to evaluate your data distribution and work on
improving it. You are able to look at the query performance, generate the
Query Plans for the queries you are performing and analyze them.

What you should be able to do


At the end of the exercise, you should be able to:
• Run predefined queries against the tables you had loaded (RANDOM)
• Make a note of times taken to query
• Edit the queries to generate the Query Plans
• Redistribute the data based on your analysis of the Query Plans
• Rerun the queries to see gains in performance

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
IBM PureData System for Analytics uses a cost-based optimizer to determine the best method for
scan and join operations, join order, and data movement between SPUs (redistribute or broadcast
operations if necessary). For example the planner tries to avoid redistributing large tables because
of the performance impact. The optimizer can also dynamically rewrite queries to improve query
performance. The optimizer takes a SQL query as input and creates a detailed execution or query
plan for the database system. For the optimizer to create the best execution plan that results in the
best performance, it must have the most up-to-date statistics. You can use explain, html (also
known as bubble), and text plans to analyze how the IBM PureData System for Analytics executes
a query.
The explain tool is quite useful to spot and identify performance problems, bad distribution keys,
badly written SQL queries and out-of-date statistics. For example, you query the database and see
that the performance could be improved, but when you look at the distribution of data you see it is
not too skewed. Then, you generate the query plans and analyze them. Based on the results, you
define new distribution criteria and apply those to the existing tables and rerun your queries.
During our proof-of-concept, we have identified a couple of long running customer queries that have
significantly worse performance than the number of rows involved would suggest. In this exercise,
you use explain functionality to identify the concrete bottlenecks and if possible fix them to improve
query performance.

Note

A snippet is a unit of work. It is one distinct C++ program. A snippet could have dozens (hundreds)
of nodes or steps or operations. One node might scan the table then we might have a node that does
some aggregations on the data and another node that sorts the results. In that case we are probably
scanning + aggregating at the same time as we go along. Whereas we can't start the sort node until
we have finished first two steps completely. When processing a table you will always see a:
ScanNode
RestrictNode
ProjectNode
So even though it is listed as 3 operations, they are really being done all at the same time. We read
a 128KB page of data and throw away the rows + columns we don't want and then do some further
processing of the data. Before we go back to get the next 128KB page of data. In this case, the 3
nodes are basically combined into one operation most of which occurs in the FPGA itself.

8-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 8.1. Identifying join problems


In the last exercise, you looked at the tables involved in a join query and improved optimizer
estimates by generating statistics on the involved tables. Now, you review the complete execution
plan, specifically looking at the distribution and involved join. In this example, there is a query that
does not finish in a reasonable amount of time; it is taking much longer than you would expect from
the data sizes involved.
In this exercise, you analyze why this query takes much longer than expected.
__ 1. Analyze the execution plan for this query using the EXPLAIN VERBOSE command. You
should see something similar to the output below:
explain verbose select c.c_mktsegment, avg(o.o_totalprice) from orders as o,
customer as c where extract(year from o.o_orderdate) = 1996 and
o.o_orderpriority = '1-URGENT' group by c.c_mktsegment;

__ 2. Try to answer the following questions through reviewing the execution plan.
__ a. Which columns of the ORDERS table are used in further computations?
__ b. Is the ORDERS table redistributed, broadcast or can it be joined locally?
__ c. Is the CUSTOMER table redistributed, broadcast or can it be joined locally?

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ d. In which node are the WHERE conditions applied and how many rows does IBM
PureData System for Analytics expect to fulfill the where condition?
__ e. What kind of join takes place and in which node?
__ f. What is the number of estimated rows for the join?
__ g. What is the most expensive node and why?

Hint

A stream operation in IBM PureData System for Analytics explain is a join whose output is not
persisted on disk but streamed to further computation nodes.

__ 3. Review the answers to the questions in step 2.


__ a. Which columns of the ORDERS table are used in further computations?
The first node in the execution plan does a sequential scan of the ORDERS table on the
SPUs. It estimates that 9000 rows are returned which we know is the number of rows in the
CUSTOMER table. The statement that tells us which columns are used in further
computations is the “Projections:” clause. We can see that only the O_TOTALPRICE
column is carried on from the ORDERS table. All other columns are thrown away. Since
O_TOTALPRICE is a NUMERIC (15,2) column, the returned result set has a width of 8.
Projections:
1:O.O_TOTALPRICE
[SPU Broadcast]
__ b. Is the ORDERS table redistributed, broadcast or can it be joined locally?
During scan the table is broadcast to the other SPUs. This means that a complete ORDERS
table is assembled on the host and broadcast to each SPU for further computation of the
query. This might seem surprising at first since we have a substantial amount of rows. But
since the width of the result set is only 8 we are talking about 9000 rows * 8 bytes = 72 KB.
This is almost nothing for a warehousing system.
-- Estimated Rows = 9000, Width = 8, Cost = 0.0 .. 57.9, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND
(DATE_PART('YEAR'::"VARCHAR",O.O_ORDERDATE) = 1996))
__ c. Is the CUSTOMER table redistributed, broadcast or can it be joined locally?
The second node of the execution plan does a scan of the CUSTOMER table. One column
C_MKTSEGMENT is projected and used in further computations. We cannot see any
distribution or broadcast clauses so this table can be joined locally. This is true because the
ORDERS table is broadcast to all SPUs. If one table of a join is broadcast the other table
does not need any redistribution.
__ d. In which node are the WHERE conditions applied and how many rows does IBM
PureData System for Analytics expect to fulfill the where condition?

8-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty We can see in the “Restrictions” clause that the WHERE conditions of our query are applied
during the first node as well. This should be clear since both of the WHERE conditions are
applied to the ORDERS table and they can be executed during the scan of the ORDERS
table. As we can see in the “Estimated Rows” clause, the optimizer estimates a returned set
of 9000 rows which we know is underestimated since in reality tens of thousands of rows
are returned from this table.
__ e. What kind of join takes place and in which node?
The third node of our execution plan contains the join between the two tables. It is a Nested
Loop Join which means that every row of the first join set is compared to each row of the
second join set. If the join condition holds true the joined row is then added to the result set.
This can be a very efficient join for small tables but for large tables its complexity is
quadratic and therefore in general less fast than for example a Hash Join. The Hash Join
though cannot be used in cases of inequality join conditions, floating point join keys etc.
[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {}]
-- Estimated Rows = 405000000, Width = 18, Cost = 65097.6 .. 2080409.8, Conf
= 64.0
__ f. What is the number of estimated rows for the join?
We can see in the Estimated Rows clause that the optimizer estimates this join node to
return roughly 4 billion rows, which is the number of rows from the first node times the
number of rows from the second node.
__ g. What is the most expensive node and why?
As we can see from the Cost clause, the optimizer estimates that the SPU Sort, SPU Group
and Host Aggregate have costs in the range from 37862669.4.. 37879544.4. So our
performance problems clearly originate in the join Node 3 and the problem continues
throughout the rest of the nodes. So what is happening here? If we take a look at the query
we can assume that it is intended to compute the average order cost per market segment.
This means we should join all customers to their corresponding order rows. But for this to
happen we would need a join condition that joins the customer table and the ORDERS table
on the customer key. Instead the query performs a Cartesian Join, joining each customer
row to each orders row. This is a very work intensive query that results in the behavior we
have seen. The joined result set becomes huge. And it even returns results that cannot
have been expected for the query we see.
__ 4. Fix this by adding a join condition to the query to make sure that customers are only joined
to their orders. This additional join condition is O.O_CUSTKEY=C.C_CUSTKEY. Execute
the following EXPLAIN command for the modified query.
explain verbose select c.c_mktsegment, avg(o.o_totalprice) from orders as o,
customer as c where extract(year from o.o_orderdate) = 1996 and
o.o_orderpriority = '1-URGENT' and o.o_custkey = c.c_custkey group by
c.c_mktsegment;

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

You should see something similar to the following results. Scroll up to your query to see the
scan and join nodes.

Note

If you do not get the same query plan, please follow the explanation and bypass the SQL
execution.The cardinality (adjusted) in the plan is the number of expected unique values for the
column and is not the same as the Estimated Rows for the results set. It is used by the Optimizer for
determining such things as numbers of duplicate rows and potential sorts

As you can see there have been some changes to the execution plan.
• In Node 1:
- The ORDERS table projections are now O_TOTALPRICE and O_CUSTKEY.

8-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty - The O_TOTALPRICE and O_CUSTKEY are broadcast and later used in a hash join.
• In Node 2:
- The CUSTOMER table is scanned with projections of C_MKTSEGMENT and
C_CUSTKEY.
• In Node 3:
- Does a hash join of the results set from both Node 1 and Node 2 using the customer
key.
- The estimated number of rows is now 450,000, which is the same as the number of
customers; since we have a 1:n relationship between customers and orders this is
as we would expect.
- The estimated cost of Node 3 has come down significantly to 57.9 to 80.4.
• In Nodes 4, 5 and 6, there has been a significant reduction in the cost as well.
__ 5. Make sure that the query performance has improved. Switch on the display of elapsed
query time with the following command: \time
__ 6. Execute our modified query:
select c.c_mktsegment, avg(o.o_totalprice) from orders as o,customer as c
where extract(year from o.o_orderdate) = 1996 and o.o_orderpriority =
'1-URGENT' and o.o_custkey = c.c_custkey group by c.c_mktsegment;
__ 7. The results should look similar to this:

Before we made our changes and generated statistics the query took much longer than it
does now. In this relatively simple case we might have been able to pinpoint the problem
through analyzing the SQL on its own. But this can be almost impossible for complicated
multi-join queries that are often used in warehousing. Reporting and BI tools tend to create
very complicated portable SQL as well. In these cases EXPLAIN can be a valuable tool to
pinpoint the problem.

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

8.2. Using Aginity for analyzing SQL queries


Aginity Workbench is a third-party tool with a user-friendly UI for developers to make managing
your IBM PureData System for Analytics more efficient and with less of an effort. You can more
easily create and manage objects as well as develop SQL applications.
In this exercise, you analyze three SQL queries using Aginity and the EXPLAIN function.
__ 1. To create and save a connection to your <student_id>_db from Aginity:
__ a. In the Netezza Training Tools folder, double-click the Aginity Workbench for PureData
System for Analytics icon.

__ b. At the Edit connection name window, type any name for your new connection to Aginity,
and then click OK.

__ c. On the Connect to PureData system for Analytics window, enter the following
information:
i. In the Server field, type pok-puredata.clp.local
ii. In the User ID and Password fields, enter your <student_id> and password.
iii. For the Database field, select your <student_id>_db from the drop-down list.
iv. Ensure the Netezza ODBC driver and NOT the Netezza OLEDB driver is
selected.

8-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty Your window should look similar to this:

v. Click Save.
__ 2. To connect to your <student_id>_db, click OK.

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 3. In PuTTY, use the following command to change to directory SQL:


cd SQL
__ 4. Ensure your SQL directory contains the following files:
• query1.sql
• query2.sql
• query3.sql
__ 5. View each of the queries using the more command, for example:
more query1.sql

8-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 6. For each query, copy the SQL and paste each into a separate Aginity Workbench Query
tab.

__ 7. To run each of query:


__ a. Ensure you are connected to your <student_id>_db database.
__ b. Position the cursor at the beginning of the statement.

Important

Before you execute a query, ensure that your cursor is positioned at the beginning of the statement.
Also, ensure that the Database drop-down box shows the correct database e.g.
<student_id>_db. If it does not show the correct database, all SQL is executed on the wrong
database and errors occur.

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ c. Click Execute. Keep a note of the time and keep the query tabs open.

Query 1 Time: _______________________


Query 2 Time: _______________________
Query 3 Time: _______________________

Important

Note which tables each query uses because you need this information for the following exercise.

8-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 8. In Aginity, for each query add explain before the select clause, and execute the queries
so that you can analyze the Query Plans.

__ 9. In PuTTY, use the CREATE TABLE AS (CTAS) command, redistribute the tables based on
good criteria.
__ a. Connect to your database.
__ b. Choose distribution keys based on the query/joins to minimize the data movement.
__ c. After creating the CTAS tables, drop the base tables and rename the CTAS tables to the
original base table names.

© Copyright IBM Corp. 2013, 2016 Exercise 8. Analyzing query plans 8-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ d. Repeat these steps for each table referenced in the queries.


create table customer_ctas as select * from customer distribute on
(c_custkey);

drop table customer;

alter table customer_ctas rename to customer;

__ 10. Remove explain verbose and execute the queries. What do you see now for the query
times?
Query 1 Time: _______________________
Query 2 Time: _______________________
Query 3 Time: _______________________

End of exercise

8-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 9. Zone maps and clustered base tables

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• The LINEITEM table in your database has been loaded with the data
supplied.
• You have completed exercise 8

What this exercise is about


To observe how zone maps and clustered base tables (CBTs) impact query
performance.

What you should be able to do


At the end of the exercise, you should be able to:
• Order the data to take advantage of Zone Maps
• Organize the data in a table and convert a base table to a CBT
• Compare query performances between:
- Perfectly-ordered data (Zone Maps vs CBTs)
- Nearly-ordered data (Zone Maps vs CBTs)
- A regular table and CBT (Query on non zone-mappable fields)

© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
We have received a set of new customer queries on the ORDERS table that do not only restrict the
table by order date but also only accesses orders in a given price range. These queries make up a
significant part of the system workload and we look into ways to increase performance for them.
The following query is a template for the queries in question. It returns the aggregated total price of
all orders by order priority for a given year (in this case 1996) and price range (in this case between
150000 and 180000).
select o_orderpriority, sum(o_totalprice) from orders where extract(year from
o_orderdate) = 1996 and o_totalprice > 150000 and o_totalprice <= 180000 group
by o_orderpriority;
In this example, there is a significantly restrictive WHERE condition on two columns
O_ORDERDATE and O_TOTALPRICE, which can help us to increase performance. The ORDERS
table has around 220,000 rows with an order date of 1996 and 160,000 rows with the given price
range. But it only has 20,000 columns that satisfy both conditions. Materialized views provide their
main performance improvements on one column. Also INSERTS to the ORDERS table are frequent
and time critical; therefore, you would not want to use materialized views. This exercise investigates
the use of clustered base tables.
Clustered base tables are IBM PureData System for Analytics tables that are created with an
ORGANIZE ON keyword. They use a special space filling algorithm to organize a table by up to 4
columns. Zone maps for a clustered base table provide approximately the same performance
increases for all organization columns. This is useful if your query restricts a table on more than one
column or if your workload consists of multiple queries hitting the same table using different
columns in WHERE conditions. In contrast to materialized views no additional disk space is
needed, since the base table itself is reordered.

9-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 9.1. Using clustered base tables


Clustered base tables are created like normal IBM PureData System for Analytics database tables.
They need to be flagged as a CBT during table creation by specifying up to four organization
columns. An IBM PureData System for Analytics table can be altered at any time to become a
clustered base table as well.
To create a new clustered base table called ORDERS_CBT, use the create table command for
ORDERS, as follows.
__ 1. Exit the NZSQL console by executing the \q command.
__ 2. Switch to the DDL directory by executing the following command:
cd /home/<student_id>/DDL
__ 3. You can use an existing script to create the ORDERS_CBT table. Review the
orders_cbt.sql script:
more orders_cbt.sql..

__ 4. Go back up one level in the directories by typing cd ..


__ 5. Create and load the orders_cbt table by executing the following command:
./create_orders_cbt.sh
__ 6. To see how IBM PureData System for Analytics has organized the data in this table, you
use the nz_zonemap utility. Execute the following command:
nz_zonemap <student_id>_db orders_cbt
This command shows you the zone mappable columns of the ORDERS_CBT table. If you
compare it with the output below of the nz_zonemap tool for the ORDERS table, you see
that ORDERS_CBT table contains the additional column O_TOTALPRICE. Numeric
columns are not zone mapped by default for performance reasons but zone maps are
created for them, if they are part of the organization columns By default nz_zonemap only

© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

shows you the first data slice, this can be changed to a different data slice using a switch in
the command.

__ 7. To see the zone map values of the O_ORDERDATE column, execute the following
command:
nz_zonemap <student_id>_db orders_cbt o_orderdate o_totalprice

9-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty You should get something similar to the following results:

This is unexpected. Since you used O_ORDERDATE and O_TOTALPRICE as organizing


columns, you would expect some kind of order in the data values. But they are distributed
equally over six extents. The reason for this is that the organization process takes place
during a command called groom.
Instead of creating a new table you could also have altered the existing ORDERS table to
become a clustered base table. Creating or altering a table to become a clustered base
table does not actually change the physical table layout until the groom command has been
used.

Note

If the “Sort” column shows “true”, then the (min) value in this extent is greater than or equal to the
(max) value of the previous extent. Which indicates optimal zonemap usage and results in optimal
performance, basically to see if the data is sorted on disk in ASCending order. And since a
DESCending sort order should perform just as well as an ASCending sort order (when it comes to
zonemaps), the column will show “true” if the (max) value in this extent is less than or equal to the
(min) value of the previous extent.
The groom command is covered in detail in a following presentation and exercise; but you use it in
the next section to reorganize the table.

© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

9.2. Maintaining clustered base tables


When a table is created as a clustered base table in IBM PureData System for Analytics, the data is
not actually organized during load time. Also similar to ordered materialized views a clustered base
table can become partially unordered due to INSERTS, UPDATES and DELETES. A threshold is
defined for reorganization and the groom command can be used at any time to reorganize a
clustered base table, based on its organization keys.
__ 1. To organize the table you created in the last section, switch to the NZSQL console.
__ 2. To groom your clustered base table, execute the following command:
groom table orders_cbt;
This command does a variety of things and is covered in a further presentation and
exercise. In this case it organizes the clustered base table based on its organization keys.
__ 3. To look at the data organization in the table, quit the NZSQL console with the \q command.
__ 4. Review the zone maps of the two organization columns by executing the following
command:
nz_zonemap <student_id>_db orders_cbt o_orderdate o_totalprice
Your results should look like the following:

You can see that both columns have some form of order now. The query is restricting rows
in two ranges:
Condition 1: O_ORDERDATE = 1996
AND
Condition 2: 150000 < O_TOTALPRICE <= 180000
There are now 2 extents that have rows from 1996 in them and 1 extent that contains rows
in the price range from 150000 to 18000; this is the only row that satisfies both conditions
and needs to be scanned during query execution.

9-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty The new NPS7 architecture has much finer grained extents. In this case there are 48 in total
and only a limited number need to be read:
• 20 extents that might have O_ORDERDATE in 1996
• 14 extents that might have O_TOTALPRICE between 150000 and 180000
• 2 extents for which both conditions apply
This means by using CBTs in the new NPS7 architecture we can restrict the amount of data
that needs to be queried by a factor of 16. This is 3-5 times less of what would need to be
read if the table is only ordered on a single column.
In this exercise, you created a clustered base table and used the groom command to organize it.
Throughout the exercise, you used the nz_zonemap tool to see zone maps and get a better idea on
how data is stored in the IBM PureData System for Analytics.

© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

9.3. Comparing a regular table with a CBT


In this exercise you compare the performance of the LINEITEM table with a CBT-version of the
LINEITEM table.
__ 1. Create an ordered table and a CBT from the same base table LINEITEM using Aginity:
create table ag_ord_li as select * from lineitem order by l_orderkey,
l_partkey;

create table ag_cbt_li as select * from lineitem organize on (l_orderkey,


l_partkey);

groom table ag_cbt_li;

9-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 2. Query both tables using restrictions on the l_orderkey column to see the performance
difference between perfectly-ordered data and clustered data:
select l_partkey, l_shipinstruct from ag_ord_li where l_orderkey=100007;

select l_partkey, l_shipinstruct from ag_cbt_li where l_orderkey=100007;

© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 3. Query using restrictions on the l_partkey column to see the performance difference between
nearly-ordered data and clustered data:
select l_orderkey, l_partkey, l_shipinstruct from ag_ord_li where
l_partkey=173466;

select l_orderkey, l_partkey, l_shipinstruct from ag_cbt_li where


l_partkey=173466;

9-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 9.4. Comparing a regular table with a CBT restricting non


zone-mappable fields
In this exercise, you compare the performance between an ordered table and a CBT using
restrictions on non zone-mappable columns.
__ 1. Create a CBT using l_shipinstruct column to organize the data and groom the table:
create table li_cbt_SI as select * from lineitem organize on
(l_shipinstruct);

groom table li_cbt_SI;


__ 2. Query using restrictions on l_shipinstruct column:
select count(*) from lineitem where l_shipinstruct = 'COLLECT COD';

select count(*) from li_cbt_SI where l_shipinstruct = 'COLLECT COD';

End of exercise

© Copyright IBM Corp. 2013, 2016 Exercise 9. Zone maps and clustered base tables 9-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

9-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 10.Materialized views

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• You redistributed the tables by making good choices and saw
performance improvements.

What this exercise is about


A IBM PureData System for Analytics is an appliance designed to provide
excellent performance in most cases without any specific tuning or index
creation. One of the key technologies used to achieve this is zone maps:
Automatically computed and maintained records of the data that are inside
the extents of a database table. In general data is loaded into data
warehouses ordered by the time dimension; therefore zone maps have the
biggest performance impact on queries that restrict the time dimension as
well. This approach works well for most situations, but IBM PureData System
for Analytics provides additional functionality to enhance specific workloads,
which we use in this chapter. We use materialized views to enhance
performance of database queries against wide tables and for queries that
only lookup small subsets of columns. Then, in the following exercise we use
Cluster Based Tables to enhance query performance of queries which are
using multiple lookup dimensions.

What you should be able to do


At the end of the exercise, you should be able to:
• Create materialized view from existing tables and utilize them to improve
query performance

© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
A materialized view is a view of a database table that projects a subset of the base table’s columns
into a physical manifestation. It can be sorted on a specific set of the projected columns. When a
materialized view is created, the sorted projection of the base table’s data is stored in a physical
table on disk. Materialized views reduce the width of data being scanned in a base table. They are
beneficial for wide tables that contain many columns (i.e. 50-500 columns) where typical queries
only reference a small subset of the columns.Materialized views also provide fast, single or few
record lookup operations. The thin materialized view is automatically substituted by the optimizer
for the base table, allowing faster response, particularly for shorter tactical queries that examine
only a small segment of the overall database table.
In the last few exercises, you recreated a customer database in our IBM PureData System for
Analytics, and you picked distribution keys, loaded the data and made some first performance
investigations. In this exercise, you look deeper into some customer queries and try to enhance
their performance by tuning the system. You see acceptable levels of performance from your IBM
PureData System for Analytics; however, you can improve performance still further by using
materialized views for enhancements in SQL selection criteria.

Figure 10-1. Student_DB database entity relationship diagram

10-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 10.1.Creating a materialized view to reduce data width


In this exercise, you create a materialized view referencing a subset of the table’s columns to
improve performance. You use the LINEITEM table because it has a large number of columns.
__ 1. Using a PuTTY session connect to the host using your student id and password.
__ 2. From your home directory, execute the following script:
./lineitem.sh

Note

If you are still in the SQL directory from the previous exercise, type cd .. to return to your home
directory.

__ 3. Type in nzsql.
__ 4. Make sure table statistics have been generated so that more accurate estimated query
costs can be reported by explain commands we look at. Generate statistics for the
ORDERS and LINEITEM tables using the following commands:
generate statistics on orders;
generate statistics on lineitem;
__ 5. Execute the following query which computes the total quantity of items shipped and their
average tax rate for a given month, which in this case is the fourth month or April:
\time

select l_shipdate, sum(l_quantity), avg(l_tax) from lineitem where


extract(month from l_shipdate) = 4 group by 1;
The result should look similar to the output below which has been reduced here for clarity:

Note

Notice the extract(month from l_shipdate) command. The extract command can be used to retrieve
parts of a date or time column like year, month or day.

© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 6. To get the projected cost from the optimizer, use the following explain verbose command:
explain verbose select l_shipdate, sum(l_quantity), avg(l_tax) from lineitem
where extract(month from l_shipdate) = 4 group by 1;
You see a long output on the screen. Scroll up until you reach the command you just
executed:

__ 7. Since this query is run frequently we want to enhance the scanning performance. And since
it only uses 3 of the 16 LINEITEM columns we have decided to create a materialized view
covering these three columns. This should significantly increase scan speed since only a
small subset of the data needs to be scanned. To create the materialized view
THINLINEITEM execute the following command. This command can take several minutes
since we effectively create a copy of the three columns of the table:
create materialized view thinlineitem as select l_shipdate, l_quantity, l_tax
from lineitem order by l_shipdate;

10-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 8. Repeat the explain call from step 5. The results should now look like the following:
QUERY VERBOSE PLAN:

Node 1.
[SPU Sequential Scan mview "_MTHINLINEITEM" {(LINEITEM.L_SHIPDATE)}]
-- Estimated Rows = 359932, Width = 16, Cost = 0.0 .. 128.9, Conf = 80.0
Restrictions:
(DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4)
Projections:
1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 32, Cost = 133.4 .. 133.4, Conf = 0.0
Projections:
1:SUM(LINEITEM.L_QUANTITY)
2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX)))
[SPU Return]
[HOST Merge Aggs]
[Host Return]

QUERY PLANTEXT:

Aggregate (cost=133.4..133.4 rows=1 width=32 conf=0)


(xpath_none, locus=spu subject=self)
(spu_send, locus=host subject=self)
(host_merge_agg, locus=host subject=self)
(host_return, locus=host subject=self)
l: Sequential Scan mview "_MTHINLINEITEM" (cost=0.0..128.9 rows=359932
width=16 conf=80) {(LINEITEM.L_ORDERKEY)}
(xpath_none, locus=spu subject=self)

Note

Notice that the IBM PureData System for Analytics optimizer has automatically replaced the
LINEITEM table with the view THINLINEITEM. We didn’t need to make any changes to the query.
Also notice that the expected cost has been reduced to 174 which is less than 10% of the original.
It is possible you will not get the same result as above due to the different workloads on the system
and the different SPU activities. If this is the case examine the output above and discuss with your
instructor.

As you have seen in cases where you have wide database tables, with queries only touching a
subset of them, a materialized view of the hot columns can significantly increase performance for
these queries, without any changes to the executed queries.

© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

10.2.Using a materialized view as an index


Materialized views not only reduce the width of tables, but they can also be used in a similar way to
indexes to increase the speed of queries by only accessing a very limited set of rows.
In this exercise, you create a materialized view referencing a subset of the table’s rows to improve
performance. You again use the LINEITEM table.
__ 1. Drop the view we used in the last chapter with the following command:
drop view thinlineitem;
__ 2. The following command returns the number of returned shipments vs. total shipments for a
specific shipping day. Execute the following command.You should have a similar result to
the output shown below, your row counts may differ but that is not relevant:
select sum(case when l_returnflag <> 'N' then 1 else 0 end) as ret, count(*)
as total from lineitem where l_shipdate='1995-06-15';

Note

You can see that on the 15th June of 1995 there have been only a percentage of the shipments
returned out of the total. Notice the use of the CASE statement to change the L_RETURNFLAG
column into a Boolean 0-1 value, which is easily countable.

__ 3. Look at the underlying data distribution of the LINEITEM table and its zone map values. To
do this exit the NZSQL console by executing the \q command.
__ 4. The IBM PureData System for Analytics support tools are installed in the /nz directory. One
of these tools is the nz_zonemap tool that returns detailed information about the zone map
values associated with a given database table. To look at the zone mappable columns of
the LINEITEM table, execute the following command:
nz_zonemap <student_id>_db lineitem

10-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty The results you see should be similar to this:

This command returns an overview of the zone mappable columns of the LINEITEM table in
the <student_id>_db database. Seven of the sixteen columns have zone maps created for
them. Zone mappable columns include integer and date data types. We see that the
L_SHIPDATE column we have in the WHERE condition of the customer query is zone
mappable.

Note

The support tools are available as an installation package in /nz on your IBM PureData System for
Analytics or you can obtain them from IBM support.

__ 5. To look at the zone map values for the L_SHIPDATE column, execute the following
command. This command returns a list of all extents that make up the LINEITEM table and
the minimum and maximum values of the data in the L_SHIPDATE column for each extent.
nz_zonemap <student_id>_db lineitem l_shipdate

© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

You should get a result similar to the following:

This is the actual ouput from the LINEITEM table. There are 21 rows which means the table
consists of 21 extents. We can also see the minimum and maximum values for the
L_SHIPDATE column in each extent. These values are stored in the zone map and
automatically updated when rows are inserted, updated or deleted. If a query has a where
condition on the L_SHIPDATE column that falls outside of the data range of an extent, the
whole extent can be discarded by IBM PureData System for Analytics without scanning it.
In this case the data has been distributed across the 21 extents. This means that our query
which has a WHERE condition on the 15th June of 1995 does not profit from the zone maps
and requires a full table scan.
__ 6. Return to the NZSQL command interface.
__ 7. To create a materialized view that is ordered on the L_SHIPDATE column, execute the
following command:
create materialized view shiplineitem as select l_shipdate from lineitem
order by l_shipdate;

10-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Note

Note that our customer query has a WHERE condition on the L_SHIPDATE column but aggregates
the L_RETURNFLAG column. Nevertheless we didn’t add the L_RETURNFLAG column to the
materialized view. We could have done it to enhance the performance of our specific query even
more. But in this case we assume that there are lots of customer queries which are restricted on the
ship date and access different columns of the LINEITEM table. A materialized view retains the
information about the location of a parent row in the base table and can be used for lookups even if
columns of the parent table are accessed in the SELECT clause.

You can specify more than one order column. In that case they are ordered first by the first
column; in case this column has equal values the next column is used to order rows with the
same value in column one etc. In general only the first order column provides a significant
impact on performance.
__ 8. Look at the zone map of the newly created view. Leave the NZSQL console again with the
\q command.
__ 9. Display the zone map values of the materialized view SHIPLINEITEM with the following
command:
nz_zonemap <student_id>_db shiplineitem l_shipdate
The results should look something similar to the following:

Notice the materialized view is significantly smaller than the base table. In this case the
number of rows returned reduced from 21 in the original to 6. Also notice that the data
values in the extent are ordered on the L_SHIPDATE column. This means that for our query,
which is accessing data from the 15th June of 1995, only extent 3 needs to be accessed at
all, since only this extent has a data range that contains this date value.
__ 10. Return to the NZSQL command interface.

© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 11. Use the explain command again to verify that our materialized view is used by the optimizer:
explain verbose select sum(case when l_returnflag <> 'N' then 1 else 0 end)
as ret, count(*) as total from lineitem where l_shipdate='1995-06-15';
You see a long text output; scroll up until you find the command you just executed. Your
result should look like the following:
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM"
{(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 24, Cost = 62.2 .. 62.2, Conf = 0.0
Projections:
1:SUM(CASE WHEN (LINEITEM.L_RETURNFLAG <> 'N'::BPCHAR) THEN 1 ELSE 0 END)
2:COUNT(*)
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...

Note

Notice that the Optimizer has automatically changed the table scan to a scan of the view
SHIPLINEITEM we just created. This is possible even though the projection is taking place on
column L_RETURNFLAG of the base table.

__ 12. In some cases you might want to disable or suspend an associated materialized view. For
troubleshooting or administrative tasks on the base table. For these cases use the following
command to suspend the view:
alter view shiplineitem materialize suspend;
__ 13. To make sure that the view is not used anymore during query execution, execute the
EXPLAIN command for our query again. With the view suspended we can see that the
optimizer again scans the original table LINEITEM:
explain verbose select sum(case when l_returnflag <> 'N' then 1 else 0 end)
as ret, count(*) as total from lineitem where l_shipdate='1995-06-15';

10-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty You see a long text output; scroll up until you find the command you just executed. Your
result should look like the following:
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 60012, Width = 1, Cost = 0.0 .. 2417.5, Conf = 80.0
Restrictions:
...

Note

Note that we have only suspended the view, not dropped it.

__ 14. Reactivate the view with the following refresh command. This command can also be used to
reorder materialized views in case the base table has been changed. While INSERTs,
UPDATEs and DELETEs into the base table are automatically reflected in associated
materialized views, the view is not reordered for every change. Therefore it is advisable to
refresh them periodically especially after major changes to the base table:
alter view shiplineitem materialize refresh;
__ 15. To check that the optimizer again uses the materialized view for query execution, execute
the following command.Make sure that the optimizer again uses the materialized view for its
first scan operation.
explain verbose select sum(case when l_returnflag <> 'N' then 1 else 0 end)
as ret, count(*) as total from lineitem where l_shipdate='1995-06-15';
You see a long text output; scroll up until you find the command you just executed. Your
result should look like the following:

QUERY VERBOSE PLAN:


Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM"
{(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
...
You have just created a materialized view to speed up queries that lookup small numbers of rows.
A materialized view can provide a significant performance improvement and is transparent to end
users and applications accessing the database. However, a materialized view also creates
additional overhead during INSERTs, UPDATEs and DELETEs, requires additional hard disk space

© Copyright IBM Corp. 2013, 2016 Exercise 10. Materialized views 10-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

and might require regular maintenance. Therefore materialized views should be used sparingly. In
the next chapter, you learn an alternative approach to speed up scan speeds on a database table.

End of exercise

10-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 11.Transactions and Truncate table

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.

What this exercise is about


To gain experience with TRUNCATE TABLE and transactions. We go
through the different transaction types and show you what happens under
the covers in a IBM PureData System for Analytics.

What you should be able to do


At the end of the exercise, you should be able to:
• Truncate a table within a transaction while another, concurrent,
transaction is working on the same table.
• Change the current schema within a transaction

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
In this section, you learn how transactions can leave logically deleted rows in a table which later as
an administrative task need to be removed with the groom command. You need to truncate a table
in a transaction which is in a different schema to the session’s current schema. However, another
transaction is currently inserting from the same table.

11-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 11.1.Inserting a transaction


In this chapter, you add a new row to the regions table and review the hidden fields that are saved
in the database. As you remember from the Transactions presentation, IBM PureData System for
Analytics uses a concept called multi-versioning for transactions. Each transaction has its own
image of the table and does not influence other transactions. This is done by adding a number of
hidden fields to the IBM PureData System for Analytics table. The most important ones are the
CREATEXID and the DELETEXID. Each IBM PureData System for Analytics transaction has a
unique transaction id that is increasing with each new transaction. In this subsection, you add a new
row to the REGION table.
__ 1. Using a PuTTY session connect to the host as <student_id>.
__ 2. Open an NZSQL console session and connect to the <student_id>_db database
__ 3. Select all rows from the REGION table:
select * from region;

__ 4. Insert a new row into the REGION table for the region Australia with the following SQL
command:
insert into region values (5, 'as', 'AUSTRALIA');
__ 5. Do a select on the REGION table, but this time you query the hidden fields CREATEXID,
DELETEXID and ROWID:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

As you can see, there are now six rows in the REGION table. The new row for Australia has
the id of the last transaction as CREATEXID and “0” as DELETEXID since it has not yet
been deleted. There could be transactions with a lower transaction id than yours running
concurrently, they will not see this new row. Note also that each row has a unique rowid.
Rowids do not need to be consecutive, but they are unique across all dataslices for one
table.

11-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 11.2.Updating and deleting transactions


Delete transactions in IBM PureData System for Analytics do not physically remove rows but
update the DELETEXID field of a row to mark it as logically deleted. These logically deleted rows
need to be removed regularly with the administrative Groom command. Update transactions in IBM
PureData System for Analytics consist of a logical delete of the old row and an insert of a new row
with the updated fields.
To show this effectively, you change a system parameter in IBM PureData System for Analytics that
allows you to turn off invisibility lists so that you can see all rows; this is already done in the training
system. Note that using the parameter is dangerous and should not be used in a real IBM PureData
System for Analytics environment.

Note

In a real IBM PureData System for Analytics changing system configuration parameters can be a
very dangerous thing that is normally not advisable without IBM PureData System for Analytics
service support.

__ 1. If you exited nzsql after 11.1 then re-enter the NZSQL console and connect to the
<student_id>_db database.
__ 2. To disable invisibility lists and show all records, run the following command:
set show_deleted_records=true;
__ 3. Update the row you inserted in the last section to the REGION table:
update region set r_comment='Australia' where r_regionkey=5;
__ 4. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You should see something similar to the following output:

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Normally, you would now see 6 rows with the update value; since the invisibility lists are
disabled, you now see 7 rows in the REGION table. The transaction that updated the
“AUSTRALIA” row has an entry in the DELETEXID column. Transactions with a higher
transaction id do not see a row with a deletexid, which indicates that it has been logically
deleted before the transaction is run. You can also see a newly inserted row with the new
comment value ‘Australia’; it has the same rowid as the deleted row and the same
CREATEXID as the transaction that did the insert.
__ 5. Clean up the table again by deleting the Australia row:
delete from region where r_regionkey=5;
__ 6. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You should see something similar to the following output:

You can see the updated row was logically deleted as well; it now has a DELETEXID field
with the value of the new transaction. Normally, the logically deleted rows are filtered out
automatically by the FPGA. If you do a select, the FPGA removes all rows that have a:
• CREATEXID which is bigger than the current transaction id.
• CREATEXID of an uncommitted transaction.
• DELETENXID which is smaller than the current transaction, but only if the transaction of
the DELETEXID field is committed.
• DELETEXID of 1 which means that the insert has been aborted.

11-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 11.3.Aborting transactions


IBM PureData System for Analytics never deletes a row during transactions even if transactions are
rolled back. In this section we show what happens if a transaction is rolled back. Since an update
transaction consists of a delete and insert transaction, we demonstrate the behavior for all three
transaction types with this.
__ 1. To start a transaction that we can later rollback we need to use the BEGIN keyword:
begin;
Per default all SQL statements entered into the NZSQL console are auto-committed. To
start a multi command transaction the BEGIN keyword needs to be used. All SQL
statements that are executed after it belong to a single transaction. To end the transaction
two keywords can be used COMMIT to commit the transaction or ROLLBACK to rollback
the transaction and all changes since the BEGIN statement was executed.
__ 2. Update the row for the MIDDLE EAST region:
update region set r_comment='AP' where r_regionkey=4;
__ 3. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You should see the following output:

Note that you have the same results as in the last chapter, the original row for the MIDDLE
EAST region was logically deleted by updating its DELETEXID field, and a new row with the
updated comment and new rowid has been added. Note that its CREATEXID is the same as
the DELETEXID of the old row, since they were updated by the same transaction.
__ 4. Rollback the transaction:
rollback;

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 5. Do a select on the REGION table again:


select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;
You should see the following output:

We can see that the transaction has been rolled back. The DELETEXID of the old version of
the row has been reset to 0, which means that it is a valid row that can be seen by other
transactions, and the DELETEXID of the new row has been set to 1 which marks it as
aborted.

11-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 11.4.Cleaning up with GROOM


In this section, you use the Groom command to remove the logically deleted rows you have entered
and remove the system parameter from the configuration file. You use the Groom command in
more detail in the next chapter. It is the main maintenance command in IBM PureData System for
Analytics and we have already used it in the Cluster Based Table labs to reorder a CBT. It also
removes all logically deleted rows from a table and frees up the space on the machine again.
__ 1. Execute the Groom command on the REGION table:
groom table region;
__ 2. You should see the following result:

You can see that the groom command purged 3 rows, exactly the number of aborted and
logically deleted rows we have generated in the previous chapter.
__ 3. Do a select on the REGION table again:
select createxid, deletexid, rowid, r_regionkey, rtrim(r_name) as r_name,
r_comment from region;

You can see that the groom command has removed all logically deleted rows from the table.
Remember that we still have the parameter switched on that allows us to see any logically
deleted rows. Especially in tables that are heavily changed with a lot of updates and deletes,
running the groom command frees up hard drive space and increase performance.

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

11.5.Understanding concurrent select and truncate


This exercise helps you to understand the implications of concurrently running a SELECT
transaction with a TRUNCATE transaction. To demonstrate this, you select from a table in a
transaction. However, another transaction is currently trying to truncate the same table. In this case,
the select should show longer response times while the truncate is running.
__ 1. Open two PuTTY sessions and log in on both.
__ 2. In session 1, start NZSQL which takes you directly into your <student_id>_db.
__ 3. Type in \time.
__ 4. In session 2, start NZSQL which takes you directly into the <student_id>_db
__ 5. In session 2, change to the <student_id>_2 schema and create a new table called
LINEITEM using the following commands:
set schema <student_id>_2;

create table lineitem as select * from <student_id>.lineitem;

__ 6. In session 1, execute the following command and make a note of the elapsed time to do the
count:
select count(*) from <student_id>_2.lineitem;
__ 7. In session 2, exit NZSQL, and navigate to the SQL directory:
cd SQL
__ 8. Open the lineitem.sql file in an editor:
vi lineitem.sql
__ 9. Change the following:
• For set schema student2, change student2 to your new schema,
<student_id>_2
• In the three lines containing student.lineitem, change student to <student_id>

11-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 10. Save your changes and exit the editor.

__ 11. Execute the script lineitem.sh from session 1.

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 12. While lineitem.sh is running in the session 1, execute the following SQL select statement in
session 2 several times before and after the execution of the TRUNCATE has executed in
session 1.
select count(*) from <student_id>_2.lineitem;.

11-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 13. Once both sessions have completed execution, review the different times for the select
statement in session 2. The last execution should have a longer elapsed time due to the
wait time caused by the concurrent TRUNCATE command.

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

11.6.Understanding concurrent insert and truncate


This exercise helps you to understand the implications of concurrently running an INSERT
transaction with a TRUNCATE transaction. To demonstrate this, you insert into a table in a
transaction. However, another transaction is currently trying to truncate the same table.The truncate
should error out due to the concurrent insert activity.
__ 1. Open two PuTTY sessions and log in as student on both.
__ 2. In session 1, exit NZSQL, and navigate to the SQL directory:
cd SQL
__ 3. Open the lineitem2.sql file in an editor:
vi lineitem2.sql
__ 4. Change the following:
• For set schema student2, change student2 to your new schema,
<student_id>_2
• In the two lines containing student.lineitem, change student to <student_id>
__ 5. Save your changes and exit the editor.
__ 6. Open the truncate.sql file in an editor:
vi truncate.sql
__ 7. In the set schema student2 line, change student2 to your new schema,
<student_id>_2

__ 8. Save your changes and exit the editor.


__ 9. In session 1, return to your home directory and type ./lineitem2.sh to start execution.
__ 10. In session 2 type, type ./truncate.sh to start execution.

11-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 11. In session 1, after completion of ./lineitem2.sh you should see the following. Note that the
number of rows might vary but that is not relevant.

© Copyright IBM Corp. 2013, 2016 Exercise 11. Transactions and Truncate table 11-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 12. In session 2,after completion of ./truncate.sh you should see the following. Note that the
number of rows might vary but that is not relevant.

End of exercise

11-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 12. GROOM

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• All the eight tables in your database are loaded with the data supplied.

What this exercise is about


As part of your routine database maintenance activities, you should plan to
recover disk space occupied by outdated or deleted rows. In normal IBM
PureData System for Analytics operation, an UPDATE or DELETE of a table
row does not remove the physical row on the hard disk. Instead the old row is
marked as deleted together with a transaction id of the deleting transaction
and in case of update a new row is created. This approach is called
multi-versioning. Rows that could potentially be visible to other transactions
with an older transaction id are still accessible. Over time however, the
outdated or deleted rows are of no interest to any transaction anymore and
need to be removed to free up hard disk space and improve performance.
After the rows have been captured in a backup, you can reclaim the space
they occupy using the SQL GROOM TABLE command. In this lab we explore
the GROOM command. The GROOM TABLE command does not lock a
table while it is running; you can continue to SELECT, UPDATE, and INSERT
into the table while the table is being groomed.

What you should be able to do


At the end of the exercise, you should be able to:
• Use the GROOM command

Note

The groom command is also run using the new nzreclaim script (nzreclaim is a wrapper around
groom with many command line options). You might be familiar with nzreclaim in earlier releases.

© Copyright IBM Corp. 2013, 2016 Exercise 12. GROOM 12-1


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
Your database is getting larger and you need to save on some space. You realize that you make
updates and deletes in this database and the IBM PureData System for Analytics does not really
remove the data from the database; it simply hides data by flagging it. Essentially, the system is
using up much-needed space; therefore, you need to use groom to reclaim the unused space.

12-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 12.1.Grooming logically deleted rows


In this section, when you delete rows you determine that they are not really deleted from the disk.
You use groom to physically delete the rows.
__ 1. Determine the physical size on disk of the table ORDERS using the following command:
nz_db_size <student_id>_db
Your results might vary from the following:

Notice that in this instance the ORDERS table is 240MB in size.


__ 2. Start NZSQL.
__ 3. To delete all rows in the ORDERS table where the orderstatus is marked as ‘F’ for finished,
using the following command:
delete from orders where o_orderstatus='F';
Your results might vary from the following:

__ 4. Exit NZSQL.
__ 5. Check the physical table size for ORDERS and see if the size decreased using the same
command as in step 1:
nz_db_size <student_id>_db

© Copyright IBM Corp. 2013, 2016 Exercise 12. GROOM 12-3


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Your results might vary from the following:

The output should be the same as above showing that the ORDERS table did not change in
size and is still 240MB. This is because the deleted rows were logically deleted but are still
on the disk. The rows are still accessible to transactions that started before the DELETE
statement which you just executed. In practical terms, this means that the transactions that
are still active have a lower transaction id than the deleted rows.
__ 6. Start NZSQL.
__ 7. When you run the groom table command, it removes outdated and deleted records from
tables. Use the groom table command, specifying table ORDERS, to physically delete
the rows you just logically deleted:
groom table orders;
Your results might vary from the following:

You can see that 2192233 rows were removed from disk. Notice that this is the same
number of rows that you previously deleted.
__ 8. Exit NZSQL.
__ 9. Use the nz_db_size command to check to see if the ORDERS table size on disk has
shrunk. Execute the following command:
nz_db_size <student_id>_db

12-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty Your results might vary from the following:

Note the reduced size of the ORDERS table. You can see that GROOM did purge the
deleted rows from disk.

© Copyright IBM Corp. 2013, 2016 Exercise 12. GROOM 12-5


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

12.2.Understanding performance benefits of GROOM


In this section, you groom a table to see the resulting performance benefit because the amount of
data that needs to be scanned is smaller. Outdated rows are still present on the hard disk. They can
be dismissed by the FPGA chip, but the system still needs to read them from disk. In this example,
for accounting reasons, you must increase the order price of all columns. This means that you need
to update every row in the ORDERS table. Also, measure query performance before and after
grooming the table.
__ 1. Start NZSQL.
__ 2. Update the ORDERS table so that the price of everything is increased by $1 using the
following command:
update orders set o_totalprice = o_totalprice+1;
Your results might vary from the following:

All rows are affected by the update resulting in a doubled number of physical rows in the
table. This is because the update operation leaves a copy of the rows before the update
occurred in case a transaction is still operating on the rows. New rows are created and the
results of the UPDATE are put in these rows. The old rows that are left on disk are marked
as logically deleted.
__ 3. To measure the performance of the test query, configure the NZSQL console to show the
elapsed execution time using the \time command.
__ 4. Run the test query and note the performance:
select count(*) from orders;
Your results might vary from the following:

__ 5. Rerun the query a few more times to estimate a consistent query time on the system, and
make note of the times.
__ 6. Run the groom table command on the ORDERS table:
groom table orders;

12-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty Your results might vary from the following:

__ 7. How much disk space did the groom save? (It is the number of extents times 3MB.)
__ 8. Run the test query again and you should see a difference in performance:

End of exercise

© Copyright IBM Corp. 2013, 2016 Exercise 12. GROOM 12-7


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

12-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty
Exercise 13.Stored procedures

Prerequisites
Before proceeding with this lab, make sure that you have completed the
following:
• The ORDERS table in your database has been loaded with the data
supplied.

What this exercise is about


Stored Procedures are subroutines that are saved in IBM PureData System
for Analytics. They are executed inside the database server and are only
available by accessing the NPS system. They combine the capabilities of
SQL to query and manipulate database information with capabilities of
procedural programming languages, like branching and iterations. This
makes them an ideal solution for tasks like data validation, writing event logs
or encrypting data. They are especially suited for repetitive tasks that can be
easily encapsulated in a sub-routine.

What you should be able to do


At the end of the exercise, you should be able to:
• Develop a stored procedure
• Verify that the stored procedure worked correctly

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Exercise instructions
So far, you have created a database, loaded data, and performed some optimization and
administrative tasks. In this exercise, you enhance the database with a couple of stored
procedures. As mentioned before, IBM PureData System for Analytics does not check referential
integrity or unique constraints. This is normally not critical since data loading in a data warehousing
environment is a controlled task. In this IBM PureData System for Analytics implementation, you
have the requirement to allow some non-administrative database users to add new customers to
the customer table. Since this happens rarely, there are no performance requirements, so you
implement this with a stored procedure that is accessible for these users and checks the input
values and referential constraints.
You also implement a business logic function as a stored procedure, based on this data model,
returning a result set.

In the following exercises, you create the stored procedure to insert data into the customer table.
The information that is added for a new customer is the customer key, name, phone number and
nation; the rest of the information is updated through other processes.

13-2 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 13.1.Creating a stored procedure


In this exercise, you review the customer table and define the interface of the stored procedure.
__ 1. Using PuTTY connect to the host using <student_id>/<student_pwd>.
__ 2. Access the lab directory with the following command:
cd /home/<student_id>/storedProcedure/

Note

This folder already contains the empty file for the stored procedure script you create. It also
contains the solution file that you can review using the ls command:
ls addCustomer_sol.sql

__ 3. Enter NZSQL and connect to <student_id>_db as <student_id>.


nzsql
__ 4. View the customer table description with the following command \d customer. You see the
following:

__ 5. Create a stored procedure that adds a new customer entry and sets these 4 fields:
C_CUSTKEY, C_NAME, C_NATIONKEY, and C_PHONE. All other fields are set with an
empty value or 0, since the fields are flagged as NOT NULL.
__ a. Exit the NZSQL console by executing the \q command.
__ b. To create a stored procedure, use the internal vi editor. Open the already existing empty
file addCustomer.sql with the following command:
vi addCustomer.sql
__ c. To edit the file, switch to INSERT mode by pressing “i”.

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ d. Create the interface of the stored procedure to test; use the 4 fields and return an
integer return code. To do this, enter the following text:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer,
varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
END_PROC;
__ e. Exit the insert mode by pressing ESC and enter :wq! to save the file and quit vi.
This minimal stored procedure does not yet do anything since it has an empty body. You
create the signature with the input and output variables using the command CREATE or
REPLACE so you can later execute the same command multiple times to update the
stored procedure with more code. The input variables cannot be given names, so you
only add the data types for the input parameters key, name, nation and phone and return
an integer return code.

Note

You have to specify the procedure language even though NZPLSQL is the only available option in
IBM PureData System for Analytics.

__ 6. Log into NZSQL.


__ 7. Execute the script you just created with \i addCustomer.sql.
__ 8. Display all stored procedures in the <student_id>_db database with the following
command:
show procedure;

You see the following result and the procedure ADDCUSTOMER with the specified
arguments:

13-4 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 9. Execute the stored procedure with the following dummy input parameters:
call addcustomer(1,'test', 2, 'test');
You should see the following:

The result shows that there is a syntax error in the stored procedure. Every stored
procedure needs at least one BEGIN..END block that encapsulates the code to be
executed. Stored procedures are compiled when they are first executed not when they are
created, therefore errors in the code can only be seen during execution.
__ 10. Exit the NZSQL console.
__ 11. Using vi, open the addCustomer.sql file:
vi addCustomer.sql
__ 12. Switch to insert mode by pressing “i".
__ 13. To create a simple stored procedure that inserts the new entry into the customer table, you
need some variables. Add variables that alias the input variables $1, $2, $3, and $4 after the
BEGIN_PROC statement:
DECLARE
C_KEY ALIAS FOR $1;
C_NAME ALIAS FOR $2;
N_KEY ALIAS FOR $3;
PHONE ALIAS FOR $4;

Information

Each BEGIN..END block in the stored procedure can have its own DECLARE section. Variables
are valid in the block they belong to. It is a good best practice to change the input parameters into
readable variable names to make the stored procedure code maintainable. Be careful not to use
variable names that are restricted by IBM PureData System for Analytics, for example NAME.

__ 14. Add the BEGIN..END block with the INSERT statement.


BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
This statement adds a new row to the customer table using the input variables. It replaces
the remaining fields like account balance with default values that can be later filled. It is also
possible to execute dynamic SQL queries.

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Your complete stored procedure should now look like the following:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer,
varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
__ 15. Save the file and exit vi.
__ 16. Log in to NZSQL.
__ 17. Execute the stored procedure script with the following command: \i addCustomer.sql.
__ 18. To test the stored procedure add a new customer John Smith with customer key 999999,
phone number 555-5555 and nation 2 (which is the key for the United States in our nation
table). You can also check to ensure the customer does not yet exist:
call addCustomer(999999,'John Smith', 2, '555-5555');
You should get the following results:

__ 19. Check to see if the insert was successful:


select * from customer where c_custkey = 999999;
You should get the following results:

This result shows that the insert was successful. You have built your first IBM PureData
System for Analytics stored procedure.

13-6 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 13.2.Adding integrity checks


In this section, you add integrity checks to the stored procedure you just created. You make sure
that no duplicate customer is entered into the CUSTOMER table by querying it before the insert.
Then, you check using an IF condition to see if the value had already been inserted into the
CUSTOMER table and abort the insert in that case.
You also check the foreign key relationship to the nation table and make sure that no customer is
inserted for a nation that does not exist. If any of these conditions are not met the procedure aborts
and displays an error message.
__ 1. Exit the NZSQL console.
__ 2. Open the addCustomer.sql stored procedure in vi with the following command:
vi addCustomer.sql
__ 3. Switch to insert mode by pressing “i".
__ 4. Add a new variable for customer_rec with the type RECORD in the DECLARE section:
rec record;
A record is a row set with dynamic fields. It can refer to any row that is selected in a
SELECT INTO statement. You can later refer to fields with, for example,
CUSTOMER_REC.C_PHONE.
__ 5. To fill the CUSTOMER_REC variable with the results of the query, add the following
statement before the INSERT statement:
select * into rec from customer where c_custkey = c_key;
If there is already one or more customers with the specified key, it contains the first.
Otherwise the variable is null.
__ 6. To add the IF condition to abort the stored procedure in case a record already exists, after
the newly added select statement add the following lines:
if found rec then
raise exception 'Customer with key % already exists', c_key;
end if;

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

Information

In this case, you use an IF condition to check if a customer record with the key already exists and
has been selected by the previous select condition. You could do an implicit check on the record or
any of its fields and see if it compares to the null value, but IBM PureData System for Analytics
provides a number of special variables that make this more convenient.
• FOUND specifies if the last SELECT INTO statement has returned any records.
• ROW_COUNT contains the number of found rows in the last SELECT INTO statement.
• LAST_OID is the object id of the last inserted row, this variable is not very useful unless
used for catalog tables.
• RAISE EXCEPTION statement throws an error and abort the stored procedure. To add
variable values to the return string use the % symbol anywhere in the string. This is a
similar approach as used, for example, by the C printf statement.

__ 7. Check the foreign key relationship to NATION by adding the following lines after the lines
added in step 6:
select * into rec from nation where n_nationkey = n_key;
if not found rec then
raise exception 'No Nation with nation key %', n_key;
end if;
This is very similar to the last check, only this time you check if a record was NOT found.
Notice that you can reuse the REC record since it is not typed to a particular table.

13-8 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty Your stored procedure should now look like the following:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer,
varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
REC RECORD;
BEGIN
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;

Note

If you have difficulty writing this stored procedure, remember a solution file exists in the same
directory called addCustomer_sol.sql.

__ 8. Save the stored procedure by pressing ESC, and then entering :wq! and pressing Enter.
__ 9. Log in to NZSQL.
nzsql
__ 10. In NZSQL, replace the stored procedure from the script by executing the following
command:
\i addCustomer.sql
__ 11. Test the check for duplicate customer ids by repeating the last CALL statement; remember,
you know that a customer record with the id 999999 already exists:
call addCustomer(999999,'John Smith', 2, '555-5555');
You should get the following result:

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

This is expected since the key value already exists and the first error condition is thrown.
__ 12. Check the foreign key integrity by executing the following command with a customer id that
does not yet exist and a nation key that does not exist in the NATION table, as well. You can
double check this using select statements if you want:
call addCustomer(999998,'James Brown', 999, '555-5555');
You should get the following result:

This is also as expected. The customer key did not yet exist so the first IF condition is not
thrown, but the check for the nation key table throws an error.
__ 13. For the addCustomer.sql to be successful, execute the following command with a customer
id that does not yet exist and the nation key 2:
call addCustomer(999998,'James Brown', 2, '555-5555');
You should see a successful execution:

__ 14. Check that the value was correctly inserted:


select c_custkey, c_name from customer where c_custkey = 999998;
You should see the following results:

You have successfully created a stored procedure that can be used to insert values into the
CUSTOMER table and checks for unique and foreign key constraints.

Information

You should remember that IBM PureData System for Analytics is not optimized to do lookup
queries so this is a pretty slow operation and should not be used for thousands of inserts. But for
occasional management, it is a perfectly valid solution to the problem of missing constraints in IBM
PureData System for Analytics.

13-10 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 15. In preparation for the next exercise, you need to set the schema in the stored procedure by
editing the SQL using:
vi addCustomer.sql
__ 16. Move the cursor down to the BEGIN statement and type in Shift+a, this will take you to the
end of the line and put you into insert mode. Press the Enter key.
__ 17. Type in the following
set schema <student_id>;
__ 18. Save the stored procedure by pressing ESC, and then entering :wq! and pressing Enter.
__ 19. Log in to NZSQL.
nzsql
__ 20. Replace the stored procedure from the script by executing the following command:
\i addCustomer.sql
__ 21. Quit nzsql by typing \q and pressing Enter

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

13-12 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty 13.3.Managing your stored procedure


In the last sections, you created a stored procedure that inserts values to the CUSTOMER table
and does check constraints. Now you give rights to a user to execute this procedure and use the
management functions to make changes to the stored procedure and verify those changes.
__ 1. Enter nzsql using the following command:
nzsql -db <student_id>_db -u <student_id> -pw <student_pwd>
__ 2. Create a user <student_id>_admin who is responsible for adding customers:
create user <student_id>_admin with password '<student_pwd>';

Important

You can see that he has the same password as the other users in our labs. This is for simplification,
since it allows omitting the password during user switches; this would, of course, not be done in a
production environment.

__ 3. Grant access to the <student_id>_db database so he is able to log on:


grant list on <student_id>_db to <student_id>_admin;
__ 4. Grant him the right to select from the customer table; he needs this to verify any changes he
has made:
set schema <student_id>;
grant select on customer to <student_id>_admin;
__ 5. Switch to the new user <student_id>_admin
\c <student_id>_db <student_id>_admin <student_pwd>
__ 6. Select something from the CUSTOMER table:
select c_custkey, c_name from <student_id>.customer where c_custkey =
999998;
__ 7. Switch back to your <student_id>:
\c <student_id>_db <student_id> <student_pwd>

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 8. To grant <student_id>_admin the right to execute a specific stored procedure, you need
to specify the full name including all input parameters. The easiest way to get these in the
correct syntax is to first list them with the SHOW PROCEDURE command:
show procedure all;

__ 9. Grant the right to execute this stored procedure to <student_id>_admin:


grant execute on addcustomer(integer, character varying(25), integer,
character varying(15)) to <student_id>_admin;
__ 10. Check the rights of the <student_id>_admin user:
\dpu <student_id>_admin
You should get the following results:

You can see that the user has only the rights you gave him. He can select data from the
customer table and execute our stored procedure, but he is not allowed to change the
customer table directly or execute anything except for the stored procedure.
__ 11. To test this, switch to the <student_id>_admin user with the following command:
\c <student_id>_db <student_id>_admin <student_pwd>
__ 12. Add another customer to the customer table:
call <student_id>.addCustomer(999997,'Jake Jones', 2, '555-5554');
If the insert is successful, you have another row in your table; you can check this with a
SELECT query if you want.
__ 13. To make changes to the stored procedure, switch back to your <student_id>:
\c <student_id>_db <student_id> <student_pwd>

13-14 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1
Student Exercises

EXempty __ 14. Before you modify the stored procedure, look at it in detail:
show procedure addcustomer verbose;
You should see the following:

You can see the input and output arguments, procedure name, owner, if it is executed as
owner or caller and other details. Verbose also shows you the source code of the stored
procedure. You see that the description field is still empty, so you can add a comment to the
stored procedure. This is important to do if you have a large number of stored procedures in
your system.

Information

For a convenient way to manage your stored procedures, use nzadmin since it provides most of the
managing functionality used in this lab in a graphical UI.

© Copyright IBM Corp. 2013, 2016 Exercise 13. Stored procedures 13-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


Student Exercises

__ 15. Add a description to the stored procedure:


comment on procedure addcustomer(integer, character varying(25), integer,
character varying(15)) is 'this procedure adds a new customer entry to the
customer table';
It is necessary to specify the exact stored procedure signature including the input
arguments. You can cut/paste these from the output of the show procedures command.
The COMMENT ON command can be used to add descriptions to more or less all database
objects you own from procedures, tables to columns.
__ 16. Verify that your description has been set:
show procedure addcustomer verbose;
The description field now contains your comment:

Information

Altering a stored procedure so it is executed as the caller instead of the owner means that whoever
executes the stored procedure needs to have access rights to all the objects that are touched in the
stored procedure otherwise it fails. This should be the default for stored procedures that
encapsulate business logic and do not do extensive data checking.

In this section, you setup the permissions for the addCustomer stored procedure and the
<student_id>_admin who is supposed to use it. You also added comments to the stored
procedure.

End of exercise

13-16 IBM PureData System for Analytics Programming and Usage © Copyright IBM Corp. 2013, 2016
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Licensed to Nexsys-Ecuador for class on 3/19/2018


V10.1

backpg
Back page

Licensed to Nexsys-Ecuador for class on 3/19/2018


Licensed to Nexsys-Ecuador for class on 3/19/2018

Potrebbero piacerti anche