Sei sulla pagina 1di 191

IBM Software

Information Management

IBM PureData System for Analytics


Hands-On Labs

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Table of Contents :
Connecting to the Host and Database
Database Administration
Data Distribution
NzAdmin
Loading and Unloading Data
Backup & Restore
Query Optimization
Optimization Objects
Groom
Stored Procedures

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

IBM Software
Information Management

Connecting to the Host and Database


Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .......................................................................................3
1.1

VMware Basics....................................................................................3

1.2

Architecture of the Virtual Machines.....................................................3

1.3

Tips and Tricks on Using the PureData System Virtual Machines ........3

Connecting to PureData System Host.............................................4


2.1

Open the Virtual Machines in VMware .................................................4

2.2

Start the Virtual Machines....................................................................5

Connecting to System Database Using nzsql ................................6


3.1

Using PuTTY .......................................................................................6

3.2

Connect to the System Database Using nzsql .....................................7

3.3

Commonly Used Commands and SQL Statements..............................8

3.4

Exit nzsql .............................................................................................9

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 10

IBM Software
Information Management

1 Introduction
1.1

VMware Basics

VMware Player and VMware Workstation are the synonym for test beds and developer environments across the IT industry.
While having many other functions for this specific purpose it allows the easy distribution of an up and running PureData
System system right to anybodys computer be it a notebook, desktop, or server. The VMware image can be deployed for
simple demos and educational purposes or it can be the base of your own development and experiments on top of the given
environment.

What is a VMware image?


VMware is providing a virtual computer environment on top of existing operating systems on Intel or AMD processor based
systems. The virtual computer has all the usual components like a CPU, memory and disks as well as network, USB devices or
sound. The CPU and memory are simply the existing resources provided by the underlying operating system (indicated as
processes starting with vmware). On the other hand, virtual machine disks are a collection of files in the host operating system
that can be copied between any system and platform. The virtual disk files make up the most part of the image while the actual
description file of the virtual machine is small in size.

1.2

Architecture of the Virtual Machines

For the hands-on lab portion of the bootcamp, we will be using 2 virtual machines (VM) to demonstrate the usability of PureData
System systems. Because of the nature of virtualized environment and the host hardware, we will be limited in terms of
performance. Please use these exercises as a guide to familiarize with the PureData System systems only.
The virtual images are adaptations of an appliance for their portability and convenience. We will be running a virtual image to act
as the host machine and the other image as a SPU that typically resides in a PureData System appliance. The Host image will
be the main gateway where the Netezza Performance Server (NPS) code resides and will be accessed. The second image is
the SPU where it contains 5 virtual hard drives of 20 GB each as well as a virtual FPGA. The hard disks here are not partitioned
into primary, mirror and temp partitions as you would observe on a PureData System appliance. Instead, 4 of the disks only
contain primary data partitions and the fifth disk is used for temporary data.

Host Operating System


VMware
Host

SPU

PuTTY
NPS code
temp
FPGA

1.3

Tips and Tricks on Using the PureData System Virtual Machines

The PureData System appliance is designed and fine tuned for a specific set of hardware. In order to demonstrate the system in
a virtualized environment, some adaptations were made on the virtual machines. To ensure the labs run smoothly, we have
listed some pointers for using the VMs:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 10

IBM Software
Information Management


Always boot up the Host image first before the SPU image
When booting up the VMs, start the Host image first. Once it is fully booted, the SPU image can be started at which time the
Host image would be listening for connections from the SPU machine. The connection then should be made automatically.

After pausing the virtual machines, nz services need to be restarted
In the case that the VMs were paused (the host operating system went into sleep or hibernation modes, the images were
paused in VMware Workstation,,, etc). To continue using the VMs, run the following commands in the prompt of the Host image.

[nz@netezza ~]$ nzstop


[nz@netezza ~]$ nzstart

When starting the SPU image for the first time, there will be a prompt for whether the image was copied or moved.
The first time SPU image is booted, VMware Workstation will prompt with the question whether the image was copied or moved,
the user should click on I moved it. This will insure that the SPU image will have the same MAC address as before. This is
crucial for making sure the Host and SPU images can be connected.

2 Connecting to PureData System Host


In most Bootcamp locations this chapter will already have been prepared by your bootcamp instructor. You can review it
to learn how the images would be installed. But if your NPS system is already running, jump straight to chapter 3.

2.1

Open the Virtual Machines in VMware

2.1.1 Unpacking the Images


The virtual machines for the PureData System Bootcamp are delivered in a self-extractable set of rar files. For easy handling the
files are compressed to 700MB volumes. Download all the volumes to the same directory. Double click the executable file and
select the destination folder to begin the extraction process.
2.1.2 Open the HOST Virtual Machine
There are 2 methods to start the VMware virtual machines:
Option 1: Double click on the file HOST.vmx in your Windows Explorer or Linux file browser.
Option 2: Select it through the File > Open icon in the VMware console. This will bring up the dialog to browse to the folder
where the VM image resides, select HOST.vmx and click on the Open button.

Either option should bring up the VMware console

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 10

IBM Software
Information Management

2.1.3 Open the SPU Virtual Machine


Repeat the steps you chose in the previous section to open the SPU.vmx file. The VMware console should be look similar to
the following, with a tab for each image:

2.2
2.2.1

Start the Virtual Machines


Start and Login into the Host Virtual Machine

To start using the virtual machines, first boot up the Host machine. Click on the HOST tab, then press the Power On
button in the upper left corner (marked in a red circle above). You should see the RedHat operating system boot up screen,
allow it to boot for a couple minutes until it runs to the PureData System login prompt.
At the login prompt, login with the following credentials:

Username: nz

Password: nz
Once logged in, we can check the state of the machine by issuing the following command:

[nz@netezza ~]$ nzstate


The system state should be in Discovering state which signifies that the host machine is ready for connection with the SPUs:

2.2.2 Starting the SPU Virtual Machine


Now we can start the SPU image. Similar to how we started the Host image, click on the SPU tab in the VMware console, then
click on the Power on
button.
The first time the SPU image is booted, the following prompt will display to ask if the virtual machine was moved or copied:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 10

IBM Software
Information Management

Choose the I moved it radio button, and click OK. This will ensure that the previously configured MAC address in the SPU
image will remain the same. This is crucial for the communication between the Host and SPU virtual machines.
After the SPU is fully booted, you should see the screen similar to the following. Note the bottom right corner where it displays
that there are 5 virtual hard disks in healthy state.

We can now go back to the Host image to check the status of the connection. Click on the HOST tab, and enter the following
command in the prompt:

[nz@netezza ~]$ nzstate


The system state should display Online

3 Connecting to System Database Using nzsql


Most Bootcamp locations will have a predefined PuTTY entry netezza that already contains the IP address; open it by
double-clicking on the saved connection.

3.1

Using PuTTY

Since we will not be using any graphical interface tools from the Host virtual machine, there is an alternative to using the
PureData System prompts directly in VMware. We can connect to the Host via SSH using tools such as PuTTY. We will be
using the PuTTY console for the rest of the labs since this better simulates the real life scenario of connecting to a remote
PureData System system.
First, locate the PuTTY executable in the folder where the VMs were extracted. Under the folder Tools you should be able to
find the file putty.exe. Execute it by a double click. In the PuTTY interface, enter the IP of the Host image as 192.168.239.2 and
select SSH as the connection type. Finally, click Open to start the session.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 10

IBM Software
Information Management

Once the prompt window is open, log in with the following credentials:

Username: nz

Password: nz

We are now ready for connection to the system database and execute commands in the PuTTY command prompt.

3.2

Connect to the System Database Using nzsql

Since we have not created any user and databases yet, we will connect to the default database as the default user, with the
following credentials:

Database: system

Username: admin

Password: password
When issuing the nzsql command, the user supplies the user account, password and the database to connect to using the
syntax, below is an example of how this would be done. Do not try to execute that command it is just demonstrating the syntax:

nzsql d [db_name] u [user] pw [password]


Alternatively, these values can be stored in the command shell and passed to the nzsql command when it is issued without any
arguments. Lets verify the current database, user and password values stored in the command shell by issuing the commands:

[nz@netezza ~]$ printenv NZ_DATABASE


[nz@netezza ~]$ printenv NZ_USER
[nz@netezza ~]$ printenv NZ_PASSWORD

The output should look similar to the following:

Since the current values correspond to our desired values, no modification is required.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 7 of 10

IBM Software
Information Management

Next, lets take a look at what options are available to start nzsql. Type in the following command

[nz@netezza ~]$ nzsql -?


The -? option will list the usage and all options for the nzsql command. In this exercise, we will start nzsql without arguments. In
the command prompt, issue the command:

[nz@netezza ~]$ nzsql


This will bring up the nzsql prompt below that shows a connection to the system database as user admin:

3.3

Commonly Used Commands and SQL Statements

There are commonly used commands that start with \ which we will demonstrate in this section. First, we will run the 2 help
commands to familiarize ourselves with these handy commands. The \h command will list the available SQL commands, while
the \? command is used to list the internal slash commands. Examine the output for both commands:

SYSTEM(ADMIN)=> \h
SYSTEM(ADMIN)=> \?
From the output of the \? command, we found the \l internal command we can use to find out all the databases:
Lets find out all the databases by entering

SYSTEM(ADMIN)=> \l
List of databases
DATABASE |
OWNER
-----------+----------MASTER_DB | ADMIN
SYSTEM
| ADMIN
(2 rows)
Secondly, we will use \dSt to find out the system tables within the system database.

SYSTEM(ADMIN)=> \dSt
List of relations
Name
|
Type
| Owner
------------------------------+--------------+------_T_ACCESS_TIME
| SYSTEM TABLE | ADMIN
_T_ACL
| SYSTEM TABLE | ADMIN
_T_ACTIONFRAG
| SYSTEM TABLE | ADMIN
_T_AGGREGATE
| SYSTEM TABLE | ADMIN
_T_ALTBASE
| SYSTEM TABLE | ADMIN
_T_AM
| SYSTEM TABLE | ADMIN
_T_AMOP
| SYSTEM TABLE | ADMIN
_T_AMPROC
| SYSTEM TABLE | ADMIN
_T_ATTRDEF
| SYSTEM TABLE | ADMIN
.
.
.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 10

IBM Software
Information Management

Note: press the space bar to scroll down the result set when you see --More-- on the screen.
From the previous command, we can see that there is a user table called _T_USER. To find out what is stored in that table, we
will use the describe command \d:

SYSTEM(ADMIN)=>\d _T_USER

This will return all the columns of the _T_USER system table. Next, we want to know the existing users stored in the table. In
case too many rows are returned at once, we will first calculate the number of rows it contains by enter the following query:

SYSTEM(ADMIN)=> SELECT COUNT(*) FROM (SELECT * FROM _T_USER) AS "Wrapper";


The query above is essentially the same as SELECT COUNT (*) FROM _T_USER;, we have demonstrated the sub-select
syntax in case there is a complex query that needed to have the result set evaluated. The result should show there is currently 1
entry in the user table. We can enter the following query to list the user names:

SYSTEM(ADMIN)=> SELECT USENAME FROM _T_USER;

3.4

Exit nzsql

To exit nzsql, use the command \q to return to the PureData System system.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 10

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
10

Page 10 of

IBM Software
Information Management

Database Administration
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

Objectives........................................................................3

Creating IBM PureData System Users and Groups......3


2.1

Creating New PureData System Users ............................4

2.2

Creating New PureData System Groups..........................5

Creating a PureData System Database .........................7


3.1

Creating a Database and Transferring Ownership ...........7

3.2

Assigning Authorities and Privileges ................................9

3.3

Creating PureData System Tables.................................11

3.4

Using DML Queries .......................................................14

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 18

IBM Software
Information Management

1 Introduction
A factory-configured and installed IBM PureData System will include some of the following components:

An IBM PureData System warehouse appliance with pre-installed IBM PureData System software
A preconfigured Linux operating system (with PureData System modifications)
Several preconfigured Linux users and groups:
o The nz user is the default PureData System system Administration account
o The group is the default group
An IBM PureData System database user named ADMIN. The ADMIN user is the database super-user, and has full
access to all system functions and objects
A preconfigured database group named PUBLIC. All database users are automatically placed in the group PUBLIC and
therefore inherit all of its privileges

The IBM PureData System warehouse appliance includes a highly optimized SQL dialect called PureData System Structured
Query Language. You can use SQL commands to create and manage your PureData System databases, user access, and
permissions for the databases, as well as to query and modify the contents of the databases
On a new IBM PureData System system, there is typically one main database, SYSTEM, and a database template, MASTER_DB.
IBM PureData System uses the MASTER_DB as a template for all other user databases that are created on the system.
Initially, only the ADMIN user can create new databases, but the ADMIN user can grant other users permission to create
databases as well. The ADMIN user can also make another user the owner of a database, which gives that user ADMIN-like
control over that database and its contents. The database creator becomes the default owner of the database. The owner can
remove the database and all its objects, even if other users own objects within the database. Within a database, permitted users
can create tables and populate them with data and query its contents.

1.1

Objectives

This lab will guide you through the typical steps to create and manage new IBM PureData System users and groups after an
IBM PureData System has been delivered and configured. This will include creating a new database and assigning the
appropriate privileges. The users and the database that you create in this lab will be used as a basis for the remaining labs in
this bootcamp. After this lab you will have a basic understanding on how to plan and create an IBM PureData System database
environment.

The first part of this lab will examine creating IBM PureData System users and groups

The second part of this lab will explore creating and using a database and tables. The table schema to be used within
this bootcamp will be explained in the Data Distribution lab.

2 Creating IBM PureData System Users and Groups


The initial task after an IBM PureData System has been set up is to create the database environment. This typically begins by
creating a new set of database users and user groups before creating the database. You will use the ADMIN user to start
creating additional database users and users groups. Then you will assign the appropriate authorities after the database has
been created in the next section. The ADMIN user should only be used to perform administrative tasks within the IBM PureData
System and is not recommended for regular use. Also, it is highly advisable to develop a security access model to control user
access against the database and the database objects in an IBM PureData System. This will involve creating separate users to
perform certain tasks.

The security access model for this bootcamp environment will use three PureData System database users:
o

LABADMIN

LABUSER

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 18

IBM Software
Information Management

DBUSER

and two PureData System database user groups:


o

LAGRP

LUGRP

1.

Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user nz with password nz. 192.168.239.2 is
the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images
are hosted remotely, the instructors will provide the host IPs which will vary between machines

2.

Connect to the system database as the PureData System database super-user, ADMIN, using the nzsql interface:

[nz@netezza ~]$ nzsql


or,

[nz@netezza ~}$ nzsql d system u admin pw password


There are different options you can use with the nzsql interface. Here we present two options, where the first option uses
information set in the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD. By default the
environment variables are set to the following values:
NZ_DATATASE=system
NZ_USER=admin
NZ_PASSWORD=password
So you do not need to specify the database name or the user. In the second option the information is explicitly stated using
the d, -u, and pw options, which specifies the database name, the user, and the users password, respectively. This
option is useful when you want to connect to a different database or use a different user than specified in the NZ
environment variables.
You will see the following:

Welcome to nzsql, the Netezza SQL interactive terminal.


Type:

\h
\?
\g
\q

for help with SQL commands


for help on internal slash commands
or terminate with semicolon to execute query
to quit

SYSTEM(ADMIN)=>

2.1

Creating New PureData System Users

The three new PureData System database users will be initially created using the ADMIN user. The LABADMIN user will be the
full owner of the bootcamp database. The LABUSER user will be allowed to perform data manipulation language (DML)
operations (INSERT, UPDATE, DELETE) against all of the tables in the database, but they will not be allowed to create new
objects like tables in the database. And lastly, the DBUSER user will only be allowed to read tables in the database, that is, they
will only have LIST and SELECT privilege against tables in the database.
The basic syntax to create a user is:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 18

IBM Software
Information Management

CREATE USER username WITH PASSWORD string;

1.

As the PureData System database super-user, ADMIN, you can now start to create the first user, LABADMIN, which will
be the administrator of the database: (Note user and group names are not case sensitive)

SYSTEM(ADMIN)=> create user labadmin with password 'password';


Later in this lab you will assign administrative ownership of the lab database to this user.

2.

Now you will create two additional PureData System database users that will have restricted access to the database.
The first user, LABUSER, will have full DML access to the data in the tables, but will not be able to create or alter tables.
For now you will just create the user. We will set the privileges after the database is created :

SYSTEM(ADMIN)=> create user labuser with password 'password';

3.

Finally we create the user DBUSER. This user will have even more limited access to the database since it will only be
allowed to select data from the tables within the database. Again, you will set the privileges after the database is
created :

SYSTEM(ADMIN)=> create user dbuser with password 'password';

4.

To list the existing PureData System database users in the environment use the \du internal slash option:

SYSTEM(ADMIN)=> \du
This will return a list of all database users:
List of Users
USERNAME | VALIDUNTIL | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | USERESOURCEGRPID | USERESOURCEGRPNAME | CROSS_JOINS_ALLOWED
-----------+------------+----------+----------------+--------------+--------------+--------------+------------------+--------------------+--------------------ADMIN
|
|
|
0 |
0 | NONE
| NONE
|
| _ADMIN_
| NULL
DBUSER
|
|
0 |
0 |
0 | NONE
| NONE
|
| PUBLIC
| NULL
LABADMIN |
|
0 |
0 |
0 | NONE
| NONE
|
| PUBLIC
| NULL
LABUSER
|
|
0 |
0 |
0 | NONE
| NONE
|
| PUBLIC
| NULL
(4 rows)

The additional information like USERRESOURCEGROUP is intended for resource management, which is covered later in the
WLM presentation.

2.2

Creating New PureData System Groups

PureData System database user groups are useful for organizing and managing PureData System database users. By default
PureData System contains one group with the name PUBLIC. All users are members in the PUBLIC group when they are
created. Users can be members of other groups as well though. In this section we will create two new PureData System
database user groups. They will be initially created by the ADMIN user.
We will create an administrative group LAGRP which is short for Lab Admin Group. This group will contain the LABADMIN user.
The second group we create will be the LUGRP or Lab User Group. This group will contain the users LABUSER and DBUSER.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 18

IBM Software
Information Management

Two different methods will be used to add the existing users to the newly created groups. Alternatively, the groups could be
created first and then the users. The basic command to create a group is:

CREATE GROUP groupname;

1.

As the PureData System database super-user, ADMIN, you will now create the first group, LAGRP, which will be the
administrative group for the LABADMIN user :

SYSTEM(ADMIN)=> create group lagrp;

2.

After the LAGRP group is created you will now add the LABADMIN user to this group. This is accomplished by using the
ALTER statement. You can either ALTER the user or the group, for this task you will ALTER the group to add the
LABADMIN user to the LAGRP group:

SYSTEM(ADMIN)=> alter group lagrp with add user labadmin;

To ALTER the user you would use the following command :

alter user labadmin with group lagrp ;

3.

Now you will create the second group, LUGRP, which will be the user group for the both the LABUSER and DBUSER
users. You can specify the users to be included in the group when creating the group:

SYSTEM(ADMIN)=> create group lugrp with add user labuser, dbuser;

If you had created the group before creating the user, you could add the user to the group when creating the user. To create
the LABUSER user and add it to an existing group LUGRP, you would use the following command:
create user LABUSER with in group LUGRP;

4.

To list the existing PureData System groups in the environment use the \dg internal slash option:

SYSTEM(ADMIN)=> \dg
This will return a list of all groups in the system. In our test system this is the default group PUBLIC and the two groups you
have just created:
List of Groups
GROUPNAME | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | GRORSGPERCENT | RSGMAXPERCENT | JOBMAX | SOSS_JOINS_ALLOWED
-----------+----------+----------------+--------------+--------------+--------------+---------------+---------------+--------+------------------LAGRP
|
0 |
0 |
0 | NONE
| NONE
|
0 |
100 |
0 | NULL
LUGRP
|
0 |
0 |
0 | NONE
| NONE
|
0 |
100 |
0 | NULL
PUBLIC
|
0 |
0 |
0 | NONE
| NONE
|
20 |
100 |
0 | NULL
(3 rows)

The other columns are explained in the WLM presentation.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 18

IBM Software
Information Management

5.

To list the users in a group you can use one of two internal slash options, \dG, or \dU. The internal slash option \dG
will list the groups with the associated users:

SYSTEM(ADMIN)=> \dG
This returns a list of all groups and the users they contain:

List of Users in a Group


GROUPNAME | USERNAME
-----------+----------LAGRP
| LABADMIN
LUGRP
| DBUSER
LUGRP
| LABUSER
PUBLIC
| DBUSER
PUBLIC
| LABADMIN
PUBLIC
| LABUSER
(6 rows)
The internal slash option \dU will list the users with the associated group:

SYSTEM(ADMIN)=> \dU
In this case the output is ordered by the users:

List of Groups a User is a member


USERNAME | GROUPNAME
-----------+----------DBUSER
| LUGRP
DBUSER
| PUBLIC
LABADMIN | LAGRP
LABADMIN | PUBLIC
LABUSER
| LUGRP
LABUSER
| PUBLIC
(6 rows)

3 Creating a PureData System Database


The next step after the PureData System database users and user groups have been created is to create the lab database. You
will continue to use the ADMIN user to create the lab database then assign the appropriate authorities and privileges to the users
created in the previous sections. The ADMIN user can also be used to create tables within the new database. However, the
ADMIN user should only be used to perform administrative tasks. After the appropriate privileges have been assigned by the
ADMIN user, the database can be handed over to the end-users to start creating and populating the tables in the database.

3.1

Creating a Database and Transferring Ownership

The lab database that will be created will be named LABDB. It will be initially created by the ADMIN user and then ownership of
the database will be transferred to the LABADMIN user. The LABADMIN user will have full administrative privileges against the
LABDB database. The basic syntax to create a database is:
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 7 of 18

IBM Software
Information Management

CREATE DATABASE database_name;

1.

As the PureData System database super-user, ADMIN, you will create the first database, LABDB, using the CREATE
DATABASE command :

SYSTEM(ADMIN)=> create database labdb;


The database LABDB has been created.

2.

To view the existing databases use the internal slash option \l :

SYSTEM(ADMIN)=> \l
This will return the following list:

List of databases
DATABASE |
OWNER
-----------+----------LABDB
| ADMIN
MASTER_DB | ADMIN
SYSTEM
| ADMIN
(3 rows)
The owner of the newly created LABDB database is the ADMIN user. The other databases are the default database SYSTEM
and the template database MASTER_DB.

3.

At this point you could continue by creating new tables as the ADMIN user. However, the ADMIN user should only be
used to create users, groups, and databases, and to assign authorities and privileges. Therefore we will transfer
ownership of the LABDB database from the ADMIN user to the LABADMIN user we created previously. The ALTER
DATABASE command is used to transfer ownership of an existing database :

SYSTEM(ADMIN)=> alter database labdb owner to labadmin;


This is the only method to transfer ownership of a database to an existing user. The CREATE DATABASE command does
not include this option.

4.

Check that the owner of the LABDB database is now the LABADMIN user :

SYSTEM(ADMIN)=> \l
List of databases
DATABASE |
OWNER
-----------+----------LABDB
| LABADMIN
MASTER_DB | ADMIN
SYSTEM
| ADMIN
(3 rows)
The owner of the LABDB database is now the LABADMIN user.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 8 of 18

IBM Software
Information Management

The LABDB database is now created and the LABADMIN user has full privileges on the LABDB database. The user can create
and alter objects within the database. You could now continue and start creating tables as the LABADMIN user. However, we will
first finish assigning privileges to the two remaining database users that were created in the previous section.

3.2

Assigning Authorities and Privileges

One last task for the ADMIN user is to assign the privileges to the two users we created earlier, LABUSER and DBUSER. LABUSER
user will have full DML rights against all tables in the LABDB database. It will not be allowed to create or alter tables within the
database. User DBUSER will have more restricted access in the database and will only be allowed to read data from the tables in
the database. The privileges will be controlled by a combination of setting the privileges at the group and user level.
The LUGRP user group will be granted LIST and SELECT privileges against the database and tables within the database. So any
member of the LUGRP will have these privileges. The full data manipulation privileges will be granted individually to the LABUSER
user.
The GRANT command that is used to assign object privileges has the following syntax:
GRANT <objective_privilege> ON <object> TO { PUBLIC | GROUP <group> | <username> }

1.

As the PureData System database super-user, ADMIN, connect to the LABDB database using the internal slash option
\c:

SYSTEM(ADMIN)=> \c labdb admin password


You should see that you have successfully connected to the database:

SYSTEM(ADMIN)=> \c labdb admin password


You are now connected to database LABDB as user admin.
LABDB(ADMIN)=>
You will notice that the database name in command prompt has changed from SYSTEM to LABDB.

2.

First you will grant LIST privilege on the LABDB database to the group LUGRP. This will allow members of the LUGRP to
view and connect to the LABDB database :

LABDB(ADMIN)=> grant list on labdb to lugrp;

3.

To list the object permissions for a group use the following internal slash option, \dpg :

LABDB(ADMIN)=> \dpg lugrp


You will see the following output:
Group object permissions for group 'LUGRP'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S
---------------+-------------+-------------------------------------+--------------------------------------------GLOBAL
| LABDB
| X
|
(1 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s

The X in the L column of the list denotes that the LUGRP group has LIST object privileges on the LABDB global object.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 18

IBM Software
Information Management

4.

With the current privileges set for the LABUSER and DBUSER, they can now view and connect to the LABDB database
as members of the LUGRP group. But these two users have no privileges to access any of the objects within the
database. So you will grant LIST and SELECT privilege to the tables within the LABDB database to the members of the
LUGRP :

LABDB(ADMIN)=> grant list, select on table to lugrp;

5.

View the object permissions for the LUGRP group :

LABDB(ADMIN)=> \dpg lugrp


This will create the following results:
Group object permissions for group 'LUGRP'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S
---------------+-------------+-------------------------------------+--------------------------------------------GLOBAL
| LABDB
| X
|
LABDB
| TABLE
| X X
|
(2 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s

The X in the L and S column denotes that the LUGRP group has both LIST and SELECT privileges on all of the tables in the
LABDB database. (The LIST privilege is used to allow users to view the tables using the internal slash opton \d.)

6.

The current privileges satisfy the DBUSER user requirements, which is to allow access to the LABDB database and
SELECT access to all the tables in the database. But these privileges do not satisfy the requirements for the LABUSER
user. The LABUSER user is to have full DML access to all the tables in the database. So you will grant SELECT, INSERT,
UPDATE, DELETE, LIST, and TRUNCATE privileges on tables in the LABDB database to the LABUSER user:

LABDB(ADMIN)=> grant select, insert, update, delete, list, truncate on table to labuser;

7.

To list the object permissions for a user use the \dpu <user name> internal slash option,:

LABDB(ADMIN)=> \dpu labuser


This will return the following:
User object permissions for user 'LABUSER'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M I B R C S H F A L P N S
---------------+-------------+-------------------------------------+--------------------------------------------LABDB
| TABLE
| X X X X X X
|
(1 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s

The X under the L, S, I, U, D, T columns indicates that the LABUSER user has LIST, SELECT, INSERT, UPDATE, DELETE,
and TRUNCATE privileges on all of the tables in the LABDB database.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 10 of

IBM Software
Information Management

Now that all of the privileges have been set by the ADMIN user the LABDB database can be handed over to the end-users. The
end-users can use the LABADMIN user to create objects, which include tables, in the database and also maintain the database.

3.3

Creating PureData System Tables

The LABADMIN user will be used to create tables in the LABDB database instead of the ADMIN user. Two tables will be created
in this lab. The remaining tables for the LABDB database schema will be created in the Data Distribution lab. Data Distribution is
an important aspect that should be considered when creating tables. This concept is not covered in this lab since it is discussed
separately in the Data Distribution presentation. The two tables that will be created are the REGION and NATION tables. These
two tables will be populated with data in the next section using LABUSER user. Two methods will be utilized to create these
tables. The basic syntax to create a table is:
CREATE TABLE table_name
(
column_name type [ [ constraint_name ] column_constraint
[ constraint_characteristics ] ] [, ... ]
[ [ constraint_name ] table_constraint [ constraint_characteristics ] ] [, ... ]
)
[ DISTRIBUTE ON ( column [, ...] ) ]

1.

Connect to the LABDB database as the LABADMIN user using the internal slash option \c:

LABDB(ADMIN)=> \c labdb labadmin password


You will see the following results:

LABDB(ADMIN)=> \c LABDB labadmin password


You are now connected to database LABDB as user labadmin.
LABDB(LABADMIN)=>
You will notice that the user name in the command prompt has changed from ADMIN to LABADMIN. Since you already had
an opened session you could use the internal slash option \c to connect to the database. However, if you had handed over
this environment to the end user they would need to initiate a new connection using the nzsql interface.

To use the nzsql interface to connect to the LABDB database as the LABADMIN user you could use the following options:
nzsql d labdb u labadmin pw password
or with the short form, omitting the options:
nzsql labdb labadmin password

or you could set the environment variables to the following values and issue nzsql without options.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 11 of

IBM Software
Information Management

NZ_DATABASE=LABDB
NZ_USER=LABADMIN
NZ_PASSWORD=password
In further labs we will often leave out the password parameter since it has been set to the same value password for all
users.

2.

Now you can create the first table in the LABDB database. The first table you will create is the REGION table with the
following columns and datatypes :
Column Name

Data Type

R_REGIONKEY

INTEGER

R_NAME

CHAR(25)

R_COMMENT

VARCHAR(152)

To create that table execute the following command:

LABDB(LABADMIN)=> create table region (r_regionkey integer, r_name char(25),


r_comment varchar(152));

3.

To list the tables in the LABDB database use the \dt internal slash option:

LABDB(LABADMIN)=> \dt
This will show the table you just created

List of relations
Name | Type | Owner
--------+-------+---------REGION | TABLE | LABADMIN
(1 row)

4.

To describe a table you can use the internal slash option \d <table name>:

LABDB(LABADMIN)=> \d region
This shows a description of the created table:
Table "REGION"
Attribute |
Type
| Modifier | Default Value
-------------+------------------------+----------+--------------R_REGIONKEY | INTEGER
|
|
R_NAME
| CHARACTER(25)
|
|
R_COMMENT
| CHARACTER VARYING(152) |
|
Distributed on hash: "R_REGIONKEY"

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 12 of

IBM Software
Information Management

The distributed on hash clause is the distribution method used by the table. If you do not explicitly specify a distribution
method, a default distribution is used. In our system this is a hash distribution on the first column R_REGIONKEY. This
concept is discussed in the Data Distribution presentation and lab.

5.

Instead of typing out the entire create table statement at the nzsql command line you can read and execute
commands from a file. Youll use this method to create the NATION table in the LABDB database with the following
columns and data types:
Column Name

Data Type

Constraint

N_NATIONKEY

INTEGER

NOT NULL

N_NAME

CHAR(25)

NOT NULL

N_REGIONKEY

INTEGER

NOT NULL

N_COMMENT

VARCHAR(152)

---

The full create table statement for the NATION table:

create table nation


(
n_nationkey integer not null,
n_name char(25) not null,
n_regionkey integer not null,
n_comment varchar(152)
)
distribute on random;
6.

The statement can be found in the nation.ddl file under the /labs/databaseAdministration directory. To read and
execute commands from a file use the \i <file> internal slash option:

LABDB(LABADMIN)=> \i /labs/databaseAdministration/nation.ddl

7.

List all the tables in the LABDB database:

LABDB(LABADMIN)=> \dt
We will now see a list containing the two tables you created:

List of relations
Name | Type | Owner
--------+-------+---------NATION | TABLE | LABADMIN
REGION | TABLE | LABADMIN
(2 rows)

8.

Describe the NATION table :

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 13 of

IBM Software
Information Management

LABDB(LABADMIN)=> \d nation
This will show the following results:

Table "NATION"
Attribute |
Type
| Modifier | Default Value
-------------+------------------------+----------+--------------N_NATIONKEY | INTEGER
| NOT NULL |
N_NAME
| CHARACTER(25)
| NOT NULL |
N_REGIONKEY | INTEGER
| NOT NULL |
N_COMMENT
| CHARACTER VARYING(152) |
|
Distributed on random: (round-robin)
The distributed on random is the distribution method used, in this case the rows in the NATION table are distributed in
round-robin fashion. This concept will be discussed separately in the Data Distribution presentation and lab.

It is possible to continue to use LABADMIN user to perform DML queries since it is the owner of the database and holds all
privileges on all of the objects in the databases. However, the LABUSER and DBUSER users will be used to perform DML queries
against the tables in the database.

3.4

Using DML Queries

We will now use the LABUSER user to populate data into both the REGION and NATION tables. This user has full data
manipulation language (DML) privileges in the database, but no data definition language (DDL) privileges. Only the LABADMIN
has full DDL privileges in the database. Later in this course more efficient methods to populate tables with data are discussed.
The DBUSER will also be used to read data from the tables, but it can not insert data in to the tables since is has limited DML
privileges in the database.

1.

Connect to the LABDB database as the LABUSER user using the internal slash option, \c:

LABDB(ADMIN)=> \c labdb labuser password


You will see the following result:

LABDB(ADMIN)=> \c labdb labuser password


You are now connected to database LABDB as user labuser.
LABDB(LABUSER)=>
You will notice that the user name in the command prompt has changed from LABADMIN to LABUSER.

2.

First check which tables exist in the LABDB database using the \dt internal slash option:

LABDB(LABUSER)=> \dt
You should see the following list:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 14 of

IBM Software
Information Management

List of relations
Name | Type | Owner
--------+-------+---------NATION | TABLE | LABADMIN
REGION | TABLE | LABADMIN
(2 rows)

Remember that the LABUSER user is a member of the LUGRP group which was granted LIST privileges on the tables in the
LABDB database. This is the reason why it can list and view the tables in the LABDB database. If it did not have this privilege
it would not be able to see any of the tables in the LABDB database.
3.

The LABUSER user was created to perform DML operations against the tables in the LABDB database. However, it was
restricted on performing DDL operations against the database. Lets see what happens when you try create a new table,
t1, with one column, C1, using the INTEGER data type:

LABDB(LABUSER)=> create table t1 (c1 integer);


You will see the following error message:

LABDB(LABUSER)=> create table t1 (c1 integer);


ERROR: CREATE TABLE: permission denied.
As expected the create table statement is not allowed since LABUSER user does not have the privilege to create tables in
the LABDB database.

4.

Lets continue by performing DML operations that the LABUSER user is allowed to perform against the tables in the
LABDB database. Insert a new row into the REGION table:

LABDB(LABUSER)=> insert into region values (1, 'NA', 'north america');


You will see the following results:

LABDB(LABUSER)=> insert into region values (1, 'NA', 'north america');


INSERT 0 1
As expected this operation is successful. The output of the INSERT gives feedback about the number of successfully
inserted rows.

5.

Issue the SELECT statement against the REGION table to check the new row you just added to the table:

LABDB(LABUSER)=> select * from region;


This should return the row you just inserted:
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------1 | NA
| north america
(1 rows)

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 15 of

IBM Software
Information Management

6.

Instead of typing DML statements at the nzsql command line, you can read and execute statements from a file. You
will use this method to add the following three rows to the REGION table:
R_REGIONKEY

R_NAME

R_COMMENT

SA

South America

EMEA

Europe, Middle East, Africa

AP

Asia Pacific

This is done with a SQL script containing the following commands:

insert into region values (2, 'sa', 'south america');


insert into region values (3, 'emea', 'europe, middle east, africa');
insert into region values (4, 'ap', 'asia pacific');
It can be found in the region.dml file under the /labs/databaseAdministration directory. To read and execute commands from
a file use the \i <file> internal slash option:

LABDB(LABUSER)=> \i /labs/databaseAdministration/region.dml
You will see the following result. You can see from the output that the SQL script contained three INSERT statements.

LABDB(LABUSER)=> \i /labs/databaseAdministration/region.dml
INSERT 0 1
INSERT 0 1
INSERT 0 1

7.

You will load data into the NATION table using an external table with the following command:

LABDB(LABUSER)=> insert into nation select * from external


'/labs/databaseAdministration/nation.del';
You will see that 14 rows are inserted to the table:

LABDB(LABUSER)=> insert into nation select * from external


/labs/databaseAdministration/nation.del
INSERT 0 14
Loading data into a table is covered in the Loading and Unloading Data presentation and lab.

8.

Now you will switch over to the DBUSER user, who only has SELECT privilege on the tables in the LABDB database. This
privilege is granted to this user since he is a member of the LUGRP group. Use the internal slash option, \c <database
name> <user> <password> to connect to the LABDB database as the DBUSER user:

LABDB(LABUSER)=> \c labdb dbuser password


You will see the following:

LABDB(LABUSER)=> \c LABDB dbuser password


You are now connected to database LABDB as user dbuser.
LABDB(DBUSER)=>
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
18

Page 16 of

IBM Software
Information Management

You will notice that the user name in the command prompt changes from LABUSER to DBUSER.

9.

Before trying to view rows from tables in the LABDB database, try to add a new row to the REGION table:

LABDB(DBUSER)=> insert into region values (5, 'NP', 'north pole');


You should see the following error message:

LABDB(DBUSER)=> insert into region values (5, 'np', 'north pole');


ERROR: Permission denied on "REGION".
As expected the INSERT statement is not allowed since the DBUSER does not have the privilege to add rows to any tables in
the LABDB database.

10. Now select all rows from the REGION table:

LABDB(DBUSER)=> select * from region;


You should get the following output:
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------1 | na
| north america
2 | sa
| south america
3 | emea
| europe, middle east, Africa
4 | ap
| asia pacific
(4 rows)

11. Finally let's run a slightly more complex query. We want to return all nation names in Asia Pacific, together with their
region name. To do this you need to execute a simple join using the NATION and REGION tables. The join key will be
the region key, and to restrict results on the AP region you need to add a WHERE condition:

LABDB(DBUSER)=> select n_name, r_name from nation, region where


n_regionkey = r_regionkey and r_name = 'ap';
This should return the following results, containing all countries from the ap region.
N_NAME
|
R_NAME
---------------------------+--------------------------macau
| ap
new zealand
| ap
australia
| ap
japan
| ap
hong kong
| ap
(5 rows)

Congratulations you have completed the lab. You have successfully created the lab database, 2 tables, and database users
and user groups with various privileges. You also ran a couple of simple queries. In further labs you will continue to use this
database by creating the full schema.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 17 of

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 18 of

Data Distribution
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

Objectives........................................................................3

Skew .................................................................................4
2.1

Data Skew .......................................................................4

2.2

Processing Skew .............................................................7

Co-Location ...................................................................10
3.1

Investigation ..................................................................10

3.2

Co-Located Joins...........................................................12

Schema Creation...........................................................15
4.1

Investigation ..................................................................15

4.2

Solution .........................................................................16

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 19

IBM Software
Information Management

1 Introduction
IBM PureData System is a family of data-warehousing appliances that combine high performance with low administrative effort.
Due to the unique data warehousing centric architecture of PureData System, most performance tuning tasks are either not
necessary or automated. Unlike normal data warehousing solutions, no tablespaces need to be created or tuned, there are also
no indexes, buffer pools or partitions.
Since PureData System is built on a massively parallel architecture that distributes data and workloads over a large number of
processing and data nodes, the single most important tuning factor is picking the right distribution key. The distribution key
governs which data rows of a table are distributed to which data slice and it is very important to pick an optimal distribution key
to avoid data skew, processing skew and to make joins co-located whenever possible.

1.1

Objectives

In this lab we will cover a typical scenario in a POC or customer engagement which involves an existing data warehouse for
customer transactions.

Figure 1 LABDB database


Figure 1 shows a visualization of the tables in the data warehouse and the relationships between the tables. The warehouse
contains the customers of the company, their orders, and the line items that are part of the order. The warehouse also has a list
of suppliers, providing the parts that are part of the shipped line items.
For this lab we already have the DDLs for creation of the tables and load files containing the warehouse data. Both have already
been transformed in a format usable by PureData System. In this lab we will define the distribution keys for these tables.
In addition to the data and the DDLs we also have received a couple of queries from the customer that are usually run against
the warehouse. Those are important input as well for picking optimal distribution keys.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 19

IBM Software
Information Management

2 Skew
Tables in PureData System are distributed across data slices based on the distribution method and key. If a bad data distribution
method has been picked, it will result in skewed tables or processing skew. Data skew occurs when the distribution method puts
significantly more records of a table on one data slice than on other data slices. Apart from bad performance this also results in
a situation where the PureData System can hold significantly less data than expected.
Processing skew occurs if processing of queries is mainly taking place on some data slices for example because queries only
apply to data on those data slices. Both types of skew result in suboptimal performance since in a parallel system the slowest
node defines the total execution time.

2.1

Data Skew

The first table we will create is LINEITEM, the main fact table of the schema. It contains roughly 6 million rows.
1.

Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user nz with password nz. 192.168.239.2 is
the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images
are hosted remotely, the instructors will provide the host IPs which will vary between machines

2.

If you are continuing from the previous lab and are already connected to NZSQL quit the NZSQL console with the \q
command.

3.

To create the LINEITEM table, switch to the lab directory /labs/dataDistribution. To do this use the following command:
(Notice that you can use bash auto complete by using the Tab key to complete folder and files names)

[nz@netezza ~]$ cd /labs/dataDistribution/

4.

Create the LINEITEM table by using the following script. Since the fact table is quite large this can take a couple
minutes.

[nz@netezza dataDistribution]$ ./create_lineitem_1.sh

You should see a similar result to the following. The error message at the beginning is expected since the script tries to
clean up existing LINEITEM tables:

[nz@netezza dataDistribution]$ ./create_lineitem_1.sh


ERROR: Table 'LINEITEM' does not exist
CREATE TABLE
Load session of table 'LINEITEM' completed successfully

5.

Now lets have a look at the created table, open the nzsql console by entering the command: nzsql

[nz@netezza dataDistribution]$ nzsql


Welcome to nzsql, the Netezza SQL interactive terminal.
Type:

\h
\?
\g
\q

for help with SQL commands


for help on internal slash commands
or terminate with semicolon to execute query
to quit

SYSTEM(ADMIN)=>

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 19

IBM Software
Information Management

6.

Connect to the database LABDB as user LABADMIN by typing the following command:

SYSTEM(ADMIN)=> \c LABDB LABADMIN


You should now be connected to the LABDB database as the LABADMIN user.

SYSTEM(ADMIN)=> \c LABDB LABADMIN


You are now connected to database LABDB as user LABADMIN.
LABDB(LABADMIN) =>

7.

Lets have a look at the table we just created. First we want to see a description of its columns and distribution key. Use
the NZSQL describe command \d LINEITEM to get a description of the table. This should have the following result:

TPCH(TPCHADMIN)=> \d LINEITEM
Table "LINEITEM"
Attribute
|
Type
| Modifier | Default Value
-----------------+-----------------------+----------+--------------L_ORDERKEY
| INTEGER
| NOT NULL |
L_PARTKEY
| INTEGER
| NOT NULL |
L_SUPPKEY
| INTEGER
| NOT NULL |
L_LINENUMBER
| INTEGER
| NOT NULL |
L_QUANTITY
| NUMERIC(15,2)
| NOT NULL |
L_EXTENDEDPRICE | NUMERIC(15,2)
| NOT NULL |
L_DISCOUNT
| NUMERIC(15,2)
| NOT NULL |
L_TAX
| NUMERIC(15,2)
| NOT NULL |
L_RETURNFLAG
| CHARACTER(1)
| NOT NULL |
L_LINESTATUS
| CHARACTER(1)
| NOT NULL |
L_SHIPDATE
| DATE
| NOT NULL |
L_COMMITDATE
| DATE
| NOT NULL |
L_RECEIPTDATE
| DATE
| NOT NULL |
L_SHIPINSTRUCT | CHARACTER(25)
| NOT NULL |
L_SHIPMODE
| CHARACTER(10)
| NOT NULL |
L_COMMENT
| CHARACTER VARYING(44) | NOT NULL |
Distributed on hash: "L_LINESTATUS"

We can see that the LINEITEM table has 16 columns with different data types. Some of the columns have a key suffix and
substrings containing the names of other tables and are most likely foreign keys of dimension tables. The distribution key is
L_LINESTATUS, which is of a CHAR(1) data type.
8.

Now lets have a look at the data in the table. To return a limited number of rows you can use the limit keyword in your
select queries. Execute the following select command to return 10 rows of the LINEITEM table. For readability we only
select a couple of columns including the order key, the ship date and the linestatus distribution key:

LABDB(LABADMIN)=> SELECT L_ORDERKEY, L_QUANTITY, L_SHIPDATE, L_LINESTATUS FROM


LINEITEM LIMIT 10;
You will see the following results:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 19

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT L_ORDERKEY, L_QUANTITY, L_SHIPDATE, L_LINESTATUS FROM


LINEITEM LIMIT 10;
L_ORDERKEY | L_QUANTITY | L_SHIPDATE | L_LINESTATUS
------------+------------+------------+-------------2 |
38.00 | 1997-01-28 | O
6 |
37.00 | 1992-04-27 | F
34 |
13.00 | 1998-10-23 | O
34 |
22.00 | 1998-10-09 | O
34 |
6.00 | 1998-10-30 | O
38 |
44.00 | 1996-09-29 | O
66 |
31.00 | 1994-02-19 | F
66 |
41.00 | 1994-02-21 | F
70 |
8.00 | 1994-01-12 | F
70 |
13.00 | 1994-03-03 | F
(10 rows)
From this limited sample we can not make any definite judgments but we can make a couple of assumptions. While the
L_ORDERKEY column is not unique it seems to have a lot of distinct values. The L_SHIPDATE column also appears to
have a lot of distinct shipping date values. Our current distribution key L_LINESTATUS on the other hand has only two
shown values which may make it a bad distribution key. It is possible that you get different results. Since a database table is
an unordered set it is probable that you get different results for example only O or F values in the L_LINESTATUS
column.
9.

We will now verify the number of distinct values in the L_LINESTATUS column with a SELECT DISTINCT call. To
return a list of all values that are in the L_LINESTATUS column execute the following SQL command:

LABDB(LABADMIN)=> SELECT DISTINCT L_LINESTATUS FROM LINEITEM;


You should see the following results:

LABDB(LABADMIN)=> SELECT DISTINCT L_LINESTATUS FROM LINEITEM;


L_LINESTATUS
-------------O
F
(2 rows)
We can see that the L_LINESTATUS column only contains two distinct values. As a distribution key, this will result in a table
that is only distributed to two of the available dataslices.
10. We verify this by executing the following SQL call, which will return a list of all dataslices which contain rows of the
LINEITEM table, and the corresponding number of rows stored in them:

LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;


This will result in a similar output to the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 19

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;


DATASLICEID | COUNT
-------------+--------1 | 3004998
4 | 2996217
(2 rows)
Every PureData System table has a hidden column DATASLICEID, which contains the id of the dataslice the selected row is
being stored in. By executing a SQL query that does a GROUP BY on this column and counts the number of rows for each
dataslice id, data skew can be detected.
In this case the table has been, as we already expected, distributed to only two of the available four dataslices. This means
that we only use half of the available space and it will also result in low performance during most query executions. In
general a good distribution key should have a big number of distinct values with a good value distribution. Columns with a
low number of distinct values, especially boolean columns should not be considered as distribution keys.

2.2

Processing Skew

Even in tables that are distributed evenly across dataslices, data processing for queries can be concentrated or skewed to a
limited number of dataslices. This can happen because PureData System is able to ignore data extents (sets of data pages) that
do not fit to a given WHERE condition. We will cover the mechanism behind that in the zone map chapter.
1.

First we will pick a new distribution key. As we have seen it should have a big number of distinct values. One of the
columns that did fit this description was the L_SHIPDATE column. Check the number of distinct values in the shipdate
column with the COUNT(DISTINCT ) statement:

LABDB(LABADMIN)=> SELECT COUNT(DISTINCT L_SHIPDATE) FROM LINEITEM;


You will get a result similar to the following:

LABDB(LABADMIN)=> SELECT COUNT(DISTINCT L_SHIPDATE) FROM LINEITEM;


COUNT
------2526
(1 row)
The column has over 2500 distinct values and has therefore more than enough distinct values to guarantee a good data
distribution on 4 dataslices. Of course this is under the assumption that the value distribution is good as well.
2.

Now lets reload the LINEITEM table with the new distribution key. For this we need to change the SQL of the loading
script we executed at the beginning of the lab. Exit the nzsql console by entering: \q

3.

You should now be in the lab directory /labs/dataDistribution. The table creation statement is situated in the lineitem.sql
file. We will need to make changes to the file with a text editor. Open the file with the default linux text editor vi. To do
this enter the following command:
vi lineitem.sql

4.

The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you
will be in the command mode. To change the file you need to switch into the insert mode by pressing i. The editor will
show an INSERT at the bottom of the screen.

5.

You can now use the cursor keys to navigate to the DISTRIBUTE ON clause at the bottom of the create command.
Change the distribution key to l_shipdate. The editor should now look like the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 19

IBM Software
Information Management

create table lineitem


(
l_orderkey integer not null ,
l_partkey integer not null ,
l_suppkey integer not null ,
l_linenumber integer not null ,
l_quantity decimal(15,2) not null ,
l_extendedprice decimal(15,2) not null ,
l_discount decimal(15,2) not null ,
l_tax decimal(15,2) not null ,
l_returnflag char(1) not null ,
l_linestatus char(1) not null ,
l_shipdate date not null ,
l_commitdate date not null ,
l_receiptdate date not null ,
l_shipinstruct char(25) not null ,
l_shipmode char(10) not null ,
l_comment varchar(44) not null
)
DISTRIBUTE ON (l_shipdate);
~
~
~
-- INSERT --

6.

We will now save our changes. Press Esc to switch back into command mode. You should see that the INSERT
string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without
any questions. If you made a mistake editing and would like to undo it press Esc then enter :q! and go back to step
3.

7.

Now repeat steps 3-5 of section 2.1 Data Skew:


a.
b.
c.

8.

Recreate and load the LINEITEM table with your new distribution key by executing
the ./create_lineitem_1.sh command
Use the nzsql command to enter the command console
Switch to the LABDB database by using the \c LABDB LABADMIN command.

Now we verify that the new distribution key results in a good data distribution. For this we will repeat the query, which
returns the number of rows for each datasliceid of the LINEITEM table. Execute the following command:

LABDB(LABADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;


Your results should look similar to the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 19

IBM Software
Information Management

TPCH(TPCHADMIN)=> SELECT DATASLICEID, COUNT(*) FROM LINEITEM GROUP BY DATASLICEID;


DATASLICEID | COUNT
-------------+--------2 | 1497649
3 | 1501760
4 | 1501816
1 | 1499990
(4 rows)

We can see that the data distribution is much better now. All four data slices have a roughly equal amount of rows.
9.

Now that we have a database table with a good data distribution lets look at a couple of queries we have received from
the customer. The following query is executed regularly by the customer. It returns the average quantity shipped on a
given day grouped by the shipping mode. Execute the following query:

LABDB(LABADMIN)=> SELECT AVG(L_QUANTITY) AS AVG_Q, L_SHIPMODE FROM LINEITEM WHERE


L_SHIPDATE = '1996-03-29' GROUP BY L_SHIPMODE;
Your results should look like the following:

LABDB(LABADMIN)=> SELECT AVG(L_QUANTITY) AS AVG_Q, L_SHIPMODE FROM LINEITEM WHERE


L_SHIPDATE = '1996-03-29' GROUP BY L_SHIPMODE;
AVG_Q
| L_SHIPMODE
-----------+-----------26.045455 | MAIL
27.147826 | TRUCK
26.038567 | FOB
24.780282 | RAIL
25.708556 | AIR
24.494186 | REG AIR
25.562500 | SHIP
(7 rows)

This query will take all rows from the 29th March of 1996 and compute the average value of the L_QUANTITY column for
each L_SHIPMODE value. It is a typical warehousing query insofar as a date column is used to restrict the row set that is
taken as input for computation.
In this example most rows of the LINEITEM table will be filtered away, only rows that have the specified date will be used as
input for computation of the AVG aggregation.
10. Execute the following SQL statement to see on which data slice we can find the rows from the 29th March of 1996:

LABDB(LABADMIN)=> SELECT COUNT(*), DATASLICEID FROM LINEITEM WHERE L_SHIPDATE =


'1996-03-29' GROUP BY DATASLICEID;
You should see the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 19

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT COUNT(*), DATASLICEID FROM LINEITEM WHERE L_SHIPDATE =


'1996-03-29' GROUP BY DATASLICEID;
COUNT | DATASLICEID
-------+------------2501 |
2
(1 row)
Since we used the shipping date column as a distribution key, all rows from a specific date can be found on one data slice
and therefore also one SPU. This means that for our previous query all rows on other data slices are dismissed and the
computation takes only place on one dataslice and SPU. This is known as processing skew. While this one SPU is working
the other SPUs will be idling.
Columns that are often used in WHERE conditions shouldnt be used as distribution keys, since this can easily result in
processing skew. In warehousing environments this is especially true for date columns.
Good distribution keys are key columns; they have lots of distinct values and very rarely result in processing skew. In our
example we have a couple of distribution keys to choose from: L_SUPPKEY, L_ORDERKEY, L_PARTKEY. All have a big
number of distinct values.

3 Co-Location
The most basic warehouse schema consists of a fact table containing a list of all business transactions and a set of dimension
tables that contain the different actors, objects, locations and time points that have taken part in these transactions. This means
that most queries will not only access one database table but will require joins between tables.
In PureData System database, tables are distributed over a potentially large numbers of data slices on different SPUs. This
means that during a join of two tables there are two possibilities.
Rows of the two tables that belong together are situated on the same dataslice, which means that they are co-located
and can be joined locally
Rows that belong together are situated on different dataslices which means that tables need to be redistributed.

3.1

Investigation

Obviously co-location has big performance advantages. In the following section we will demonstrate that by introducing a
second table ORDERS.
1.

Switch to the Linux command line, if you are in the NZSQL console. Do this with the \q command.

2.

Switch to the data distribution lab directory with the command cd /labs/dataDistribution

3.

Create and load the ORDERS table by executing the following command: ./create_orders_1.sh

4.

Enter the NZSQL console with the nzsql labdb labadmin command

5.

Lets take a look at the ORDERS table with the \d orders command. You should see the following results.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 10 of

IBM Software
Information Management

LABDB(LABADMIN)=> \d orders
Table "ORDERS"
Attribute
|
Type
| Modifier | Default Value
-----------------+-----------------------+----------+--------------O_ORDERKEY
| INTEGER
| NOT NULL |
O_CUSTKEY
| INTEGER
| NOT NULL |
O_ORDERSTATUS
| CHARACTER(1)
| NOT NULL |
O_TOTALPRICE
| NUMERIC(15,2)
| NOT NULL |
O_ORDERDATE
| DATE
| NOT NULL |
O_ORDERPRIORITY | CHARACTER(15)
| NOT NULL |
O_CLERK
| CHARACTER(15)
| NOT NULL |
O_SHIPPRIORITY | INTEGER
| NOT NULL |
O_COMMENT
| CHARACTER VARYING(79) | NOT NULL |
Distributed on random: (round-robin)
The orders table has a key column O_ORDERKEY that is most likely the primary key of the table. It contains information on
the order value, priority and date and has been distributed on random. This means that PureData System doesnt use a
hash based algorithm to distribute the data. Instead, rows are distributed randomly on the available data slices.
You can check the data distribution of the table, using the methods we have used before for the LINEITEM table. The data
distribution will be perfect. There will also not be any processing skew for queries on the single table, since in a random
distribution there can be no correlation between any WHERE condition and the distribution key.
6.

We have received another typical query from our customer. It returns the average total price and item quantity of all
orders grouped by the shipping priority. This query has to join together the LINEITEM and ORDERS tables to get the
total order cost from the orders table and the quantity for each shipped item from the LINEITEM table. The tables are
joined with an inner join on the L_ORDERKEY column. Execute the following query and note the approximate execution
time:

LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY,


O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
You should see the following results:

LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE,AVG(L.L_QUANTITY) AS QUANTITY,


O.O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
PRICE
| QUANTITY | O_ORDERPRIORITY
---------------+-----------+----------------189285.029553 | 25.526186 | 2-HIGH
189219.594349 | 25.532474 | 5-LOW
189093.608965 | 25.513563 | 1-URGENT
189026.093657 | 25.494518 | 3-MEDIUM
188546.457203 | 25.472923 | 4-NOT SPECIFIED
(5 rows)

Notice that the query takes about a minute to complete on our machine. The actual execution times on your machine will be
different.
7.

Remember that the ORDERS table was distributed randomly and the LINEITEM table is still distributed by the
L_SHIPDATE column. The join on the other hand is taking place on the L_ORDERKEY and O_ORDERKEY columns.
We will now have a quick look at what is happening inside PureData System in this scenario. To do this we use the
PureData System EXPLAIN function. This will be more thoroughly covered in the Optimization lab.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 11 of

IBM Software
Information Management

Execute the following command:

LABDB(LABADMIN)=>EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,


AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE
L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY;
You will get a long output. Scroll up till you see your command in the text window. The start of the EXPLAIN output should
look like the following:

EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,AVG(L.L_QUANTITY) AS QUANTITY,


O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {}]
-- Estimated Rows = 1500000, Width = 27, Cost = 0.0 .. 578.6, Conf = 100.0
Projections:
1:O.O_TOTALPRICE 2:O.O_ORDERPRIORITY 3:O.O_ORDERKEY
[SPU Distribute on {(O.O_ORDERKEY)}]
[HashIt for Join]
Node 2.
[SPU Sequential Scan table "LINEITEM" as "L" {(L.L_SHIPDATE)}]
-- Estimated Rows = 6001215, Width = 12, Cost = 0.0 .. 2417.5, Conf = 100.0
Projections:
1:L.L_QUANTITY 2:L.L_ORDERKEY
[SPU Distribute on {(L.L_ORDERKEY)}]
...
The EXPLAIN functionality will be covered in detail in a following chapter but it is easy to see what is happening here.
Whats happening is the system is redistributing both the ORDERS and LINEITEM tables. This is very bad because both
tables are of significant size so there is a considerable overhead. This inefficient redistribution occurs because the tables
are not distributed on a useful column. In the next section we will fix this.

3.2

Co-Located Joins

In the last section we have seen that a query using joins can result in costly data redistribution during join execution when the
joined tables are not distributed on the join key. In this section we will reload the tables based on the mutual join key to enhance
performance during joins.
1.

Exit the NZSQL console with the \q command.

2.

Switch to the dataDistribution directory with the cd /labs/dataDistribution command

3.

Change the distribution key in the lineitem.sql file to L_ORDERKEY:


a.
b.
c.
d.

Open the file with the vi editor by executing the command: vi lineitem.sql
Switch to INSERT mode by pressing i
Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON
(L_ORDERKEY)
Exit the INSERT mode by pressing ESC

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 12 of

IBM Software
Information Management

e.

Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should
look like the following:

create table lineitem


(
l_orderkey integer not null ,
l_partkey integer not null ,
l_suppkey integer not null ,
l_linenumber integer not null ,
l_quantity decimal(15,2) not null ,
l_extendedprice decimal(15,2) not null ,
l_discount decimal(15,2) not null ,
l_tax decimal(15,2) not null ,
l_returnflag char(1) not null ,
l_linestatus char(1) not null ,
l_shipdate date not null ,
l_commitdate date not null ,
l_receiptdate date not null ,
l_shipinstruct char(25) not null ,
l_shipmode char(10) not null ,
l_comment varchar(44) not null
)
DISTRIBUTE ON (l_orderkey);
~
:wq!
4.

Change the Distribution key in the orders.sql file to O_ORDERKEY.


a. Open the file with the vi editor by executing the command: vi orders.sql
b. Switch to INSERT mode by pressing i
c. Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON
(O_ORDERKEY)
d. Exit the INSERT mode by pressing ESC
e. Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should
look like the following:

create table orders


(
o_orderkey integer not null ,
o_custkey integer not null ,
o_orderstatus char(1) not null ,
o_totalprice decimal(15,2) not null ,
o_orderdate date not null ,
o_orderpriority char(15) not null ,
o_clerk char(15) not null ,
o_shippriority integer not null ,
o_comment varchar(79) not null
)
DISTRIBUTE ON (o_orderkey);
~
:wq!
5.

Recreate and load the LINEITEM table with the distribution key L_ORDERKEY by executing the
command: ./create_lineitem_1.sh

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 13 of

IBM Software
Information Management

6.

Recreate and load the ORDERS table with the distribution key O_ORDERKEY by executing the
command: ./create_orders_1.sh

7.

Enter the NZSQL console by executing the following command: nzsql labdb labadmin

8.

Repeat executing the explain of our join query from the previous section by executing the following command:

LABDB(LABADMIN)=>EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE,


AVG(L.L_QUANTITY) AS QUANTITY, O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE
L.L_ORDERKEY=O.O_ORDERKEY GROUP BY O_ORDERPRIORITY;
The query itself has not been changed. The only changes are in the distribution keys of the involved tables. You will again
see a long output. Scroll up to the start of the output, directly after your query. You should see a similar output to the
following:

EXPLAIN VERBOSE SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY,


O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY GROUP
BY O_ORDERPRIORITY;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 1500000, Width = 27, Cost = 0.0 .. 578.6, Conf = 100.0
Projections:
1:O.O_TOTALPRICE 2:O.O_ORDERPRIORITY 3:O.O_ORDERKEY
[HashIt for Join]
Node 2.
[SPU Sequential Scan table "LINEITEM" as "L" {(L.L_ORDERKEY)}]
-- Estimated Rows = 6001215, Width = 12, Cost = 0.0 .. 2417.5, Conf = 100.0
Projections:
1:L.L_QUANTITY 2:L.L_ORDERKEY
...
Again we do not want to make a complete analysis of the explain output. We will cover that in more detail in later chapters.
But if you compare the output with the output of the last section you will see that the [SPU Distribute on
O.O_ORDERKEY)}] nodes have vanished. The reason is that the join is now co-located because both tables are
distributed on the join key.
You may see a distribution node further below during the execution of the group by clause, but this is forecast to distribute
only hundred rows which has no negative performance influence.
9.

Finally execute the joined query again:

LABDB(LABADMIN)=>SELECT AVG(O.O_TOTALPRICE) AS PRICE, AVG(L.L_QUANTITY) AS QUANTITY,


O_ORDERPRIORITY FROM ORDERS AS O, LINEITEM AS L WHERE L.L_ORDERKEY=O.O_ORDERKEY
GROUP BY O_ORDERPRIORITY;

The query should return the same results as in the previous section but run much faster even in the VMWare environment.
In a real PureData System appliance with 6, 12 or more SPUs the difference would be much more significant.
You now have loaded the LINEITEM and ORDERS table into your PureData System appliance using the optimal distribution
key for these tables for most situations.
a. Both tables are distributed evenly across dataslices, so there is no data skew.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
19

Page 14 of

IBM Software
Information Management

b.
c.

The distribution key is highly unlikely to result in processing skew, since most where conditions will restrict a
key column evenly
Since ORDERS is a parent table of LINEITEM, with a foreign key relationship between them, most queries
joining them together will utilize the join key. These queries will be co-located.

Now finally we will pick the distribution keys of the full schema.

4 Schema Creation
Now that we have created the ORDERS and LINEITEM tables we need to pick the distribution keys for the remaining tables as
well.

4.1

Investigation

Figure 2 LABDB database


You will notice that it is much harder to find optimal distribution keys in a more complicated schema like this. In many situations
you will be forced to choose between enabling co-located joins between one set of tables or another one.
The following provides some details on the remaining tables:

Table

Number of Rows

Primary Key

REGION

R_REGIONKEY

NATION

25

N_NATIONKEY

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 15 of

IBM Software
Information Management

CUSTOMER

150000

C_CUSTKEY

ORDERS

1500000

O_ORDERKEY

SUPPLIER

10000

S_SUPPKEY

PART

200000

P_PARTKEY

PARTSUPP

800000

---

LINEITEM

6000000

--

And on the involved relationships:

Parent Table

Child Table

Parent Table Join Column

Child Table Join Column

REGION

NATION

R_REGIONKEY

N_REGIONKEY

NATION

CUSTOMER

N_NATIONKEY

C_NATIONKEY

NATION

SUPPLIER

N_NATIONKEY

S_NATIONKEY

CUSTOMER

ORDERS

C_CUSTKEY

O_CUSTKEY

ORDERS

LINEITEM

O_ORDERKEY

L_ORDERKEY

SUPPLIER

LINEITEM

S_SUPPKEY

L_SUPPKEY

SUPPLIER

PARTSUPP

S_SUPPKEY

PS_SUPPKEY

PART

LINEITEM

P_PARTKEY

L_PARTKEY

PART

PARTSUPP

P_PARTKEY

PS_PARTKEY

Given all that you heard in the presentation and lab, try to fill in the distribution keys in the chart below. Lets assume that we will
not change the distribution keys for LINEITEM and ORDERS anymore.

Table

Distribution Key (up to 4 columns) or Random

REGION
NATION
CUSTOMER
SUPPLIER
PART
PARTSUPP
ORDERS

O_ORDERKEY

LINEITEM

L_ORDERKEY

4.2

Solution

It is important to note that there is no optimal way to pick distribution keys. It always depends on the queries that run against the
database. Without these queries it is only possible to follow some general rules:
Co-Location between big tables (esp. if a fact table is involved) is more important than between small tables
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
19

Page 16 of

IBM Software
Information Management

Very small tables can be broadcast by the system with little performance penalty. If one table of a join is broadcast the
other will not need to be redistributed
If you suspect that there will be lots of queries joining two big tables but you cannot distribute both of them on the
expected join key, distributing one table on the join key is better than nothing, since it will lead to a single redistribute
instead of a double redistribute.

If we break down the problem, we can see that PART and PARTSUPP are the biggest two of the remaining tables and we have
already based on available customer queries distributed the LINEITEM table on the order key, so it seems to make sense to
distribute PART and PARTSUPP on their join keys.
CUSTOMER is big as well and has two relationships. The first relationship is with the very small NATION table that is easily
broadcasted by the system. The second relationship is with the ORDERS table which is big as well but already distributed by the
order key. But as mentioned above a single redistribute is better than a double redistribute. Therefore it makes sense to
distribute the CUSTOMER table on the customer key, which is also the join key of this relationship.
The situation is very similar for the SUPPLIER table. It has two very large child tables PARTSUPP and LINEITEM which are
both related to it through the supplier key, so it should be distributed on this key.
NATION and REGION are both very small and will most likely be broadcasted by the Optimizer. You could distribute those
tables randomly, on their primary keys, on their join keys. In this case we have decided to distribute both on their primary keys
but there is no definite right or wrong approaches. One possible solution for the distribution keys could be the following.

Table

Distribution Key (up to 4 columns) or Random

REGION

R_REGIONKEY

NATION

N_NATIONKEY

CUSTOMER

C_CUSTKEY

SUPPLIER

S_SUPPKEY

PART

P_PARTKEY

PARTSUPP

PS_PARTKEY

ORDERS

O_ORDERKEY

LINEITEM

L_ORDERKEY

Finally we will actually load the remaining tables.


1.

You should still be connected to the LABDB database. We now need to recreate the NATION and REGION tables with
a new distribution key. To drop the old versions execute the following command:

LABDB(LABADMIN)=> DROP TABLE NATION, REGION;


2.

Quit the NZSQL console with the \q command.

3.

Navigate to the lab folder by executing the following command: cd /labs/dataDistribution

4.

Verify the SQL script creating the remaining 6 tables with the command: more remaining_tables.sql

You will see the SQL script used for creating the remaining tables with the distribution keys mentioned above. Press the
Enter key to scroll lower until you reach the end of the file.
5.

Actually create the remaining tables and load the data into it with the following command: ./create_remaining.sh

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 17 of

IBM Software
Information Management

You should see the following results. The error message at the top is expected since the script tries to clean up any old
tables of the same name in case a reload is necessary.

[nz@netezza dataDistribution]$ ./create_remaining.sh


ERROR: Table 'NATION' does not exist
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE TABLE
Load session of table 'NATION' completed successfully
Load session of table 'REGION' completed successfully
Load session of table 'CUSTOMER' completed successfully
Load session of table 'SUPPLIER' completed successfully
Load session of table 'PART' completed successfully
Load session of table 'PARTSUPP' completed successfully

Congratulations! You just have defined data distribution keys for a customer data schema in PureData System. You can
have a look at the created tables and their definitions with the commands you used in the previous chapters. We will
continue to use the tables we created in the following labs.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
19

Page 18 of

IBM Software
Information Management

PARTICULAR PURPOSE OR NON-INFRINGEMENT.

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada

IBM products are warranted according to the terms and


conditions of
the agreements (e.g. IBM Customer Agreement, Statement
of Limited
Warranty, International Program License Agreement, etc.)
under which
they are provided.

IBM, the IBM logo, ibm.com and Tivoli are trademarks or


registered
trademarks of International Business Machines Corporation
in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence
in this
information with a trademark symbol ( or ), these
symbols indicate
U.S. registered or common law trademarks owned by IBM
at the time
this information was published. Such trademarks may also
be registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web at
Copyright and trademark information at
ibm.com/legal/copytrade.shtml

Other company, product and service names may be


trademarks or service marks of others.
References in this publication to IBM products and services
do not imply that IBM intends to make them available in all
countries in which
IBM operates.
No part of this document may be reproduced or transmitted
in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date
of initial
publication. Product data is subject to change without notice.
Any
statements regarding IBMs future direction and intent are
subject to
change or withdrawal without notice, and represent goals
and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY,
EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS
ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
19

Page 19 of

IBM Software
Information Management

IBM PureData System Administrator


Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents
1
2
3
4
5

Introduction .....................................................................3
Installing NzAdmin ..........................................................3
The System Tab...............................................................4
The Database Tab............................................................6
Tools...............................................................................14
5.1

Workload Management..................................................14

5.2

Table Storage ................................................................15

5.3

Table Skew....................................................................15

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 16

IBM Software
Information Management

1 Introduction
In this lab we will explore the features of the IBM PureData System Administrator GUI tool, NzAdmin. NzAdmin is a Windowsbased application that allows users to manage the system, obtain hardware information and status, and manage various aspects
of user databases, tables, and objects. NzAdmin consists of two distinct environments: the System tab and the Database tab.
We will look at both. When you click either tab, the system displays the tree view on the left side and the data view on the right
side.
The VMWare image we are using in the labs differs significantly from a normal PureData System appliance. There is
only one virtualized SPA and SPU, only 4 dataslices and no dataslice mirroring. In addition to that some NzAdmin
functions do not work with the VM. For example the SPU and SPA sections are blank and the data distribution of a
table cannot be displayed. Nevertheless most functionality works and should provide a good overview.

2 Installing NzAdmin
NzAdmin is part of the PureData System client tools for Windows. It can be installed with a standard Windows installer and
doesnt require the JDBC or ODBC drivers to be installed, since it contains its own connection libraries.
1.

The installation package is in C:\Bootcamp\Netezza_Bootcamp_VMs\Tools\nzsetup.exe


(The base directory C:\Bootcamp may differ in your environment, there should be a shortcut on your Desktop
as well, if you cannot find it ask the instructor for help)

2.

Install the NzAdmin client by double-clicking on the Installer and accepting all standard settings.

3.

You can start NzAdmin from the Windows Start Menu. Programs->IBM PureData System -> IBM PureData System
Administrator

4.

Connect to your PureData System host with the IP address taken from the VM where PureData System Host is running
(you can use ifconfig eth0 in the Linux terminal window. In our lab the IP address is 192.168.239.2, username admin,
and password password.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 16

IBM Software
Information Management

5.

You should see the following:

The Admin client has a navigation menu on the left with two views System and Database.
The System view contains information about the general status of the PureData System hardware and the PureData
System Performance Server software. It displays system information and provides information about possible system
problems like a hard disc failure. It also contains statistics like the user space usage.
The database view contains information about the user databases in the system. It displays information about all database
objects like tables, views, sequences, synonyms, user defined functions, procedures etc. It also provides the user with the
tools necessary to manage groups and access privileges. You can also view the current active database sessions and their
queries and a recent history of all queries that have been run on the system. Finally you can see the backup history for
each database.
The menu bar contains common actions like refresh or connect. In addition to that it provides access to some
administration tools like Workload Management, a tool for the identification of table skew etc.

3 The System Tab


In this section we will inspect the hardware components that make up a PureData System Appliance system using NzAdmin,
including the SPUs and the data slices.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 16

IBM Software
Information Management

1.

2.

3.

The default view is the main hardware view, which shows a dashboard representing the general health of the system.
Unfortunately the hardware information cannot be gathered for our VM. But we see the disc usage at the bottom. Note
that the most important measure is actually the Maximum storage utilization. If one disc runs full, no new data can be
added to the system.

Unfortunately the SPA and SPU sections are empty for our VM system, normally we could see health information of all
Snippet processing arrays, snippet processing units and their data slices and hard discs. The next available section is
data slices. When you select it, you can see that our VM has 4 dataslices 1-4 on four hard discs 1002-1005. Normally
we would also see which disc contains the mirrors of these discs, but our VM system doesnt mirror its data slices.

Under the data slice section you can see the currently active event rules. Event rules monitor the system for note worthy
occurrences and act accordingly i.e. sending an email or raising an alert. For example by sending a mail to an
administrator in case of a hardware failure. Unlike a real PureData System appliance only a very small set of event rules
is enabled. You could use the New Event Rule wizard to add new events or generate test events.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 16

IBM Software
Information Management

4 The Database Tab


In this section we will learn how NzAdmin can be used to view and manage user database objects including tables, users,
groups and sessions.
1.

Switch tabs to the Database tab. This is the area where database tables, users, groups and sessions can be viewed
and managed.
You may not have some of the database objects displayed in the image, this shouldnt change the lab in any way.

2.

In the Database tab, expand Databases and click on LABDB. NzAdmin can view all the objects of the following types:
tables, views, sequences, synonyms, functions, aggregates, procedures, and libraries. You can also create objects of
the following types: tables, views, sequences, and synonyms. Furthermore, many of these object types can be
managed in some way using NzAdmin - for example we have control over tables in NzAdmin. Finally we can see the
currently running Groom processes in the Groom Status section.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 16

IBM Software
Information Management

3.

Click on Tables in the tree or data view. For each table in the LABDB database you can view information such as the
owner, creation date, number of columns, size of table on disk, data skew, row count, and percentage organized if
enabled.

4.

If you right click on a table you can selected ways in which to manage the table including changing the name, owner,
columns, organizing keys, default value, generating or viewing statistics, viewing record distribution, truncating and/or
dropping the table.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 16

IBM Software
Information Management

Unfortunately one of the most important menu entries Record Distribution which gives you a graphical distribution of
the data distribution of the table doesnt work in our VMware environment.

5.

To look at information about the columns, distribution key, organizing keys, DLL, and users/groups with privileges for a
table double click on the table entry to bring up the details:

This view shows the columns of the table and their constraints. It also shows if the columns are Zone Map enabled or
not - Zone Maps are an important performance feature in PureData System and will be discussed later in this course.
You can set access privileges to the Table with the Privileges button. The DDL button returns the command to create
the table and is a convenient feature for administrators.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 16

IBM Software
Information Management

6.

Close the Table Properties view again and click on the Users field of the left navigation tree. Here you can create and
manage users.

Users can either be created from a context menu on the Users folder or from the Options Menu at the top of the screen.
To manage users use the context menu that is displayed when you right click on the user you want to manage.
NzAdmin allows you to rename or drop users, change their privileges and workload management settings etc.
7.

You can do the same management for groups in the Groups section of the Database tab.

8.

Click on Sessions in the Database tab. Here you can see who is currently logged into the PureData System and the
commands they have issued. You can also abort sessions or change their priority in a workload management
environment (this has to be setup before you can change the priority).

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 16

IBM Software
Information Management

9.

To see and manage active queries you can expand Sessions in the Database tab and click on Active Queries, however,
there are no queries running at this time.

10. Comprehensive query history information can be seen by clicking on the Query History section in the Database tab.
PureData System keeps a query history of the last 2000 queries. For a full audit log you would need to use the Query
History database. Select the View Query Entry menu item from the context menu, to get a more structured view for the
values of a specific query:

11. Another window is opened showing the fields of the query history table in a more structured way:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
16

Page 10 of

IBM Software
Information Management

PureData System saves for each recent query a significant amount of information, which can help you to identify
queries that behave badly. Important values are estimated and actually elapsed seconds, result rows and of course the
actual SQL statement.
12. It is also possible to get information about the actual execution plan of the query. We will discuss this in more detail in
future modules. To see a graphical representation of an Explain output right-click on a query and select Explain->Graph:

13. You should see a similar window, the actual graph may differ depending on the query you pick:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
16

Page 11 of

IBM Software
Information Management

This graph shows how the PureData System appliance plans to execute the query in question. It is an important part of
troubleshooting the occasional misbehaving queries. We will discuss this in more detail in the Query Optimization
module. You can also get a textual view by selecting Explain->Verbose.
14. Close the graph and display the plan file in the context menu with the View Plan File entry. You should see a similar
window to the following. Please scroll down to the bottom:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
16

Page 12 of

IBM Software
Information Management

Plan files look similar to Verbose Explain information but there is a significant difference. Explain information tells you
how the system plans to execute a query. The Plan files add information on how the query was actually executed
including actual execution times and are an invaluable help for debugging queries that failed or took longer than
expected.
15. Finally select the Backup->History field in the navigator.

PureData System logs all backups and restore sessions in the system. You will see an empty list but if you return to this
view after the Backup and Restore lab you will see the backup and restore processes you started. The backup history
allows PureData System to provide easy incremental or cumulative backups and to synchronize backups with the
groom process - we will discuss more about that in a later section.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
16

Page 13 of

IBM Software
Information Management

5 Tools
In this section we will learn how to set workload management system settings with NzAdmin, as well as search for data skew,
and view disk usage by database or user.

5.1

Workload Management
1.

From the menu bar at the top of NzAdmin click on Tools  Workload Management  System Settings. Using this tool
we can limit the maximum number of rows allowed in a table, enable query timeouts, session idle timeouts, and default

session priority

2.

From the Workload Management menu option, click into Performance  Summary

3.

From the Summary pane, we can look at activities that happened in the last hour in an aggregate view. Keep this in
mind for the Workload Management module.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
16

Page 14 of

IBM Software
Information Management

5.2

Table Storage
1.

5.3

From the menu bar at the top of NzAdmin click on Tools  Table Storage. This is a tool, which will tell us the total size
in MB for each database or the total size of all the databases a user owns.

Table Skew
1.

From the menu bar at the top of NzAdmin click on Tools  Table Skew. This tool displays tables that meet or exceed a
specified data skew threshold between data slices.

Once an administrator has seen in the main overview that the maximal storage differs significantly from the average
story he can use this tool to find the skewed tables. He can then fix them for example by redistributing them with a
CTAS table. Skewed tables not only limit the available storage but also significantly lower the performance.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
16

Page 15 of

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
16

Page 16 of

Loading and Unloading Data


Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

Objectives........................................................................3

External Tables................................................................3
2.1

Unloading Data using External Tables .............................5

2.2

Dropping External Tables ..............................................15

2.3

Loading Data using External Tables ..............................16

Loading Data using the nzload Utility........................18


3.1

Using the nzload Utility with Command Line Options ...19

3.2

Using the nzload Utility with a Control File. ..................22

3.3

(Optional) Using nzload with Bad Records...................24

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 32

IBM Software
Information Management

1 Introduction
In every data warehouse environment there is a need to load new data into the database. The task to load data into the
database is not just a one time operation but rather a continuous operation that can occur hourly, daily, weekly, or even monthly.
The loading of the data into a database is vital operation that needs to be supported by the data warehouse system. IBM
PureData System provides a framework to support not only the loading of data into the PureData System database environment
but also the unloading of data from the database environment. This framework contains more than one component, some of
these components are:

External Tables These are tables stored as flat files on the host or client systems and registered like tables in the
PureData System catalog. They can be used to load data into the PureData System appliance or unload data to the file
system.

nzload This is a wrapper command line tool around external tables that provides an easy method loading data into
the PureData System appliance.

Format Options These are options for formatting the data load to and from external tables.

1.1

Objectives

This lab will help you explore the IBM PureData System framework components for loading and unloading data from the
database. You will use the various commands to create external tables to unload and load data. Then you will get a basic
understanding of the nzload utility. In this lab the REGION and NATION tables in the LABDB database are used to illustrate the
use of external tables and the nzload utility. After this lab you will have a good understanding on how to load and unload data
from a PureData System database environment

The first part of this lab will explore using External Tables to unload and load data.

The second part of this lab will discuss using the nzload utility to load records into tables.

2 External Tables
An external table allows PureData System to treat an external file as a database table. An external table has a definition (a table
schema) in the PureData System system catalog but the actual data exists outside of the PureData System appliance database.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 3 of 32

IBM Software
Information Management

This is referred to as a datasource file. External tables can be used to access files which are stored on the file system. After you
have created the external table definition, you can use INSERT INTO statements to load data from the external file into a
database table, or SELECT FROM statements to query the external table. Different methods are described to create and use
external tables using the nzsql interface. Along with this the external datasource files for the external tables are examined, so a
second session will be used to help view these files.

I. Connect to your PureData System image using PuTTY. Login to 192.168.239.2 as user nz with password nz.
(192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp)
II. Change to the lab working directory /labs/movingData with the following command
cd /labs/movingData
III. Connect to the LABDB database as the database owner, LABADMIN, using the nzsql interface :

[nz@netezza ~] nzsql -d LABDB -u labadmin -pw password


You should see the following results

Welcome to nzsql, the Netezza SQL interactive terminal.


Type:

\h
\?
\g
\q

for help with SQL commands


for help on internal slash commands
or terminate with semicolon to execute query
to quit

LABDB(LABADMIN)=>

IV. Now in this lab we will need to alternatively execute SQL commands and operating system commands. To make it easier
for you, we will open a second putty session for executing operating system commands like nzload, view generated
external files etc. It will be referred to as session 2 throughout the lab.

The picture above shows the two PuTTY windows that you will need. Session 1 will be used for SQL commands and
session 2 for operating system prompt commands.
V. Open another session using PuTTY. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is the
default IP address for a local VM, the IP may be different for your Bootcamp)
Also make sure that you change to the correct directory, /labs/movingData:

[nz@netezza ~] cd /labs/movingData

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 32

IBM Software
Information Management

2.1

Unloading Data using External Tables

External tables will be used to unload rows from the LABDB database as records into an external datasource file. Various
methods to create and use external tables will be explored unloading rows from either REGION or NATION tables. Five basic
different user cases are presented for you to follow so that you can gain a better understanding of how to use external tables to
unload data from a database.

2.1.1

Unloading data with an External Table created with the SAMEAS clause

The first external table will be used to unload data from the REGION table into an ASCII delimited text file. This external table will
be named ET1_REGION using the same column definition as the REGION table. After the ET1_REGION external table is created
you will then use it to unload all the rows from the REGION table. The records for the ET1_REGION external table will be in the
external datasource file, et1_region_flat_file. The basic syntax to create this type of external table is:

CREATE EXTERNAL TABLE table_name


SAMEAS table_name
USING external_table_options

The SAMEAS clause allows the external table to be created with the same column definition of the referred. This is referred to as
implicit schema definition.
1.

As the LABDB database owner, LABADMIN, you will create the first basic external table using the same column
definitions as the REGION table:

LABDB(LABADMIN)=> create external table et1_region sameas region


using (dataobject ('/labs/movingData/et1_region_flat_file'));

2.

To list the external tables in the LABDB database you use the internal slash option, \dx:

LABDB(LABADMIN)=> \dx
Which will list the external table you just created:
List of relations
Name
|
Type
| Owner
------------+----------------+---------ET1_REGION | EXTERNAL TABLE | LABADMIN
(1 rows)

3.

You can also list the properties of the external table using the following internal slash option to describe the table, \d
<external table name> :

LABDB(LABADMIN)=> \d et1_region
Which will list the properties of the ET1_REGION external table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 32

IBM Software
Information Management

External Table "ET1_REGION"


Attribute |
Type
| Modifier
-------------+------------------------+---------R_REGIONKEY | INTEGER
|
R_NAME
| CHARACTER(25)
|
R_COMMENT
| CHARACTER VARYING(152) |
DataObject - '/labs/movingData/et1_region_flat_file'
adjustdistzeroint
bool style
code set
compress
cr in string
ctrl chars
date delim
date style
delim
encoding
escape
fill record
format
ignore zero
log dir
max errors
max rows
null value
quoted value
remote source
require quotes
skip rows
socket buf size
timedelim
time round nanos
time style
trunc string
y2base
includezeroseconds
record length
record delimiter
nullindicator bytes
layout
decimaldelim

f
1_0
FALSE
f
f
YMD
|
INTERNAL
f
TEXT
f
/tmp
1
0
NULL
NO
f
0
8388608
:
f
24HOUR
f
0
f

This output includes the columns and associated data types in the external table. You will notice that this is similar to the
REGION table since the external table was created using the SAMEAS clause in the CREATE EXTERNAL TABLE command.
The output also includes the properties of the external table. The most notable property is the DataObject property that
shows the location and the name of the external datasource file used for the external table. We will examine some of the
other properties in this lab.

4.

Now that the external table is created you can use it to unload data from the REGION table using an INSERT statement :

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 32

IBM Software
Information Management

LABDB(LABADMIN)=> insert into et1_region select * from region;

5.

You can use this external table like a regular table by issuing SQL statements. Try issuing a simple SELECT FROM
statement against ET1_REGION external table:

LABDB(LABADMIN)=> select * from et1_region;


Which will return all the rows in the ET1_REGION external table:

R_REGIONKEY |
R_NAME
|
R_COMMENT
------------+---------------------------+----------------------------2 | sa
| south america
1 | na
| north america
4 | ap
| asia pacific
3 | emea
| europe, middle east, Africa
(4 rows)
You will notice that this is the same data that is in the REGION table. But the data retrieved for this SELECT statement was
from the datasource of this external table and not from the data within the database.

6.

The main reason for creating an external table is to unload data from a table to a file. Using the second putty session
review the file that was created, et1_region_flat_file, in the /labs/movingData directory:

[nz@netezza movingData]$ more et1_region_flat_file


The file should look similar to the following:

2|sa|south america
1|na|north america
4|ap|asia pacific
3|emea|europe, middle east, africa
This is an ASCII delimited flat file containing the data from the REGION table. The column delimiter used in this file was the
default character |.

2.1.2

Unloading data with an External Table using the AS SELECT clause

The second external table will also be used to unload data from the REGION table into an ASCII delimited text file using a
different method. The external table will be created and the data will be unloaded in the same create statement. So a separate
step is not required to unload the data. The external table will be named ET2_REGION and the external datasource file will be
named et2_region_flat_file. The basic syntax to create this type of external table is:

CREATE EXTERNAL TABLE table_name 'filename'


AS select_statement;

The AS clause allows the external table to be created with the same columns returned in the SELECT FROM statement, which is
referred to as implicit table schema definition. This also unloads the rows at the same time the external table is created.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 32

IBM Software
Information Management

1.

The first method used to create an external table required the data to be unloaded in a second step using an INSERT
statement. Now you will create an external table and unload the data in a single step:

LABDB(LABADMIN)=> create external table et2_region


'/labs/movingData/et2_region_flat_file' as select * from region;
This command created the external table ET2_REGION using the same definition as the REGION table and also unloaded
the data to the et2_region_flat_file.

2.

LIST the EXTERNAL TABLES in the LABDB database:

LABDB(LABADMIN)=> \dx
Which will list all the external tables in the LABDB database:

List of relations
Name
|
Type
| Owner
------------+----------------+---------ET1_REGION | EXTERNAL TABLE | LABADMIN
ET2_REGION | EXTERNAL TABLE | LABADMIN
(2 rows)
You will notice that there are now two external tables. You can also list the properties of the external table, but the output
will be similar to the output in the last section, except for the filename.

3.

Using the second session review the file that was created, et2_region_flat_file, in the /labs/movingData directory:

[nz@netezza movingData]$ more et2_region_flat_file


The file should look similar to the following:

2|sa|south america
1|na|north america
4|ap|asia pacific
3|emea|europe, middle east, africa
This file is exactly the same as the file you reviewed in the last chapter. The difference this time is that we didnt need to
unload it explicitly.

2.1.3 Unloading data with an external table using defined columns


The first two external tables that you created used the exact same columns from the REGION table, using an implicit table
schema. You can also create an external table by explicitly specifying the columns. This is referred to as explicit table schema.
The third external table that you create will still be used to unload data from the REGION table but only from the R_NAME and
R_COMMENT columns. The ET3_REGION external table will be created in one step and then the data will be unloaded in the
et3_region_flat_file ASCII delimited text file using a different delimiter string. The basic syntax to create this type of external
table is:
CREATE EXTERNAL TABLE table_name
({column_name type} [, ... ])
[USING external_table_options}]

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 32

IBM Software
Information Management

1.

You will create a new external table to only include the R_NAME and R_COMMENT columns, and exclude the
R_REGIONKEY column from the REGION table. Along with this you will change the delimiter string from the default | to
=:

LABDB(LABADMIN)=> create external table et3_region (r_name char(25),


r_comment varchar(152)) USING (dataobject
('/labs/movingData/et3_region_flat_file') DELIMITER '=');

2.

LIST the properties of the ET3_REGION external table

LABDB(LABADMIN)=> \d et3_region
Which will list the properties of the ET3_REGION external table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 32

IBM Software
Information Management

External Table "ET3_REGION"


Attribute |
Type
| Modifier
-------------+------------------------+---------R_NAME
| CHARACTER(25)
|
R_COMMENT
| CHARACTER VARYING(152) |
DataObject - '/labs/movingData/et3_region_flat_file'
adjustdistzeroint
bool style
code set
compress
cr in string
ctrl chars
date delim
date style
delim
encoding
escape
fill record
format
ignore zero
log dir
max errors
max rows
null value
quoted value
remote source
require quotes
skip rows
socket buf size
timedelim
time round nanos
time style
trunc string
y2base
includezeroseconds
record length
record delimiter
nullindicator bytes
layout
decimaldelim

f
1_0
FALSE
f
f
YMD
=
INTERNAL
f
TEXT
f
/tmp
1
0
NULL
NO
f
0
8388608
:
f
24HOUR
f
0
f

You will notice that there are only two columns for this external table since you only specified two columns when creating
the external table. The rest of the output is very similar to the properties of the other two external tables that you created,
with two main exceptions. The first difference is obviously the Dataobjects field, since the filename is different. The other
difference is the string used for the delimiter, since it is now = instead of the default, |.

3.

Now you will unload the data from the REGION table but only the data from columns R_NAME and R_COMMENT:

LABDB(LABADMIN)=> insert into et3_region select r_name, r_comment from region;


(Alternatively, you could create the external table and unload the data in one step using the following command:
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 10 of 32

IBM Software
Information Management

create external table et4_test '/labs/movingData/et4_region_flat_file'


using (delimiter '=') as select r_name, r_comment from region;

4.

Using the second session review the file that was created, et3_region_flat_file, in the /labs/movingData directory:

[nz@netezza movingData]$ more et3_region_flat_file


The file should look similar to the following:

sa=south america
na=north america
ap=asia pacific
emea=europe, middle east, africa
You will notice that only two columns are present in the flat file using the = string as a delimiter.

2.1.4 (Optional) Unloading data with an External Table from two tables
The first three external tables unloaded data from one table. The next external table you will create will be based on using a
table join between the REGION and NATION table. The two tables will be joined on the REGIONKEY and only the N_NAME and
R_NAME columns will be defined for the external table. This exercise will illustrate how data can be unloaded using SQL
statements other than a simple SELECT FROM statement. The external table will be named ET_NATION_REGION using another
ASCII delimited text file named et_nation_file_flat_file.
1.

For the next external table you will unload data from both the REGION and NATION table joined on the REGIONKEY
column to list all of the countries and their associated regions. Instead of specifying the columns in the create external
table statement you will use the AS SELECT option:

LABDB(LABADMIN)=> create external table et_nation_region


'/labs/movingData/et_nation_region_flat_file' as select n_name, r_name from
nation, region where n_regionkey=r_regionkey;

2.

LIST the properties of the ET_NATION_REGION external table

LABDB(LABADMIN)=> \d et_nation_region
Which will show the properties of the ET_NATION_REGION table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 11 of 32

IBM Software
Information Management

External Table "ET_NATION_REGION"


Attribute |
Type
| Modifier
-----------+---------------+---------N_NAME
| CHARACTER(25) | NOT NULL
R_NAME
| CHARACTER(25) |
DataObject - '/labs/movingData/et_NATION_REGION_flat_file'
adjustdistzeroint
bool style
code set
compress
cr in string
ctrl chars
date delim
date style
delim
encoding
escape
fill record
format
ignore zero
log dir
max errors
max rows
null value
quoted value
remote source
require quotes
skip rows
socket buf size
timedelim
time round nanos
time style
trunc string
y2base
includezeroseconds
record length
record delimiter
nullindicator bytes
layout
decimaldelim

f
1_0
FALSE
f
f
YMD
|
INTERNAL
f
TEXT
f
/tmp
1
0
NULL
NO
f
0
8388608
:
f
24HOUR
f
0
f

You will notice that the external table was created using the two columns specified in the SELECT clause: N_NAME and
R_NAME.
3.

View the data of the ET_NATION_REGION external table:

LABDB(LABADMIN)=> select * from et_nation_region;


Which will show all the rows from the ET_NATION_REGION table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 12 of 32

IBM Software
Information Management

N_NAME
|
R_NAME
---------------------------+--------------------------brazil
| sa
guyana
| sa
venezuela
| sa
portugal
| emea
australia
| ap
united kingdom
| emea
united arab emirates
| emea
south africa
| emea
hong kong
| ap
new zealand
| ap
japan
| ap
macau
| ap
canada
| na
united states
| na
(14 rows)
This is the result of the joining the NATION and REGION table on the REGIONKEY column to return just the N_NAME and
R_NAME columns.

4.

And now using the second session review the file that was created, et_nation_region_flat_file, in the /labs/movingData
directory:

[nz@netezza movingData]$ more et_nation_region_flat_file


Which should look similar to the following:

brazil|sa
guyana|sa
venezuela|sa
portugal|emea
australia|ap
united kingdom|emea
united arab emirates|emea
south africa|emea
hong kong|ap
new zealand|ap
japan|ap
macau|ap
canada|na
united states|na
You can see that we created a flat delimited flat file from a complex SQL statement. External tables are a very flexible and
powerful way to load, unload and transfer data.

2.1.5 (Optional) Unloading data with an External Table using the compress format
The previous external tables that you created used the default ASCII delimited text format. This last external table will be similar
to the second external table that you created. But instead of the using an ASCII delimited text format you will use the
compressed binary format. The name of the external table will be ET4_REGION and the datasource file name will be
et4_region_compress. The basic syntax to create this type of external table is:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 13 of 32

IBM Software
Information Management

CREATE EXTERNAL TABLE table_name 'filename'


USING (COMPRESS true FORMAT internal)
AS select_statement;
The external table options COMPRESS and FORMAT must be specified to use the compressed binary format.
1.

You will now create one last external table using a similar method that you used to create the second external table, in
section 2.1.2. But instead of using an ASCII delimited-text format the datasource will be compressed. This is achieved
by using the COMPRESS and FORMAT external table options:

LABDB(LABADMIN)=> create external table et4_region


'/labs/movingData/et4_region_compress' using (compress true format 'internal') as
select * from region;
As a reminder the external table is created and the data is unloaded in the same operation using the AS SELECT clause.

2.

LIST the properties of the ET4_REGION external table

LABDB(LABADMIN)=> \d et4_region
Which will list the properties of the ET4_REGION table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 14 of 32

IBM Software
Information Management

External Table "ET4_REGION"


Attribute |
Type
| Modifier
-------------+------------------------+---------R_REGIONKEY | INTEGER
|
R_NAME
| CHARACTER(25)
|
R_COMMENT
| CHARACTER VARYING(152) |
DataObject - '/labs/movingData/et4_region_compress'
adjustdistzeroint
bool style
code set
compress
cr in string
ctrl chars
date delim
date style
delim
encoding
escape
fill record
format
ignore zero
log dir
max errors
max rows
null value
quoted value
remote source
require quotes
skip rows
socket buf size
timedelim
time round nanos
time style
trunc string
y2base
includezeroseconds
record length
record delimiter
nullindicator bytes
layout
decimaldelim

- TRUE
- INTERNAL
- 8388608
-

You will notice that the options for COMPRESS has changed from FALSE to TRUE indicating that the datasource file is
compressed. And the FORMAT has changed from TEXT to INTERNAL, which is required for compressed files.

2.2

Dropping External Tables

Dropping external tables is similar to dropping a regular PureData System table. The column definition for the external table is
removed from the PureData System catalog. Keep in mind that dropping the table doesnt delete the external datasource file so
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 15 of 32

IBM Software
Information Management

this also has to be maintained. So the external datasource file can still be used for loading data into a different table. In this
chapter you will drop the ET1_REGION table, but you will not delete the associated external datasource file, et1_region_flat_file.
This datasource file will be used later in this lab to load data into the REGION table.

1.

Drop the first external table that you created, ET1_REGION, using the DROP TABLE command

LABDB(LABADMIN)=> drop table et1_region;


The same drop command for tables is used for external tables, so there is no separate DROP EXTERNAL TABLE
command.

2.

Verify that the external table has been dropped using the internal slash option, \dx:

LABDB(LABADMIN)=> \dx
Which will list all the external tables in the LABDB database:

List of relations
Name
|
Type
| Owner
------------------+----------------+---------ET2_REGION
| EXTERNAL TABLE | LABADMIN
ET3_REGION
| EXTERNAL TABLE | LABADMIN
ET4_REGION
| EXTERNAL TABLE | LABADMIN
ET_NATION_REGION | EXTERNAL TABLE | LABADMIN
(4 rows)
In this list the four remaining external tables that you created still exist.

3.

Even though the external table definition no longer exists within the LABDB database, the flat file named
et1_region_flat_file still exits in the /labs/movingData directory. Verify this by using the second putty session:

[nz@netezza movingData]$ ls
Which will list all of the files in the /labs/movingData directory:
et1_region_flat_file

et2_region_flat_file

et4_region_compress

et3_region_flat_file

et_nation_region

You can see that the file et1_REGION_flat_file still exists. This file can still be used to load data into another similar table.

2.3

Loading Data using External Tables

External tables can also be used to load data into tables in the database. In this chapter data will be loaded into the REGION
table, so you will first have to remove the existing rows from the REGION table. The method to load data from external tables into
a table is similar to using the DML INSERT INTO and SELECT FROM statements. You will use two different methods to load
data into the REGION table, one using an external table and the other using the external datasource file directly. Loading data
into a table from any external table will have an associated log file with a default name of <table_name>.<database_name>.log

1.

Before loading the data into the REGION table, delete the rows from the data using the TRUNCATE TABLE command:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 16 of 32

IBM Software
Information Management

LABDB(LABADMIN)=> truncate table region;

2.

Check that the table is empty with the SELECT * command:

LABDB(LABADMIN)=> select * from region;


You should see that the table contains no data.

3.

You will load data into the REGION table from the ET2_REGION external table using an INSERT statement:

LABDB(LABADMIN)=> insert into region select * from et2_region;

4.

Check to ensure that the table contains the four rows using the SELECT * statement.

LABDB(LABADMIN)=> select * from region;


You should see that the table now contains 4 rows.

5.

Again delete the rows in the REGION table:

LABDB(LABADMIN)=> truncate table region;

6.

Check to ensure that the table is empty using the SELECT * statement.

LABDB(LABADMIN)=> select * from region;


You should see that the table contains no rows.

7.

You will load data into the REGION table using the ASCII delimited file that was created for external table ET1_REGION.
Remember that the definition of the external table was removed from that database, but the external data source file,
et1_region_flat_file, still exists:

LABDB(LABADMIN)=> insert into region select * from external


'/labs/movingData/et1_region_flat_file';

8.

Check to ensure that the table contains the four rows using the SELECT * statement.

LABDB(LABADMIN)=> select * from region;


You should see that the table now contains four rows.

9.

Since this is a load operation there is always an associated log file, <table>.<database>.nzlog created for each load
performed. By default this log file is created in the /tmp directory. In the second session review this file:

[nz@netezza movingData]$ more /tmp/REGION.LABDB.nzlog

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 17 of 32

IBM Software
Information Management

The log file should look similar to the following:

Load started at:01-Jan-11 12:34:56 EDT


Database:
Tablename:
Datafile:
Host:

LABDB
REGION
/labs/movingData/et1_region_flat_file
netezza

Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:

'|'
8
INTERNAL
0
No
None
No
NO

NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No

BoolStyle:

1_0

Decimal Delimiter:

'.'

Date Style:
Time Style:
Time extra zeros:

YMD
24HOUR
No

Date Delim:
Time Delim:

'-'
':'

Statistics
number of records read:
4
number of bad records:
0
------------------------------------------------number of records loaded:
4
Elapsed Time (sec): 3.0
----------------------------------------------------------------------------Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information
to identify the table.

3 Loading Data using the nzload Utility


The nzload command is a SQL CLI client application that allows you to load data from the local host or a remote client, on all
the supported client platforms. The nzload command processes command-line load options to send queries to the host to
create an external table definition, run the insert/select query to load data, and when the load completes, drop the external table.
The nzload command is a command-line program that accepts options from multiple sources, where some of the sources can
be from the:

Command line

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 18 of 32

IBM Software
Information Management

Control file
NZ Environment Variables

Without a control file, you can only do one load at a time. Using a control file allows multiple loads. The nzload command
connects to a database with a user name and password, just like any other PureData System appliance client application. The
user name specifies an account with a particular set of privileges, and the system uses this account to verify access.
For this section of the lab you will continue to use the LABADMIN user to load data into the LABDB database. The nzload utility
will be used to load records from an external datasource file into the REGION table. Along with this the nzload log files will be
reviewed to examine the nzload options. Since you will be loading data into a populated REGION table, you will use the
TRUNCATE TABLE command to remove the rows from the table.
We will continue to use the two putty sessions from the external table lab.
Session One, which is connected to the NZSQL console to execute SQL commands, for example to review tables after
load operations
Session Two, which will be used for operating system commands, to execute nzload commands, view data files,

3.1

Using the nzload Utility with Command Line Options

The first method for using the nzload utility to load data in the REGION table will specify options at the command line. You will
only need to specify the datasource file. We will use default options for the rest. The datasource file will be the
et1_region_flat_file that you created in the External Tables section. The basic syntax for this type of command is:
nzload db <database> -u <username> pw <password>

1.

-df <datasource filename>

As the LABDB database owner, LABADMIN first remove the rows in the REGION table:

LABDB(LABADMIN)=> truncate table region;

2.

Check to ensure that the rows have been removed from the table using the SELECT * statement:

LABDB(LABADMIN)=> select * from region;


The REGION table should return no rows.

3.

Using the second session at the OS command line you will use the nzload utility to load data from the et1_region_flat
file into the REGION table using the following command line options, -db <database name>, -u <user>, -pw
<password>, -t <table name>, -df <data file>, and delimiter <string>:

[nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t


region -df etl_region_flat_file -delimiter '|'
Note: The filename in the image is etL_region_flat_file, this is an inconsistency that will be fixed in the next iteration of the
image.
Which will return the following status message:

Load session of table 'REGION' completed successfully

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 19 of 32

IBM Software
Information Management

4.

Check to ensure that the rows have been load into the table using the SELECT * statement:

LABDB(LABADMIN)=> select * from region;


Which will return all of the rows in the REGION table:

R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------1 | na
| north america
4 | ap
| asia pacific
2 | sa
| south america
3 | emea
| europe, middle east, africa
(4 rows)
These rows were loaded from the records in the etl_region_flat_file file.

5.

For every load task performed there is always an associated log file, <table>.<db>.nzlog created. By default this log file
is created in the current working directory, which is the /labs/movingData directory. In the second session review this file:

[nz@netezza movingData]$ more REGION.LABDB.nzlog

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 20 of 32

IBM Software
Information Management

Load started at:01-Jan-11 12:34:56 EDT


Database:
Tablename:
Datafile:
Host:

LABDB
REGION
/labs/movingData/et1_region_flat_file
netezza

Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:

'|'
8
INTERNAL
0
No
None
No
NO

NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No

BoolStyle:

1_0

Decimal Delimiter:

'.'

Date Style:
Time Style:
Time extra zeros:

YMD
24HOUR
No

Date Delim:
Time Delim:

'-'
':'

Statistics
number of records read:
4
number of bad records:
0
------------------------------------------------number of records loaded:
4
Elapsed Time (sec): 3.0
----------------------------------------------------------------------------Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information
to identify the database and table.

The db, -u, and pw, options specify the database name, the user, and the password, respectively. Alternatively, you could
omit these options if the NZ environment variables are set to the appropriate database, username and password values. Since
the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD are set to system, admin, and password, you
need to use these options so the load will be against the LABDB database using the LABADMIN user.
The other options:
-t specifies the target table name in the database
-df specifies the datasource file to be loaded
-delimiter specifies the string to use as the delimiter in an ASCII delimited text file.
There are other options that you can use with the nzload utility. These options were not specified here since the default values
were sufficient for this load task.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 21 of 32

IBM Software
Information Management

The following command is equivalent to the nzload command we used above. Its intended to demonstrate some of the options
that can be used with the nzload command but can be omitted when default values are used. Its only for demonstrating
purposes:

nzload db labdb u labadmin pw password t region


df et1_region_flat_file delimiter |
outputDir <current directory>
lf <table>.<database>.nzlog bf<table>.<database>.nzlog
compress false format text
maxErrors 1
The lf, -bf, and maxErrors options are explained in the next exercise. The compress and format options indicate
that the datasource file is an ASCII delimited text file. For a compressed binary datasource file the following options would be
used, -compress true format internal.

3.2

Using the nzload Utility with a Control File.

As demonstrated in section 3.1 you can run the nzload command by specifying the command line options or you can use
another method by specifying the options in a file, which is referred to as a control file. This is useful since the file can be
updated and modified over time since loading data into a database for a data warehouse environment is a continuous operation.
An nzload control file has the following basic structure:

DATAFILE <filename>
{
[<option name> <option value>]
}
And the cf option is used at the nzload command line to use a control file:

nzload u <username> -pw <password> -cf <control file>


The u and pw options are optional if the NZ_USER and NZ_PASSWORD environment variables are set to the appropriate user
and password. Using the u and pw options overrides the values in the NZ environment variables.
In this session you will again load rows into an empty REGION table using the nzload utility with a control file. The control file
will set the following options: delimiter, logDir, logFile, and badFile, along with the database, and tablename. The
datasource file to be used in this session is the region.del file.

1.

As the LABDB database owner, LABADMIN first remove the rows in the REGION table::

LABDB(LABADMIN)=> truncate table region;


Check to ensure that the rows have been removed from the table using the SELECT * statement. The table should contain
no rows.

2.

Using the second session at the OS command line you will create the control file to be used with the nzload utility to
load data into the REGION table using the region.del data file. The control file will include the following options:
Parameter

Value

Database

Database name

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 22 of 32

IBM Software
Information Management

Tablename

Table name

Delimiter

Delimiter string

LogDir

Log directory

LogFile

Log file name

BadFile

Bad record log file name

And the data file will be the region.del file instead of the et1_region_flat_file that you used in section 3.1.
We already created the control file in the lab directory. Review it in the second putty session with the following command:

[nz@netezza movingData]$ more control_file


The control file looks like the following:
DATAFILE /labs/movingData/region.del
{
Database
labdb
Tablename
region
Delimiter
'|'
LogDir
/labs/movingData
LogFile
region.log
BadFile
region.bad
}

3.

Still in the second session you will load the data using the nzload utility with the control file you created, using the
following command line options: -u <user>, -pw <user>, -cf <control file>

[nz@netezza movingData]$ nzload -u labadmin -pw password -cf control_file


Which will return the following status message:

Load session of table 'REGION' completed successfully


4.

Check the nzload log which was renamed from the default to region.log which is located in the /labs/movingData
directory.

[nz@netezza movingData]$ more region.log


You should see a successful load
5.

Check to ensure that the rows are in the REGION table in the first putty session with the nzsql console:

LABDB(LABADMIN)=> select * from region;


You should see the added rows.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 23 of 32

IBM Software
Information Management

3.3

(Optional) Using nzload with Bad Records

The first two methods illustrated how to use the nzload utility to load data into an empty table using command line options or a
control file. In a data warehousing environment you will most of the time incrementally add data to a table already containing
some rows.
There will be instances where records from a datasource might not match the datatypes in the table. When this occurs the load
will abort when the first bad record is encountered. This is the default behaviour and is controlled by the maxErrors option,
which is set to a default value of 1.
For this exercise you will add additional rows to the NATION table. Since you will be adding rows to the NATION table there will
be no need to truncate the table. The datasource file you will be using is the nation.del file, which unfortunately has a bad record.

1.

First check the NATION table by listing all of the rows in the table using the SELECT * statement in the first putty
session:

LABDB(LABADMIN)=> select * from nation;


Which will list all the rows in the NATION table:
N_NATIONKEY |
N_NAME
| N_REGIONKEY |
N_COMMENT
-------------+---------------------------+-------------+---------------------------------1 | canada
|
1 | canada
2 | united states
|
1 | united states of america
10 | australia
|
4 | australia
5 | venezuela
|
2 | venezuela
8 | united arab emirates
|
3 | al imarat al arabiyah multahidah
9 | south africa
|
3 | south africa
3 | brazil
|
2 | brasil
11 | japan
|
4 | nippon
12 | macau
|
4 | aomen
14 | new zealand
|
4 | new zealand
4 | guyana
|
2 | guyana
6 | united kingdom
|
3 | united kingdom
7 | portugal
|
3 | portugal
13 | hong kong
|
4 | xianggang
(14 rows)

2.

Using the second session at the OS command line you will use the nzload utility to load data from the nation.del file
into the NATION table using the following command line options, -db <database name>, -u <user>, -pw
<password>, -t <table name>, -df <data file>, and delimiter <string>

[nz@netezza movingData]$ nzload -db LABDB -u labadmin -pw password -t nation


-df nation.del -delimiter '|'
Which will return the following status message:

Error: External Table : count of bad input rows reached maxerrors limit
See /labs/movingData/NATION.LABDB.nzlog file
Error: Load Failed, records not inserted.
This is an indication that the load has failed due to a bad record in the datasource file

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 24 of 32

IBM Software
Information Management

3.

Since the load has failed no rows were loaded into the NATION table, which you can confirm by using the SELECT *
statement (in the first session):

LABDB(LABADMIN)=> select * from nation;


Which will return the rows in the NATION table:
N_NATIONKEY |
N_NAME
| N_REGIONKEY |
N_COMMENT
-------------+---------------------------+-------------+---------------------------------1 | canada
|
1 | canada
2 | united states
|
1 | united states of america
10 | australia
|
4 | australia
5 | venezuela
|
2 | venezuela
8 | united arab emirates
|
3 | al imarat al arabiyah multahidah
9 | south africa
|
3 | south africa
3 | brazil
|
2 | brasil
11 | japan
|
4 | nippon
12 | macau
|
4 | aomen
14 | new zealand
|
4 | new zealand
4 | guyana
|
2 | guyana
6 | united kingdom
|
3 | united kingdom
7 | portugal
|
3 | portugal
13 | hong kong
|
4 | xianggang
(14 rows)

4.

In the second session you can check the log file, NATION.LABDB.nzlog, to determine the problem:

[nz@netezza movingData] more NATION.LABDB.nzlog

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 25 of 32

IBM Software
Information Management

Load started at:01-Jan-11 12:34:56 EDT


Database:
Tablename:
Datafile:
Host:

LABDB
NATION
/home/nz/movingData/nation.del
netezza

Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:

'|'
8
INTERNAL
0
No
None
No
NO

NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No

BoolStyle:

1_0

Decimal Delimiter:

'.'

Date Style:
Time Style:
Time extra zeros:

YMD
24HOUR
No

Date Delim:
Time Delim:

'-'
':'

Found bad records


bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined]
---------------------------------------------------------------------------------------------------------------------------1: 10(1) [1, INT4] expected field delimiter or end of record, "2"[t]
Statistics
number of records read:
10
number of bad records:
1
------------------------------------------------number of records loaded:
0
Elapsed Time (sec): 1.0
----------------------------------------------------------------------------Load completed at: 01-Jan-11 12:34:57 EDT

=============================================================================

The Statistics section indicates that 10 records were read before the bad record was encountered during the load process.
As expected no rows were inserted into the table since the default is to abort the load when one bad record is encountered.
The log file also provides information about the bad record:
10(1)

[1, INT4]

expected field delimiter or end or record, 2[t]

10(1) indicates the input record number (10) within the file and the offset (1) within the row where a problem was
encountered. [1, INT(4)] indicates the column number (1) within the row and the data type (INT(4)) for the column.
2[t] indicates the char that caused the problem ([2]). So putting this all together the problem is that the 2t was in a field
for an INT(4) column, which is the N_NATIONKEY in the NATION table. 2t is not a valid integer so this is why the load
marked this as a bad record.

5.

You can confirm that this observation is correct by examining the nation.del datasource file that was used for the load.
In the second session execute the following command:

[nz@netezza movingData] more nation.del


Which will display the nation.del file with the following text:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 26 of 32

IBM Software
Information Management

15|andorra|2|andorra
16|ascension islan|3|ascension
17|austria|3|osterreich
18|bahamas|2|bahamas
19|barbados|2|barbados
20|belgium|3|belqique
21|chile|2|chile
22|cuba|2|cuba
23|cook islands|4|cook islands
2t|denmark|3|denmark
25|ecuador|2|ecuador
26|falkland islands|3|islas malinas
27|fiji|4|fiji
28|finland|3|suomen tasavalta
29|greenland|1|kalaallit nunaat
30|great britain|3|great britian
31|gibraltar|3|gibraltar
32|hungary|3|magyarorszag
33|iceland|3|lyoveldio island
34|ireland|3|eire
35|isle of man|3|isle of man
36|jamaica|2|jamaica
37|korea|4|han-guk
38|luxembourg|3|luxembourg
39|monaco|3|Monaco
You will notice on the 10th line the following record:
2t|denmark|3|denmark
So we made the correct assumption that the 2t is causing the problem. From this list you can assume that the correct
value should be 24.

6.

Alternatively you could instead examine the nzload bad log file NATION.LABDB.nzbad, which will contain all bad
records that are processed during a load. In the second session execute the following command:

[nz@netezza movingData] more NATION.LABDB.nzbad


Which will display the NATION.LABDB.nzbad file text:

2t|denmark|3|denmark
This is the same row identified in the nation.del file using the log file to locate the record within the file. Since the default is to
stop the load after the first bad record is processed there is only one row. If you were to change the default behaviour to
allow more bad records to be processed this file could potentially contain more records. It provides a comfortable overview
of all the records that created exceptions during load.

7.

We have the option of changing the NATION.del file to change 2t to 24, and then rerun the same nzload command
as in step 7. Instead you will rerun a similar load but you will allow 10 bad records to be encountered during the load
process. To change the default behaviour you need to use the command option -maxErrors. You will also change the
name of the nzbad file using the bf command option and the log filename using the lf command option:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 27 of 32

IBM Software
Information Management

[nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t nation


-df nation.del -delimiter '|' -maxerrors 10 -bf nation.bad -lf nation.log
Which will return the following status message:

Load session of table 'NATION' completed successfully


Now the load is successful.

8.

Check to ensure that the new loaded rows are in the NATION table:

LABDB(LABADMIN)=> select * from nation;


Which will list all of the rows in the NATION table:
N_NATIONKEY |
N_NAME
| N_REGIONKEY |
N_COMMENT
-------------+---------------------------+-------------+---------------------------------2 | united states
|
1 | united states of america
11 | japan
|
4 | nippon
18 | bahamas
|
2 | bahamas
19 | barbados
|
2 | barbados
20 | belgium
|
3 | belqique
25 | ecuador
|
2 | ecuador
33 | iceland
|
3 | lyoveldio island
34 | ireland
|
3 | eire
39 | monaco
|
3 | monaco
3 | brazil
|
2 | brasil
4 | guyana
|
2 | guyana
5 | venezuela
|
2 | venezuela
9 | south africa
|
3 | south africa
13 | hong kong
|
4 | xianggang
15 | andorra
|
2 | andorra
27 | fiji
|
4 | fiji
28 | finland
|
3 | suomen tasavalta
30 | great britain
|
3 | great britian
36 | jamaica
|
2 | jamaica
37 | korea
|
4 | han-guk
38 | luxembourg
|
3 | luxembourg
6 | united kingdom
|
3 | united kingdom
7 | portugal
|
3 | portugal
10 | australia
|
4 | australia
12 | macau
|
4 | aomen
14 | new zealand
|
4 | new zealand
26 | falkland islands
|
3 | islas malinas
29 | greenland
|
1 | kalaallit nunaat
31 | gibraltar
|
3 | gibraltar
32 | hungary
|
3 | magyarorszag
1 | canada
|
1 | canada
8 | united arab emirates
|
3 | al imarat al arabiyah multahidah
16 | ascension islan
|
3 | ascension
17 | austria
|
3 | osterreich
21 | chile
|
2 | chile
22 | cuba
|
2 | cuba
23 | cook islands
|
4 | cook islands
35 | isle of man
|
3 | isle of man
(38 rows)

So now all of the new records were loaded except for the one bad row with nation key 24.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 28 of 32

IBM Software
Information Management

9.

Even though the nzload command received a successful message it is good practice to review the nzload log file for
any problems, for example bad rows that are under the maxErrors threshold. In the second putty session execute the
following command:

[nz@netezza movingData] more nation.log


The log file should be similar to the following:
Load started at:01-Jan-11 12:34:56 EDT
Database:
Tablename:
Datafile:
Host:

LABDB
NATION
/home/nz/movingData/nation.del
netezza

Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:

'|'
8
INTERNAL
0
No
None
No
NO

NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No

BoolStyle:

1_0

Decimal Delimiter:

'.'

Date Style:
Time Style:
Time extra zeros:

YMD
24HOUR
No

Date Delim:
Time Delim:

'-'
':'

Found bad records


bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined]
---------------------------------------------------------------------------------------------------------------------------1: 10(1) [1, INT4] expected field delimiter or end of record, "2"[t]
Statistics
number of records read:
25
number of bad records:
1
------------------------------------------------number of records loaded:
24
Elapsed Time (sec): 3.0
----------------------------------------------------------------------------Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================

The main difference to before is that all of the data records in the data source file were processed (25.) 24 records were
loaded because there was one bad record in the data source file.

10. Now you will correct the bad row and load it into the NATION table. There are couple of options you could use. One
option is to extract the bad row from the original data source file and create a new data source file with the correct
record. However, this task could be tedious when dealing with large data source files and potentially many bad records.
The other option, which is more appropriate, is to use the bad log file. All bad records that can not be loaded into the
table are placed in the bad log file. So in the second session use vi to open and edit the nation.bad file and change the
2t to 24 in the first field.

[nz@netezza movingData]$ vi nation.bad

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 29 of 32

IBM Software
Information Management

11. The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you
will be in the command mode. To change the file you need to switch into the insert mode by pressing i. The editor will
show an INSERT at the bottom of the screen.
12. You can now use the cursor keys to navigate. Change the first two chars of the bad row from 2t to 24. Your screen
should look like the following:

24|denmark|3|denmark
~
~
~
~
~
~
~
~
~
~
-- INSERT -13. We will now save our changes. Press Esc to switch back into command mode. You should see that the INSERT
string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without any
questions.
14. After the nation.bad file has modified to correct the record issue a nzload to load the modified nation.bad file:

[nz@netezza movingData]$ nzload -db labdb -u labadmin -pw password -t nation


-df nation.bad -delimiter '|'
Which will return the following status message:
Load session of table 'NATION' completed successfully

15. And now check the new row has been loaded into the table in session one:

LABDB(LABADMIN)=> select * from nation;


Which will return all rows in the NATION table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 30 of 32

IBM Software
Information Management

N_NATIONKEY |
N_NAME
| N_REGIONKEY |
N_COMMENT
-------------+---------------------------+-------------+---------------------------------1 | canada
|
1 | canada
8 | united arab emirates
|
3 | al imarat al arabiyah multahidah
16 | ascension islan
|
3 | ascension
17 | austria
|
3 | osterreich
21 | chile
|
2 | chile
22 | cuba
|
2 | cuba
23 | cook islands
|
4 | cook islands
35 | isle of man
|
3 | isle of man
24 | denmark
|
3 | denmark
2 | united states
|
1 | united states of america
11 | japan
|
4 | nippon
18 | bahamas
|
2 | bahamas
19 | barbados
|
2 | barbados
20 | belgium
|
3 | belqique
25 | ecuador
|
2 | ecuador
33 | iceland
|
3 | lyoveldio island
34 | ireland
|
3 | eire
39 | monaco
|
3 | monaco
3 | brazil
|
2 | brasil
4 | guyana
|
2 | guyana
5 | venezuela
|
2 | venezuela
9 | south africa
|
3 | south africa
13 | hong kong
|
4 | xianggang
15 | andorra
|
2 | andorra
27 | fiji
|
4 | fiji
28 | finland
|
3 | suomen tasavalta
30 | great britain
|
3 | great britian
36 | jamaica
|
2 | jamaica
37 | korea
|
4 | han-guk
38 | luxembourg
|
3 | luxembourg
6 | united kingdom
|
3 | united kingdom
7 | portugal
|
3 | portugal
10 | australia
|
4 | australia
12 | macau
|
4 | aomen
14 | new zealand
|
4 | new zealand
26 | falkland islands
|
3 | islas malinas
29 | greenland
|
1 | kalaallit nunaat
31 | gibraltar
|
3 | gibraltar
32 | hungary
|
3 | magyarorszag
(39 rows)

The row in bold denotes the new row that was added to the table, which was the bad record you corrected.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 31 of 32

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 32 of 32

Backup Restore
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents
1

Introduction .....................................................................3
1.1

2
3
4

Objectives........................................................................3

Creating a QA Database .................................................4


Creating the Test Database ............................................8
Backing up and Restoring a Database ........................10
4.1

Backing up the Database...............................................10

4.2

Verifying the Backups ....................................................11

4.3

Restoring the Database .................................................15

4.4

Single Table Restore .....................................................18

Backing up User Data and Host Data ..........................20


5.1

User Data Backup..........................................................21

5.2

Host Data Backup..........................................................21

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 23

IBM Software
Information Management

1 Introduction
PureData System appliances are 99.99% reliable and all internal components are redundant. Nevertheless regular
backups should be part of any data warehouse. The first reason for this is disaster recovery, for example in case of a fire
in the data warehouse. The second reason is to undo changes like accidental deletes.
For disaster recovery, backups should be stored in a different location than the data center that hosts the data warehouse.
For most of the big companies this will be a backup server which will have a version of Veritas Netbackup, Tivoli Storage
Manager, or similar software, furthermore, backing up to a file server is also possible.

1.1

Objectives

In the last labs we have created our LABDB database, and loaded the data into it. In this lab we will first set up a QA
database that contains a subset of the tables and data of the full database. To create the tables we will use cross
database access from our QA database to the LABDB production database.
We will then use the schema-only function of nzbackup to create a test database that contains the same tables and data
objects as the QA database but no data. Test data will later be added specifically for testing needs. After that we will do a
multistep backup of our QA database and test the restore functionality. Testing backups by restoring them is generally a
good idea and should be done during the development phase and also at regular intervals. After all - you are not fully sure
what a backup contains until you restore it.
Finally we will backup the system user data and the host data. While a database backup saves all users and groups that
are involved in that database, a full user backup may be needed to get the full picture - for example to archive users and
groups that are not used in any database. Also host data should be backed up regularly. In case of a host failure, which
leaves the user data on the S-Blades intact, having a recent host backup will make the recovery of the appliance much
faster and more straightforward.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 23

IBM Software
Information Management

Figure 1 LABDB database

2 Creating a QA Database
In this chapter we will create a QA database called LABDBQA, which contains a subset of the tables. It will contain the full
NATION and REGION tables and the CUSTOMER table with a subset of the data. We will first create our QA database then
we will connect to it and use CTAS tables to create the table copies. We will use cross-database access to create our
CTAS tables from the foreign LABDB database. This is possible since PureData System allows read-only cross database
access if fully qualified names are used.
In this lab we will switch regularly between the operating system prompt and the nzsql console. The operating system
prompt will be used to execute the backup and restore commands and review the created files. The nzsql console will be
used to create the tables and further review the changes made to the user data using the restore commands.
To make this easier you should open two putty sessions, the first one will be used to execute the operating system
commands and it will be referred to as session 1 or the OS session, in the second session we will start the nzsql console.
It will be referred to as session 2 or the nzsql session. You can also see which session to use from the command prompt
in the screenshots.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 23

IBM Software
Information Management

Figure 2 The two putty sessions for this lab, OS session 1 on the left, NZSQL session 2 on the right
1. Open the first putty session. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is the default
IP address for a local VM, the IP may be different for your Bootcamp)
2. Access the lab directory for this lab with the following command,
[nz@netezza ~]$ cd /labs/backupRestore/
3. Open the second putty session. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is the
default IP address for a local VM, the IP may be different for your Bootcamp)
4. Access the lab directory for this lab with the same command as before
[nz@netezza ~]$ cd /labs/backupRestore/
5. Start the NZSQL console with the following command: nzsql

This will connect you to the SYSTEM database with the ADMIN user. These are the default settings stored in
the environment variables of the NZ user.
6. Create our empty QA database with the following command:
SYSTEM(ADMIN)=> create database LABDBQA;
7.

Connect to the QA database with the following command:


SYSTEM(ADMIN)=> \c LABDBQA

8. Create a full copy of the REGION table from the LABDB database:
LABDBQA(ADMIN)=> create table region as select * from labdb..region;
With this statement we create a local REGION table in the currently connected QA database that has the same
definition and content as the REGION table from the LABDB database. The CREATE TABLE AS statement is one of
the most flexible administrative tools for a PureData System administrator.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 23

IBM Software
Information Management

We can easily access tables of databases we are currently not connected to, but only for read operations. We couldnt
insert data into a database we are not connected to.
9. Lets verify that the content has been copied over correctly. First lets look at the original data in the LABDB
database:
LABDBQA(ADMIN)=> select * from labdb..region;
You should see four rows in the result set:
LABDBQA(ADMIN)=> select * from labdb..region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
3 | emea
| europe, middle east, africa
1 | na
| north america
4 | ap
| asia pacific
(4 rows)
To access a table from a foreign database we need to have the fully qualified name. Notice that we leave out the
schema name between the two dots. Schemas are not fully supported in PureData System and since each table
name needs to be unique in a given database it can be omitted.
10. Now lets compare that to our local REGION table:
LABDBQA(ADMIN)=> select * from region;
You should see the same rows as before although they can be in a different order:
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------1 | na
| north america
3 | emea
| europe, middle east, africa
4 | ap
| asia pacific
2 | sa
| south america
(4 rows)
11. Now we copy over the NATION table as well:
LABDBQA(ADMIN)=> create table nation as select * from labdb..nation;
12. And finally we will copy over a subset of our CUSTOMER table, we will only use the rows from the automobile
market segment for the QA database:
LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where
c_mktsegment = 'AUTOMOBILE';
You will see that this inserts almost 30000 rows into the QA customer table, this is roughly a fifth of the original table:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 23

IBM Software
Information Management

LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where


c_mktsegment = 'AUTOMOBILE';
INSERT 0 29752
13. We will now create a view NATIONSBYREGION which returns a list of nation names with their corresponding
region names. This is used in a couple of applications:
LABDBQA(ADMIN)=> create view nationsbyregions as select r_name, n_name from nation,
region where r_regionkey = n_regionkey;
14. Lets have a look at what the view returns:
LABDBQA(ADMIN)=> select * from nationsbyregions;
You should get a list of all nations and their corresponding region name:
LABDBQA(ADMIN)=> select * from nationsbyregions;
R_NAME
|
N_NAME
---------------------------+--------------------------sa
| guyana
emea
| united arab emirates
ap
| macau
sa
| brazil
emea
| portugal
ap
| japan
na
| canada
sa
| venezuela
emea
| south africa
ap
| hong kong
na
| united states
emea
| united kingdom
ap
| australia
ap
| new zealand
(14 rows)
Views are a very convenient way to hide SQL complexity. They can also be used to implement column level security
by creating views of tables that only contain a subset of columns.They are fully supported by PureData System.
15. Verify the created tables with the following command: \d
You will see that our QA database only contains the three tables we just created:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 23

IBM Software
Information Management

LABDBQA(ADMIN)=> \d
List of relations
Name
| Type | Owner
------------------+-------+------CUSTOMER
| TABLE | ADMIN
NATION
| TABLE | ADMIN
NATIONSBYREGIONS | VIEW | ADMIN
REGION
| TABLE | ADMIN
(4 rows)
16. Finally we will create a QA user and make him owner of the database. Create the user with:
LABDBQA(ADMIN)=> create user qauser;
17. Make him the owner of the QA database:
LABDBQA(ADMIN)=> alter database labdbqa owner to qauser;
We have successfully created our QA database using cross database CTAS statements. Our QA database contains
three tables, a view and we have a user that is the owner of this database. In the next chapter we will use backup and
restore to create an empty copy of the QA database for the test database.

3 Creating the Test Database


In this chapter we will use schema-only backup and restore to create an empty copy of the QA database as test database.
This will not contain any data since the developers will fill it with test-specific data. Schema only backup is a convenient
way to recreate databases without the contained user data.
1. Switch to the OS session and create the schema only backup of our QA database:
[nz@netezza backupRestore]$ nzbackup -schema-only -db labdbqa -dir /tmp/bkschema
To do this we need to specify three parameters the database we want to backup, the file system location where to
save the backup files to and the schema-only parameter to specify that user data shouldnt be backed up.
Normally backups shouldnt be saved on the host hard discs but on a remote network file server. Not only is
this essential for disaster recovery but the host hard discs are small, optimized for speed and not intended to
hold large amount of data. They are strictly intended for PureData System software and operational data.
Later we will have a deeper look at the created files and the logs but for the moment we will not go into that.
2. Now we will restore the test database from this backup:
[nz@netezza backupRestore]$ nzrestore -dir /tmp/bkschema -db labdbtest -sourcedb
labdbqa -schema-only
We can restore a database to a different database name. We simply need to specify the new name in the
db parameter and the old name in the sourcedb parameter.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 23

IBM Software
Information Management

3. In the nzsql session we will verify that we successfully created an empty copy of our database. See all available
databases with the following command: \l

LABDBTEST(ADMIN)=> \l
List of databases
DATABASE | OWNER
-----------+---------INZA
| ADMIN
LABDB
| LABADMIN
LABDBQA
| QAUSER
LABDBTEST | QAUSER
MASTER_DB | ADMIN
NZA
| ADMIN
NZM
| ADMIN
NZR
| ADMIN
SYSTEM
| ADMIN
(9 rows)

You can see that the LABDBTEST database was successfully created and that its privilege information have been
copied as well, the owner is QAUSER as in the LABDBQA database.
4. First we do not want the QA user being the owner of the test database, change the owner to ADMIN for now:
LABDBTEST(ADMIN)=> alter database labdbtest owner to admin;
5. Now lets check the contents of our test database, connect to it with: \c labdbtest
6. Check if our test database contains all the objects of the QA database: \d
You will see the three tables and the view we created:
LABDBTEST(ADMIN)=> \d
List of relations
Name
| Type | Owner
------------------+-------+------CUSTOMER
| TABLE | ADMIN
NATION
| TABLE | ADMIN
NATIONSBYREGIONS | VIEW | ADMIN
REGION
| TABLE | ADMIN
(4 rows)

PureData System Backup does save all database objects including views, stored procedures etc. Also all users,
groups and privileges that refer to the backed up database are saved as well.
7. Since we used the schema-only option we have not copied any data verify this for the NATION table with the
following command:
LABDBTEST(ADMIN)=> select * from nation;
You will see an empty result set as expected. The schema-only backup option is a convenient way to save your
database schema and to create empty copies of your database. Apart from the missing user data it will create a full
1:1 copy of the original database. You could also restore the database to a different PureData System Appliance. This
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 9 of 23

IBM Software
Information Management

would only require that the backup server location is accessible from both PureData System Appliances. It could even
be a differently sized appliance and the target appliance can have a higher version number of the NPS software than
the source. It cannot be lower though.

4 Backing up and Restoring a Database


.
PureData Systems user data backup will create a backup of the complete database, including all database objects and
user data. Even global objects like Users and privileges that are used in the database are backed up. Backup and Restore
is therefore a very easy and straightforward process.
Since PureData System has no transaction log, point in time restore is not possible. Therefore frequent backups are
advisable. NPS supports full, differential and cumulative backups that allow easy and fast regular data backups. An
example backup strategy would be monthly full backups, weekly cumulative backups and daily differential.
Since PureData System is not intended to be used nor has been designed as an OLTP database, this should provide
enough backup flexibility for most situations. For example to run differential backups after the daily ETL processes that
feed the warehouse.

Figure 3 A typical backup strategy


This chapter we will create a backup of our QA database. We will then do a differential backup and then do a restore.
Our VMWare environment has some specific restrictions that only allow the restoration of up to 2 increments. The labs
will work correctly but dont be surprised of errors during restore operations of more than 2 increments.

4.1

Backing up the Database

PureData Systems backup is organized in so called backup sets. Every new full backup creates a new backup set. Differential
and cumulative backups are per default added to the last backup set. But they can be added to a different backup set as well. In
this section we will switch between the two putty sessions.

1. In the OS session execute the following command to create a full backup of the QA database:
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
You should get the following result:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 10 of 23

IBM Software
Information Management

[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2


Backup of database labdbqa to backupset 20111214173551 completed successfully.

This command will create a full user data backup of the LABDBQA database.
Each backup set has a unique id that can be later used to access it. Per default the last active backup set is used for
restore and differential backups.
In this lab we split up the backup between two file system locations. You can specify up to 16 file system

locations after the dir parameter. Alternatively you could use a directory list file as well with the dirfile
option. Splitting up the backup between different file servers will result in higher backup performance.
2. In the NZSQL session we will now add a new row to the REGION table. First connect to the QA database:
LABDBTEST(ADMIN)=> \c labdbqa
3. Now add a new entry for the north pole to the REGION table:
LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');
4. In the OS session create an differential backup:
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
-differential
We now create a differential backup with the differential option. This will create a new entry to the backup set
we created previously only containing the differences since the full backup. You can see that the backup set id hasnt
changed.
5. In the NZSQL session add the south pole to the REGION table:
LABDBQA(ADMIN)=> insert into region values (6, 'sp', 'south pole');
You have now one full backup with the original 4 rows in the REGION table, then a differential backup that has
additionally the north pole entry and a current state that has in addition to that the south pole region.

4.2

Verifying the Backups

In this subchapter we will have a closer look at the files and logs that are created during the PureData System Backup
process.
1. In the OS session display the backup history of your Appliance:
[nz@netezza backupRestore]$ nzbackup -history
You should get the following result:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 11 of 23

IBM Software
Information Management

[nz@netezza backupRestore]$ nzbackup -history


Database Backupset
Seq # OpType Status
-------- -------------- ----- ------- -----------------LABDBQA 20111213225029 1
SCHEMA COMPLETED
backupsvr.10724.2011-12-13.log
LABDBQA 20111214173551 1
FULL
COMPLETED
backupsvr.21406.2011-12-14.log
LABDBQA 20111214173551 2
DIFF
COMPLETED
backupsvr.21594.2011-12-14.log

Date
Log File
------------------- -------------------2011-12-13 17:50:29
2011-12-14 12:35:51
2011-12-14 12:44:53

PureData System keeps track of all backups and saves them in the system catalog. This is used for differential
backups and it is also integrated with the Groom process. Since PureData System doesnt use transaction logs it
needs logically deleted rows for differential backups. Per default Groom doesnt remove a logically deleted row that
has not been backed up yet. Therefore the Groom process is integrated with the backup history. We will explain this in
more detail in the Transaction and Groom modules.
In our machine we have done three backups, one backup set containing the schema only backup and two backups for
the second backup set, one full and one differential. Lets have a closer look at the log that has been generated for the
last differential backup.
2. In the OS session, switch to the log directory of the backupsrv process, which is the process responsible for
backing up data:
[nz@netezza backupRestore]$ cd /nz/kit/log/backupsvr/

The /nz/kit/log directory contains the log directories for all PureData System processes.

3. Display the end of the log for the last differential backup process. You will need to replace the XXX values with
the actual values of your log. You can cut and paste the log name from the history output above. We are
interested in the last differential backup process:
[nz@netezza backupsvr]$ tail backupsvr.xxxxx.xxxx-xx-xx.log

You will see the following result:


[[nz@netezza backupsvr]$ tail backupsvr.21594.2011-12-14.log
2011-12-14 12:44:59.445051 EST Info: [21604] Postgres client pid: 21606, session: 19206
2011-12-14 12:45:00.461034 EST Info: Capturing deleted rows
2011-12-14 12:45:03.971731 EST Info: Backing up table REGION
2011-12-14 12:45:04.675441 EST Info: Backing up table NATION
2011-12-14 12:45:06.077822 EST Info: Backing up table CUSTOMER
2011-12-14 12:45:08.673602 EST Info: Operation committed
2011-12-14 12:45:08.673636 EST Info: Wrote 264 bytes in less than one second to location
1
2011-12-14 12:45:08.673643 EST Info: Wrote 385 bytes in less than one second to location
2
2011-12-14 12:45:08.682316 EST Info: Backup of database labdbqa to backupset
20111214173551 completed successfully.
2011-12-14 12:45:08.767215 EST Info: NZ-00023: --- program 'backupsvr' (21594) exiting
on host 'netezza' ... ---

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 12 of 23

IBM Software
Information Management

You can see that the process backed up the three tables REGION, NATION and CUSTOMER and wrote the result to
two different locations. You also see the amount of data written to these locations. Since we only added a single row
the amount of data is tiny. If you look at the log of the full backup you will see a lot more data being written.

4. Now lets have a look at the files that are created during the backup process, enter the first backup location:
[nz@netezza backupsvr]$ cd /tmp/bk1
5. And display the contents with ls
You will see the following result:
[nz@netezza bk1]$ ls
Netezza

The PureData System folder contains all backup sets for all PureData System appliances that use this backup
location. If you need to move the backup you always have to move the complete folder.
6. Enter the Netezza folder with cd Netezza and display the contents with ls :
You will see the following result:
[nz@netezza Netezza]$ ls
netezza

Under the main Netezza folder you will find sub folders for each Netezza host that is backed up to this location. In our
case we only have one Netezza host called netezza. But if your company had multiple Netezza hosts you would
find them here.
7. Enter the Netezza folder with cd Netezza and display the contents with ls :
[nz@netezza netezza]$ ls
LABDBQA

Below the host you will find all the databases of the host that have been backed up to this location, in our case the QA
database.
8. Enter the LABDBQA folder with cd LABDBQA and display the contents with ls :
[nz@netezza LABDBQA]$ ls
20111214173551

In this folder you can see all the backup sets that have been saved for this database. Each backup set corresponds to
one full backup plus an optional set of differential and cumulative backups. Note that we backed up the schema to a
different location so we only have one backup set in here.
9. Enter the backup set folder with cd <your backupset id> and display the contents with ls :

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 13 of 23

IBM Software
Information Management

[nz@netezza 20111214173551]$ ls
1 2

Under the backup set are folders for each backup that has been added to that backup set. 1 is always the full
backup followed by additional differential or cumulative backups. We will later use these numbers to restore our
database to a specific backup of the backup set.
10.

Enter the full backup with cd 1 and display the contents with ls :

[nz@netezza 1]$ ls
FULL

As expected its a differential backup.


11.

Enter the FULL folder with cd FULL and display the contents with ls :

[nz@netezza FULL]$ ls
data md

The data folder contains the user data, the md folder contains metadata including the schema definition of the
database.
12.

Enter the data folder with cd data and display detailed information with ll :

[nz@netezza data]$ ll
total 1120
-rw------- 1 nz nz
338 Dec 14 12:36 206208.full.2.1
-rw------- 1 nz nz
451 Dec 14 12:36 206222.full.2.1
-rw------- 1 nz nz 1132410 Dec 14 12:36 206238.full.1.1

You can see that there are three data files two small files for the REGION and NATION table and a big file for the
CUSTOMER table.
13.

Now switch to the md folder with cd ../md and display the contents with ls :

[nz@netezza md]$ ls
contents.txt loc1 schema.xml

stream.0.1

stream.1.1.1.1

This folder contains information about the files that contribute to the backup and the schema definition of the database
in the schema.xml
14. Lets have a quick look inside the schema.xml file:
[nz@netezza md]$ more schema.xml
You should see the following result:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 14 of 23

IBM Software
Information Management

more schema.xml
<ARCHIVE archive_major="4" archive_minor="0" product_ver="Release 6.1, Dev 2 [Bu
ild 16340]" catalog_ver="3.976" hostname="netezza" dataslices="4" createtime="20
11-12-14 17:35:57" lowercase="f" hpfrel="4.10" model="WMware" family="vmware" pl
atform="xs">
<OPERATION backupset="20111214173551" increment="1" predecessor="0" optype="0" d
bname="LABDBQA"/>
<DATABASE name="LABDBQA" owner="QAUSER" oid="206144" delimited="f" odelim="f" ch
arset="LATIN9" collation="BINARY" collecthist="t">
<STATISTICS column_count="15"/>
<TABLE ver="2" name="REGION" owner="ADMIN" oid="206208" delimited="f" odelim="f"
rowsecurity="f" origoid="206208">
<COLUMN name="R_REGIONKEY" owner="" oid="206209" delimited="f" odelim="t" seq="1
" type="INTEGER" typeno="23" typemod="-1" notnull="t"/>
...

As you see this file contains a full XML description of your database, including table definition, views, users etc.
15. Switch back to the lab folder with :
[nz@netezza md]$ cd /labs/backupRestore/
You should now have a pretty good understanding of the PureData System Backup process, in the next subchapter
we will demonstrate the restore functionality.

4.3

Restoring the Database

In this subchapter we will restore our database first to the first increment and then we will upgrade our database to the
next increment.
PureData System allows you to return a database to a specific increment in your backup set. If you want to do an
incremental restore the database must be locked. Tables can be queried but not changed until the database is in the
desired state and unlocked.
1. In the NZSQL session we will now drop the QA database and the QA user, first connect to the SYSTEM database:
LABDBQA(ADMIN)=> \c SYSTEM
2. Now drop the QA database:
SYSTEM(ADMIN)=> DROP DATABASE LABDBQA;
3. Now drop the QA User:
SYSTEM(ADMIN)=> DROP USER QAUSER;
4. Lets verify that the QA database really has been deleted with \l
You will see that the LABDBQA database has been removed:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 15 of 23

IBM Software
Information Management

SYSTEM(ADMIN)=> \l
List of databases
DATABASE | OWNER
-----------+---------INZA
| ADMIN
LABDB
| LABADMIN
LABDBTEST | ADMIN
MASTER_DB | ADMIN
NZA
| ADMIN
NZM
| ADMIN
NZR
| ADMIN
SYSTEM
| ADMIN
(8 rows)

5. In the OS session we will now restore the database to the first increment:
[nz@netezza md]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment 1 -lockdb
true

Notice that we have specified the increment with the increment option. In our case this is the first full backup in our
backup set.
We didnt need to specify a backup set, per default the most recent one is used. Since we are not sure to which
increment we want to restore the database we have to lock the database with the lockdb option. This allows only
read-only access until the desired increment has been restored.
6. In the NZSQL session verify that the database has been recreated with \l
You will see the LABDBQA database and you can also see that the owner QAUSER has been recreated and is again
the database owner:
SYSTEM(ADMIN)=> \l
List of databases
DATABASE | OWNER
-----------+---------INZA
| ADMIN
LABDB
| LABADMIN
LABDBQA
| QAUSER
LABDBTEST | ADMIN
MASTER_DB | ADMIN
NZA
| ADMIN
NZM
| ADMIN
NZR
| ADMIN
SYSTEM
| ADMIN
(9 rows)
7. Connect to the LABDBQA database with
SYSTEM(ADMIN)=> \c labdbqa
You will see that LABDBQA database is currently in read-only mode.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 16 of 23

IBM Software
Information Management

SYSTEM(ADMIN)=> \c labdbqa
NOTICE: Database 'LABDBQA' is available for read-only
You are now connected to database labdbqa.
8. Verify the contents of the REGION table from the LABDBQA database:
SYSTEM(ADMIN)=> select * from region;

You can see that we have returned the database to the point in time before the first full backup. There is no north or
south pole in the comments column:
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
1 | na
| north america
3 | emea
| europe, middle east, africa
4 | ap
| asia pacific
(4 rows)

9. Try to insert a row to verify the read only mode:


LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');

As expected this is prohibited until we unlock the database:


LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');
ERROR: Database 'LABDBQA' is available for read-only (command ignored)

10. In the OS session we will now apply the next increment to the database
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment
next -lockdb true
You will see that we now apply the second increment to the database:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment
next -lockdb true
Restore of increment 2 from backupset 20111214173551 to database 'labdbqa'
committed.
11. Since we do not need to load any more increments, we can now unlock the database:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -unlockdb
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 17 of 23

IBM Software
Information Management

After the database unlock we cannot apply any further increments to this database. To jump to a different increment
we would need to start from the beginning.

12. In the NZSQL session we have a look at the REGION table again:
LABDBQA(ADMIN)=> select * from region;
You can see that we have added the north pole region which was created before the first differential backup:
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
3 | emea
| europe, middle east, africa
4 | ap
| asia pacific
1 | na
| north america
5 | np
| north pole
(5 rows)

13. Verify that the database is unlocked and ready for use again by adding a new set of customers to the
CUSTOMER table. In addition to the Automobile users we want to add the machinery users from the main
database:
LABDBQA(ADMIN)=> insert into customer select * from labdb..customer where
c_mktsegment = 'MACHINERY';

You will see that we now can use the database in a normal fashion again.
14. We had around 30000 customers before, verify that the new user set has been added successfully:
LABDBQA(ADMIN)=> select count(*) from customer;
You will see that we now have around 60000 rows in the CUSTOMER table.
You have now done a full restore cycle for the database and applied a full and incremental backup to your database.
In the next chapter we will demonstrate single table restore and the ability to restore from any backup set.

4.4

Single Table Restore

In this chapter we will demonstrate the targeted restore of a subset of tables from a backup set. We will also demonstrate
how to restore from a specific older backup set.
1. First we will create a second backup set with the new customer data. In the OS session execute the following
command:
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 18 of 23

IBM Software
Information Management

[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2


Since we didnt specify anything else this is a full database backup. In this case PureData System automatically
creates a new backup set.
2. We want to return the CUSTOMER table to the previous condition. But we do not want to change the REGION or the
NATION tables. To do this we need to know the backup set id of the previous backup set. To do this execute the
history command again:
[nz@netezza backupRestore]$ nzbackup -history
We now see three different backup sets, the schema only backup, the two step backupset and the new full backupset.
Remember the backup set id of the two step backupset.
[nz@netezza backupRestore]$ nzbackup -history
Database Backupset
Seq # OpType Status
--------- -------------- ----- ------- ----------------------(LABDBQA) 20111213225029 1
SCHEMA COMPLETED
backupsvr.10724.2011-12-13.log
FULL
COMPLETED
(LABDBQA) 20111214173551 1
backupsvr.21406.2011-12-14.log
(LABDBQA) 20111214173551 2
DIFF
COMPLETED
backupsvr.21594.2011-12-14.log
LABDBQA
20111214205536 1
FULL
COMPLETED
backupsvr.23621.2011-12-14.log

Date
Log File
------------------- --------------2011-12-13 17:50:29
2011-12-14 12:35:51
2011-12-14 12:44:53
2011-12-14 15:55:36

3. To return only the CUSTOMER table to its condition of the second backup set we can do a table level restore with
the following command:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
<your_backup_set_id> -tables CUSTOMER
This command will only restore the tables of the tables option. If you want to restore multiple tables you can simply
write them in a list after the option.
We use the backupset option to specify a specific backup set. Remember to replace the id with the value you
retrieved with the history command.

Notice that the table name needs to be case sensitive. This is in contrast to the database name.

You will get the following error:


[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
20111214173551 -tables CUSTOMER
Error: Specify -droptables to force drop of tables in the -tables list.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 19 of 23

IBM Software
Information Management

PureData System cannot restore a table that exists in the target database. You can either drop the table before
restoring it, or use the droptables option.

4. Repeat the previous command with the added droptables option:


[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
<your_backup_set_id> -tables CUSTOMER -droptables
You will get the following result:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
20111214173551 -tables CUSTOMER -droptables
[Restore Server] : Dropping TABLE 'CUSTOMER'
Restore of increment 1 from backupset 20111214173551 to database 'labdbqa'
committed.
Restore of increment 2 from backupset 20111214173551 to database 'labdbqa'
committed.
You can see the target table was dropped before the restore happened and the specified backup set was used. Since
we didnt stipulate a specific increment, the full backup set has been applied with all increments. Also the table is
automatically unlocked after the restore process finishes.
5. Finally lets verify that the restore worked as expected, in the NZSQL console count the rows of the customer table
again:
LABDBQA(ADMIN)=> select count(*) from customer;
You will see that we are back to 30000 rows. This means that we have reverted the most recent changes:
LABDBQA(ADMIN)=> select count(*) from customer;
COUNT
------29752
(1 row)

In this chapter you have executed a single table restore and you did a restore from a specific backup set.

5 Backing up User Data and Host Data


In the previous chapters you have learned to backup PureData System databases. This backs up all the database objects
that are used in the database and the user data from the S-Blades. These are the most critical components to back up in
a PureData System appliance. They will allow you to recreate your databases even if you would need to switch to a
completely new Appliance.
But there are two other things that should be backed up:
The global user information.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 20 of 23

IBM Software
Information Management

The host data

In this chapter you will do a backup of these components, so you would be able to revert your appliance to the exact
condition it was in before the backup.

5.1

User Data Backup

Users, groups, and privileges that are not used in databases will not be backed up by the user data backup. To be able to
revert a PureData System Appliance completely to its original condition you need to have a backup of the global user
information as well, to capture for example administrative users that are not part of any database.
This is done with the users option of the nzbackup command:
1. In the OS session execute the following command:
[nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users
You will see the following results:
[nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users
Backup of users, groups, and global permissions completed successfully.
.
This will create a backup of all Users, Groups and Privileges. Restoring it will not delete any users, instead it will only add
missing Users, Groups and Privileges, so it doesnt need to be fully synchronized with the user data backup. You can
even restore an older user backup without fear of destroying information.

5.2

Host Data Backup

Until now we have always backed up database content. This is essentially catalog and user data that can be applied to a
new PureData System appliance. PureData System also provides the functionality to backup and restore host data. This
is essentially the data in the /nz/data and /export/nz directories of the host server.
There are two reasons for regularly backing up host data. The first is a host crash. If the S-Blades of your appliance are
intact but the host file system has been destroyed you could recreate all databases from the user backup. But in very
large systems this might take a long time. It is much easier to only restore the host information and reconnect to the
undamaged user tables on the S-Blades.
The second reason is that the host data contains configuration information, log and plan files etc. that are not saved by
the user backup. If you for example changed the system configuration that information would be lost.
Therefore it is advisable to regularly backup host data.
1. To backup the host data execute the following command in the OS session:
[nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup
This will pause your system and copy the host files into the specified file name:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 21 of 23

IBM Software
Information Management

[nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup


Starting host backup. System state is 'online'.
Pausing the system ...
Checkpointing host catalog ...
Archiving system catalog ...
Resuming the system ...
Host backup completed successfully. System state is 'online'.
As you can see the system has been paused for the duration of the host backup but is automatically resumed after
the backup is successful.
Also notice that the host backup is done with the nzhostbackup command instead of the standard nzbackup
command.
2. Lets have a look at the created file:
[nz@netezza backupRestore]$ ll /tmp
You will see the following results:
[nz@netezza backupRestore]$ ll /tmp
total 66160
drwxrwxrwx 3 nz
nz
4096 Dec
drwxrwxrwx 3 nz
nz
4096 Dec
drwxrwxrwx 3 nz
nz
4096 Dec
-rw------- 1 nz
nz
67628809 Dec
drwxrwxr-x 2 nz
nz
4096 Dec
drwxrwxrwx 2 root root
16384 Jan
srwxrwxrwx 1 nz
nz
0 Dec
-rw-rw-r-- 1 nz
nz
33 Dec
-rw-rw-r-- 1 nz
nz
0 Dec
drwx------ 2 nz
nz
4096 Dec
drwx------ 2 nz
nz
4096 Dec

14
14
13
14
12
20
12
12
12
12
12

12:35
12:35
17:50
16:37
14:55
2011
15:04
15:04
15:05
14:46
12:55

bk1
bk2
bkschema
hostbackup
inza1.1.2
lost+found
nzaeus__nzmpirun___
nzaeus__nzmpirun_____Process
nzcm.lock
nzcm-temp_18uEeq
nzcm-temp_rvAZXR

You can see that a backup file has been created. Its a compressed file containing the system catalog and PureData
System host information. If possible host backups should be done regularly. If for example an old host backup is
restored there might exist so called orphaned tables. This means tables that have been created after the host backup
and exist on the S-Blades but are now not registered in the system catalog anymore. During host restore PureData
System will create a script to clean up these orphaned tables, so they do not take up any disc space.
Congratulations you have finished the Backup&Restore lab and you have had a chance to see all components of a
successful PureData System backup strategy. The one missing component is that we did only use file system backup. In
a real environment you would more likely use a Veritas or TSM backup server. For further information regarding the setup
steps please refer to the excellent system administration guide.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 22 of 23

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current
list of IBM trademarks is available on the Web at Copyright and trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or service marks of others.
References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries
in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 23 of 23

IBM
Software
h
Information Management

Query Optimization
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

2
3
4

Objectives........................................................................3

Generate Statistics..........................................................3
Identifying Join Problems ..............................................6
HTML Explain ................................................................10

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 14

IBM Software
Information Management

1 Introduction
PureData System uses a cost-based optimizer to determine the best method for scan and join operations, join order, and data
movement between SPUs (redistribute or broadcast operations if necessary). For example the planner tries to avoid
redistributing large tables because of the performance impact. The optimizer can also dynamically rewrite queries to improve
query performance.
The optimizer takes a SQL query as input and creates a detailed execution or query plan for the database system. For the
optimizer to create the best execution plan that results in the best performance, it must have the most up-to-date statistics.

You can use EXPLAIN, HTML (also known as bubble), and text plans to analyze how the PureData System system
executes a query.
Explain is a very useful tool to spot and identify performance problems, bad distribution keys, badly written SQL queries
and out-of-date statistics.

1.1

Objectives

During our POC we have identified a couple of very long running customer queries that have significantly worse performance
than the number of rows involved would suggest. In this lab we will use Explain functionality to identify the concrete bottlenecks
and if possible fix them to improve query performance.

2 Generate Statistics
Our first long running customer query returns the average order price by customer segment for a given year and order priority. It
joins the customer table for the market segment and the orders table for the total price of the order. Due to restrictive join
conditions it shouldnt require too much processing time. But on our test systems it runs a very long time. In this chapter we will
use PureData System Explain functionality to find out why this is the case.
The customer query in question:

SELECT c.c_mktsegment, AVG(o.o_totalprice)


FROM orders AS o, CUSTOMER as c
WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT'
GROUP BY c.c_mktsegment;

1.

Connect to your PureData System image using putty. Login to 192.168.239.2 as user nz with password nz.
(192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp)

2.

First we will make sure that the system doesnt run a different workload that could influence our tests. Use the following
nzsession command to verify that the system is free:

[nz@netezza ~]$ nzsession show


You should get a similar result to the following:

[nz@netezza ~]$ nzsession show


ID
Type User Start Time
PID Database State Priority Name Client
IP Client PID Command
----- ---- ----- ----------------------- ---- -------- ------ ------------- -------- ---------- -----------------------16023 sql ADMIN 29-Apr-11, 09:18:13 EDT 4795 SYSTEM
active normal
127.0.0.1
4794 SELECT session_id, clien

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 14

IBM Software
Information Management

This result shows that there is currently only one session connected to the database, which is the nzsession command itself.
Per default the database user in your vmware image is ADMIN. Executing this command before doing any performance
measurements ensures that other workloads are not influencing the performance of the system. You can use the nzsession
command as well to abort bad or locked sessions.

3.

After we verified that the system is free we can start analyzing the query. Connect to the lab database with the following
command:

[nz@netezza ~]$ nzsql labdb labadmin

4.

Lets first have a look at the two tables and the WHERE conditions to get an idea of the row numbers involved. Our query
joins the CUSTOMER table without any where condition applied to it and the ORDERS table that has two where
conditions restricting it on the date and order priority. From the data distribution lab we know that the CUSTOMER table
has 150000 rows. To get the rows that are involved from the ORDERS table Execute the following COUNT(*) command:

LABDB(LABADMIN)=> SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) =


1996 AND o_orderpriority = '1-URGENT';
You should get the following results:

LABDB(LABADMIN)=> SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR FROM o_orderdate) =


1996 AND o_orderpriority = '1-URGENT';
COUNT
------46014
(1 row)
So the ORDERS table has 46014 rows that fit the WHERE condition. We will use EXPLAIN functionality to check if the
available Statistics allow the PureData System optimizer to estimate this correctly for its plan creation.

5.

The PureData System optimizer uses statistics about the data in the system to estimate the number of rows that result
from WHERE conditions, joins, etc. Doing wrong approximations can lead to bad execution plans. For example a huge
result set could be broadcast for a join instead of doing a double redistribution. To see its estimated rows for the
WHERE conditions in our query run the following EXPLAIN command:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR


FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT';
You will see a long output. Scroll up to your command and you should see the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 14

IBM Software
Information Management

explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) =
1996 and o.o_orderpriority = '1-URGENT';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 150, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
Node 2.
[SPU Aggregate]
...

The execution plan of this query consists of two nodes or snippets. First the table is scanned and the WHERE conditions
are applied, which can be seen in the Restrictions sub node. Since we use a COUNT(*) the Projections node is empty.
Then an Aggregation node is applied to count the rows that are returned by node 1.
When we look at the estimated number of rows we can see that it is way off the mark. The PureData System Optimizer
estimates from its available statistics that only 150 rows are returned by the WHERE conditions. We have seen before that
in reality its 46014 or roughly 300 times as many.
6.

One way to help the optimizer in its estimates is the collection of detailed statistics about the involved tables. Execute
the following command to generate detailed statistics about the ORDERS table:

LABDB(LABADMIN)=> generate statistics on orders;


Since generating full statistics involves a table scan this command may take some time to execute.

7.

We will now check if generating statistics has improved the estimates. Execute the EXPLAIN command again:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT COUNT(*) FROM orders WHERE EXTRACT(YEAR


FROM o_orderdate) = 1996 AND o_orderpriority = '1-URGENT';
Scroll up to your command and you should now see the following:
explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) =
1996 and o.o_orderpriority = '1-URGENT';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
Node 2.
[SPU Aggregate]
...

As we can see the estimated rows of the SELECT query have improved drastically. The optimizer now assumes this WHERE
condition will apply to 3000 rows of the order table. Still significantly off the true number of 46000 but by a factor of 20 better
than the original estimate of 150.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 5 of 14

IBM Software
Information Management

Estimations are very difficult to make. Obviously the optimizer cannot do the actual computation during planning. It relies on
current statistics about the involved columns. Statistics include min/max values, distinct values, numbers of null values etc.
Some of these statistics are collected on the fly but the most detailed statistics can be generated manually with the Generate
Statistics command. Generating full statistics after loading a table or changing its content significantly is one of the most
important administration tasks in PureData System. The PureData System appliance will automatically generate express
statistics after many tasks like load operations and just-in-time statistics during planning. Nevertheless full statistics should be
generated on a regular basis.

3 Identifying Join Problems


In the last chapter we have taken a first look at the tables involved in our join query and have improved optimizer estimates by
generating statistics on the involved tables. Now we will have a look at the complete execution plan and we will have a specific
look at the distribution and involved join.
In our example we have a query that doesnt finish in a reasonable amount of time. It is taken much longer than you would
expect from the involved data sizes. We will now analyze why this is the case.
1.

Lets analyze the execution plan for this query using the EXPLAIN VERBOSE command:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM


orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND
o.o_orderpriority = '1-URGENT' GROUP BY c.c_mktsegment;
You should see the following results (Scroll up to your query)
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "CUSTOMER" as "C" {(C.C_CUSTKEY)}]
-- Estimated Rows = 150000, Width = 10, Cost = 0.0 .. 90.5, Conf = 100.0
Projections:
1:C.C_MKTSEGMENT
[SPU Broadcast]
Node 2.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) =
1996))
Projections:
1:O.O_TOTALPRICE
Node 3.
[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}]
-- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0
Restrictions:
't'::BOOL
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
Node 4.
[SPU Group {(C.C_MKTSEGMENT)}]
-- Estimated Rows = 100, Width = 18, Cost = 1048040.0 .. 7732377.0, Conf = 0.0
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
[SPU Distribute on {(C.C_MKTSEGMENT)}]
[SPU Merge Group]
Node 5.
[SPU Aggregate {(C.C_MKTSEGMENT)}]
-- Estimated Rows = 100, Width = 26, Cost = 1048040.0 .. 7732377.0, Conf = 0.0
Projections:
1:C.C_MKTSEGMENT 2:(SUM(O.O_TOTALPRICE) / "NUMERIC"(COUNT(O.O_TOTALPRICE)))
[SPU Return]
[Host Return]

... Removed Plan Text ...


IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 6 of 14

IBM Software
Information Management

2.

First try to answer the following questions through the execution plan yourself. Take your time. We will walk through the
answers after that.
Question

Answer

a. Which columns of Table Customer are used in


further computations?
b. Is Table Customer redistributed, broadcast or can it
be joined locally?
c. Is Table Orders redistributed, broadcast or can it be
joined locally?
d. In which node are the WHERE conditions applied
and how many rows does PureData System expect to
fulfill the where condition?
e. What kind of join takes place and in which node?
f. What is the number of estimated rows for the join?
g. What is the most expensive node and why?
Hint: a stream operation in PureData System Explain is a join whose output isnt persisted on disc but streamed to further
computation nodes or snippets.
3.

So lets walk through the questions:


Node 1.
[SPU Sequential Scan table "CUSTOMER" as "C" {}]
-- Estimated Rows = 150000, Width = 10, Cost = 0.0 .. 90.5, Conf = 100.0

Projections:
1:C.C_MKTSEGMENT
[SPU Broadcast]

a.

Which columns of Table Customer are used in further computations?

The first node in the execution plan does a sequential scan of the CUSTOMER table on the SPUs. It estimates that 150000
rows are returned which we know is the number of rows in the CUSTOMER table.
The statement that tells us which columns are used in further computations is the Projections: clause. We can see that
only the C_MKTSEGMENT column is carried on from the CUSTOMER table. All other columns are thrown away. Since
C_MKTSEGMENT is a CHAR(10) column the returned resultset has a width of 10.
b.

Is Table Customer redistributed, broadcast or can it be joined locally?

During scan the table is broadcast to the other SPUs. This means that a complete CUSTOMER table is assembled on the
host and broadcast to each SPU for further computation of the query. This may seem surprising at first since we have a
substantial amount of rows. But since the width of the result set is only 10 we are talking about 150000 rows * 10 bytes =
1.5mb. This is almost nothing for a warehousing system.
Node 2.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0

Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
1:O.O_TOTALPRICE

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 14

IBM Software
Information Management

c.

Is Table Orders redistributed, broadcast or can it be joined locally?

The second node of the execution plan does a scan of the ORDERS table. One column O_TOTALPRICE is projected and
used in further computations. We cannot see any distribution or broadcast clauses so this table can be joined locally. This is
true because the CUSTOMER table is broadcast to all SPUs. If one table of a join is broadcast the other table doesnt need
any redistribution.
d.

In which node are the WHERE conditions applied and how many rows does PureData System expect to fulfill the where
condition?

We can see in the Restrictions clause that the WHERE conditions of our query are applied during the second node as well.
This should be clear since both of the WHERE conditions are applied to the ORDERS table and they can be executed
during the scan of the ORDERS table. As we can see in the Estimated Rows clause, the optimizer estimates a returned
set of 3000 rows which we know is not perfectly true since in reality 46014 rows are returned from this table.
Node 3.

[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}]
-- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0
Restrictions:
't'::BOOL
Projections:
1:C.C_MKTSEGMENT

e.

2:O.O_TOTALPRICE

What kind of join takes place and in which node?

The third node of our execution plan contains the join between the two tables. It is a Nested Loop Join which means that
every row of the first join set is compared to each row of the second join set. If the join condition holds true the joined row is
then added to the result set. This can be a very efficient join for small tables but for large tables its complexity is quadratic
and therefore in general less fast than for example a Hash Join. The Hash Join though cannot be used in cases of inequality
join conditions, floating point join keys etc.
f. What is the number of estimated rows for the join?
We can see in the Estimated Rows clause that the optimizer estimates this join node to return roughly 450m rows. Which is
the number of rows from the first node times the number of rows from the second node.
g. What is the most expensive node and why?
As we can see from the Cost clause the optimizer estimates, that the join has a cost in the range from 1048040 .. 7676127.0.
This is a roughly 2000 14000 times higher cost than what was expected for Node 1 and Node 2. Node 4 and 5 which
group and aggregate the result set do not add much cost as well. So our performance problems clearly originate in the join
node 3.
So what is happening here? If we take a look at the query we can assume that it is intended to compute the average order
cost per market segment. This means we should join all customers to their corresponding order rows. But for this to happen
we would need a join condition that joins the customer table and the orders table on the customer key. Instead the query
performs a Cartesian Join, joining each customer row to each orders row. This is a very work intensive query that results in
the behavior we have seen. The joined result set becomes huge. And it even returns results that cannot have been
expected for the query we see.
4.

So how do we fix this? By adding a join condition to the query that makes sure that customers are only joined to their
orders. This additional join condition is O.O_CUSTKEY=C.C_CUSTKEY. Execute the following EXPLAIN command for
the modified query.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 14

IBM Software
Information Management

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM


orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND
o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
You should see the following results. Scroll up to your query to see the scan and join nodes.
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 12, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR", O.O_ORDERDATE) =
1996))
Projections:
1:O.O_TOTALPRICE 2:O.O_CUSTKEY
Cardinality:
O.O_CUSTKEY 3.0K (Adjusted)
[SPU Distribute on {(O.O_CUSTKEY)}]
[HashIt for Join]
Node 2.
[SPU Sequential Scan table "CUSTOMER" as "C" {(C.C_CUSTKEY)}]
-- Estimated Rows = 150000, Width = 14, Cost = 0.0 .. 90.5, Conf = 100.0
Projections:
1:C.C_MKTSEGMENT 2:C.C_CUSTKEY
Node 3.
[SPU Hash Join Stream "Node 2" with Temp "Node 1" {(C.C_CUSTKEY,O.O_CUSTKEY)}]
-- Estimated Rows = 150000, Width = 18, Cost = 578.6 .. 746.7, Conf = 51.2
Restrictions:
(C.C_CUSTKEY = O.O_CUSTKEY)
Projections:
1:C.C_MKTSEGMENT 2:O.O_TOTALPRICE
Cardinality:
O.O_CUSTKEY 100 (Adjusted)

As you can see there have been some changes to the exeuction plan. The ORDERS table is now scanned first and
distributed on the customer key. The CUSTOMER table is already distributed on the customer key so there doesnt need to
happen any redistribution here. Both tables are then joined in node 3 through a Hash Join on the customer key.
The estimated number of rows is now 150000, the same as the number of customers. Since we have a 1:n relationship
between customers and orders this is as we would expect. Also the estimated cost of node 3 has come down significantly to
578.6 ... 746.7.
5.

Lets make sure that the query performance has indeed improved. Switch on the display of elapsed query time with the
following command:

LABDB(LABADMIN)=> \time
If you want you can later switch off the elapsed time display by executing the same command again. It is a toggle.
6.

Now execute our modified query:


LABDB(LABADMIN)=> SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o,
CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority =
'1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;

You should see the following results:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 14

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c


WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND
o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
C_MKTSEGMENT |
AVG
--------------+--------------HOUSEHOLD
| 150196.009267
BUILDING
| 151275.977882
AUTOMOBILE
| 151488.825830
MACHINERY
| 151348.971079
FURNITURE
| 150998.129771
(5 rows)
Elapsed time: 0m1.129s

Before we made our changes the query took so long that we couldnt wait for it to finish. After our changes the execution
time has improved to slightly more than a second. In this relatively simple case we might have been able to pinpoint the
problem through analyzing the SQL on its own. But this can be almost impossible for complicated multi-join queries that are
often used in warehousing. Reporting and BI tools tend to create very complicated portable SQL as well. In these cases
EXPLAIN can be a valuable tool to pinpoint the problem.

4 HTML Explain
In this section we will look at the HTML plangraph for the customer query that we just fixed. Besides the text descriptions of the
exeution plan we used in the previous chapter, PureData System provides the ability to generate a graphical query tree as well.
This is done with the help of HTML. So plangraph files can be created and viewed in your internet browser. PureData System
can be configured to save a HTML plangraph or plantext file for every executed SQL query. But in this chapter we will use the
basic EXPLAIN PLANGRAPH command and use Cut&Paste to export the file to your host computer.
1.

Enter the query with the keyword explain plangraph to generate the HTML plangraph:

LABDB(ADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM


orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND
o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
You will get a long print output of the HTML file content on your screen:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
14

Page 10 of

IBM Software
Information Management

LABDB(LABADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE
EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
NOTICE: QUERY PLAN:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Generator" content="Netezza Performance Server">
<meta http-equiv="Author" content="Babu Tammisetti <btammisetti@netezza.com>">
<style>
v\:* {behavior:url(#default#VML);}
</style>
</head>
<body lang="en-US">
<pre style="font:normal 68% verdana,arial,helvetica;background:#EEEEEE;margin-top:1em;margin-bottom:1em;marginleft:0px;padding:5pt;">
EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM
o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
</pre>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:19pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">AGG<br/>r=100 w=26 s=2.5KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:15pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:0pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">snd,ret</p></v:textbox>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:54pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">GROUP<br/>r=100 w=18 s=1.8KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:50pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,27pt" to="270pt,62pt"/>
<v:textbox style="position:absolute;margin-left:233pt;margin-top:42pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">dst,m-grp</p></v:textbox>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:89pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">HASHJOIN<br/>r=150.0K w=18 s=2.6MB<br/>(C_CUSTKEY =
O_CUSTKEY)</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:85pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,62pt" to="270pt,100pt"/>
<v:textbox style="position:absolute;margin-left:190pt;margin-top:124pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=150.0K w=14 s=2.0MB<br/>C</p></v:textbox>
<v:oval style="position:absolute;margin-left:191pt;margin-top:120pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="230pt,135pt"/>
<v:textbox style="position:absolute;margin-left:270pt;margin-top:124pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">HASH<br/>r=3.0K w=12 s=35.2KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:271pt;margin-top:120pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="310pt,132pt"/>
<v:textbox style="position:absolute;margin-left:253pt;margin-top:112pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">dst{(O_CUSTKEY)}</p></v:textbox>
<v:textbox style="position:absolute;margin-left:270pt;margin-top:159pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=3.0K w=12 s=35.2KB<br/>O</p></v:textbox>
<v:oval style="position:absolute;margin-left:271pt;margin-top:155pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="310pt,132pt" to="310pt,170pt"/>
</body>
</html>
EXPLAIN

Next open your host computers text editor. If you workstation is windows open notepad, if you use a linux desktop use the
default text editor like KEDIT, or GEDIT. Copy the output from the explain plangraph from your putty window into notepad.
Make sure that you only copy the HTML file from the <html start tag to the </html> end tag.
2.

Save the file as explain.html on your desktop.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
14

Page 11 of

IBM Software
Information Management

3.

Now on your desktop double click on explain.html. In windows make sure to open it with Internet Explorer since this
will result in the best output

You can see a graphical representation of the query we analyzed before. The left leg of the tree is the scan node of the
Customer tables C, the right leg contains a scan of the Orders table O and a node hashing the result set from orders in
preparation for the HASHJOIN node, that is joining the resultsets of the two table scans on the customer key. After the join the
result is fed into a GROUP node and an Aggregation node that computes the Average total price, before being returned to the
caller.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
14

Page 12 of

IBM Software
Information Management

A graphical representation of the execution plan can be valuable for complicated multi-join queries to get an overview of the join.

Congratulations in this lab you have used PureData System Explain functionality to analyze a query.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
14

Page 13 of

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
14

Page 14 of

Optimization Objects
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

Objectives........................................................................3

Materialized Views...........................................................3
2.1

Wide Tables.....................................................................4

2.2

Lookup of small set of rows .............................................6

Cluster Based Tables (CBT) .........................................12


3.1

Cluster Based Table Usage ...........................................12

3.2

Cluster Based Table Maintenance .................................15

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 2 of 17

IBM Software
Information Management

1 Introduction
A PureData System appliance is designed to provide excellent performance in most cases without any specific tuning or index
creation. One of the key technologies used to achieve this are zone maps: Automatically computed and maintained records of
the data that is inside the extents of a database table.
In general data is loaded into data warehouses ordered by the time dimension; therefore zone maps have the biggest
performance impact on queries that restrict the time dimension as well.
This approach works well for most situations, but PureData System provides additional functionality to enhance specific
workloads, which we will use in this chapter.
We will first use materialized views to enhance performance of database queries against wide tables and for queries that only
lookup small subsets of columns.
Then we will use Cluster Based Tables to enhance query performance of queries which are using multiple lookup dimensions.

1.1

Objectives

In the last couple of labs we have recreated a customer database in our PureData System system. We have picked distribution
keys, loaded the data and made some first performance investigations. In this lab we will take a deeper look at some customer
queries and try to enhance their performance by tuning the system.

Figure 1 LABDB database

2 Materialized Views
A materialized view is a view of a database table that projects a subset of the base tables columns and can be sorted on a
specific set of the projected columns. When a materialized view is created, the sorted projection of the base tables data is
stored in a materialized table on disk.
Materialized views reduce the width of data being scanned in a base table. They are beneficial for wide tables that contain many
columns (i.e. 50-500 columns) where typical queries only reference a small subset of the columns.
IBM PureData System for Analytics
Copyright IBM Corp. 20112 All rights reserved

Page 3 of 17

IBM Software
Information Management

Materialized views also provide fast, single or few record lookup operations. The thin materialized view is automatically
substituted by the optimizer for the base table, allowing faster response, particularly for shorter tactical queries that examine only
a small segment of the overall database table.

2.1

Wide Tables

In our customer scenario we have a couple of queries that do some basic computations on the LINEITEM table but only touch a
small number of columns of the table.
1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2.

Enter NZSQL and connect to LABDB as user LABADMIN.

[nz@netezza labs]$ nzsql LABDB LABADMIN

3.

The first thing we need to do is to make sure table statistics have been generated so that more accurate estimated
query costs can be reported by explain commands which we will be looking at. Please generate statistics for the
ORDERS and LINEITEM tables using the following commands.

LABDB(LABADMIN)=> GENERATE STATISTICS ON ORDERS;


LABDB(LABADMIN)=> GENERATE STATISTICS ON LINEITEM;
4.

The following query computes the total quantity of items shipped and their average tax rate for a given month. In this
case the fourth month or April. Execute the following query:

LABDB(LABADMIN)=> SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE


EXTRACT(MONTH FROM L_SHIPDATE) = 4;
Your results should look similar to the following:
LABDB(LABADMIN)=> SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;
SUM
|
AVG
-------------+---------13136228.00 | 0.039974
(1 row)
Notice the EXTRACT(MONTH FROM L_SHIPDATE) command. The EXTRACT command can be used to retrieve parts of a
date or time column like YEAR, MONTH or DAY.
5.

Now lets have a look at the cost of this query. To get the projected cost from the Optimizer we use the following
EXPLAIN VERBOSE command:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE


EXTRACT(MONTH FROM L_SHIPDATE) = 4;
You will see a long output on the screen. Scroll up till you reach the command you just executed. You should see something
similar to the following:

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 4 of 17

IBM Software
Information Management

QUERY SQL:
EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;

QUERY VERBOSE PLAN:


Node 1.
[SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 60012, Width = 16, Cost = 0.0 .. 2417.5, Conf = 80.0
Restrictions:
(DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4)
Projections:
1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 32, Cost = 2440.0 .. 2440.0, Conf = 0.0
Projections:
1:SUM(LINEITEM.L_QUANTITY)
2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX)))
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...
Notice the highlighted cost associated with the table scan. In our example its a value of over 2400.
6.

Since this query is run very frequently we want to enhance the scanning performance. And since it only uses 3 of the 16
LINEITEM columns we have decided to create a materialized view covering these three columns. This should
significantly increase scan speed since only a small subset of the data needs to be scanned. To create the materialized
view THINLINEITEM execute the following command:

LABDB(LABADMIN)=> CREATE MATERIALIZED VIEW THINLINEITEM AS SELECT L_QUANTITY, L_TAX,


L_SHIPDATE FROM LINEITEM;
This command can take several minutes since we effectively create a copy of the three columns of the table.
7.

Repeat the explain call from step 2. Execute the following command:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE


EXTRACT(MONTH FROM L_SHIPDATE) = 4;
Again scroll up till you reach your command. The results should now look like the following:

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 5 of 17

IBM Software
Information Management

QUERY SQL:
EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;

QUERY VERBOSE PLAN:


Node 1.
[SPU Sequential Scan mview "_MTHINLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 511888, Width = 16, Cost = 0.0 .. 174.1, Conf = 90.0
MaxPages=136 TotalPages=544] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(DATE_PART('MONTH'::"VARCHAR", LINEITEM.L_SHIPDATE) = 4)
Projections:
1:LINEITEM.L_QUANTITY 2:LINEITEM.L_TAX
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 32, Cost = 366.0 .. 366.0, Conf = 0.0
Projections:
1:SUM(LINEITEM.L_QUANTITY)
2:(SUM(LINEITEM.L_TAX) / "NUMERIC"(COUNT(LINEITEM.L_TAX)))
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...

[MV:

Notice that the PureData System Optimizer has automatically replaced the LINEITEM table with the view THINLINEITEM.
We didnt need to make any changes to the query. Also notice that the expected cost has been reduced to 174 which is less
than 10% of the original.
As you have seen in cases where you have wide database tables, with queries only touching a subset of them, a
materialized view of the hot columns can significantly increase performance for these queries, without any changes to the
executed queries.

2.2

Lookup of small set of rows

Materialized views not only reduce the width of tables, they can also be used in a similar way to indexes to increase the speed of
queries that only access a very limited set of rows.
1.

First we drop the view we used in the last chapter with the following command:

LABDB(LABADMIN)=> DROP VIEW THINLINEITEM;


2.

The following command returns the number of returned shipments vs. total shipments for a specific shipping day.
Execute the following command:

LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
You should have a similar result to the following:

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 6 of 17

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
RET | TOTAL
-----+------176 | 2550
(1 row)
You can see that on the 15th June of 1995 there have been 176 returned shipments out of a total of 2550. Notice the use of
the CASE statement to change the L_RETURNFLAG column into a Boolean 0-1 value, which is easily countable.
3.

We will now take a look at the underlying data distribution of the LINEITEM table and its zone map values. To do this
exit the NZSQL console by executing the \q command.

4.

In our demo image we have installed the PureData System support tools. You can normally find them as an installation
package in /nz on your PureData System appliances or you can retrieve them from IBM support. One of these tools is
the nz_zonemap tool that returns detailed information about the zone map values associated with a given database
table. First lets have a look at the zone mappable columns of the LINEITEM table. Execute the following command:

[nz@netezza ~]$ nz_zonemap LABDB LINEITEM


You should get the following result:
[nz@netezza ~]$ nz_zonemap LABDB LINEITEM
Database:
Object Name:
Object Type:
Object ID :

LABDB
LINEITEM
TABLE
243252

The zonemappable columns are:


Column # | Column Name | Data Type
----------+---------------+----------1 | L_ORDERKEY
| INTEGER
2 | L_PARTKEY
| INTEGER
3 | L_SUPPKEY
| INTEGER
4 | L_LINENUMBER | INTEGER
11 | L_SHIPDATE
| DATE
12 | L_COMMITDATE | DATE
13 | L_RECEIPTDATE | DATE
(7 rows)
This command returns an overview of the zonemappable columns of the LINEITEM table in the LABDB database. Seven of
the sixteen columns have zone maps created for them. Zonemappable columns include integer and date data types. We
see that the L_SHIPDATE column we have in the WHERE condition of the customer query is zonemappable.
5.

Now we will have a look at the zone map values for the L_SHIPDATE column. Execute the following command:

[nz@netezza ~]$ nz_zonemap LABDB LINEITEM L_SHIPDATE


This command returns a list of all extents that make up the LINEITEM table and the minimum and maximum values of the
data in the L_SHIPDATE column for each extent. Your results should look like the following:

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 7 of 17

IBM Software
Information Management

[nz@netezza ~]$ nz_zonemap LABDB LINEITEM L_SHIPDATE


Database:
Object Name:
Object Type:
Object ID :
Data Slice:
Column 1:

LABDB
LINEITEM
TABLE
243252
1
L_SHIPDATE

(DATE)

Extent # | L_SHIPDATE (Min) | L_SHIPDATE (Max) | ORDER'ed


----------+------------------+------------------+---------1 | 1992-01-04
| 1998-11-29
|
2 | 1992-01-06
| 1998-11-30
|
3 | 1992-01-03
| 1998-11-28
|
4 | 1992-01-02
| 1998-11-29
|
5 | 1992-01-04
| 1998-11-29
|
6 | 1992-01-03
| 1998-11-28
|
7 | 1992-01-04
| 1998-11-29
|
8 | 1992-01-04
| 1998-11-30
|
9 | 1992-01-07
| 1998-12-01
|
10 | 1992-01-03
| 1998-11-28
|
11 | 1992-01-05
| 1998-11-27
|
12 | 1992-01-03
| 1998-12-01
|
13 | 1992-01-03
| 1998-11-30
|
14 | 1992-01-04
| 1998-11-30
|
15 | 1992-01-06
| 1998-11-27
|
16 | 1992-01-03
| 1998-11-30
|
17 | 1992-01-02
| 1998-11-29
|
18 | 1992-01-07
| 1998-11-29
|
19 | 1992-01-04
| 1998-11-30
|
20 | 1992-01-04
| 1998-11-30
|
21 | 1992-01-03
| 1998-11-30
|
22 | 1992-01-04
| 1998-11-29
|
23 | 1992-01-02
| 1998-11-26
|
(23 rows)
You can see that the LINEITEM table consists of 23 extents of data (3MB chunks on each dataslice). We can also see the
minimum and maximum values for the L_SHIPDATE column in each extent. These values are stored in the zone map and
automatically updated when rows are inserted, updated or deleted. If a query has a where condition on the L_SHIPDATE
column that falls outside of the data range of an extent, the whole extent can be discarded by PureData System without
scanning it.
In this case the data has been equally distributed on all extents. This means that our query which has a WHERE condition
on the 15th June of 1995 doesnt profit from the zone maps and requires a full table scan. Not a single extent could be safely
ruled out.
6.

Enter the NZSQL console again by entering the nzsql labdb labadmin command.

7.

We will now create a materialized view that is ordered on the L_SHIPDATE column. Execute the following command:

LABDB(LABADMIN)=> CREATE MATERIALIZED VIEW SHIPLINEITEM AS SELECT L_SHIPDATE FROM


LINEITEM ORDER BY L_SHIPDATE;
Note that our customer query has a WHERE condition on the L_SHIPDATE column but aggregates the L_RETURNFLAG
column. Nevertheless we didnt add the L_RETURNFLAG column to the materialized view. We could have done it to
enhance the performance of our specific query even more. But in this case we assume that there are lots of customer
queries which are restricted on the ship date and access different columns of the LINEITEM table. A materialized view

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 8 of 17

IBM Software
Information Management

retains the information about the location of a parent row in the base table and can be used for lookups even if columns of
the parent table are accessed in the SELECT clause.
You can specify more than one order column. In that case they are ordered first by the first column; in case this column has
equal values the next column is used to order rows with the same value in column one etc. In general only the first order
column provides a significant impact on performance.
8.

Lets have a look at the zone map of the newly created view. Leave the NZSQL console again with the \q command.

9.

Display the zone map values of the materialized view SHIPLINEITEM with the following command:

[nz@netezza ~]$ nz_zonemap LABDB SHIPLINEITEM L_SHIPDATE


The results should look like the following:

[nz@netezza ~]$ nz_zonemap LABDB SHIPLINEITEM L_SHIPDATE


Database:
Object Name:
Object Type:
Object ID :
Data Slice:
Column 1:

LABDB
SHIPLINEITEM
MATERIALIZED VIEW
252077
1
L_SHIPDATE (DATE)

Extent # | L_SHIPDATE (Min) | L_SHIPDATE (Max) | ORDER'ed


----------+------------------+------------------+---------1 | 1992-01-02
| 1993-04-11
|
2 | 1993-04-11
| 1994-05-24
| TRUE
3 | 1994-05-24
| 1995-07-03
| TRUE
4 | 1995-07-03
| 1996-08-14
| TRUE
5 | 1996-08-14
| 1997-09-24
| TRUE
6 | 1997-09-24
| 1998-12-01
| TRUE
(6 rows)
We can make a couple of observations here. First the materialized view is significantly smaller than the base table, since it
only contains one column. We can also see that the data values in the extent are ordered on the L_SHIPDATE column. This
means that for our query, which is accessing data from the 15th June of 1995, only extent 3 needs to be accessed at all,
since only this extent has a data range that contains this date value.
10. Now lets verify that our materialized view is indeed used for this query. Enter the NZSQL console by entering the
following command: nzsql labdb labadmin
11. Use the Explain command again to verify that our materialized view is used by the Optimizer:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
You will see a long text output, scroll up till you find the command you just executed. Your result should look like the
following:

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 9 of 17

IBM Software
Information Management

EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 24, Cost = 62.2 .. 62.2, Conf = 0.0
Projections:
1:SUM(CASE WHEN (LINEITEM.L_RETURNFLAG <> 'N'::BPCHAR) THEN 1 ELSE 0 END)
2:COUNT(*)
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...

Notice that the Optimizer has automatically changed the table scan to a scan of the view SHIPLINEITEM we just created.
This is possible even though the projection is taking place on column L_RETURNFLAG of the base table.
12. In some cases you might want to disable or suspend an associated materialized view. For troubleshooting or
administrative tasks on the base table. For these cases use the following command to suspend the view:

LABDB(LABADMIN)=> ALTER VIEW SHIPLINEITEM MATERIALIZE SUSPEND;


13. We want to make sure that the view is not used anymore during query execution. Execute the EXPLAIN command for
our query again:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
Scroll up till you see your explain query. With the view suspended we can see that the optimizer again scans the original
table LINEITEM.

EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';

QUERY VERBOSE PLAN:


Node 1.
[SPU Sequential Scan table "LINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 60012, Width = 1, Cost = 0.0 .. 2417.5, Conf = 80.0
Restrictions:
...

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 10 of 17

IBM Software
Information Management

14. Note that we have only suspended our view not dropped it. We will now reactivate it with the following refresh command:

LABDB(LABADMIN)=> ALTER VIEW SHIPLINEITEM MATERIALIZE REFRESH;


This command can also be used to reorder materialized views in case the base table has been changed. While INSERTs,
UPDATEs and DELETEs into the base table are automatically reflected in associated materialized views, the view is not
reordered for every change. Therefore it is advisable to refresh them periodically esp. after major changes to the base table.
15. To check that the Optimizer again uses the materialized view for query execution, execute the following command:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
Make sure that the Optimizer again uses the materialized view for its first scan operation. The output should again look like
before you suspended the view.

EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
...
16. If you execute the query again you should get the same results as you got before creating the materialized view.

Execute the query again:

LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
You should see the following output:
There is a defect in our VMWare image which in some cases only returns the rows from one dataslice instead of all
four, when a materialized view is used. This means that instead of seeing a TOTAL of 2550 you will see a total of
623 (or similar numbers depending on your data distribution and which dataslice is returned). You can solve this
problem by restarting your PureData System database. It will also not occur on a real PureData System appliance.

LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS
RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
RET | TOTAL
-----+------176 | 2550
(1 row)
IBM PureData System for Analytics
Copyright IBM Corp. 20112 All rights reserved

Page 11 of 17

IBM Software
Information Management

You have just created a materialized view to speed up queries that lookup small numbers of rows. A materialized view can
provide a significant performance improvement and is transparent to end users and applications accessing the database.
But it also creates additional overhead during INSERTs, UPDATEs and DELETEs, requires additional hard disc space and it
may require regular maintenance.
Therefore materialized views should be used sparingly. In the next chapter we will discuss an alternative approach to speed
up scan speeds on a database table.

3 Cluster Based Tables (CBT)


We have received a set of new customer queries on the ORDERS table that do not only restrict the table by order date but also
only accesses orders in a given price range. These queries make up a significant part of the system workload and we will look
into ways to increase performance for them. The following query is a template for the queries in question. It returns the
aggregated total price of all orders by order priority for a given year (in this case 1996) and price range (in this case between
150000 and 180000).

SELECT O_ORDERPRIORITY, SUM(O_TOTALPRICE) FROM ORDERS


WHERE EXTRACT(YEAR FROM O_ORDERDATE) = 1996 AND
O_TOTALPRICE > 150000 AND
O_TOTALPRICE <= 180000
GROUP BY O_ORDERPRIORITY;
In this example we have a very restrictive WHERE condition on two columns O_ORDERDATE and O_TOTALPRICE, which can
help us to increase performance. The ORDERS table has around 220,000 rows with an order date of 1996 and 160,000 rows
with the given price range. But it only has 20,000 columns that satisfy both conditions.
Materialized views provide their main performance improvements on one column. Also INSERTS to the ORDERS table are
frequent and time critical, therefore we would prefer not to use materialized views and will in this chapter investigate the use of
cluster based tables.
Cluster based tables are PureData System tables that are created with an ORGANIZE ON keyword. They use a special space
filling algorithm to organize a table by up to 4 columns. Zone maps for a cluster based table will provide approximately the same
performance increases for all organization columns. This is useful if your query restricts a table on more than one column or if
your workload consists of multiple queries hitting the same table using different columns in WHERE conditions. In contrast to
materialized views no additional disc space is needed, since the base table itself is reordered.

3.1

Cluster Based Table Usage

Cluster based tables are created like normal PureData System database tables. They need to be flagged as a CBT during table
creation by specifying up to four organization columns. A PureData System table can be altered at any time to become a cluster
based table as well.
1.

We are going to change the create table command for ORDERS to create a cluster based table. We will create a new
cluster based table called ORDERS_CBT. Exit the NZSQL console by executing the \q command.

2.

Switch to the optimization lab directory by executing the following command: cd /labs/optimizationObjects

3.

We have supplied a the script for the creation of the ORDERS_CBT table but we need to add the ORGANIZE
ON(O_ORDERDATE, O_TOTALPRICE) clause to create the table as a cluster based table organized on the
O_ORDERDATE and O_TOTALPRICE columns. To change the CREATE statement open the orders_cbt.sql script in
the vi editor with the following command:
vi orders_cbt.sql

4.

Enter the insert mode by pressing i, the editor should now show an ---INSERT MODE--- statement in the bottom line.

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 12 of 17

IBM Software
Information Management

5.

Navigate the cursor on the semicolon ending the statement. Press enter to move it into a new line. Enter the line
organize on (o_orderdate, o_totalprice) before it. Your screen should now look like the following.

create table orders_cbt


(
o_orderkey integer not null ,
o_custkey integer not null ,
o_orderstatus char(1) not null ,
o_totalprice decimal(15,2) not null ,
o_orderdate date not null ,
o_orderpriority char(15) not null ,
o_clerk char(15) not null ,
o_shippriority integer not null ,
o_comment varchar(79) not null
)
distribute on (o_orderkey)
organize on (o_orderdate, o_totalprice);
~
-- INSERT --

6.

Exit the insert mode by pressing Esc.

7.

Enter :wq! In the command line and press Enter to save and exit without questions.

8.

Create and load the orders_cbt table by executing the following script: ./create_orders_test.sh

9.

This may take a couple minutes because of our virtualized environment. You may see an error message that the table
orders_cbt does not exist. This is expected since the script first tries to clean up an existing orders_cbt table.

10. We will now have a look at how Netezza has organized the data in this table. For this we use the nz_zonemap utility
again. Execute the following command:

[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt

You will get the following result:

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 13 of 17

IBM Software
Information Management

[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt


Database:
Object Name:
Object Type:
Object ID :

LABDB
ORDERS_CBT
TABLE
264428

The zonemappable columns are:


Column # | Column Name
|
Data Type
----------+----------------+--------------1 | O_ORDERKEY
| INTEGER
2 | O_CUSTKEY
| INTEGER
4 | O_TOTALPRICE
| NUMERIC(15,2)
5 | O_ORDERDATE
| DATE
8 | O_SHIPPRIORITY | INTEGER
(5 rows)
This command shows you the zone mappable columns of the ORDERS_CBT table. If you compare it with the output of the
nz_zonemap tool for the ORDERS table, you will see that it contains the additional column O_TOTALPRICE. Numeric
columns are not zone mapped per default for performance reasons but zone maps are created for them, if they are part of
the organization columns.
11. Execute the following command to see the zone map values of the O_ORDERDATE column:

[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate


You will get the following results:

[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate


Database:
Object Name:
Object Type:
Object ID :
Data Slice:
Column 1:

LABDB
ORDERS_CBT
TABLE
264428
1
O_ORDERDATE

(DATE)

Extent # | O_ORDERDATE (Min) | O_ORDERDATE (Max) | ORDER'ed


----------+-------------------+-------------------+---------1 | 1992-01-01
| 1998-08-02
|
2 | 1992-01-01
| 1998-08-02
|
3 | 1992-01-01
| 1998-08-02
|
4 | 1992-01-01
| 1998-08-02
|
5 | 1992-01-01
| 1998-08-02
|
6 | 1992-01-01
| 1998-08-02
|
7 | 1992-01-01
| 1998-08-02
|
(7 rows)
This is unexpected. Since we used O_ORDERDATE as an organization column we would have expected some kind of
order in the data values, but they are again distributed equally over all extents.
The reason for this is that the organization process takes place during a command called groom. Instead of creating a new
table we could also have altered the existing ORDERS table to become a cluster based table. Creating or altering a table to
become a cluster based table doesnt actually change the physical table layout till the groom command has been used.
IBM PureData System for Analytics
Copyright IBM Corp. 20112 All rights reserved

Page 14 of 17

IBM Software
Information Management

This command will be covered in detail in the following presentation and lab. But we will use it in the next chapter to
reorganize the table.

3.2

Cluster Based Table Maintenance

When a table is created as a cluster based table in Netezza the data isnt actually organized during load time. Also similar to
ordered materialized views a cluster based table can become partially unordered due to INSERTS, UPDATES and DELETES. A
threshold is defined for reorganization and the groom command can be used at any time to reorganize a cluster based table,
based on its organization keys.

1.

To organize the table you created in the last chapter you need to switch to the NZSQL console again. Execute the
following command: nzsql labdb labadmin

2.

Execute the following command to groom your cluster based table:

LABDB(LABADMIN)=> groom table orders_cbt;


This command does a variety of things which will be covered in a further presentation and lab. In this case it organizes the
cluster based table based on its organization keys.
This command requires a lot of RAM on the SPUs to operate. Our VMWare systems have been tuned so the
command should be able to finish. Since the whole table is reordered it may take a couple of minutes to finish but
should you get the impression that the system is stuck please inform the lecturer.
3.

Lets have a look at the data organization in the table. To do this quit the NZSQL console with the \q command.

4.

Review the zone maps of the two organization columns by executing the following command:

[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice

Your results should look like the following (we removed the ORDER columns from the results to make it better readable)
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice
Database:
Object Name:
Object Type:
Object ID :
Data Slice:
Column 1:
Column 2:

LABDB
ORDERS_CBT
TABLE
264428
1
O_ORDERDATE (DATE)
O_TOTALPRICE (NUMERIC(15,2))

Extent # | O_ORDERDATE (Min) | O_ORDERDATE (Max) | O_TOTALPRICE (Min) | O_TOTALPRICE (Max)


----------+-------------------+-------------------+--------------------+-------------------1 | 1992-01-01
| 1994-06-22
|
912.10 |
144450.63 |
2 | 1993-08-27
| 1996-12-08
|
875.52 |
144451.22 |
3 | 1996-02-13
| 1998-08-02
|
884.52 |
144446.76 |
4 | 1995-04-18
| 1998-08-02
|
78002.23 |
215555.39 |
5 | 1993-08-27
| 1998-08-02
|
196595.73 |
530604.44 |
6 | 1992-01-01
| 1995-04-18
|
144451.94 |
296228.30 |
7 | 1992-01-01
| 1993-08-27
|
196591.22 |
555285.16 |

(7 rows)

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 15 of 17

IBM Software
Information Management

You can see that both columns have some form of order now. Our query is restricting rows in two ranges
Condition 1: O_ORDERDATE = 1996
AND
Condition 2: 150000 < O_TOTALPRICE <= 180000
Below we enter the minimum and maximum values of the extents in a table and add a column to mark (with an X) if the
contained values of an extent overlap with the above conditions.

Min(Date)

Max(Date)

Min(Price)

Max(Price)

1992-01-01

1994-06-22

912.10

144450.63

Cond 1

1993-08-27

1996-12-08

875.52

144451.22

1996-02-13

1998-08-02

884.52

144446.76

1995-04-18

1998-08-02

78002.23

215555.39

1993-08-27

1998-08-02

196595.73

530604.44

1992-01-01

1995-04-18

144451.94

296228.30

1992-01-01

1993-08-27

196591.22

555285.16

Cond 2

Both Cond

As you can see there are now 4 extents that have rows from 1996 in them and 2 extents that contain rows in the price range
from 150000 to 18000. But we have only one extent that contains rows that satisfy both conditions and needs to be scanned
during query execution.
In this scenario we probably would have been able to get similar results with one organization column or a materialized view,
but with bigger tables and more extents cluster based tables gain a performance advantage.

Congratulations, you have finished the Optimization Objects lab. In this lab you have created materialized views to speedup
scans of wide tables and queries that only look up small numbers of rows. Finally you created a cluster based table and
used the groom command to organize it. Throughout the lab you have used the nz_zonemap tool to see zone maps and get
a better idea on how data is stored in the Netezza appliance.

IBM PureData System for Analytics


Copyright IBM Corp. 20112 All rights reserved

Page 16 of 17

IBM Software
Information Management

WARRANTIES OF MERCHANTABILITY, FITNESS FOR A


PARTICULAR PURPOSE OR NON-INFRINGEMENT.

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada

IBM products are warranted according to the terms and


conditions of
the agreements (e.g. IBM Customer Agreement, Statement
of Limited
Warranty, International Program License Agreement, etc.)
under which
they are provided.

IBM, the IBM logo, ibm.com and Tivoli are trademarks or


registered
trademarks of International Business Machines Corporation
in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence
in this
information with a trademark symbol ( or ), these
symbols indicate
U.S. registered or common law trademarks owned by IBM
at the time
this information was published. Such trademarks may also
be registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web at
Copyright and trademark information at
ibm.com/legal/copytrade.shtml

Other company, product and service names may be


trademarks or service marks of others.
References in this publication to IBM products and services
do not imply that IBM intends to make them available in all
countries in which
IBM operates.
No part of this document may be reproduced or transmitted
in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date
of initial
publication. Product data is subject to change without notice.
Any
statements regarding IBMs future direction and intent are
subject to
change or withdrawal without notice, and represent goals
and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY,
EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS
ANY
IBM PureData System for Analytics
Copyright IBM Corp. 20112 All rights reserved

Page 17 of 17

IBM Software
Information Management

Groom
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

3
4
5

Objectives........................................................................3

Transactions....................................................................3
2.1

Insert Transaction ............................................................3

2.2

Update and Delete Transactions......................................4

2.3

Aborting Transactions......................................................7

2.4

Cleaning up .....................................................................8

Grooming Logically Deleted Rows ..............................10


Performance Benefits of GROOM ................................12
Changing the Data Type of a Column..........................13

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 18

IBM Software
Information Management

1 Introduction
As part of your routine database maintenance activities, you should plan to recover disk space occupied by outdated or deleted
rows. In normal PureData System operation, an UPDATE or DELETE of a table row does not remove the physical row on the
hard disc. Instead the old row is marked as deleted together with a transaction id of the deleting transaction and in case of
update a new row is created. This approach is called multiversioning. Rows that could potentially be visible to other transactions
with an older transaction id are still accessible. Over time however, the outdated or deleted rows are of no interest to any
transaction anymore and need to be removed to free up hard disc space and improve performance. After the rows have been
captured in a backup, you can reclaim the space they occupy using the SQL GROOM TABLE command. The GROOM TABLE
command does not lock a table while it is running; you can continue to SELECT, UPDATE, and INSERT into the table while the
table is being groomed.

1.1

Objectives

In this lab we will use the GROOM command to prepare our tables for the customer. During the course of the POC we have
deleted and update a number of rows. At the end of a POC it is sensible to clean up the system. Use Groom on the created
tables, Generate Statistics, and other cleanup tasks.

2 Transactions
In this section we will show how transactions can leave logically deleted rows in a table which later as an administrative task
need to be removed with the groom command. We will go through the different transaction types and show you what happens
under the covers in a PureData System Appliance.

2.1

Insert Transaction

In this chapter we will add a new row to the regions table and review the hidden fields that are saved in the database. As you
remember from the Transactions presentation, PureData System uses a concept called multi-versioning for transactions. Each
transaction has its own image of the table and doesnt influence other transactions. This is done by adding a number of hidden
fields to the PureData System table. The most important ones are the CREATEXID and the DELETEXID. Each PureData System
transaction has a unique transaction id that is increasing with each new transaction.
In this subsection we will add a new row to the REGION table.
1.

Connect to your Netezza image using putty. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)

2.

Start NZSQL with : nzsql

3.

Connect to the database LABDB as user LABADMIN by typing the following command:

SYSTEM(ADMIN)=> \c LABDB LABADMIN


4. Select all rows from the REGION table:
LABDB(LABADMIN)=> SELECT * FROM REGION;

You should see the following output with 4 existing regions:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 18

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT * FROM REGION;


R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
4 | ap
| asia pacific
3 | emea
| europe, middle east, africa
1 | na
| north america
(4 rows)
5.

Insert a new row into the REGIONS table for the region Australia with the following SQL command

LABDB(LABADMIN)=> INSERT INTO REGION VALUES (5, 'as', 'australia');


6.

Now we will again do a select on the REGION table. But this time we will also query the hidden fields CREATEXID,
DELETEXID and ROWID:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;

You should see the following results:


LABDB(LABADMIN)=> SELECT CREATEXID,DELETEXID,ROWID,* FROM REGION;
CREATEXID | DELETEXID |
ROWID
| R_REGIONKEY |
R_NAME
|
R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+---------------------------365584 |
0 | 163100000 |
5 | as
| australia
357480 |
0 | 161271001 |
1 | na
| north america
357480 |
0 | 161271002 |
2 | sa
| south america
357480 |
0 | 161271000 |
3 | emea
| europe, ...
357480 |
0 | 161271003 |
4 | ap
| asia pacific
(5 rows)

As you can see we now have five rows in the REGION table. The new row for Australia has the id of the last transaction as
CREATEXID and 0 as DELETEXID since it has not yet been deleted. Other transactions with a lower transaction id that
might still be running will not be able to see this new row. Note also that each row has a unique rowid. Rowids do not need
to be consecutive but they are unique across all dataslices for one table.

2.2

Update and Delete Transactions

Delete transactions in PureData System do not physically remove rows but update the DELETEXID field of a row to mark it as
logically deleted. These logically deleted rows need to be removed regularly with the administrative Groom command.
Update transactions in PureData System consist of a logical delete of the old row and an insert of a new row with the updated
fields. To show this effectively we will need to change a system parameter in PureData System that allows us to switch off the
invisibility lists in PureData System. Note that the parameter we will be using is dangerous and shouldnt be used in a real
PureData System environment. There is also a safer environment variable but this has some restrictions.
1.

First we will change the system variable that allows us to see deleted rows in the system
To do this exit the console with \q

2.

Stop the PureData System database with nzstop

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 18

IBM Software
Information Management

3.

Navigate to the system config directory with the following command:

[nz@netezza ~]$ cd /nz/data/config


4.

Open the system.cfg file that contains the PureData System system configuration with vi

[nz@netezza config]$ vi system.cfg

5.

Enter the insert mode by pressing i, the editor should now show an ---INSERT MODE--- statement in the bottom line.

6.

Navigate the cursor to the end of the last line. Press enter to create a new line. Enter the line
host.fpgaAllowXIDOverride=yes before it. Your screen should now look like the following.

system.enableCompressedTables=false
system.realFpga=no
system.useFpgaPrep=yes
system.enableCompressedTables=yes
system.enclosurePollInterval=0
system.envPollInterval=0
system.esmPollInterval=0
system.hbaPollInterval=0
system.diskPollInterval=0
system.enableCTA2=1
system.enableCTAColumns=1
sysmgr.coreCountWarning=1
sysmgr.coreCountFailover=1
system.emulatorMode=64
system.emulatorThreads=4
host.fpgaAllowXIDOverride=yes
~
~
-- INSERT --

7.

Exit the insert mode by pressing Esc.

8.

Enter :wq! In the command line and press Enter to save and exit without questions.

9.

Start the system again with the nzstart command. Note in a real PureData System system changing system
configuration parameters can be a very dangerous thing that is normally not advisable without PureData System service
support.

10. Enter the NZSQL console again with the following command:

[nz@netezza config]$ nzsql labdb labadmin

11. Now we will update the row we inserted in the last chapter to the REGION table:

LABDB(LABADMIN)=> UPDATE REGION SET R_COMMENT='Australia' WHERE R_REGIONKEY=5;


IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 5 of 18

IBM Software
Information Management

12. Do a SELECT on the REGION table again:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


You should see the following output:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


CREATEXID | DELETEXID |
ROWID
| R_REGIONKEY |
R_NAME
|
R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+---------------------------357480 |
0 | 161271003 |
4 | ap
| asia pacific
357480 |
0 | 161271000 |
3 | emea
| europe, ...
357480 |
0 | 161271002 |
2 | sa
| south america
365584 |
369666 | 163100000 |
5 | as
| australia
369666 |
0 | 163100000 |
5 | as
| Australia
357480 |
0 | 161271001 |
1 | na
| north america
(6 rows)

Normally you would now see 5 rows with the update value. But since we disabled the invisibility lists you now see 6 rows in
the REGION table. Our transaction that updated the row had the transaction id 369666. You can see that the original row
with the lowercase australia in the comment column is still there and now has a DELETXID field that contains the
transaction id of the transaction that deleted it. Transactions with a higher transaction id will not see a row with a deletexid
that indicates that it has been logically deleted before the transaction is run.
We also see a newly inserted row with the new comment value Australia. It has the same rowid as the deleted row and the
same CREATEXID as the transaction that did the insert.
13. Finally lets clean up the table again by deleting the Australia row:

LABDB(LABADMIN) => DELETE FROM REGION WHERE R_REGIONKEY=5;

14. Do a SELECT on the REGION table again:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


You should see the following output:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


CREATEXID | DELETEXID |
ROWID
| R_REGIONKEY |
R_NAME
|
R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+---------------------------357480 |
0 | 161271000 |
3 | emea
| europe, ...
365584 |
369666 | 163100000 |
5 | as
| australia
369666 |
369670 | 163100000 |
5 | as
| Australia
357480 |
0 | 161271001 |
1 | na
| north america
357480 |
0 | 161271003 |
4 | ap
| asia pacific
357480 |
0 | 161271002 |
2 | sa
| south america
(6 rows)

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 18

IBM Software
Information Management

We can now see that we have logically deleted our updated row as well. It has now a DELETEXID field with the value of the
new transaction. New transactions will see the original table from the start of this lab again. Normally the logically deleted
rows are filtered out automatically by the FPGA.
If you do a SELECT the FPGA will remove all rows that:
have a CREATEXID which is bigger than the current transaction id.
have a CREATEXID of an uncommitted transaction.
have a DELETENXID which is smaller than the current transaction, but only if the transaction of the DELETEXID
field is committed.
have a DELETEXID of 1 which means that the insert has been aborted.

2.3

Aborting Transactions

PureData System never deletes a row during transactions even if transactions are rolled back. In this section we will show what
happens if a transaction is rolled back. Since an update transaction consists of a delete and insert transaction, we will
demonstrate the behavior for all tree transaction types with this.
1.

To start a transaction that we can later rollback we need to use the BEGIN keyword.

LABDB(LABADMIN)=> BEGIN;
Per default all SQL statements entered into the NZSQL console are auto-committed. To start a multi command transaction
the BEGIN keyword needs to be used. All SQL statements that are executed after it will belong to a single transaction. To
end the transaction two keywords can be used COMMIT to commit the transaction or ROLLBACK to rollback the transaction
and all changes since the BEGIN statement was executed.
2.

Update the row for the AP region:

LABDB(LABADMIN)=> UPDATE REGION SET R_COMMENT='AP' WHERE R_REGIONKEY=4;

3.

Do a SELECT on the REGION table again:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


You should see the following output:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


CREATEXID | DELETEXID | ROWID
| R_REGIONKEY |
R_NAME
|
R_COMMENT
-----------+-----------+----------+-------------+---------------------------+----------------------------5160 |
0 | 37801002 |
2 | sa
| south america
5160 |
0 | 37801001 |
1 | na
| north america
5172 |
9218 | 38962000 |
5 | as
| australia
9218 |
9222 | 38962000 |
5 | as
| Australia
5160 |
0 | 37801000 |
3 | emea
| europe, middle east, africa
5160 |
9226 | 37801003 |
4 | ap
| asia pacific
9226 |
0 | 37801003 |
4 | ap
| AP
(7 rows)

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 18

IBM Software
Information Management

Note that we have the same results as in the last chapter, the original row for the AP region was logically deleted by
updating its DELETEXID field, and a new row with the updated comment and new rowid has been added. Note that its
CREATEXID is the same as the DELETEXID of the old row, since they were updated by the same transaction.
4.

Now lets rollback the transaction:

LABDB(LABADMIN)=> ROLLBACK;

5.

Do a SELECT on the REGION table again:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


You should see the following output:

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


CREATEXID | DELETEXID | ROWID
| R_REGIONKEY |
R_NAME
|
R_COMMENT
-----------+-----------+----------+-------------+---------------------+---------------------5160 |
0 | 37801002 |
2 | sa
| south america
5160 |
0 | 37801000 |
3 | emea
| Europe
5160 |
0 | 37801003 |
4 | ap
| asia pacific
9226 |
1 | 37801003 |
4 | ap
| AP
5160 |
0 | 37801001 |
1 | na
| north america
5172 |
9218 | 38962000 |
5 | as
| australia
9218 |
9222 | 38962000 |
5 | as
| Australia
(7 rows)

We can see that the transaction has been rolled back. The DELETEXID of the old version of the row has been reset to 0 ,
which means that it is a valid row that can be seen by other transactions, and the DELETEXID of the new row has been set
to 1 which marks it as aborted.

2.4

Cleaning up

In this section we will use the Groom command to remove the logically deleted rows we have entered and we will remove the
system parameter from the configuration file. The Groom command will be used in more detail in the next chapter. It is the main
maintenance command in PureData System and we have already used it in the Cluster Based Table labs to reorder a CBT. It
also removes all logically deleted rows from a table and frees up the space on the machine again.

1.

Execute the Groom command on the REGION table:

LABDB(LABADMIN)=> groom table region;


You should see the following result:

LABDB(LABADMIN)=> groom table region;


NOTICE: Groom processed 4 pages; purged 3 records; scan size unchanged; table size
unchanged.
GROOM RECORDS ALL
You can see that the groom command purged 3 rows, exactly the number of aborted and logically deleted rows we have
generated in the previous chapter.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 18

IBM Software
Information Management

2.

Now select the rows from the REGION table again.

LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;


You should see the following result:
LABDB(LABADMIN)=> SELECT CREATEXID, DELETEXID, ROWID,* FROM REGION;
CREATEXID | DELETEXID |
ROWID
| R_REGIONKEY |
R_NAME
|
R_COMMENT
-----------+-----------+-----------+-------------+---------------------------+---------------------------357480 |
0 | 161271002 |
2 | sa
| south america
357480 |
0 | 161271000 |
3 | emea
| europe,...
369682 |
0 | 164100000 |
1 | na
| north america
357480 |
0 | 161271003 |
4 | ap
| asia pacific
(4 rows)

You can see that the groom command has removed all logically deleted rows from the table. Remember that we still have
the parameter switched on that allows us to see any logically deleted rows. Especially in tables that are heavily changed
with lots and updates and deletes running the groom command will free up hard drive space and increase performance.
3.

Finally we will remove the system parameter again. Quit the nzsql console with the \q command.

4.

Stop the PureData System database with nzstop

5.

Navigate to the system config directory with the following command:

[nz@netezza ~]$ cd /nz/data/config


6.

Open the system.cfg file that contains the PureData System system configuration with vi

[nz@netezza config]$ vi system.cfg

7.

Navigate the cursor to the last line and delete it by pressing d twice. Your screen should look like the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 18

IBM Software
Information Management

system.enableCompressedTables=false
system.realFpga=no
system.useFpgaPrep=yes
system.enableCompressedTables=yes
system.enclosurePollInterval=0
system.envPollInterval=0
system.esmPollInterval=0
system.hbaPollInterval=0
system.diskPollInterval=0
system.enableCTA2=1
system.enableCTAColumns=1
sysmgr.coreCountWarning=1
sysmgr.coreCountFailover=1
system.emulatorMode=64
system.emulatorThreads=4
~~
"system.cfg" 16L, 421C

8.

Enter :wq! In the command line and press Enter to save and exit without questions.

9.

Start the system again with the nzstart command. We have now returned the system to its original status. Logically
deleted lines will again be hidden by the database.

3 Grooming Logically Deleted Rows


In this section we will delete rows and determine that they have not really been deleted from the disk. Then using groom we will
physically delete the rows.
1.

First determine the physical size on disk of the table ORDERS using the following command :

[nz@netezza ~]$ nz_db_size LABDB


You should see the following results:
[nz@netezza ~]$ nz_db_size LABDB
Object
|
Name
|
Bytes
|
KB
|
MB
|
GB
|
TB
-----------+------------------+---------------+-------------+-------------+------------+------Appliance | netezza
|
769,785,856 |
751,744 |
734 |
.7 |
.0
Database | LABDB
|
761,921,536 |
744,064 |
727 |
.7 |
.0
Table
| CUSTOMER
|
13,631,488 |
13,312 |
13 |
.0 |
.0
Table
| LINEITEM
|
588,644,352 |
574,848 |
561 |
.5 |
.0
Table
| NATION
|
524,288 |
512 |
1 |
.0 |
.0
Table
| ORDERS
|
78,118,912 |
76,288 |
75 |
.1 |
.0
Table
| PART
|
12,058,624 |
11,776 |
12 |
.0 |
.0
Table
| PARTSUPP
|
67,502,080 |
65,920 |
64 |
.1 |
.0
Table
| REGION
|
393,216 |
384 |
0 |
.0 |
.0
Table
| SUPPLIER
|
1,048,576 |
1,024 |
1 |
.0 |
.0

Notice that the ORDERS table is 75MBs in size.


2.

Now we are going to delete some rows from ORDERS table. Delete all rows where the orderstatus is marked as F for
finished using the following command :

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 10 of

IBM Software
Information Management

[nz@netezza labs]$ nzsql LABDB LABADMIN


LABDB(LABADMIN)=> DELETE FROM ORDERS WHERE O_ORDERSTATUS='F';
The output should be:

DELETE 729413
3.

Now check the physical table size for ORDERS and see if the size decreased using the same command as before. You
must first exit NZSQL to shell using \q.

LABDB(LABADMIN)=> \q
[nz@netezza ~]$ nz_db_size LABDB
The output should be the same as above showing that the ORDERS table did not change in size and is still 75MB. This is
because the deleted rows were logically deleted but are still left on disk. The rows will still accessible to transactions that started
before the DELETE statement which we just executed. (i.e. have a lower transaction id)
4.

Next lets physically delete what we just logically deleted using the GROOM TABLE command and specifying table
ORDERS. When you run the GROOM TABLE command, it removes outdated and deleted records from tables.

[nz@netezza labs]$ nzsql LABDB LABADMIN


LABDB(LABADMIN)=> GROOM TABLE ORDERS;
The output should be:

LABDB(LABADMIN)=> GROOM TABLE ORDERS;


NOTICE: Groom processed 596 pages; purged 729413 records; scan size shrunk by 288
pages; table size shrunk by 12 extents.
GROOM RECORDS ALL
You can see that 729413 rows were removed from disk resulting in the table size shrinking by 12 extents. Notice that this is
the same number of rows we deleted in the previous step.
5.

Check if the ORDERS table size on disk has shrunk using the nz_db_size command. You must first exit NZSQL to
shell using \q.

LABDB(LABADMIN)=> \q
[nz@netezza ~]$ nz_db_size LABDB
The output is shown below. Note the reduced size of the ORDERS table:
[nz@netezza ~]$ nz_db_size labdb
Object
|
Name
|
Bytes
|
KB
|
MB
|
GB
|
TB
-----------+-------------+--------------+------------+------------+-----------+------Appliance | netezza
| 430,833,664 |
420,736 |
411 |
.4 |
.0
Database | LABDB
| 422,969,344 |
413,056 |
403 |
.4 |
.0
Table
| CUSTOMER
|
13,631,488 |
13,312 |
13 |
.0 |
.0
Table
| LINEITEM
| 294,256,640 |
287,360 |
281 |
.3 |
.0
Table
| NATION
|
524,288 |
512 |
1 |
.0 |
.0
Table
| ORDERS
|
40,370,176 |
39,424 |
39 |
.0 |
.0
Table
| PART
|
5,242,880 |
5,120 |
5 |
.0 |
.0
Table
| PARTSUPP
|
67,502,080 |
65,920 |
64 |
.1 |
.0
Table
| REGION
|
393,216 |
384 |
0 |
.0 |
.0
Table
| SUPPLIER
|
1,048,576 |
1,024 |
1 |
.0 |
.0

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 11 of

IBM Software
Information Management

We can see that GROOM did purge the deleted rows from disk. GROOM reported that the table size was reduced by 12 extents and
we can confirm this because we can see that the size of the table reduced by 36MB which is the correct size for 12 exents. (1
extents size is 3 MB).

4 Performance Benefits of GROOM


In this section we will show that grooming a table can also result in a performance benefit because the amount of data that
needs to be scanned is smaller. Outdated rows are still present on the hard disc. They can be dismissed by the FPGA chip but
the system still needs to read them from disc. In this example we need for accounting reasons increase the order price of all
columns. This means that we need to update every row in the ORDERS table. We will measure query performance before and
after Grooming the table.
1.

Update the ORDERS table so that the price of everything is increased by $1. Do this using the following command:

[nz@netezza labs]$ nzsql LABDB LABADMIN


LABDB(LABADMIN)=> UPDATE ORDERS SET O_TOTALPRICE = O_TOTALPRICE+1;
Output:

UPDATE 770587
All rows will be affected by the update resulting in a doubled number of physical rows in the table. This is because the update
operation leaves a copy of the rows before the update occurred incase a transaction is still operating on the rows.. New rows
are created and the results of the UPDATE are put in these rows. The old rows that are left on disk are marked as logically
deleted.
2.

To measure the performance of our test query, we can configure the NZSQL console to show the elapsed execution
time using the following command:

LABDB(LABADMIN)=> \time
Output:

Query time printout on


3.

Run our given test query and note the performance:

LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;


Output:

LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;


COUNT
-------770587
(1 row)
Elapsed time: 0m0.502s
4.

Please rerun the query once or twice more to see roughly what a consistent query time is on your machine.

LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;


5.

Now run the GROOM TABLE command on the ORDER table again:

LABDB(LABADMIN)=> GROOM TABLE ORDERS;


IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
18

Page 12 of

IBM Software
Information Management

The output should be:

LABDB(ADMIN)=> GROOM TABLE ORDERS;


NOTICE: Groom processed 616 pages; purged 770587 records; scan size shrunk by 308
pages; table size shrunk by 16 extents.
GROOM RECORDS ALL
Can you tell how much disk space this saved? (Its the number of extents times 3MB)
6.

Now run our chosen test query again and you should see a difference in performance:

LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;


Output:

LABDB(LABADMIN)=> SELECT COUNT(*) FROM ORDERS;


COUNT
-------770587
(1 row)
Elapsed time: 0m0.315s
You should see that the query ran faster than before. This is because GROOM reduced the number of rows that must be scanned
to complete the query. The COUNT(*) command on the table will return the same number of rows before and after the GROOM
command was run since it can only see the current version of the table, which means all rows that have not been deleted by a
lower transaction id. Since our UPDATE command hasnt changed the number of logical rows this will not change.
Nevertheless the outdated rows, which have been logically deleted by our UPDATE command, are still present on disk. The
COUNT(*) query cannot access these rows but they do take up space on disk and need to be scanned. GROOM is used to purge
these logically deleted rows from disk which increase disk usage and scan distance. You should GROOM tables that receive
frequent updates or deletes more often than tables that are seldom updated. You might want to schedule tasks that routinely
GROOM the frequently updated tables or run a GROOM command as part of you ETL process.

5 Changing the Data Type of a Column


In some situations you will realize that the initially used data types are not suitable for longterm use, for example because new
entries exceed the range of an initially picked integer value. You cannot directly change the data type by using the ALTER
statement but there are two approaches that allow you to do it without loading and unloading the data.
The first approach is to:
Create a CTAS table from the old table with a CAST to the new datatype for the column you want to change
Drop the old table
Rename the new table to the name of the old table
In general this is a good approach because it lets you keep the order of the columns. But in this example we will use a second
approach to highlight the groom command and its role during ADD and DROP column commands. Its disadvantages are that the
order of the columns will change, which may result in difficulties for third party applications that access columns by their order.
In this chapter we will:
Add a new column to the table with the new datatype
Copy over all values from the old row to the new one with an UPDATE command
Drop the old column
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
18

Page 13 of

IBM Software
Information Management

Rename the new column to the name of the old one


Use the groom command to materialize the results of our table changes

For our example we find out that we have a new Region we want to add to our Regions table which has a name that exceeds
the limits of the CHAR(25) field R_NAME. Australia, New Zealand, and Tasmania. And we decide to increase the R_NAME
field to a CHAR(40) field.
1.

Add a new column to the region table with name R_NAME_TEMP and data type CHAR(40)

LABDB(LABADMIN)=> ALTER TABLE REGION ADD COLUMN R_NAME_TEMP CHAR(40);


Notice that the ALTER command is practically instantaneous. This even holds true for huge tables. Under the cover the
system will create a new empty version of the table. It will not lock and change the whole table.
2.

Lets insert a row into the table using the new name column

LABDB(LABADMIN)=> INSERT INTO REGION VALUES (5,'', 'South Pacific Region',


'Australia, New Zealand, and Tasmania');
DELETE 39099
3.

Now do a select on the table:

LABDB(LABADMIN)=> SELECT * FROM REGION;


You should get the following results:
LABDB(LABADMIN)=> LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY |
R_NAME
|
R_COMMENT
|
R_NAME_TEMP
-------------+---------------------------+-----------------------+---------------------------------1 | na
| north america
|
5 |
| South Pacific Region | Australia, New Zealand, and Tasmania
4 | ap
| asia pacific
|
2 | sa
| south america
|
3 | emea
| europe,
|
(5 rows)

You can see that the results are exactly as you would expect them to be, but how does the system actually achieve this.
Remember inside the PureData System appliances we have two versions of the table, one containing the old columns and
rows and one containing the new row column.
4.

Lets do an EXPLAIN on the SELECT query

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;


You should get the following results:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 14 of

IBM Software
Information Management

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;


NOTICE: QUERY PLAN:
QUERY SQL:
EXPLAIN VERBOSE SELECT * FROM REGION;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table ""_TV_315893_2"" {("_TV_315893_2".R_REGIONKEY)}]
-- Estimated Rows = 1, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0
User table: REGION version 2
Projections:
1:"_TV_315893_2".R_REGIONKEY 2:"_TV_315893_2".R_NAME
3:"_TV_315893_2".R_COMMENT 4:"_TV_315893_2".R_NAME_TEMP
Node 2.
[SPU Sub-query Scan table "*SELECT* 1" Node "1" {(0."1")}]
-- Estimated Rows = 1, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0
Projections:
1:0."1" 2:0."2" 3:0."3" 4:0."4"
Node 3.
[SPU Sequential Scan table ""_TV_315893_1"" {("_TV_315893_1".R_REGIONKEY)}]
-- Estimated Rows = 8, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0
User table: REGION version 1
Projections:
1:"_TV_315893_1".R_REGIONKEY 2:"_TV_315893_1".R_NAME
3:"_TV_315893_1".R_COMMENT 4:(NULL::BPCHAR)::CHAR(40)
Node 4.
[SPU Sub-query Scan table "*SELECT* 2" Node "3" {(0."1")}]
-- Estimated Rows = 8, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0
Projections:
1:0."1" 2:0."2" 3:0."3" 4:0."4"
Node 5.
[SPU Append Nodes: , "2", "4 (stream)" {(0."1")}]
-- Estimated Rows = 9, Width = 221, Cost = 0.0 .. 0.0, Conf = 0.0
Projections:
1:0."1" 2:0."2" 3:0."3" 4:0."4"
Node 6.
[SPU Sub-query Scan table "_BV_315893" Node "5" {("_BV_315893".R_REGIONKEY)}]
-- Estimated Rows = 9, Width = 221, Cost = 0.0 .. 0.0, Conf = 100.0
Projections:
1:"_BV_315893".R_REGIONKEY 2:"_BV_315893".R_NAME 3:"_BV_315893".R_COMMENT
4:"_BV_315893".R_NAME_TEMP
[SPU Return]
[Host Return]

Normally the query would result in a single table scan node. But now we see a more complicated query plan. The Optimizer
automatically translates the simple SELECT into a UNION of two tables. The two tables are internal and are called
_TV_315893_1, which is the old version of the table before the ALTER statement. And _TV_315893_2, which is the new
version of the table after the table statement containing the new column R_NAME_TEMP.
Notice that in the old table a 4th column of CHAR(40) with default value NULL is added. This is necessary for the UNION to
succeed. The merger of those tables is done in Node 5, which takes both result sets and appends them.
But lets proceed with our data type change operation.
5.

First lets remove the new row again

LABDB(LABADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY > 4;


6.

Now we will move all values of the R_NAME column to the R_NAME_TEMP column by updating them

LABDB(LABADMIN)=> UPDATE REGION SET R_NAME_TEMP = R_NAME;


7.

Lets have a look at the table again:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 15 of

IBM Software
Information Management

LABDB(LABADMIN)=> SELECT * FROM REGION;


You should get the following results:
LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY |
R_NAME
|
R_COMMENT
|
R_NAME_TEMP
-------------+---------------------------+-----------------------------+----------------------------------------3 | emea
| europe, middle east, africa | emea
1 | na
| north america
| na
2 | sa
| south america
| sa
4 | ap
| asia pacific
| ap
(4 rows)

8.

Now lets remove the old column:

LABDB(LABADMIN)=> ALTER TABLE REGION DROP COLUMN R_NAME RESTRICT;


9.

And rename the column name

LABDB(LABADMIN)=> ALTER TABLE REGION RENAME COLUMN R_NAME_TEMP TO R_NAME;

10. Lets have a look at the table again:

LABDB(LABADMIN)=> SELECT * FROM REGION;


You should get the following results:
LABDB(LABADMIN)=> SELECT * FROM REGION;
R_REGIONKEY |
R_COMMENT
|
R_NAME
-------------+-----------------------------+-----------------------------------------4 | asia pacific
| ap
3 | europe, middle east, africa | emea
2 | south america
| sa
1 | north america
| na
(4 rows)

We have achieved to change the data type of the R_NAME column. The column order has changed but our R_NAME
column has the same values as before and now supports longer region names.
But we have one last step to do. Under the cover the system now has three different versions of the table which are merged
for each call against the REGION table. This not only uses up space it is also bad for the query performance. So we have to
materialize these table changes with the groom command.
11. Groom the REGION table with the VERSIONS keyword to merge table versions:

LABDB(LABADMIN)=> GROOM TABLE REGION VERSIONS;


You should get the following results:
LABDB(LABADMIN)=> GROOM TABLE REGION VERSIONS;
NOTICE: Groom processed 8 pages; purged 5 records; scan size shrunk by 4 pages; table size shrunk by 4
extents.
GROOM VERSIONS
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
18

Page 16 of

IBM Software
Information Management

12. And finally we will look at the EXPLAIN output again:

LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;


You should get the following results:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT * FROM REGION;
NOTICE: QUERY PLAN:
QUERY SQL:
EXPLAIN VERBOSE SELECT * FROM REGION;
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "REGION" {(REGION.R_REGIONKEY)}]
-- Estimated Rows = 4, Width = 73, Cost = 0.0 .. 0.0, Conf = 100.0
Projections:
1:REGION.R_REGIONKEY 2:REGION.R_COMMENT 3:REGION.R_NAME
[SPU Return]
[Host Return]

Now this is much nicer. As we would expect we only have a single table scan snippet in the query plan and a single version
of the REGION table.
13. Finally we will return the REGION table to the old column ordering to not interfere with future labs, to do this we will use
a CTAS statement

LABDB(LABADMIN)=> CREATE TABLE REGION_NEW AS SELECT R.R_REGIONKEY, R.R_NAME,


R.R_COMMENT FROM REGION R;
14. Now drop the REGION table:

LABDB(LABADMIN)=> DROP TABLE REGION;


15. And finally rename the REGION_NEW table to make the transformation complete:

LABDB(LABADMIN)=> ALTER TABLE REGION_NEW RENAME TO REGION;


If a table can be inaccessible for a short period of time using CTAS tables can be the better solution to change data types
than using an ALTER TABLE statement.
In this lab you have looked behind the scenes of the PureData System appliances. You have seen how transactions are
implemented and we have shown different reasons for using the groom command. It not only removes logically deleted rows
from INSERT and UPDATE operations, aborted INSERTS and Loads, it also materializes table changes and reorders cluster
based tables.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 17 of

IBM Software
Information Management

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Other company, product and service names may be trademarks or
service marks of others.
References in this publication to IBM products and services do not
imply that IBM intends to make them available in all countries in which
IBM operates.
No part of this document may be reproduced or transmitted in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved
18

Page 18 of

Stored Procedures
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology

IBM Software
Information Management

Table of Contents

Introduction .....................................................................3
1.1

Objectives........................................................................3

Implementing the addCustomer stored procedure ......3


2.1

Create Insert Stored Procedure .......................................4

2.2

Adding integrity checks....................................................8

2.3

Managing your stored procedure ...................................10

Implementing the checkRegions stored procedure ...15

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 2 of 22

IBM Software
Information Management

1 Introduction
Stored Procedures are subroutines that are saved in PureData System. They are executed inside the database server and are
only available by accessing the NPS system. They combine the capabilities of SQL to query and manipulate database
information with capabilities of procedural programming languages, like branching and iterations. This makes them an ideal
solution for tasks like data validation, writing event logs or encrypting data. They are especially suited for repetitive tasks that
can be easily encapsulated in a sub-routine.

1.1

Objectives

In the last labs we have created our database, loaded the data and we have done some optimization and administration tasks.
In this lab we will enhance the database by a couple of stored procedures. As we mentioned in a previous chapter PureData
System doesnt check referential or unique constraints. This is normally not critical since data loading in a data warehousing
environment is a controlled task. In our PureData System implementation we get the requirement to allow some non
administrative database users to add new customers to the customer table. This happens rarely so there are no performance
requirements and we have decided to implement this with a stored procedure that is accessible for these users and checks the
input values and referential constraints.
In a second part we will implement a business logic function as a stored procedure returning a result set. TODO describe
function

Figure 1 LABDB database

2 Implementing the addCustomer stored procedure

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 3 of 22

IBM Software
Information Management

In this chapter we will create the stored procedure to insert data into the customer table. The information that is added for a new
customer will be the customer key, name, phone number and nation, the rest of the information is updated through other
processes.

2.1

Create Insert Stored Procedure

First we will review the customer table and define the interface of the insert stored procedure.
1.

Connect to your Netezza image using putty. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)

2.

Access the lab directory for this lab with the following command, this folder already contains empty files for the stored
procedure scripts we will later create. If you want review them with the ls command:

[nz@netezza ~]$ cd /labs/storedProcedure/

3.

Enter NZSQL and connect to LABDB as user LABADMIN.

[nz@netezza labs]$ nzsql LABDB LABADMIN

4.

Describe the customer table with the following command \d customer

You should see the following:


LABDB(ADMIN)=> \d customer
Table "CUSTOMER"
Attribute
|
Type
| Modifier | Default Value
--------------+------------------------+----------+--------------C_CUSTKEY
| INTEGER
| NOT NULL |
C_NAME
| CHARACTER VARYING(25) | NOT NULL |
C_ADDRESS
| CHARACTER VARYING(40) | NOT NULL |
C_NATIONKEY | INTEGER
| NOT NULL |
C_PHONE
| CHARACTER(15)
| NOT NULL |
C_ACCTBAL
| NUMERIC(15,2)
| NOT NULL |
C_MKTSEGMENT | CHARACTER(10)
| NOT NULL |
C_COMMENT
| CHARACTER VARYING(117) | NOT NULL |
Distributed on hash: "C_CUSTKEY"
We will now create a stored procedure that adds a new customer entry and sets the 4 fields: C_CUSTKEY, C_NAME,
C_NATIONKEY, and C_PHONE, all other fields will be set with an empty value or 0, since the fields are flagged as NOT
NULL.
5.

To create a stored procedure we will use the internal vi editor of the nzsql console. Open the already existing empty file
addUser.sql with the following command (note you can tab out the filename):

LABDB(ADMIN)=> \e addCustomer.sql
6.

You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing i

7.

We will now create the interface of the stored procedure so we can test creating it. We need the 4 input field mentioned
above and will return an integer return code. Enter the text as seen in the following, then exit the insert mode by
pressing ESC and enter wq! and enter to save the file and quit vi.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 4 of 22

IBM Software
Information Management

CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15))


LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
END_PROC;
~
~
~
~~
~
:wq!

The minimal stored procedure we create here doesnt yet do anything, since it has an empty body. We simply create the
signature with the input and output variables. We use the command CREATE OR REPLACE so we can later execute the
same command multiple times to update the stored procedure with more code.
The input variables cannot be given names so we only add the datatypes for our input parameters key, name, nation and
phone. We also return an integer return code.
Note that we have to specify the procedure language even though NZPLSQL is the only available option in PureData
System.
8.

Back in the nzsql command line execute the script we just created with \i addCustomer.sql

You should see, that the procedure has been created successfully
LABDB(ADMIN)=> \i addCustomer.sql
CREATE PROCEDURE
LABDB(ADMIN)=>

9.

Display all stored procedures in the LABDB database with the following command:

LABDB(LABADMIN)=> SHOW PROCEDURE;

You will see the following result:


LABDB(ADMIN)=> show procedure;
RESULT | PROCEDURE | BUILTIN |
ARGUMENTS
---------+-------------+---------+----------------------------------------------------------------INTEGER | ADDCUSTOMER | f
| (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER
VARYING(15))
(1 row)

You can see the procedure ADDCUSTOMER with the arguments we specified.
10. Execute the stored procedure with the following dummy input parameters:

LABDB(LABADMIN)=> call addcustomer(1,'test', 2, 'test');

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 5 of 22

IBM Software
Information Management

You should see the following:


LABDB(LABADMIN)=> call addcustomer(1,'test', 2, 'test');
NOTICE: plpgsql: ERROR during compile of ADDCUSTOMER near line 1
ERROR: syntax error, unexpected <EOF>, expecting BEGIN at or near ""
The result shows that we have a syntax error in our stored procedure. Every stored procedure needs at least one BEGIN ..
END block that encapsulates the code that is to be executed. Stored procedures are compiled when they are first executed
not when they are created, therefore errors in the code can only be seen during execution.
11. Switch back to the VI view with the following command

LABDB(LABADMIN)=> \e addCustomer.sql
12. Switch to insert mode by pressing i"
13. We will now create a simple stored procedure that inserts the new entry into the customer table. But first we will add
some variables that alias the input variables $1, $2 etc. After the BEGIN_PROC statement enter the following lines:

DECLARE
C_KEY ALIAS FOR $1;
C_NAME ALIAS FOR $2;
N_KEY ALIAS FOR $3;
PHONE ALIAS FOR $4;
Each BEGIN..END block in the stored procedure can have its own DECLARE section. Variables are valid in the block they
belong to. It is a good best practice to change the input parameters into readable variable names to make the stored
procedure code maintainable. We will later add some additional parameters to our procedures as well.
Be careful not to use variable names that are restricted by PureData System like for example NAME.
14. Next we will add the BEGIN..END block with the INSERT statement.

BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
This statement will add a new row to the customer table using the input variables. It will replace the remaining fields like
account balance with default values that can be later filled. It is also possible to execute dynamic SQL queries which we will
do in a later chapter.
Your complete stored procedure should now look like the following:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 6 of 22

IBM Software
Information Management

CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15))


LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
15. Save and exit VI again by pressing ESC to enter the command mode and entering wq! and pressing enter. This will
bring you back to the nzsql console.
16. Execute the stored procedure script with the following command: \i addCustomer.sql
17. Now lets try our stored procedure lets add a new customer John Smith with customer key 999999, phone number 5555555 and nation 2 (which is the key for the United States in our nation table). You can also check first that the customer
doesnt yet exist if you want.

LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555');


You should get the following results:
LABDB(LABADMIN)=> CALL addcustomer(999999,'John Smith', 2, '555-5555');
ADDCUSTOMER
------------(1 row)
18. Lets check if the insert was successful:

LABDB(LABADMIN)=> SELECT * FROM CUSTOMER WHERE C_CUSTKEY = 999999;


You should get the following results:
LABDB(LABADMIN)=> SELECT * FROM CUSTOMER WHERE C_CUSTKEY = 999999;
C_CUSTKEY |
C_NAME
| C_ADDRESS | C_NATIONKEY |
C_PHONE
| C_ACCTBAL
| C_MKTSEGMENT | C_COMMENT
-----------+------------+-----------+-------------+-----------------+----------+--------------+----------999999 | John Smith |
|
2 | 555-5555
|
0.00
|
|
(1 row)
We can see that our insert was successful. Congratulations, you have built your first PureData System stored procedure.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 7 of 22

IBM Software
Information Management

2.2

Adding integrity checks

In this chapter we will add integrity checks to the stored procedure we just created. We will make sure that no duplicate
customer is entered into the CUSTOMER table by querying it before the insert. We will then check with an IF condition if the
value had already been inserted into the CUSTOMER table and abort the insert in that case. We will also check the foreign key
relationship to the nation table and make sure that no customer is inserted for a nation that doesnt exist. If any of these
conditions arent met the procedure will abort and display an error message.
1.

Switch back to the VI view of the procedure with the following command. In case of a message warning about duplicate
files press enter.

LABDB(LABADMIN)=> \e addCustomer.sql
2.

Switch to insert mode by pressing i"

3.

Add a new variable customer_rec with the type RECORD in the DECLARE section:
REC RECORD;

A RECORD is a row set with dynamic fields. It can refer to any row that is selected in a SELECT INTO statement. You can
later refer to fields with for example CUSTOMER_REC.C_PHONE.
4.

Add the following statement before the INSERT statement:


SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;

This statement fills the CUSTOMER_REC variable with the results of the query. If there is already one or more customers
with the specified key it will contain the first. Otherwise the variable will be null.
5.

Now we add the IF condition to abort the stored procedure in case a record already exists. After the newly added
SELECT statement add the following lines:
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;

In this case we use an IF condition to check if an customer record with the key already exists and has been selected by the
previous SELECT condition. We could do an implicit check on the record or any of its fields and see if it compares to the null
value, but PureData System provides a number of special variables that make this more convenient.
FOUND specifies if the last SELECT INTO statement has returned any records
ROW_COUNT contains the number of found rows in the last SELECT INTO statement
LAST_OID is the object id of the last inserted row, this variable is not very useful unless used for catalog tables.
Finally we use a RAISE EXCEPTION statement to throw an error and abort the stored procedure. To add variable values to
the return string use the % symbol anywhere in the string. This is a similar approach as used for example by the C printf
statement.
6.

We will also check the foreign key relationship to NATION, add the following lines after the last once:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 8 of 22

IBM Software
Information Management

SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;


IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
This is very similar to the last check, only that we this time check if a record was NOT found. Notice that we can reuse the
REC record since it is not typed to a particular table.
Your stored procedure should now look like the following:
CREATE OR REPLACE PROCEDURE addCustomer(integer, varchar(25), integer, varchar(15))
LANGUAGE NZPLSQL RETURNS INT4 AS
BEGIN_PROC
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
REC RECORD;
BEGIN
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
END_PROC;
7.

Save the stored procedure by pressing ESC, and then entering wq! and pressing Enter.

8.

In NZSQL create the stored procedure from the script by executing the following command (remember that you can
cycle through previous commands by pressing the UP key)

LABDB(LABADMIN)=> \i addCustomer.sql
9.

Now lets test the check for duplicate customer ids by repeating our last CALL statement, we already know that a
customer record with the id 999999 already exists:

LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555');


Your should get the following results:
LABDB(LABADMIN)=> CALL addCustomer(999999,'John Smith', 2, '555-5555');
ERROR: Customer with key 999999 already exists
LABDB(LABADMIN)=>
This is what we expected the key value already exists and our first error condition is thrown.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 9 of 22

IBM Software
Information Management

10. Now lets check the foreign key integrity by executing the following command with a customer id that does not yet exist
and a nation key that doesnt exist in the NATION table as well. You can double check this using select statements if
you want:

LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 99, '555-5555');


You should see the following output:
LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 99, '555-5555');
ERROR: No Nation with nation key 99
LABDB(LABADMIN)=>
This is also as we have expected. The customer key didnt yet exist so the first IF condition is not thrown but the check for
the nation key table throws an error.
11. Finally lets try a working example, execute the following command with a customer id that doesnt yet exist and the
nation key 2 for United States.

LABDB(LABADMIN)=> CALL addCustomer(999998,'James Brown', 2, '555-5555');


You should see a successful execution.
12. Lets check that the value was correctly inserted:

LABDB(LABADMIN)=> SELECT C_CUSTKEY, C_NAME FROM CUSTOMER WHERE C_CUSTKEY = 999998;


This will give you the following results
LABDB(LABADMIN)=> SELECT C_CUSTKEY, C_NAME FROM CUSTOMER WHERE C_CUSTKEY = 999998;
C_CUSTKEY |
C_NAME
-----------+------------999998 | James Brown
(1 row)
LABDB(LABADMIN)=>
We have successfully created a stored procedure that can be used to insert values into the CUSTOMER table and checks
for unique and foreign key constraints. You should remember that PureData System isnt optimized to do lookup queries so
this will be a pretty slow operation and shouldnt be used for thousands of inserts. But for the occasional management it is a
perfectly valid solution to the problem of missing constraints in PureData System.

2.3

Managing your stored procedure

In the last chapters we have created a stored procedure that inserts values to the CUSTOMER table and does check constraints.
We will now give rights to execute this procedure to a user and we will use the management functions to make changes to the
stored procedure and verify them.
1.

First we will create a user custadmin who will be responsible for adding customers, to do this we will need to switch to
the admin user since users are global objects:

LABDB(LABADMIN)=> \c labdb admin


IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 10 of 22

IBM Software
Information Management

2.

Now we create the user:

LABDB(ADMIN)=> create user custadmin with password 'password';


You can see that he has the same password as the other users in our labs. We do this for simplification, since it allows us to
obmit the password during user switches, this would of course not be done in a production environment
3.

Now we will grant access to the labdb database, otherwise he couldnt log on

LABDB(ADMIN)=> grant list, select on labdb to custadmin;


4.

Finally we will grant him the right to select from the customer table, he will need to have this to verify any changes he
has made:

LABDB(ADMIN)=> grant select on customer to custadmin;


5.

Now lets test this out first switch to the user custadmin:

LABDB(ADMIN)=> \c labdb custadmin


6.

Now try to select something from the NATION table to verify that the user only has access to the customer table:

LABDB(CUSTADMIN)=> select * from nation;


You should get the message that access is refused:
LABDB(CUSTADMIN)=> select * from nation;
ERROR: Permission denied on "NATION".
LABDB(CUSTADMIN)=>
7.

Now lets select something from the CUSTOMER table:

LABDB(CUSTADMIN)=> select c_custkey, c_name from customer where c_custkey = 999998;


The user should be able to select the row from the CUSTOMER table:
LABDB(CUSTADMIN)=> select c_custkey, c_name from customer where c_custkey = 999998;
C_CUSTKEY |
C_NAME
-----------+------------999998 | James Brown
(1 row)
LABDB(CUSTADMIN)=>
8.

Finally lets verify that the user doesnt have INSERT rights on the table:

LABDB(CUSTADMIN)=> INSERT INTO CUSTOMER VALUES (1, '','',1,'',1,'','');


You will be refused to insert values to the customer table:
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 11 of 22

IBM Software
Information Management

LABDB(CUSTADMIN)=> INSERT INTO CUSTOMER VALUES (1, '','',1,'',1,'','');


ERROR: Permission denied on "CUSTOMER".
LABDB(CUSTADMIN)=>
9.

We now need to switch back to the admin user to give custadmin the rights to execute the stored procedure:

LABDB(CUSTADMIN)=> \c labdb admin

10. To grant the right to execute a specific stored procedure we need to specify the full name including all input parameters.
The easiest way to get these in the correct syntax is to first list them with the SHOW PROCEDURE command:

LABDB(ADMIN)=> show procedure all;


You should see the following screen, you could either cut&paste the arguments or copy them manually:
LABDB(ADMIN)=> show procedure all;
RESULT | PROCEDURE | BUILTIN |
ARGUMENTS
---------+-------------+---------+----------------------------------------------------------------INTEGER | ADDCUSTOMER | f
| (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER
VARYING(15))
(1 row)
11. Now grant the right to execute this stored procedure to CUSTADMIN:

LABDB(ADMIN)=> grant execute on addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER,


CHARACTER VARYING(15)) to custadmin;
12. Lets check the rights of the custadmin user now with : \dpu custadmin
You should get the following results:
LABDB(ADMIN)=> \dpu custadmin
User object permissions for user 'CUSTADMIN'
Database Name | Object Name | L S I U D T L A D B L G O E C R X A | D G U T E X Q Y V M
I B R C S H F A L P N S
---------------+-------------+-------------------------------------+-------------------------------------------LABDB
| ADDCUSTOMER |
X
|
LABDB
| CUSTOMER
|
X
|
GLOBAL
| LABDB
| X X
|
(3 rows)
Object Privileges
(L)ist (S)elect (I)nsert (U)pdate (D)elete (T)runcate (L)ock
(A)lter (D)rop a(B)ort (L)oad (G)enstats Gr(O)om (E)xecute
Label-A(C)cess Label-(R)estrict Label-E(X)pand Execute-(A)s
Administration Privilege
(D)atabase (G)roup (U)ser (T)able T(E)mp E(X)ternal Se(Q)uence
S(Y)nonym (V)iew (M)aterialized View (I)ndex (B)ackup (R)estore
va(C)uum (S)ystem (H)ardware (F)unction (A)ggregate (L)ibrary
(P)rocedure U(N)fence (S)ecurity

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 12 of 22

IBM Software
Information Management

You can see that the user has only the rights we have given him. He can select data from the customer table and execute
our stored procedure but he is not allowed to change the customer table directly or execute anything but the stored
procedure.
13. Lets test this switch to the custadmin user with the following command: \c labdb custadmin
14. Add another customer to the customer table:

LABDB(CUSTADMIN)=> CALL addCustomer(999997,'Jake Jones', 2, '555-5554');


The insert will have been successful and you will have another row in your table, you can check this with a SELECT query if
you want.
15. We will now make some changes to the stored procedure to do this we need to switch back to the administrative
account:

LABDB(CUSTADMIN)=> \c labdb admin


16. Now we will modify the stored procedure first lets have a detailed look at it.

LABDB(ADMIN)=> show procedure addcustomer verbose;


You should see the following screen:
LABDB(ADMIN)=> show procedure addcustomer verbose;
RESULT | PROCEDURE | BUILTIN | ARGUMENTS | OWNER | EXECUTEDASOWNER | VARARGS | DESCRIPTION |
PROCEDURESOURCE
-----INTEGER | ADDCUSTOMER | f
| (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) | ADMIN | t
f
|
|
DECLARE
C_KEY ALIAS for $1;
C_NAME ALIAS for $2;
N_KEY ALIAS for $3;
PHONE ALIAS for $4;
REC RECORD;
BEGIN
SELECT * INTO REC FROM CUSTOMER WHERE C_CUSTKEY = C_KEY;
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
SELECT * INTO REC FROM NATION WHERE N_NATIONKEY = N_KEY;
IF NOT FOUND REC THEN
RAISE EXCEPTION 'No Nation with nation key %', N_KEY;
END IF;
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;

You can see the input and output arguments, procedure name, owner, if it is executed as owner or caller and other details.
Verbose also shows you the source code of the stored procedure. We see that the description field is still empty so lets add
a comment to the stored procedure. This is important to do if you have a big number of stored procedures in your system.
Note: nzadmin is a very convenient way to manage your stored procedure it provides most of the managing functionality
used in this lab in a graphical UI.
17. Add a description to the stored procedure:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 13 of 22

IBM Software
Information Management

LABDB(ADMIN)=> comment on procedure addcustomer(INTEGER, CHARACTER VARYING(25),


INTEGER, CHARACTER VARYING(15)) IS 'This procedure adds a new customer entry to the
CUSTOMER table';
It is necessary to specify the exact stored procedure signature including the input arguments, these can be cut& pasted from
the output of the show procedures command. The COMMENT ON command can be used to add descriptions to more or less
all database objects you own from procedures, tables till columns.
18. Verify that your description has been set:

LABDB(ADMIN)=> show procedure addcustomer verbose;


The description field will now contain your comment:
LABDB(ADMIN)=> show procedure addcustomer verbose;
RESULT | PROCEDURE | BUILTIN | ARGUMENTS | OWNER | EXECUTEDASOWNER | VARARGS | DESCRIPTION |
PROCEDURESOURCE
-----INTEGER | ADDCUSTOMER | f
| (INTEGER, CHARACTER VARYING(25), INTEGER, CHARACTER VARYING(15)) | ADMIN | t
f
| This procedure adds a new customer entry to the CUSTOMER table |

19. We will now alter the stored procedure to be executed as the caller instead of the owner. This means that whoever
executes the stored procedure needs to have access rights to all the objects that are touched in the stored procedure
otherwise it will fail. This should be the default for stored procedures that encapsulate business logic and do not do
extensive data checking:

LABDB(ADMIN)=> alter procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER,


CHARACTER VARYING(15)) execute as caller;
20. Since the admin user has access to the customer table he will be able to execute the stored procedure:

LABDB(ADMIN)=> call addCustomer(999996,'Karl Schwarz', 2, '555-5553');

21. Now lets switch to the custadmin user: \c labdb custadmin


22. Try to add another customer as custadmin:

LABDB(CUSTADMIN)=> call addCustomer(999995, 'John Schwarz', 2, '555-5553');


You should see the following results:

LABDB(CUSTADMIN)=> CALL addCustomer(999995,'John Schwarz', 2, '555-5552');


NOTICE: Error occurred while executing PL/pgSQL function ADDCUSTOMER
NOTICE: line 12 at select into variables
ERROR: Permission denied on "NATION".

As expected the stored procedure fails now. The user custadmin has read access to the CUSTOMER table but no read
access to the NATION table, therefore this check results in an exception. While EXECUTE AS CALLER is more secure in
some circumstances it doesnt fit our usecase where we specifically want to expose some data modification ability to a user
who shouldnt be able to modify a table otherwise. Therefore we will change the stored procedure back:

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 14 of 22

IBM Software
Information Management

23. First switch back to the admin user: \c labdb admin


24. Change the stored procedure back to being executed as owner:

LABDB(ADMIN)=> alter procedure addcustomer(INTEGER, CHARACTER VARYING(25), INTEGER,


CHARACTER VARYING(15)) execute as owner;

In this chapter you have setup the permissions for the addCustomer stored procedure and the user custadmin who is
supposed to use it. You also added comments to the stored procedure.

3 Implementing the checkRegions stored procedure


In this chapter we will implement a stored procedure that performs a check on all rows of the regions table. The call of the stored
procedure will be very simple and will not contain input arguments. The stored procedure is used to encapsulate a sanity check
of the regions table that is executed regularly in the PureData System system for administrative purposes.
Our stored procedure will check each row of the REGION table for three things:
1.
2.
3.

If the region key is smaller than 1


if the name string is empty
if the description is lower case only this is needed for application reasons.

The procedure will return each row of the region table together with additional columns that describe if the above constraints are
broken. It will also return a notice with the number of faulty rows.
This chapter will teach you to use loops in a stored procedure and to return table results. You will also use dynamic query
execution to create queries on the fly.
You should be familiar with the use of VI for the development of stored procedures from the last chapter. Alternatives to using a
standard text editor for the creation of your stored procedure would be the use of a graphical development environment like
Aginity or the PureData System Eclipse plugins that can be downloaded from the PureData System Developer Network.
1.

Open the already existing empty file checkRegion.sql with the following command (note you can tab out the filename):

LABDB(ADMIN)=> \e checkRegion.sql
2.

You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing i

3.

First we will define the stored procedure header similar to the last procedure. It will be very simple since we will not use
any input arguments. Enter the following code to the editor:

CREATE OR REPLACE PROCEDURE checkRegions() LANGUAGE NZPLSQL RETURNS REFTABLE(tb1) AS


BEGIN_PROC
END_PROC;
~
~
~
~
-- INSERT --

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 15 of 22

IBM Software
Information Management

Lets have a detailed look at the RETURNS section. We want to return a result set but do not have to describe the column
names or datatypes of the table object that is returned. Instead we reference an existing table, which needs to exist at the
time the stored procedure is created. This means we will need to create the table TB1 before executing the CREATE
PROCEDURE command.
Once the stored procedure is executed the stored procedure will create under the cover an empty temporary table that has
the same definition as the referenced table. So the results will not actually be saved in the referenced table, which is only
used for the definition. This means that multiple stored procedures can be executed at the same time without influencing
each other. Since the created table is temporary it will be cleaned up once the connection to the database is aborted.
Note: If the referenced table contains rows they will neither be changed nor copied over to the temporary table, the table is
strictly used for reference.
4.

For our stored procedure we need four variables, add the following lines after the BEGIN_PROC statement:

DECLARE
rec RECORD;
errorRows INTEGER;
fieldEmpty BOOLEAN;
descUpper BOOLEAN;
The four variables are needed for our stored procedure:

5.

rec, is a RECORD structure while we loop through the rows of the table we will use it to save and access the
values of each row and check them with our constraints
errorRows will be used to contain the total number of rows that violate our constraints
fieldEmpty will be used to store if the row violates either the constraint that the name is empty or the record code is
smaller than 1, this is appropriate since values of -1 or 0 in the region code are used to denote that it is empty
descUpper will be true if a record violates the constraint that the description needs to be lowercase

We will now add the main BEGIN..END clause and initialize the errorRows variable. Add the following rows after the
DECLARE section:

BEGIN
RAISE NOTICE 'Start check of Region';
errorRows := 0;
END;
Each stored procedure must at least contain one BEGIN .. END clause, which encapsulates the executed commands. We
also initially set the number of error rows to 0 and display a short sentence.
6.

We will now add the main loop. It will iterate through all rows of the REGION table and store each row in the rec
variable. Add the following lines before the END statement

FOR rec IN SELECT * FROM REGION ORDER BY R_REGIONKEY LOOP


fieldEmpty := false;
descUpper := false;
END LOOP;
RAISE NOTICE ' % rows had an error see result set', errorRows;

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 16 of 22

IBM Software
Information Management

The FOR rec IN expression LOOP .. END LOOP command is used to iterate through a result set, in our case a
SELECT * on the region table. The loop body is executed once for every row in the expression and the current row is saved
in the rec field. The loop needs to be ended with the END LOOP keyword.
There are many other types of loops in NZPLSQL, for a complete set refer to the stored procedure guide.
For each iteration of the loop we initially set the value of the fieldEmpty and descUpper to false. Variables can be
assigned with the := operator. Finally we will display a notice that shows the number of rows that either had an empty field
or upper case expression. This number will be saved in the errorRows variable.
7.

Now its time to check the rows for our constraints and set our variables accordingly. Enter the following rows behind the
variable initialization and before the END LOOP keyword:

IF rec.R_NAME = '' OR rec.R_REGIONKEY < 1 THEN


fieldEmpty := true;
END IF;
IF rec.R_COMMENT <> LOWER(rec.R_COMMENT) THEN
descUpper := true;
END IF;
IF (fieldEmpty = true) OR (descUpper = true) THEN
errorRows := errorRows + 1;
END IF;
In this section we check our constraints for each row and set our three variables accordingly. First we check if the name field
of the row is the empty string or if the region key is smaller than one. In that case the fieldEmpty field is set to true.
Note how we can access the fields by adding the fieldname to our loop record.
The second IF statement checks if the comment field of the row is different to the lower case version of the comment
field. This would be the case if it contains uppercase characters.
Note that we can use the available PureData System functions like LOWER in the stored procedure, as if it were a SQL
statement.
Finally if one of these variables has been set to true by the previous checks, we increase the value of the errorRows
variable by one. The final number will in the end be displayed by the RAISE NOTICE statement we already added to the
stored procedure.
8.

Finally add the following lines after the lines you just added and before the END LOOP statement:

EXECUTE IMMEDIATE 'INSERT INTO '|| REFTABLENAME ||' VALUES ('


|| rec.R_REGIONKEY ||','''
|| trim(rec.R_NAME) ||''','''
|| trim(rec.R_COMMENT) ||''','
|| fieldEmpty ||','
|| descUpper ||')';
These lines add the row of the REGION table to the result set of our stored procedure adding two columns containing the
fieldEmpty and descUpper flags for this row. There are a couple of important points here:
For each call of a stored procedure with a result set as return value a temporary table is created that is later returned to the
caller. Since the name is unique it needs to be referenced through a variable. This is the REFTABLENAME variable. Apart
from that, adding values to the result set is identical to other INSERT operations.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 17 of 22

IBM Software
Information Management

Since the name of the table is dynamic we need to execute the INSERT operations as a dynamic statement. This means
that the EXECUTE IMMEDIATE statement is used with a string that contains the query that is to be executed.
To add variable values to the string the pipe symbol || is used. Note that the values for R_NAME and R_COMMENT are
inserted as strings, which means they need to be surrounded by quotes. To add quotes to a string they need to be escaped
with a second quote character. This is the reason that R_NAME and R_COMMENT are surrounded by triple quotes. Apart from
that we trim them, so the inserted VARCHAR values are not blown up with empty characters.
It can be tricky to construct a string like that and you will see the error only once it is executed. For debugging it can be
useful to construct the string and display it with a RAISE NOTICE statement.
9.

Your VI should now look like that, containing the complete stored procedure:

CREATE OR REPLACE PROCEDURE checkRegions() LANGUAGE NZPLSQL RETURNS REFTABLE(tb1) AS


BEGIN_PROC
DECLARE
rec RECORD;
errorRows INTEGER;
fieldEmpty BOOLEAN;
descUpper BOOLEAN;
BEGIN
RAISE NOTICE 'Start check of Region';
errorRows := 0;
FOR rec IN SELECT * FROM REGION ORDER BY R_REGIONKEY LOOP
fieldEmpty := false;
descUpper := false;
IF rec.R_NAME = '' OR rec.R_REGIONKEY < 1 THEN
fieldEmpty := true;
END IF;
IF rec.R_COMMENT <> lower(rec.R_COMMENT) THEN
descUpper := true;
END IF;
IF (fieldEmpty = true) OR (descUpper = true) THEN errorRows := errorRows + 1;
END IF;
EXECUTE IMMEDIATE 'INSERT INTO '|| REFTABLENAME ||' VALUES ('
|| rec.R_REGIONKEY ||','''
|| trim(rec.R_NAME) ||''','''
|| trim(rec.R_COMMENT) ||''','
|| fieldEmpty ||','
|| descUpper ||')';
END LOOP;
RAISE NOTICE ' % rows had an error see result set', errorRows;
END;
END_PROC;
-- INSERT -10. Save and exit VI. Press ESC to enter the command mode, enter :wq! to save and force quit and press enter.
11. To create the stored procedure the table reference tb1 needs to exist. Create the table with the following
statement:
LABDB(ADMIN)=> create table TB1 as select *, false AS FIELDEMPTY, false as DESCUPPER
from region limit 0;

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 18 of 22

IBM Software
Information Management

This command creates a table TB1 that has all the rows of the REGION table and two additional BOOLEAN fields
FIELDNULL and DESCUPPER. It will also be empty because we used the LIMIT 0 clause.
12. Describe the reference table with \d TB1
You should see the following result:
LABDB(ADMIN)=> \d TB1
Table "TB1"
Attribute |
Type
| Modifier | Default Value
-------------+------------------------+----------+--------------R_REGIONKEY | INTEGER
| NOT NULL |
R_NAME
| CHARACTER(25)
| NOT NULL |
R_COMMENT
| CHARACTER VARYING(152) |
|
FIELDEMPTY | BOOLEAN
|
|
DESCUPPER
| BOOLEAN
|
|
Distributed on hash: "R_REGIONKEY"
You can see the three columns of the REGION table and the two additional BOOLEAN fields that will contain for each
row if the row violates the specified constraints.
Note this table needs to exist before the procedure can be created.
13. Now create the stored procedure. Execute the script you just created with the following command:
LABDB(ADMIN)=> \i checkRegion.sql
You should successfully create your stored procedure.
14. Now lets have a look at our REGION table, select all rows:
LABDB(ADMIN)=> SELECT * FROM REGION;
You will get the following results:
LABDB(ADMIN)=> SELECT * FROM REGION;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
1 | na
| north america
4 | ap
| asia pacific
3 | emea
| europe, middle east, africa
(4 rows)
We can see that none of the rows would violate the constraints we defined which would be pretty boring. So lets test
our stored procedure by adding two rows that violate our constraints.
15. Add the two violating rows with the following commands:
LABDB(ADMIN)=> INSERT INTO REGION VALUES (0, 'as', 'Australia');

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 19 of 22

IBM Software
Information Management

This row violates the lower case constraints for the comment field and the empty field constraint for the region key
LABDB(ADMIN)=> INSERT INTO REGION VALUES (6, '', 'mongolia');
This row violates the empty field constraint for the region name.
16. Now finally lets try our checkRegions stored procedure:
LABDB(ADMIN)=> call checkRegions();
You should see the following output:
LABDB(ADMIN)=> call checkRegions();
NOTICE: Start check of Region
NOTICE:
2 rows had an error see result set
R_REGIONKEY |
R_NAME
|
R_COMMENT
| FIELDEMPTY | DESCUPPER
-------------+---------------------------+-----------------------------+------------+----------1 | na
| north america
| f
| f
3 | emea
| europe, middle east, africa | f
| f
0 | as
| Australia
| t
| t
4 | ap
| asia pacific
| f
| f
2 | sa
| south america
| f
| f
6 |
| mongolia
| t
| f
(6 rows)

You can see the expected results. Our stored procedure has found two rows that violated the constraints we check for.
In the FIELDNULL and DESCUPPER columns we can easily see that the row with the key 0 has both an empty field
and uppercase comment. We can also see that row 6 only violated the empty field constraint.
Note that the TB1 table we created doesnt contain any rows, it is only used as a template.
17. Finally lets cleanup our REGION table again:
LABDB(ADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY = 0 OR R_REGIONKEY = 6;
18. And lets run our checkRegions procedure again:
LABDB(ADMIN)=> call checkRegions();
You will see the following results:
LABDB(ADMIN)=> call checkRegions();
NOTICE: Start check of Region
NOTICE:
0 rows had an error see result set
R_REGIONKEY |
R_NAME
|
R_COMMENT
| FIELDEMPTY | DESCUPPER
-------------+---------------------------+-----------------------------+------------+----------3 | emea
| europe, middle east, africa | f
| f
4 | ap
| asia pacific
| f
| f
1 | na
| north america
| f
| f
2 | sa
| south america
| f
| f
(4 rows)

You can see that the table now is error free and all constraint violation fields are false.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 20 of 22

IBM Software
Information Management

Congratulations you have finished the stored procedure lab and created two stored procedures that help you to
manage your database.

IBM PureData System for Analytics


Copyright IBM Corp. 2012. All rights reserved

Page 21 of 22

IBM Software
Information Management

WARRANTIES OF MERCHANTABILITY, FITNESS FOR A


PARTICULAR PURPOSE OR NON-INFRINGEMENT.

Copyright IBM Corporation 2011


All Rights Reserved.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada

IBM products are warranted according to the terms and


conditions of
the agreements (e.g. IBM Customer Agreement, Statement
of Limited
Warranty, International Program License Agreement, etc.)
under which
they are provided.

IBM, the IBM logo, ibm.com and Tivoli are trademarks or


registered
trademarks of International Business Machines Corporation
in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence
in this
information with a trademark symbol ( or ), these
symbols indicate
U.S. registered or common law trademarks owned by IBM
at the time
this information was published. Such trademarks may also
be registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web at
Copyright and trademark information at
ibm.com/legal/copytrade.shtml

Other company, product and service names may be


trademarks or service marks of others.
References in this publication to IBM products and services
do not imply that IBM intends to make them available in all
countries in which
IBM operates.
No part of this document may be reproduced or transmitted
in any form
without written permission from IBM Corporation.
Product data has been reviewed for accuracy as of the date
of initial
publication. Product data is subject to change without notice.
Any
statements regarding IBMs future direction and intent are
subject to
change or withdrawal without notice, and represent goals
and objectives only.
THE INFORMATION PROVIDED IN THIS DOCUMENT IS
DISTRIBUTED AS IS WITHOUT ANY WARRANTY,
EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS
ANY
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved

Page 22 of 22

Potrebbero piacerti anche