Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
20 August 2009
Trademarks
Page 1 of 28
developerWorks
ibm.com/developerWorks/
using the DataStage DB2 Connector. In the second part of this article series, you'll learn how to
transform flat file data into XML and store the data in a data warehouse with pureXML.
DataStage overview
IBM InfoSphere DataStage enables firms to extract, transform, and load (ETL) data from a
variety of sources into a data warehouse. Built-in support for multi-processor hardware enables
DataStage to provide high levels of scalability and work with large volumes of data efficiently.
Various "connectors" support a wide range of source and target data formats, including popular
IBM and OEM database management systems, ODBC data sources, third-party applications, realtime messages generated by messaging queuing software and Web services, and popular file
formats.
DataStage provides these capabilitiesand othersthrough a number of software components.
The scenarios in this article series use a subset of these. Specifically, the article scenarios use
DataStage Designer to construct ETL jobs. Each job consists of multiple "stages," and each stage
performs a given task. Examples of such tasks include reading information from a data source,
transforming input data using built-in functions, converting data from one type to another, and
so on. The article examples define stages that involve the DB2 Connector, XML operations, and
various processing operations. Two DataStage technologies are critical for the scenarios: the DB2
Connector and the XML Pack 2.0. This article describes these shortly.
After using DataStage Designer to create and compile your job, you'll use DataStage Director to
execute your work.
Page 2 of 28
ibm.com/developerWorks/
developerWorks
reference, the downstream (or receiving) stage uses the reference to get the data by invoking
the source connector directly. The downside of this method is that the LOB data can only
be moved from one connector to another and cannot be transformed in the job (as only the
reference is passed through the links). The connector requires that the ArraySize property to
be set to 1 when reading or writing XML columns.
Ability to execute standard SQL statements such as SELECT, INSERT, UPDATE, and DELETE,
combinations of these statements, user-defined SQL, as well as bulk load.
Support for parallel (multi-process) execution.
Support for DB2 DPF (Database Partitioning Feature). The connector can work with
partitioned databases in parallel or sequential mode. When running in parallel, a separate
DataStage process is allocated for each DB2 partition. The connector can read and write in
parallel and can load partitioned databases in parallel.
Support for guaranteed data delivery through XA (two-phase commit) transactions when
used together with the Distributed Transaction Stage (DTS). DTS provides a mechanism
for multiple DB2 databases to be updated as part of one XA transaction. Note that DTS is
available with a separate patch for DataStage 8.1.
Support for metadata import through the Common Connector Import Wizard available in the
DataStage Designer.
Page 3 of 28
developerWorks
ibm.com/developerWorks/
XQueries
XQueries with embedded SQL
SQL/XML queries
Update and delete operations with XML columns
If the XML column was represented with the LongVarChar data type in the DataStage job, the
connector would give the user an option to select whether this column will be passed to the next
stage as a reference or inline.
Listing 5. Using XQuery with embedded SQL to extract full XML documents
xquery db2-fn:sqlquery('select xml_col from table')
Page 4 of 28
ibm.com/developerWorks/
developerWorks
In Listing 6, the connector can either pass the result inline or by reference.
The column name col1 indicates the column as specified on the input link. This column can be
either NVarChar or LongNVarChar and can be passed inline or by reference.
Listing 8. Updating full XML document using a simple SQL update statement
update clients set
contactinfo=( xmlparse(document <email>newemail@someplace.com </email> ' ) )
where id = 3227
Listing 9 uses two parameters. The value of col2 replaces the existing /customerinfo/phone data in
the XML document; the value of col1 restricts the rows affected by the update.
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse
Page 5 of 28
developerWorks
ibm.com/developerWorks/
Listing 10. Deleting an XML record based on internal XML data filter
delete from clients
where xmlexists ('$c/Client/Address[zip="95116"]'
passing clients.contactinfo as "c")
Listing 11 uses the value of col2 as a parameter to restrict DELETE operations based on the value of
an XML element (the zip or zip code element of a client's address).
Figure 1. The XML Input stage transforms hierarchical XML data into "flat"
tables
Page 6 of 28
ibm.com/developerWorks/
developerWorks
Figure 2. The XML Output stage transforms "flat," tabular structures into XML
hierarchies
Page 7 of 28
developerWorks
ibm.com/developerWorks/
The scenarios in this article series rely on DB2's pureXML capabilities, which include optimized
storage of XML data in its native hierarchical format and support for querying XML data in SQL or
XQuery languages.
Figure 4. Operational data stored in DB2 pureXML serves as input to the data
warehouse
This data model represents a common scenario in which portions of XML data are often
"shredded" into a relational structure. Such portions represent data that business users might
frequently analyze and query. Many business intelligence tools are optimized to support relational
structures, so shredding frequently queried XML data into relational columns can be quite effective.
However, business needs can vary over time, so it can be difficult for administrators to determine
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse
Page 8 of 28
ibm.com/developerWorks/
developerWorks
which relational columns should be created. Maintaining full XML data in the data warehouse
allows users to immediately access important business data that wasn't previously shredded into a
relational format.
To simplify the sample scenarios in this article, we use a single DB2 database (named "TPOX") to
store both the operational data and the warehouse data. Of course, in a production environment,
operational data and warehouse data would be managed in separate databases, usually on
separate servers.
Since we use DB2 to manage both operational data and warehouse data, we could have elected
to use built-in DB2 technologies to perform much of the ETL work. However, offloading this
work to DataStage minimizes impact on DB2 operations, which is a common goal in production
environments. In addition, DataStage provides a number of transformation and cleansing functions
beyond those found in DB2. Finally, many firms need to populate their data warehouses with data
from heterogeneous data sources, and DataStage provides critical services to help them do so.
Such capabilities are well-documented elsewhere and are beyond the scope of this article series.
Design overview
The sample scenario in this article stores operational XML data in the TPOXADMIN.ACCOUNT
table, which serves as the source table to DataStage for this scenario. The ACCOUNT table
contains one relational column (ID) and one XML column (INFO). Listing 12 shows how easy it is
to create this table:
The INFO column contains details about the account, including its title, opening date, working
balance, portfolio holdings, and other information. Listing 13 shows a portion of one XML
document stored in the ACCOUNT table. The DB2 script accompanying this article contains the
full set of XML account records (see Download).
Listing 13. Portion of one XML document stored in the ACCOUNT table
<Account id="804130877" xmlns="http://tpox-benchmark.com/custacc">
<Category>6</Category>
<AccountTitle>Mrs Shailey Lapidot EUR</AccountTitle>
<ShortTitle>Lapidot EUR</ShortTitle>
<Mnemonic>LapidotEUR</Mnemonic>
<Currency>EUR</Currency>
<CurrencyMarket>3</CurrencyMarket>
<OpeningDate>1999-02-20</OpeningDate>
<AccountOfficer>Soraya Lagarias</AccountOfficer>
<LastUpdate>2004-02-10T22:33:58</LastUpdate>
<Balance>
<OnlineActualBal>896882</OnlineActualBal>
<OnlineClearedBal>337676</OnlineClearedBal>
<WorkingBalance>430147</WorkingBalance>
</Balance>
. . .
<Holdings>
Page 9 of 28
developerWorks
ibm.com/developerWorks/
<Position>
<Symbol>ZION</Symbol>
<Name>Zions Bancorporation</Name>
<Type>Stock</Type>
<Quantity>1927.719</Quantity>
</Position>
. . .
</Holdings>
. . .
</Account>
For testing simplicity, the target data warehousing database is also configured to be TPOX.
Source information from the INFO column of TPOXADMIN.ACCOUNT will be mapped to two
tablesthe DWADMIN.ACCT table that contains information about the overall account, and
the DWADMIN.HOLDINGS table that contains information about various investments (portfolio
holdings) of a given account. Listing 14 shows how these tables are defined:
To understand how the XML source data (in the INFO column of TPOXADMIN.ACCOUNT) is
mapped to the various columns of the data warehouse tables, see Table 1. (The ID columns of
both data warehouse tables are populated from the values of the ID column in the operation.)
Table 1. XML source data for each column in the data warehouse tables
DWADMIN.ACCT
Column name
DWADMIN.HOLDINGS
Column name
Data source
(XPATH expression)
title
/Account/AccountTitle
symbol
/Account/Holdings/Position/
Symbol
currency
/Account/Currency
type
/Account/Holdings/Position/Type
workingbalance
/Account/WorkingBalance
quantity
/Account/Holdings/Position/
Quantity
officer
/Account/AccountOffice
Page 10 of 28
ibm.com/developerWorks/
developerWorks
datachanged
timechanged
fullrecord
As you might expect, there are several ways to build a DataStage job for this scenario. This article
employs an incremental development approach. In particular, the initial steps guide you through
creating a portion of the overall DataStage job that extracts, transforms, and loads data into the
DWADMIN.ACCT table. Once this portion is complete and tested, the article guides you through
enhancing the job to extract, transform, and load XML data into the DWADMIN.HOLDINGS table.
First, though, you need to create the appropriate DB2 tables to support this scenario.
Listing 15. Invoking the DB2 script accompanying this article series
db2 td@ -vf
DSsetup.db2
Note: This script is designed to support a DB2 9.5 server running on Windows.
Page 11 of 28
developerWorks
ibm.com/developerWorks/
5. Add a Transformer stage to the job. (This stage will split a single XML element value into two
values that will populate two different columns in the target table.)
a. Select the Processing tab in the Palette pane.
b. Locate the Transformer stage and drag this icon to the parallel job pane.
c. Place the icon between the XML Input stage and the final DB2 Connector.
6. Link the stages together.
a. To link the first DB2 Connector to the XML Input, hold down the right mouse button, click
on the DB2 Connector, and drag the mouse to the XML Input stage. An arrow between
the two stages will appear.
b. Link the XML Input to the Transformer stage.
c. Link the Transformer stage to the final DB2 Connector.
7. If desired, customize your job with descriptive names for each stage and link using standard
DataStage facilities. (Consult the Resources section for links to DataStage tutorials and
documentation, if needed.)
8. Verify that your parallel job design is similar to Figure 5, which shows the various stages
linked together, as described in Step 6:
Figure 5. DataStage job skeleton for the first part of the integration
scenario
Page 12 of 28
ibm.com/developerWorks/
developerWorks
3. Enter the appropriate details for the connection, including the instance type (DB2), database
name (TPOX), and a valid user ID and password.
4. Click on the Test connection option in the upper right corner of this window to verify that you
can connect to the DB2 TPOX database.
5. After a successful connection, click on Next and then on OK.
6. Accept the default values for the data source location. These include a host name of DB2 and
a database name of TPOX (DB2).
7. Click on Next.
8. Select the TPOXADMIN schema from the Filter drop-down list, verify that the include tables
option is checked, and then click on Next.
9. Select the ACCOUNT table from the list of available tables for this schema.
10. Leave all options unchecked, including the "XML Columns as LOBs" option. Your DataStage
job needs to process and transform the XML, so it will treat it as string data (rather than as an
unstructured large object or LOB).
11. Verify that the TPOXADMIN.ACCOUNT table is slated for import, and click on Import.
12. A pop-up window appears, prompting you to select a folder for the metadata import. Select
Table Definitions, and click on OK.
13. Repeat the prior steps to import the two data warehouse target tables.
In Step 8, select the DWADMIN schema, instead of the TPOXADMIN schema.
In Step 9, select the ACCT and HOLDINGS tables. (You'll only use the DWADMIN.ACCT
table definition initially, but it will save time if you import the DWADMIN.HOLDINGS table
definition now as well.)
14. To confirm that you successfully imported metadata for all necessary tables, expand the Table
Definitions folder in the Repository pane in the upper left corner and confirm that there are
entries for TPOXADMIN.ACCOUNT, DWADMIN.ACCT, and DWADMIN.HOLDINGS.
15. Save your work.
You're now ready to edit each stage of the job.
Page 13 of 28
developerWorks
ibm.com/developerWorks/
3. Click on Test in the upper right corner of the pane to verify that you can successfully connect
to the database.
4. Scroll down to the Usage section of the Properties tab, and specify the following settings (as
also shown in Figure 6):
Generate SQL: Yes
Table name: TPOXADMIN.ACCOUNT
Array size: 1
5. Click on View Data in the right-hand side of the Usage row to verify that you can successfully
query the table, then click on OK to save your settings.
6. Click on the Columns tab, and select Load at the bottom of the pane.
7. A window appears with table definitions. Select the TPOXADMIN.ACCOUNT table, and click
on OK.
8. A window appears with the columns of the table. Accept the default in which all columns of
the table are selected. (Note that the INFO column, which was created in DB2 as an XML
column, appears here with an SQL type of NVarChar, which represents a Unicode string. This
is fine.)
9. Click on OK.
Page 14 of 28
ibm.com/developerWorks/
developerWorks
10. Specify an appropriate length for the INFO column. For this sample data, a length of 5000
bytes is sufficient.
11. Click on OK.
12. Save your work.
The DB2 Connector stage for the source table is ready. In the next step, you'll customize the DB2
Connector representing the data warehouse target table.
Page 15 of 28
developerWorks
ibm.com/developerWorks/
5. Click on View Data to verify that you can successfully query the table. (The first time you run
your job, this table will be empty.)
6. Click on OK to save these settings.
7. Click on the Columns tab, and select Load.
8. Select the DWADMIN.ACCT table, and click on OK.
9. Accept the default in which all columns of the table are selected. (Note that the
FULLRECORD column, which was created in DB2 as an XML column, appears here with an
SQL type of NVarChar.)
10. Click on OK.
11. Specify an appropriate length for the FULLRECORD column. For the sample data, a length of
5000 bytes is sufficient.
12. Click on OK.
13. Save your work.
With the DB2 source and target stages defined, it's time to work on the stages that process the
data.
Page 16 of 28
ibm.com/developerWorks/
developerWorks
Page 17 of 28
developerWorks
ibm.com/developerWorks/
FULLRECORD: /ns:Account
d. Identify the TITLE column as a key. To do so, change the Key value for TITLE to Yes.
This instructs DataStage to use this XML element value as the repetition element
identifier. For every occurrence of ns:AccountTitle, the stage will produce an output
record. In other words, it will produce a record for each Account since every account
contains an AccountTitle element. In this scenario, other columns could also serve this
purpose, including the CURRENCY, WORKINGBALANCE, and OFFICER columns,
since all of them are mandatory elements of the Account element. We selected TITLE as
the key column for convenience.
Figure 8. Column definitions for output results from the XML Input stage
Page 18 of 28
ibm.com/developerWorks/
developerWorks
3. Click OK.
4. Modify the Derivation setting for the TIMECHANGED column of the output link to transform
the data as needed. Recall that the input string contains a full timestamp with date and time
information, and you want to populate the TIMECHANGED column in the DB2 target table
with only a time value.
a. Highlight the appropriate derivation setting, right-click, and select Edit Derivation.
b. A blank pane appears. Use the built-in wizards to select appropriate transformation
function calls, or enter the following code:
TimestampToTime( StringToTimestamp(AccountOverview.TIMECHANGED,
"%yyyy-%mm-%ddT%hh:%nn:%ss"))
In case you're curious, the inner function call converts the input string into a timestamp
that complies with a specific format. The outer function takes this timestamp and
converts it to a time value. For details on these functions or the Transformer stage, see
the Resources section.
5. Verify that your transformation appears similar to Figure 9, which illustrates the mapping
between the input and output links (generated in Step 2) as well as the derivation that you
edited in Step 4:
Page 19 of 28
developerWorks
ibm.com/developerWorks/
Page 20 of 28
ibm.com/developerWorks/
developerWorks
Page 21 of 28
developerWorks
ibm.com/developerWorks/
Figure 11. Edited column values of the DB2 Connector for the
DWADMIN.HOLDINGS table
Page 22 of 28
ibm.com/developerWorks/
developerWorks
Figure 12. Column definitions for output from XML Input stage to new
Copy stage
Page 23 of 28
developerWorks
ibm.com/developerWorks/
5. Use DataStage Director to inspect the log and verify that the job finished successfully.
6. Optionally, inspect the data in the target table by selecting View Data in the stage editor for
the target DB2 stage.
Summary
Increased use of XML as a preferred format for data exchange is prompting data architects
and administrators to evaluate options for integrating business-critical XML data into their data
warehouses. In this first installment of this two-part series, you learned how IBM InfoSphere
DataStage can extract and transform XML data managed by DB2 pureXML. In addition, you
explored how DataStage can load this data into two tables: one with traditional SQL data types,
and one that features both relational and XML columns.
The second part of this article series explores another important scenario: using DataStage to read
information from a flat file, convert the data into an XML format, and load this XML data into a data
warehouse that contains a table with a DB2 pureXML column.
Acknowledgments
Thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review comments on this
article.
Page 24 of 28
ibm.com/developerWorks/
developerWorks
Downloads
Description
Name
Size
DSsetup.zip
141KB
Page 25 of 28
developerWorks
ibm.com/developerWorks/
Resources
Learn
IBM InfoSphere DataStage: Get an overview of IBM InfoSphere DataStage.
"Parallel Job Tutorial" (IBM, 2008): Learn more about creating parallel jobs (publication
SC18-9889-01).
IBM InfoSphere DataStage documentation (IBM Information Server Information Center): Get
more details about DataStage capabilities and how to use them.
DB2 pureXML wiki: Find a comprehensive set of links to demos, free downloads, technical
papers, and more.
DB2 pureXML Cookbook (IBM Press, August 2009): Explore this comprehensive guide to
DB2 pureXML technology for all supported platforms.
"Query XML Data with SQL" (developerWorks, March 2006): Learn how to query data stored
in XML columns using SQL and SQL/XML, and explore many of the sample queries included
in this article.
"Query XML Data with XQuery" (developerWorks, April 2006): Learn how to query data
stored in XML columns using XQuery, and explore many of the sample queries included in
this article.
"Enhance business insight and scalability of XML data with new DB2 9.7 pureXML
features" (developerWorks, April 2009): Get an overview of the latest DB2 pureXML features.
Transaction Processing over XML (TPOX) benchmark: Learn more about the Transaction
Processing over XML (TPOX).
developerWorks Information Management zone: Learn more about Information Management.
Find technical documentation, how-to articles, education, downloads, product information,
and more.
Stay current with developerWorks technical events and webcasts.
Get products and technologies
DB2 Express-C 9.7: Download a free version of DB2 Express database server for the
community that includes pureXML.
DB2 9.7 for Linux, UNIX, and Windows: Download a free trial version of DB2 9.7.
Build your next development project with IBM trial software, available for download directly
from developerWorks.
Discuss
Participate in the discussion forum for this content.
Participate in the DB2 pureXML forum or one of the DataStage forums.
Participate in developerWorks blogs and get involved in the My developerWorks community;
with your personal profile and custom home page, you can tailor developerWorks to your
interests and interact with other developerWorks users.
Page 26 of 28
ibm.com/developerWorks/
developerWorks
Amir Bar-or
Amir Bar-or is a senior architect and manager in the Enterprise Information
Management Group at the Massachusetts Laboratory. He has more than 10 years of
experience in database research and development as a researcher in HP Labs and
then in IBM SWG. Today, he is leading the development of DataStage future XML
capabilities.
Cynthia M. Saracco
Cynthia M. Saracco is a senior solutions architect at IBM's Silicon Valley Laboratory
who specializes in emerging technologies and database management topics. She
has 23 years of software industry experience, has written 3 books and more than 60
technical papers, and holds 7 patents.
Paul Stanley
Paul Stanley is a senior architect in the Enterprise Information Management Group
in Boca Raton, FL. He has been architecting and managing the development of
connectivity components for WebSphere Transformation Extender and InfoSphere
DataStage for more than 12 years.
Page 27 of 28
developerWorks
ibm.com/developerWorks/
Page 28 of 28