Sei sulla pagina 1di 29

The Data Warehousing and Business Intelligence

DataStage is a tool set for designing, developing, and running applications that populate one
or more tables in a data warehouse or data mart. It consists of client and server components.

Client Components

DataStage has four client components which are installed on any PC running Windows 95,
Windows 2000, or Windows NT 4.0 with Service Pack 4 or later:

• DataStage Designer. A design interface used to create DataStage applications


(known as jobs). Each job specifies the data sources, the transforms required, and the
destination of the data. Jobs are compiled to create executables that are scheduled by
the Director and run by the Server.
• DataStage Director. A user interface used to validate, schedule, run, and monitor
DataStage jobs.
• DataStage Manager. A user interface used to view and edit the contents of the
Repository.
• DataStage Administrator. A user interface used to configure DataStage projects and
users.

Server Components

There are three server components which are installed on a server:

• Repository. A central store that contains all the information required to build a data
mart or data warehouse.
• DataStage Server. Runs executable jobs that extract, transform, and load data into a
data warehouse.
• DataStage Package Installer. A user interface used to install packaged DataStage
jobs and plug-ins.

DataStage Features

DataStage has the following features to aid the design and processing required to build a data
warehouse:

• Uses graphical design tools. With simple point-and-click techniques you can draw a
scheme to represent your processing requirements.
• Extracts data from any number or types of database.
• Handles all the meta data definitions required to define your data warehouse. You can
view and modify the table definitions at any point during the design of your
application.
• Aggregates data. You can modify SQL SELECT statements used to extract data.
• Transforms data. DataStage has a set of predefined transforms and functions you can
use to convert your data. You can easily extend the functionality by defining your own
transforms to use.
• Loads the data warehouse.

You always enter DataStage through a DataStage project. When you start a DataStage client
you are prompted to connect to a project. Each project contains:

• DataStage jobs.
• Built-in components. These are predefined components used in a job.
The Data Warehousing and Business Intelligence

• User-defined components. These are customized components created using the


DataStage Manager or DataStage Designer

A complete project may contain several jobs and user-defined components.

There is a special class of project called a protected project. Normally nothing can be added,
deleted, or changed in a protected project. Users can view objects in the project, and perform
tasks that affect the way a job runs rather than the job's design. Users with Production
Manager Status can import existing DataStage components into a protected project and
manipulate projects in other ways.

A DataStage job populates one or more tables in the target database. There is no limit to the
number of jobs you can create in a DataStage project. DataStage jobs are defined using
DataStage Designer, but you can view and edit some job properties using the DataStage
Manager. The job design contains:

• Stages to represent the processing steps required


• Links between the stages to represent the flow of data

There are three basic types of DataStage job:

• Server jobs. These are compiled and run on the DataStage server. A server job will
connect to databases on other machines as necessary, extract data, process it, then
write the data to the target data warehouse.
• Parallel jobs. These are available only if you have Enterprise Edition installed. Parallel
jobs are compiled and run on a DataStage UNIX server, and can be run in parallel on
SMP, MPP, and cluster systems.
• Mainframe jobs. These are available only if you have Enterprise MVS Edition
installed. A mainframe job is compiled and run on the mainframe. Data extracted by
such jobs is then loaded into the data warehouse.

There are two other entities that are similar to jobs in the way they appear in the DataStage
Designer, and are handled by it. These are:

• Shared containers. These are reusable job elements. They typically comprise a
number of stages and links. Copies of shared containers can be used in any number of
server jobs and edited as required.
• Job Sequences. A job sequence allows you to specify a sequence of DataStage jobs
to be executed, and actions to take depending on results.

Server and Mainframe jobs consist of individual stages. Each stage describes a data source, a
particular process, or a data mart. For example, one stage may extract data from a data
source, while another transforms it. Stages are added to a job and linked together using the
Designer.

There are two types of stage:

• Built-in stages. Supplied with DataStage. Used for extracting, aggregating,


transforming, or writing data.
• Plug-in stages. Additional stages separately installed to perform specialized tasks that
the built-in stages do not support.
• Job Sequence Stages. Special built-in stages which allow you to define sequences of
activities to run.
The Data Warehousing and Business Intelligence

The following diagram represents the simplest job you could have: a data source, a
Transformer stage, and the final database. The links between the stages represent the flow of
data into or out of a stage.

Data Source Transformation

Target DW

You must specify the data you want at each stage, and how it is handled. For example, do you
want all the columns in the source data, or only a select few? Should the data be aggregated
or converted before being passed on to the next stage?

You can use DataStage with MetaBrokers in order to exchange metadata with other data
warehousing tools. You might, for example, import table definitions from a data modelling
tool.

DataStage Manager

The DataStage Manager is used to view and edit the contents of the Repository. Some of
these tasks can also be performed from the DataStage Designer. You can use the DataStage
Manager to:

• Import table or stored procedure definitions


• Create table or stored procedure definitions, data elements, custom transforms, server
job routines, mainframe routines, machine profiles, and plug-ins

There are also more specialized tasks that can only be performed from the DataStage
Manager. These include:

• Perform usage analysis queries.


• Report on Repository contents.
• Importing, exporting and packaging DataStage jobs.

The DataStage Manager is a means of viewing and managing the contents of the Repository.
The Data Warehousing and Business Intelligence

You can use the DataStage Manager to:

• Create items
• Rename items
• Select multiple items
• View or edit item properties
• Delete items
• Delete a category
• Copy items
• Move items between categories
• Create empty categories

Datstage Designer

The DataStage Designer is a graphical design tool used by developers to design and develop a
DataStage job.

By default, DataStage initially starts with no jobs open. You can choose to create a new job as
follows:

• Server job. These run on the DataStage Server, connecting to other data sources as
necessary.
• Parallel job. These are compiled and run on the DataStage server in a similar way to
server jobs, but support parallel processing on SMP, MPP, and cluster systems.
• Mainframe job. These are available only if you have installed Enterprise MVS Edition.
Mainframe jobs are uploaded to a mainframe, where they are compiled and run.
• Server shared containers. These are reusable job elements. Copies of shared
containers can be used in any number of server jobs and edited as required. They can
also be used in parallel jobs to make server job functionality available.
• Parallel shared containers. These are reusable job elements. Copies of shared
containers can be used in any number of parallel jobs and edited as required.
• Job sequences. A job sequence allows you to specify a sequence of DataStage server
and parallel jobs to be executed, and actions to take depending on results.

Or you can choose to open an existing job of any of these types. You can use the DataStage
options to specify that the Designer always opens a new server, parallel, or mainframe job,
server or parallel shared container, or job sequence when its starts.

The right mouse button accesses various shortcut menus in the DataStage Designer window.

Specifying Designer Options

You can specify default display settings and the level of prompting used when the Designer is
started.

To specify the Designer options, choose Tools > Options... . The Options dialog box
appears. The dialog box has a tree in the left pane. This contains four branches, each giving
access to pages containing settings for individual areas of the DataStage Designer as follows:

• Appearance Branch
• General options
• Repository Tree options
• Palette options
The Data Warehousing and Business Intelligence

• Graphical Performance Monitor Options


• Default Branch
• General options
• Mainframe options
• Expression Editor Branch
• Server and parallel options
• Job Sequencer Branch
• SMTP Defaults
• Default Trigger Colors
• Meta data Branch
• General options
• Printing Branch
• General options
• Prompting Branch
• General options
• Confirmation options
• Transformer Branch
• General options

Architecture Approach:

Assessing Your Data - Before you design your application, you must assess your data.
DataStage jobs can be quite complex and so it is advisable to consider the following before
starting a job:

• The number and type of data sources. You will need a stage for each data source you
want to access. For each different type of data source you will need a different type of
stage.
• The location of the data. Is your data on a networked disk or a tape? You may find
that if your data is on a tape, you will need to arrange for a custom stage to extract
the data.

• Whether you will need to extract data from a mainframe source. If this is the case,
you will need Enterprise MVS Edition installed and you will use mainframe jobs that
actually run on the mainframe.

• The content of the data. What columns are in your data? Can you import the table
definitions, or will you need to define them manually? Are definitions of the data items
consistent between data sources?
• The data warehouse. What do you want to store in the data warehouse and how do
you want to store it?

NOTE - You must also determine the data you want to load into your data mart or data
warehouse. Not all the data may be eligible.

Salient Activities in a Datastage Build Initiative

Import Table Definitions - Table definitions are the key to your DataStage project and
specify the data to be used at each stage of a DataStage job. Table definitions are stored in
the Repository and are shared by all the jobs in a project. You need, as a minimum, table
definitions for each data source and one for each data target in the data warehouse.
The Data Warehousing and Business Intelligence

You can view, import, or create table definitions using the DataStage Manager or the
DataStage Designer.

The following properties are stored for each table definition:

• General information about the table or file that holds the data records
• Column definitions describing the individual columns in the record

CREATING TRANSFORMS - Transforms are used in the Transformer stage to convert your
data to a format you want to use in the final data mart. Each transform is built from functions
or routines used to convert the data from one type to another.

You can use the built-in transforms supplied with DataStage. If the built-in transforms are not
suitable, or you want a specific transform to act on a specific data element, you can create
custom transforms.

You can enter or view the definition of a transform in the Transform dialog box.

To provide even greater flexibility, you can also define your own custom routines and functions
from which to build custom transforms. There are three ways of doing this:

• Entering the code within DataStage (using BASIC functions)


• Creating a reference to an externally cataloged routine
• Importing external ActiveX (OLE) functions

USING STORED PROCEDURES - If you are accessing data from or writing data to a
database via an ODBC connection, you can use a stored procedure to define the data to use. A
stored procedure can:

• Have associated parameters, which may be input or output


• Return a value (like a function call)
• Create a result set in the same way as an SQL SELECT statement

The definition for a stored procedure (including the associated parameters and meta data) is
stored in the Repository. These stored procedure definitions can be used when you edit an
ODBC stage in your job design.

You can import, create, or edit a stored procedure definition using the DataStage Manager or
DataStage Designer.

DataStage supports the use of stored procedures (with or without input arguments) and the
creation of a result set, but does not support output arguments or return values. A stored
procedure may have a return value or output parameters defined, but these are ignored at run
time.

When writing stored procedures for use with DataStage, you should follow these
guidelines:

• Group parameter definitions for output parameters after the definitions for input
parameters.
• When using raiserror to return a user error set the severity so that the error is treated
as informational.
The Data Warehousing and Business Intelligence

Creating Datstage Jobs - A DataStage job populates one or more tables in the target
database. There is no limit to the number of jobs you can create in a DataStage project.

Jobs are designed and developed using the DataStage Designer. The DataStage Designer is a
graphical design tool used by developers to design and develop a DataStage job.

A job design contains:

• Stages to represent the data sources, data mart, and processing steps required
• Links between the stages to represent the flow of data

There are three different types of job within DataStage:

• Server jobs. These are available if you have installed DataStage Server. They run on
the DataStage Server, connecting to other data sources as necessary.
• Parallel jobs. These are only available if you have installed Enterprise Edition. These
run on DataStage servers that are SMP, MPP, or cluster systems.
• Mainframe jobs. These are only available if you have installed Enterprise MVS
Edition. Mainframe jobs are uploaded to a mainframe, where they are compiled and
run.

Stages are linked together using the tool palette. The stages that appear in the tool palette
depend on whether you have a server, parallel, or mainframe job open, and on whether you
have customized the tool palette.

This on-line help tells you about the individual stage types that each type of job supports, and
is designed to provide you with quick information when you are actually working on a job. For
general background information about each type of job, and for design tips, see the Manuals
provided in PDF format (or, optionally, in printed form). The manuals are:

• Server Job Developer's Guide


• Parallel Job Developer's Guide
• Mainframe Job Developer's Guide

Note that before you start to develop your job, you must:

1. First assess your data.


2. Then create your data warehouse.
3. Then define table or stored procedures definitions.
4. Optionally create and assign data elements.

The STAGE in a DataStage Job - A job consists of stages linked together which describe the
flow of data from a data source to a final data warehouse. A stage usually has at least one
data input and one data output. However, some stages can accept more than one data input,
and output to more than one stage.

Server jobs, parallel jobs, and mainframe jobs have different types of stages. The stages that
are available in the DataStage Designer depend on the type of job that is currently open in the
Designer.

Plug-In Stage - You may find that the built-in stage types do not meet all your requirements
for data extraction or transformation. In this case, you need to use a plug-in stage. The
function and properties of a plug-in stage are determined by the plug-in used when the stage
The Data Warehousing and Business Intelligence

is inserted. Plug-ins are written to perform specific tasks, for example, to bulk load data into a
data mart.

Two plug-ins are always installed with DataStage: BCPLoad and Orabulk. You can also choose
to install a number of other plug-ins when you install DataStage.

LINK – A Datastage jobs flows as per the links created in the Job. To link stages, do one of
the following:

• Click the Link shortcut in the General group of the tool palette. Click the first stage
and drag the link to the second stage. The link is made when you release the mouse
button.
• Use the mouse to select the first stage. Position the mouse cursor on the edge of a
stage until the cursor changes to a circle. Click and drag the mouse to the other stage.
The link is made when you release the mouse button.
• Use the mouse to point at the first stage and right click then drag the link to the
second stage and release it.

The Transformer stage allows you to specify the execution order of links coming into and going
at from the stage. When looking at a job design in the DataStage, there are two ways to look
at the link execution order:

• Place the mouse pointer over a link that is an input to or an output from a Transformer
stage. A ToolTip appears displaying the message:

Input execution order = n

for input links, and:

Output execution order = n

for output links. In both cases n gives the link’s place in the execution order. If an input link is
no. 1, then it is the primary link.

Where a link is an output from one Transformer stage and an input to another Transformer
stage, then the output link information is shown when you rest the pointer over it.

JOB Properties:

Each job in a project has properties including optional descriptions and job parameters. You
can view and edit the job properties from the DataStage Designer or the DataStage Manager:

• From the Designer, open the job in the DataStage Designer window and choose Edit >
Job Properties.
• From the Manager, double-click a job in the DataStage Manager window display area
or select the job and choose File > Properties.

The Job Properties dialog box appears. The dialog box differs depending on whether it is a
server job, a mainframe job, or a job sequence.

A server job has up to six pages: General, Parameters, Job control, NLS, Performance,
and Dependencies. Note that the NLS page is not available if you open the dialog box from
the Manager, even if you have NLS installed.
The Data Warehousing and Business Intelligence

Parallel jobs have up to eight pages: General, Parameters, Job Control, Execution, NLS,
Dependencies, Defaults, and, if the Generated OSH visible option has been selected in
the Administrator client, Generated OSH. Note that the NLS page is not available if you open
the dialog box from the Manager, even if you have NLS installed.

A mainframe job has five pages: General, Parameters, Environment, Extensions, and
Operational meta data.

A job sequence has up to four pages: General, Parameters, Job Control, and
Dependencies.

Code Reusability - A container is a group of stages and links. Containers enable you to
simplify and modularize your server job designs by replacing complex areas of the diagram
with a single container stage. You can also use shared containers as a way of incorporating
server job functionality into parallel jobs.

DataStage provides two types of container:

Local containers. These are created within a job and are only accessible by that job. A local
container is edited in a tabbed page of the job’s Diagram window.

The main purpose of using a DataStage local container is to simplify a complex design visually
to make it easier to understand in the Diagram window. If the DataStage job has lots of
stages and links, it may be easier to create additional containers to describe a particular
sequence of steps. Containers are linked to other stages or containers in the job by input and
output stages.

• Shared containers. These are created separately and are stored in the Repository in
the same way that jobs are. There are two types of shared container:
• Server shared container. Used in server jobs (can also be used in parallel jobs).
• Parallel shared container. Used in parallel jobs.

Shared containers also help you to simplify your design but, unlike local containers, they are
reusable by other jobs. You can use shared containers to make common job components
available throughout the project.

You can also include server shared containers in parallel jobs as a way of incorporating server
job functionality into a parallel stage (for example, you could use one to make a server plug-in
stage available to a parallel job).

Job Sequences - DataStage provides a graphical Job Sequencer which allows you to specify a
sequence of server or parallel jobs to run. The sequence can also contain control information,
for example, you can specify different courses of action to take depending on whether a job in
the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled
and run using the DataStage Director. It appears in the DataStage Repository and in the
DataStage Director client as a job.

The job sequence supports the following types of activity:

Job. Specifies a DataStage server job.

Routine. Specifies a routine. This can be any routine in the DataStage Repository
(but not transforms).
The Data Warehousing and Business Intelligence

ExecCommand. Specifies an operating system command to execute.

Email Notification. Specifies that an email notification should be sent at


this point of the sequence (uses SMTP).

Wait-for-file. Waits for a specified file to appear or disappear.

Run-activity-on-exception. There can only be one of these in a job


sequence. It is executed if a job in the sequence fails to run (other exceptions
are handled by triggers).

TRIGGERS

The control flow in the sequence is dictated by how you interconnect activity icons with
triggers. There are three types of trigger:

• Conditional. A conditional trigger fires the target activity if the source activity fulfills
the specified condition. The condition is defined by an expression, and can be one of
the following types:
• OK. Activity succeeds.
• Failed. Activity fails.
• Warnings. Activity produced warnings.
• ReturnValue. A routine or command has returned a value.
• Custom. Allows you to define a custom expression.
• User status. Allows you to define a custom status message to write to the log.
• Unconditional. An unconditional trigger fires the target activity once the source
activity completes, regardless of what other triggers are fired from the same activity.
• Otherwise. An otherwise trigger is used as a default where a source activity has
multiple output triggers, but none of the conditional ones have fired.

Different activities can output different types of trigger:

Activity Type Trigger Type


Wait-for-file, ExecuteCommand Unconditional
Otherwise
Conditional - OK
Conditional - Failed
Conditional - Custom
Conditional - ReturnValue
Routine Unconditional
Otherwise
Conditional - OK
Conditional - Failed
Conditional - Custom
Conditional - ReturnValue
Job Unconditional
Otherwise
Conditional - OK
Conditional - Failed
Conditional - Warnings
Conditional - Custom
Conditional - UserStatus
Nested Condition Unconditional
Otherwise
The Data Warehousing and Business Intelligence

Conditional - Custom
Run-activity-on-exception, Unconditional
Sequencer, Email notification

CONTROL ENTITIES - The Job Sequencer provides additional control entities to help control
execution in a job sequence. Nested Conditions and Sequences are represented in the job
design by icons and joined to activities by triggers.

DataStage Job Debugging

The DataStage debugger provides you with basic facilities for testing and debugging your job
designs. The debugger is run from the DataStage Designer. It can be used from many places
in the Designer:

The debugger enables you to set breakpoints on the links in your job. When you run the job in
debug mode, the job will stop when it reaches a breakpoint. You can then step to the next
action (reading or writing) on that link, or step to the processing of the next row of data
(which may be on the same link or another link). Any breakpoints you have set remain if the
job is closed and reopened. Breakpoints are validated when the job is compiled. If a link is
deleted, or either end moved, the breakpoint is deleted. If a link is deleted and another of the
same name created, the new link does not inherit the breakpoint. Breakpoints are not
inherited when a job is saved under a different name, exported, or upgraded.

End of Build - Releasing a Job

If you are developing a job for users on another DataStage system, you must label the job as
ready for deployment before you can package it.

To label a job for deployment, you must release it. A job can be released when it has been
compiled and validated successfully at least once in its life.

Jobs are released using the DataStage Manager. To release a job:

1. From the DataStage Manager, browse to the required category in the Jobs branch in
the project tree.
2. Select the job you want to release in the display area.
3. Choose Tools > Release job. The Job Release dialog box appears.
4. Select the job that you want to release.
5. Click Release Job to release the selected job, or Release All to release all the jobs in
the tree.

A physical copy of the chosen job is made (along with all the routines and code required to run
the job) and it is recompiled.

The released job is automatically assigned a name and version number using the format
jobname%reln.n.n, where jobname is the name of the job you chose to release and n.n.n is
the job version number. When you refer to a job by its released name, this is known as a
"fixed job release", and always equates to that particular version of the job.

If you want to develop and enhance a job design, you must edit the original job. To use the
changes you have made, you must release the job again.
The Data Warehousing and Business Intelligence

Note: Released jobs cannot be copied or renamed using the Manager.

Integrating with an existing EDW - MetaBrokers allow you to exchange enterprise meta
data between DataStage and other data warehousing tools. For example, you can use
MetaBrokers to import table definitions into DataStage that you have set up using a data
modeling tool. Similarly you can export meta data from a DataStage job to a business
intelligence tool to help it in its analysis of your data warehouse.

Tool Customization Using BASIC programming:

DataStage BASIC is a business-oriented programming language designed to work efficiently


with the DataStage environment. It is easy for a beginning programmer to use yet powerful
enough to meet the needs of an experienced programmer.

The power of DataStage BASIC comes from statements and built-in functions that take
advantage of the extensive database management capabilities of DataStage. These benefits
combined with other BASIC extensions result in a development tool well-suited for a wide
range of applications.

DataStage BASIC programmers should understand the meanings of the following terms:
• BASIC program
• Source code
• Object code
• Variable
• Function
• Keyword.

BASIC Program.: A BASIC program is a set of statements directing the computer to perform
a series of tasks in a specified order. A BASIC statement is made up of keywords and
variables.

Source Code: Source code is the original form of the program written by the programmer.

Object Code: Object code is compiler output, which can be executed by the DataStage RUN
command or called as a subroutine.

Variable: A variable is a symbolic name assigned to one or more data values stored in
memory. A variable’s value can be numeric or character string data, the null value, or it can
be defined by the programmer, or it can be the result of operations performed by the
program. Variable names can be as long as the physical line, but only the first 64 characters
are significant. Variable names begin with an alphabetic character and can include
alphanumeric characters, periods ( . ), dollar signs ( $ ), and percent signs ( % ). Upper- and
lowercase letters are interpreted as different; that is, REC and Rec are different variables.

Function: A BASIC intrinsic function performs mathematical or string manipulations on its


arguments. It is referenced by its keyword name and is followed by the required arguments
enclosed in parentheses. Functions can be used in expressions; in addition, function
arguments can be expressions that include functions. DataStage BASIC contains both numeric
and string functions.
• Numeric functions. BASIC can perform certain arithmetic or algebraic calculations,
such as calculating the sine (SIN), cosine (COS), or tangent (TAN) of an angle passed
as an argument.
The Data Warehousing and Business Intelligence

• String functions. A string function operates on ASCII character strings. For


example, the TRIM function deletes extra blank spaces and tabs from a character
string, and the STR function generates a particular character string a specified number
of times.

Keyword: A BASIC keyword is a word that has special significance in a BASIC program
statement. The case of a keyword is ignored; for example, READU and readu are the same
keyword.

BASIC programming allows a developer to perform a wide gamut of activities. The


summarized list is given bellow:

• Optional statement labels (that is, statement numbers)


• Statement labels of any length
• Multiple statements allowed on one line
• Computed GOTO statements
• Complex IF statements
• Multiline IF statements
• Priority CASE statement selection
• String handling with variable length strings up to 232–1 characters
• External subroutine calls
• Direct and indirect subroutine calls
• Magnetic tape input and output
• Retrieve data conversion capabilities
• DataStage file access and update capabilities
• File-level and record-level locking capabilities
• Pattern matching

Best Practices:

Design Approach

It is always preferred to follow a lifecycle framework when deploying iterations to the data
warehouse. As a warehouse matures, you find out where you made short sighted decisions in
the initial framework and methodology. You're asking the right questions right off the bat. I
look at the long term affects of building 500+ jobs to support the load of an enterprise data
warehouse. You're going to iterative releases where more subject areas and tables are
deployed, as well as bug fixes and corrections to existing tables and processes. You're going to
want to setup job design standards, shared/common libraries of functions, routines, etc.
You're going to need to do things like setup a consistent parameter framework where all of
your jobs are location/host independent.

Release Approach

I recommend that one considers the merits of a release control system, whereby Iteration #1
of the warehouse is assigned a release name. You will create the project under that release
name. Example, EDW_1_0. As you build and work on the next release for EDW_1_1 in a same
named project, EDW_1_0 is available for bug fixes to the current production release.
EDW_1_1 would be a full code set of EDW_1_0, plus the enhancements and revisions. Since
you're going to have to coordinate database changes with the implementation of the code
changes, this approach will allow you to create the EDW_1_1 project in the production
environment ahead of time, and deploy a completed release into the project and have it
compiled and in place awaiting your implementation day. On implementation day, you have
the database changes enacted for that release, and you update whatever schedule pointers
are necessary to reflect that EDW_1_1 is now the current released project.
The Data Warehousing and Business Intelligence

Synchronizing Data Model changes with the Warehouse

One has a challenge as the ETL Architect/Warehouse Architect in coordinating code changes
with database changes. Though nothing has been mentioned along the lines of the
presentation layer (Business objects, MicroStrategy, etc), but if you are dealing with a
matured warehouse you now have to toss in the OLAP reporting changes that go along with
the database changes. It's a fully integrated approach and you have to consider the full
lifecycle crossing from back room to front office. By versioning your full EDW application at a
project level, you gain the ability to work on maintenance issues on the current release
(EDW_1_0), while developing the third generation release (EDW_1_2), while the second
generation (EDW_1_1) is undergoing user acceptance testing. This type of approach follows
the iterative approach espoused by Inmon and the lifecycle of Kimball.

As for downstream data marts/EDW, they too should be developed in versioned projects, as
each data marts/EDW could be running on its own independent lifecycle. You may have a data
marts/EDW enhancement that is running on a delayed reaction because the EDW release
changes aren't in place. Trying to bring a data marts/EDW enhancement online at the same
time as an EDW enhancement can be VERY difficult. If you have a high volume issue with a
backfill, this could take the data marts/EDW offline during the backfill process, which may be
undesirable. If you're running independent teams working on the EDW and the marts, this
coordination effort is like the Russians and the US working on the international space station:
sometimes things go wrong in the translation and the parts don't mate perfectly.

Incorporating and Maintaining Plug-n-Play Features in the DWH

It is better to have a common/shared library of routines and functions that you're going to
keep synchronized across all of the projects. You're going to have shell scripts, SQL scripts,
control files, lots of pieces and parts outside of the ETL tool. You're going to want a good
version control product that can manage everything, from your data model to your shell
scripts to your DS objects. I recommend something robust like PVCS to manage your objects.
From that standpoint, you can now tag and manage your common/shared objects across the
board and package a complete release.

On that note, you're going to want to setup release versioned directory structures within your
system environment so that each version can have a discrete working environment. Your
scripts, control files, etc have to follow the release, as well as DataStage's working directories
for sequential files, hash files, and other sundry pieces. For example, /var/opt/EDW/edw_1_0
could the base directory for assorted subdirectories to contain runtime files, as well as the
script, control file, and log folders for your release. This approach allows you to seamlessly
migrate your application across the multiple hosts required for development, system test, user
acceptance/QA testing, and production.

Other Guidelines:

• Standard Naming Conventions.


• Use of Enterprise Standard Event Handler shared containers for across-the-board
activities like error handling (including error prioritizing), Event handling (like sending
notifications), generating log files etc.
• Reduce number of active database lookups in a job. Use pre-loaded hash files for the
same.
• Incorporate Checkpoint Restart logic (if Business requires).
• Achieve parameterization as far as practicable.
• Use synonyms vs. fully qualified table names.
• Convert repetitive transformations to routines.
The Data Warehousing and Business Intelligence

• Use of appropriate plug-in if there is only one type of RDBMS that is in place in the
Data warehouse. i.e. if the DWH is working on Oracle 8/9.x only then use OCI9 stage
and not ODBC stage.

Points of Caution:

Organizing a Datastage project depends on the number of jobs you expect to have. The
number of jobs somewhat corresponds to the number of tables in each target. There is a
trade-off between in benefits gained by having less projects and the complexity of MetaStage
and Reporting Assistant. These tools can extract the ETL business rules and allow you to
report against them. I would try to keep the number of jobs under 250 in each project. If it
gets over 1000 then you see some performance loss to browse through the jobs. DataStage
itself seems to take longer to do things like pull a job up. Some platforms have a lot less of an
issue with this.

If you can separate your jobs into projects that never overlap then do it. If there is some
overlap in functionality then you cannot easily run jobs in 2 separate projects. Reusability is
not an issue. Jobs usually cannot be reused. Routines are easily copied from one project to
another. Routines are seldom changed. Either they work or they do not work. Replicating
metadata is not a problem either. It does not take long to re-import table definitions or export
them and import them into another project.

NOTE - If you do not separate then you may have issues in isolating sensitive data. Financial
data may be sensitive and need specific developers working on it.

Here are some of my observations based on my experience with Datastage:

• 500+ jobs in a project causes a long refresh time in the DataStage Director. During
this refresh, your Director client is completely locked up. Any edit windows open are
hung until the refresh completes.
• Increasing the refresh interval to 30 seconds mitigates the occurrence of refresh, but
does not lessen the impact of the refresh.
• The usage analysis links on import add a lot of overhead to the import process.
• Compiling a routine can take minutes, even a 1 line routine, depending on how many
jobs there are and how many jobs use the routine.
• A Director refresh will hang a Monitor dialog box until the refresh completes.

SOME PRACTICAL TOPICS

Executing DS jobs thru Autosys/Unix environments – Enterprise Schedulers like Autosys on


UNIX platform can be used to schedule Datastage jobs. To interface Datastage jobs with an
external scheduler we need to use the Command Line Interface (CLI) of Datastage.
Command Syntax:
dsjob [-file <file> <server> | [-server <server>][-user <user>][-password
<password>]] <primary command> [<arguments>]

Valid primary command options are:


-run
-stop
-lprojects
-ljobs
-linvocations
-lstages
-llinks
-projectinfo
-jobinfo
The Data Warehousing and Business Intelligence

-stageinfo
-linkinfo
-lparams
-paraminfo
-log
-logsum
-logdetail
-lognewest
-report
-jobid

So by using the various command options we can get the relevant information for a job or
about the project. A DataStage job failure is passed by to AutoSys as an exit code. The
simplest way to achieve this is to use the -jobstatus parameter when invoking the job. The
entire logic can be written in a shell file and invoked through an Autosys Command Job. The
sample code for the same is given bellow:

#Get Job Status


#$DSSERVER - Datastage Server name. Information is available from the prod.profile file
#$DSUSERID - Datastage Server User ID. Information is available from the prod.profile file
#$DSPASSWORD - Datastage Server Password. Information is available from the prod.profile
file
#$PROJECTNAME - The name of the Datastage Project. Information is available from the
prod.profile file
#${BinFileDirectory} - the BIN directory of the Datastage Engine - Information is available
from the prod.profile file.
#$1 - Job Name. Passed as Command line parameter
#$2 - Log file Name. Passed as Command line parameter

#Execute the datastage job


$BinFileDirectory/dsjob -server $DSSERVER -user $DSUSERID -password $DSPASSWORD -run
$PROJECTNAME $1
#loop to check the status of the job from the defined log file
while [ 1 -eq 1 ]
do
dsjob -jobinfo $PROJECTNAME $1 > $2
jobstatus=`grep 'Job Status' $2|cut -d':' -f 2 | cut -d'(' -f 1`
echo $jobstatus
if [[ $jobstatus != ' RUNNING ' ]] then
if [[ $jobstatus != ' RUN OK ' ]] then
auditstatus='FAILURE'
echo $auditstatus
exit 1
else
auditstatus='SUCCESS'
echo $auditstatus
exit 0
fi
fi
done

DS Jobs options on "Restartability", reaching commit points, logs and archives -


Restartability always carries with it a burden of needing to stage data on disk. You have to
design for this, since DataStage is intended to keep data in memory as much as possible (for
speed). So this should be implemented only if there is a Business requirement. Otherwise one
The Data Warehousing and Business Intelligence

should go for the option of a START-OVER. Here is an over view of the approach to implement
a Check-point restart:

A hierarchy of control jobs (job sequences) is the easiest way to accomplish Restartability. It
is not advisable to abort; I prefer to log warnings and other restart status information, so that
recovery can be 100% automatic. However, this does require customizing the code generated
when a job sequence is compiled. In fact, the main control jobs should never abort. Use
DSJ.ERRNONE as the second argument for DSAttachJob, and never call DSLogFatal. Never use
ABORT or STOP statements. Never return non-zero codes from before/after subroutines
(instead pass results and status as return values).

A good approach is to develop a job control library that reads a simple dependency matrix,
which greatly extends the ability to manage hundreds of jobs in a single process. Once you
have a dependency tree, then you can have a job control process to manage the execution of
the jobs, track completed jobs and waiting jobs. You control the absolute level of restart
capabilities, milestone tracking, etc. You also gain the ability to start and stop at milestones,
etc. You can customize parameter value assignments to each jobs needs. (Parameters can be
read from a file and set them in a job at runtime.)

The only challenge becomes maintaining the dependency tree. One can maintain the tree in an
Oracle (any RDBMS) table. It is a good approach because it allows full metadata exposure as
to the process execution flow. The dependency matrix can be an Excel Spreadsheet that can
be designed on the dependency tree by listing jobs and a space separated list of immediate
predecessor jobs.

Realize that DataStage has a wonderful API library of job control functions. One should
leverage the power of the underlying BASIC language to create a customized job control with
automatic 'Resurrection' ability where it picks up right from where it left off in the job stream.

Reaching Commit Points


The commit points can be controlled through error thresholds. The error threshold can be a
global parameter maintained at the prod.profile level or be job specific. In the 2nd case there
error threshold for each job needs to be passed as a parameter to the particular job. There are
2 approaches that can be considered while using Error threshold:

Approach 1 – Figure out the count of erroneous records (as per the given business logic) and
compare with the error threshold. If the count of erroneous records is greater than the error
threshold then have the control job abort the processing job and write to a log file. Use
standard UNIX commands like grep/awk to read from the log file. Mainly used where the DWH
design is in form of a Batch Architecture.

Approach 2 – Let the processing job continue with its activity and do the comparison with the
error threshold intermittently during the job activity. Have an external job monitor the
processing job and ABORT the processing job as soon as the error threshold is reached. This
approach is applied in a Real Time DWH.

Process Logging
In all approaches it is advisable to have a DataStage job to dump the requisite contents from
DataStage job logs into, for example, delimited text files.

One could then create a job control routine to process a particular selection of jobs in the
project (maybe all of them, maybe just the ones with a status of DSJS.RUNWARN,
DSJS.FAILED or DSJS.CRASHED), for each of these executing your "dump log" job.

Executing Jobs in a Controlled Sequence within the DS environment


DataStage provides a graphical Job Sequencer which allows you to specify a sequence of
server jobs or parallel jobs to run. The sequence can also contain control information; for
The Data Warehousing and Business Intelligence

example, you can specify different courses of action to take depending on whether a job in the
sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and
run using the DataStage Director. It appears in the DataStage Repository and in the
DataStage Director client as a job.

Designing a job sequence is similar to designing a job. You create the job sequence in the
DataStage Designer, add activities (as opposed to stages) from the tool palette, and join these
together with triggers (as opposed to links) to define control flow. Each activity has properties
that can be tested in trigger expressions and passed to other activities further on in the
sequence. Activities can also have parameters, which are used to supply job parameters and
routine arguments. The job sequence itself has properties, and can have parameters, which
can be passed to the activities it is sequencing.

The job sequence supports the following types of activity:


• Job - Specifies a DataStage server or parallel job.
• Routine - Specifies a routine. This can be any routine in the DataStage Repository
(but not transforms).
• ExecCommand - Specifies an operating system command to execute.
• Email Notification - Specifies that an email notification should be sent at this point of
the sequence (uses SMTP).
• Wait-for-file - Waits for a specified file to appear or disappear.
• Run-activity-on-exception - There can only be one of these in a job sequence. It is
executed if a job in the sequence fails to run (other exceptions are handled by
triggers).

Control Entities
The Job Sequencer provides additional control entities to help control execution in a job
sequence. Nested Conditions and Sequences are represented in the job design by icons and
joined to activities by triggers.
A nested condition allows you to further branch the execution of a sequence depending on a
condition. The Pseudo code:
Load/init jobA
Run jobA
If ExitStatus of jobA = OK then /*tested by trigger*/
If Today = “Wednesday” then /*tested by nested condition*/
run jobW
If Today = “Saturday” then
run jobS
Else
run JobB

A sequencer allows you to synchronize the control flow of multiple activities in a job
sequence. It can have multiple input triggers as well as multiple output triggers.
The sequencer operates in two modes:
• ALL mode- In this mode all of the inputs to the sequencer must be TRUE for any of
the sequencer outputs to fire.
• ANY mode- In this mode, output triggers can be fired if any of the sequencer inputs
are TRUE.

Tips on performance and tuning of DS jobs


Here is an overview of the some design techniques for getting the best possible performance
from DataStage jobs that one is designing.

Translating Stages and Links to Processes


When you design a job you see it in terms of stages and links. When it is compiled, the
DataStage engine sees it in terms of processes that are subsequently run on the server.
How does the DataStage engine define a process? It is here that the distinction between active
and passive stages becomes important. Actives stages, such as the Transformer and
The Data Warehousing and Business Intelligence

Aggregator perform processing tasks, while passive stages, such as Sequential file stage and
ODBC stage, are reading or writing data sources and provide services to the active stages.

At its simplest, active stages become processes. But the situation becomes more complicated
where you connect active stages together and passive stages together. What happens when
you have a job that links two passive stages together? Obviously there is some processing
going on.

Under the covers DataStage inserts a cut-down transformer stage between the passive stages
which just passes data straight from one stage to the other, and becomes a process when the
job is run. What happens where you have a job that links two or more active stages together?.
By default this will all be run in a single process. Passive stages mark the process boundaries,
all adjacent active stages between them being run in a single process.

Behavior of Datastage jobs on Single and Multiple Processor systems


The default behavior when compiling DataStage jobs is to run all adjacent active stages in a
single process. This makes good sense when you are running the job on a single processor
system.

When you are running on a multi-processor system it is better to run each active stage in a
separate process so the processes can be distributed among available processors and run in
parallel. There are two ways of doing this:
• Explicitly – by inserting IPC stages between connected active stages.
• Implicitly – by turning on inter-process row buffering either project wide
(using the DataStage Administrator) or for individual jobs (in the Job
Properties dialog box).

The IPC facility can also be used to produce multiple processes where passive stages are
directly connected. This means that an operation reading from one data source and writing to
another could be divided into a reading process and a writing process able to take advantage
of multiprocessor systems.

Partitioning and Collecting


With the introduction of the enhanced multi-processor support at Release 6 onwards, there are
opportunities to further enhance the performance of server jobs by partitioning data.

The Link Partitioner stage allows you to partition data you are reading so it can be
processed by individual processors running on multiple processors. The Link Collector stage
allows you to collect partitioned data together again for writing to a single data target.

Diagnosing the Jobs


Once the jobs have been designed it is better to run some diagnostics to see if performance
could be improved.
There may be two factors affecting the performance of your DataStage job:
• It may be CPU limited
• It may be I/O limited

You can now obtain detailed performance statistics on a job to enable you to identify those
parts of a job that might be limiting performance, and so make changes to increase
performance.

The collection of performance statistics can be turned on and off for each active stage in a
DataStage job. This is done via the Tracing tab of the Job Run Options dialog box, select the
stage you want to monitor and select the Performance statistics check box. Use shift-click to
select multiple active stages to monitor from the list.

Interpreting Performance Statistics


The Data Warehousing and Business Intelligence

The performance statistics relate to the per-row processing cycle of an active stage, and of
each of its input and output links. The information shown is:
• Percent- The percentage of overall execution time that this part of the process
used.
• Count- The number of times this part of the process was executed.
• Minimum - The minimum elapsed time in microseconds that this part of the process
took for any of the rows processed.
• Average - The average elapsed time in microseconds that this part of the process
took for the rows processed.

Care should be taken to interpret these figures. For example, when in-process active stage to
active stage links are used the percent column will not add up to 100%. Also be aware that, in
these circumstances, if you collect statistics for the first active stage the entire cost of the
downstream active stage is included in the active-to-active link This distortion remains even
where you are running the active stages in different processes (by having inter-process row
buffering enabled) unless you are actually running on a multi-processor system.

If the Minimum figure and Average figure are very close, this suggests that the process is CPU
limited. Otherwise poorly performing jobs may be I/O limited. If the Job monitor window
shows that one active stage is using nearly 100% of CPU time this also indicates that the job
is CPU limited.

Additional Information to improve Job performance


CPU Limited Jobs – Single Processor Systems - The performance of most DataStage jobs
can be improved by turning in-process row buffering on and recompiling the job. This allows
connected active stages to pass data via buffers rather than row by row. You can turn in-
process row buffering on for the whole project using the DataStage Administrator.
Alternatively, you can turn it on for individual jobs via the Performance tab of the Job
Properties dialog box.

CPU Limited Jobs - Multi-processor Systems – The performance of most DataStage jobs
on multiprocessor systems can be improved by turning on inter-process row buffering and
recompiling the job. This enables the job to run using a separate process for each active
stage, which will run simultaneously on a separate processor. You can turn inter-process row
buffering on for the whole project using the DataStage Administrator. Alternatively, you can
turn it on for individual jobs via the Performance tab of the Job Properties dialog box.

CAUTION: You cannot use inter-process row-buffering if your job uses COMMON blocks in
transform functions to pass data between stages. This is not recommended practice, and it is
advisable to redesign your job to use row buffering rather than COMMON blocks.

If you have one active stage using nearly 100% of CPU you can improve performance by
running multiple parallel copies of a stage process. This is achieved by duplicating the CPU-
intensive stages or stages (using a shared container is the quickest way to do this) and
inserting a Link Partitioner and Link Collector stage before and after the duplicated stages.

I/O Limited Jobs - Although it can be more difficult to diagnose I/O limited jobs and improve
them, there are certain basic steps you can take:
• If you have split processes in your job design by writing data to a Sequential file and
then reading it back again, you can use an Inter Process (IPC) stage in place of the
Sequential stage. This will split the process and reduce I/O and elapsed time as the
reading process can start reading data as soon as it is available rather than waiting for
writing process to finish.
• If an intermediate sequential stage is being used to land a file so that it can be fed to
an external tool, for example a bulk loader, or an external sort, it may be possible to
invoke the tool as a filter command in the Sequential stage and pass the data direct to
the tool
The Data Warehousing and Business Intelligence

• If you are processing a large data set you can use the Link Partitioner stage to split it
into multiple parts without landing intermediate fields

If a job still appears to be I/O limited after taking one or more of the above steps you can use
the performance statistics to determine which individual stages are I/O limited. Following can
be done:
1. Run the job with a substantial data set and with performance tracing enabled
for each of the active stages.
2. Analyze the results and compare them for each stage. In particular look for
active stages that use less CPU than others, and which have one or more links
where the average elapsed time.

Once you have identified the stage the actions you take might depend on the types of passive
stage involved in the process. Poorly designed hashed files can have particular performance
implications for all stage types you might consider:
• redistributing files across disk drives
• changing memory or disk hardware
• reconfiguring databases
• reconfiguring operating system

Hash File Design - Poorly designed hashed files can be a cause of disappointing
performance. Hashed files are commonly used to provide a reference table based on a single
key. Performing lookups can be fast on a well designed file, but slowly on a poorly designed
one. Another use is to host slowly-growing dimension tables in a star-schema warehouse
design. Again, a well designed file will make extracting data from dimension files much faster.

There are various steps you can take within your job design to speed up operations that read
and write hash files.

• Pre-loading - You can speed up read operations of reference links by pre-loading a


hash file into memory. Specify this on the Hash File stage Outputs page.
• Write Caching - You can specify a cache for write operations such that data is written
there and then flushed to disk. This ensures that hashed files are written to disk in
group order rather than the order in which individual rows are written (which would by
its nature necessitate time consuming random disk accesses). If server caching is
enabled, you can specify the type of write caching when you create a hash file, the file
then always uses the specified type of write cache. Otherwise you can turn write
caching on at the stage level via the Outputs page of the hash file stage.
• Pre-allocating - If you are using dynamic files you can speed up loading the file by
doing some rough calculations and specifying minimum modulus accordingly. This
greatly enhances operation by cutting down or eliminating split operations. You can
calculate the minimum modulus as follows: minimum modulus = estimated data
size/ (group size *2048). When you have calculated your minimum modulus you
can create a file specifying it or resize an existing file specifying it (using the RESIZE
command)
• Calculating static file modulus - You can calculate the modulus required for a static
file using a similar method as described above for calculating a pre-allocation modulus
for dynamic files: modulus = estimated data size/(separation * 512). When you
have calculated your modulus you can create a file specifying it (using the Create File
feature of the Hash file dialog box) - or resize an existing file specifying it (using the
RESIZE command).

DWH Architecture based on a Universal File Format (UFF):


The Data Warehousing and Business Intelligence

Creating an UFF is another approach of architecting a DWH where all the input file formats are
converted to a UFF. This approach aims at having specific converter modules per file that read
the input files and convert them to a UFF based on the metadata. The metadata would capture
all the possible attributes that all the feed files would have along with the layout of each file.

There would be a common module/process that would process the UFF and load it to the DWH.
This approach of split-processing would act as a common post load process that can run in
multiple threads and will not have any dependency on the processing of the individual input
files. Incase the data source changes from a feed file to RDBMS table(s), this approach can
still be applied. Considering Oracle as our target database, one can implement a Oracle
transparent or procedural gateway to pull data from the other RDBMS, convert it to an UFF
using the specific converter module and then use the common module to load the UFF to the
staging area.

There would also be the capability to process a file having a new attribute (outside the list).
The architecture would load the data for this attribute in the staging area but will not
propagate it to the Datamarts/EDW. In the staging area this attribute would be marked as
UNK (Unknown) and notification would be sent to the process owner(s) and/or support team
regarding the occurrence of this new attribute. After the process owner validates this attribute,
the UNK tag would be taken off and the attribute can flow to the Datamarts/EDW for
reporting.

A sample pseudo code in Datastage implementing the above approach is given bellow:
$INCLUDE DSINCLUDE JOBCONTROL.H
******************************************************************
*******************************************
* Define the Event And Error Logging Routines
EventNo = 0
Result = ""
Action = ""
ErrorNo = 0
OprnCode = ""
AMPM = ""
RR = 0
*Declare the Library Functions
Deffun InsertEventLog(EventNo, Result, Action,"") Calling
"DSU.InsertEventLog" – Function for Event Logging
Deffun InsertErrorLog(ErrorNo, FileName, OprnCode,"") Calling
"DSU.InsertErrorLog" - Function for Error Logging
Deffun UpdateProcessQue(FileName, ArgFileStatus, ArgOprnCode,
ArgFileDateTime,"") Calling "DSU.UpdateProcessQue" - Function for
Getting information from a process Queue – this is a work round
for implementing a REAL TIME DWH.
*Deffun GetMetaDataString("") Calling "DSU.GetMetaDataString"
Deffun - Get Metadata Information
GetPosFromArray(AttribStr,AttribPos,MetData(x),MailInfo,"")
Calling "DSU.GetPosFromArray" – Get File Layout details
Deffun InsertUFFData(RecStr, FileName, "AHT","") Calling
"DSU.InsertUFFData" – Ceate and Load UFF to Staging area
Deffun GetMetaDataCount("") Calling "DSU.GetMetaDataCount" –
Sanity to check if any new attribute has been processed
**Next 3 Functions are for Email Notification
Deffun GetFromMailAddress("Dummy") Calling
"DSU.GetFromMailAddress"
Deffun GetToMailAddress(ArgOprnCd) Calling "DSU.GetToMailAddress"
Deffun SendMail(ArgToAddressList, ArgFromAddress, ArgSubject,
ArgMessageBody) Calling "DSU.SendMail"
The Data Warehousing and Business Intelligence

******************************************************************
*******************************************

status = SQLAllocEnv(henv)
status = SQLAllocConnect(henv, hdbc)
status = SQLConnect(hdbc, "DSN NAME", "USID", "PWD")

iTotalCount = 0
******************************************************************
********************************************
******************************************************************
***************************
*Setting the Mailing Information

ToAddressList = GetToMailAddress("PROCESS Group")


FromAddress = GetFromMailAddress("dummy")
Subject = "EDSS Autogenerated message (Process Group Name)"
******************************************************************
****************************

* Check the file extension, if it is not .AHT then move to


Error\INVALIDFILE Folder
If UpCase(Right(FileName, Len(FileName) - Index(FileName, ".",
1))) <> "AHT" then goto InvalidFile
******************************************************************
********************************************
* Open the raw data file for processing
OPENSEQ FilePath : "\WORK\": FileName to INFILE THEN
PRINT "INPUT FILE OPENED FOR PROCESSING" END
ELSE PRINT "DID NOT OPEN INPUT FILE"

Ans3 = InsertEventLog(6,"P",FileName : " picked from Queue.",hdbc)


******************************************************************
******************************************
* Function GetMetaDataCount is called to get the count of Meta
data from table META_DATA
iTotalCount = GetMetaDataCount(hdbc)

*********************************************Variables
declarations********************************************

* Stores the META DATA and then converted to comma delimited.


dim MetData(iTotalCount)

* Stores the position of META DATA stored in the above array at


their respective positions.
dim DataPos(iTotalCount)

* Stores [Start Section], rows in comma delimited format with the


headers,
* then contains the detail values, this goes on for each section.
* Virtual representation of the intermediate file, which was being
generated earlier.

dim ArrTmp(300)
ArrTmpCnt = 0
The Data Warehousing and Business Intelligence

* Counter of the array ArrTmp, denotes the number of records,


length of array.
RecStr = ""

* Stores the string that is to be inserted in the table


STAGING_FILE_DATA,
* represents a UFF record with comma delimited values at their
respective positions.
MetaData = ""
DataVal = ""
ErrorCode = 0
CNT = 0
RR = 0
WR = 0
x = 0
iCnt = 0
iCtrTemp = 1
RecCount = 0
Ans = ""
Ans1 = ""
Ans2 = ""
Ans3 = ""
Ans4 = ""
Ans5 = ""
cn = 0
CheckStr =""
RecNull = 0

* This label parse the details of a particular section

ReadSection:
* Reinitializing the variables
MetaData = ""
DataVal = ""
* If encounter EOF then goto EndOfFile section
*if Status()=1 then goto EndOfFile
* Parse the first line of raw data file to get the headers
READSEQ A FROM INFILE THEN MetaData = A ELSE ErrorCode = -101
* If the file is blank then donot process
if status() = 1 then goto EmptyFile
* Parse the second line of raw data file to get the details
READSEQ A FROM INFILE THEN DataVal = A ELSE ErrorCode = -102
If trimB(TrimF(DataVal)) = "" then goto IncompleteFile

* If encounter EOF then goto EndOfFile section


if Status()=1 then goto EndOfFile

* Parse the third line of raw data file, ignore this line
READSEQ A FROM INFILE THEN RR = RR + 1 ELSE ErrorCode = -103

* Parse the fourth line of raw data file to get the headers,
append the metadata
READSEQ A FROM INFILE THEN MetaData = MetaData : "," : A ELSE
ErrorCode = -104

* If encounter EOF then goto EndOfFile section


if Status()=1 then goto EndOfFile
The Data Warehousing and Business Intelligence

CheckStr = DataVal
CheckStr = EReplace (CheckStr,",", "")
RecNull = ISNULL(CheckStr)

IF ((CheckStr = "") OR (RecNull = 1)) then goto


IncompleteFile

*First element of the array denotes the start of section


ArrTmp(iCtrTemp) = "[StartSection]"
iCtrTemp = iCtrTemp + 1
ArrTmp(iCtrTemp) = MetaData
iCtrTemp = iCtrTemp + 1

* Append the details of the current section in the array, 300


considered as the upper limit of the details (assumption)
For CNT = 1 to 300
READSEQ A FROM INFILE THEN ArrTmp(iCtrTemp) = DataVal : ","
: A
if Status()=1 then goto EndOfFile

iCtrTemp = iCtrTemp + 1

Next CNT

******************************************************************
******************************************
* This lable takes care of EOF
EndOfFile:
* Nothing is done
******************************************************************
******************************************
ArrTmpCnt = ArrTmpCnt + 1

******************************************************************
******************************************
* This lable creates the UFF and stores in ArrTmp
* Making Log
Ans3 = InsertEventLog(14,"P",FileName : " UFF conversion
initiated.",hdbc)
CreateUFF:
AttribStr = ""
AttribPos = ""
ArrTmpCnt = ArrTmpCnt + 1
tmpA = ArrTmp(ArrTmpCnt)

Ans1 = ArgMetaDataString
AttribStr = field(Ans1,"~",1,1)
AttribPos = field(Ans1,"~",2,1)
MailInfo = FileName : " in Operation AHT. "

* Call this loop till the count of delimited values in a record


and stores the value of meta data at the respective position
for x = 1 to dcount(tmpA,",")
if TrimB(TrimF(EReplace(UpCase(field(tmpA,",",x)),
char(34),""))) <> "" THEN
MetData(x) = TrimB(TrimF(EReplace(UpCase(field(tmpA,",",x)),
char(34),"")))
* Call this function to get the position of the Meta data
Ans1 =
GetPosFromArray(AttribStr,AttribPos,MetData(x),MailInfo,hdbc)
DataPos(x) = Ans1
The Data Warehousing and Business Intelligence

End
next x
* Calling the function again to get the latest count of
Attributes, if any new attributes are added. iTotalCount =
GetMetaDataCount(hdbc)
dim ArrUFF(iTotalCount)

for cn = 1 to iCtrTemp - 1
RecStr = ""
RecStr = STR(",",iTotalCount)
ArrTmpCnt = ArrTmpCnt+1
A = ArrTmp(ArrTmpCnt)

if TrimF(TrimB(A[1,1])) = "" then goto EndProcessing

for x = 1 to dcount(A,",")
RecStr =
EReplace(RecStr,",",TrimB(TrimF(EReplace(UpCase(field(A,",",x)),
char(34),""))): ",",1,DataPos(x))
next x

* Call this function to insert the UFF record in


STAGING_FILE_DATA
Ans2 = InsertUFFData(RecStr, FileName, "AHT",hdbc)
next cn
goto EndProcessing

******************************************************************
*************************************************

******************************************************************
*************************************************
InvalidFile:
strMove = ""
If StrOSType="NT" then
strMove = "move " : DQuote(FilePath : "\WORK\" : FileName) :
" " : FilePath : "\ERROR\INVALIDFILE\"
Call DSExecute("NT", strMove, Output, SystemReturnCode)
END ELSE
strMove = "mv " : DQuote(FilePath : "/WORK/" : FileName) : "
" : FilePath : "/ERROR/INVALIDFILE/"
Call DSExecute("SH", strMove, Output, SystemReturnCode)
END
*Log Errors in Error Log
Ans4=InsertErrorLog(3, FileName , "AHT",hdbc)
*Update Process Queue
Ans5 = UpdateProcessQue(FileName , "E", "AHT", "",hdbc)
*Log Actions in Event Log
Ans7 = InsertEventLog(9, "P", FileName : " Mail notification
sent.", hdbc)
Ans8 = InsertEventLog(12, "E", FileName : " Queue status
updated.", hdbc)

* Send notification for the empty file.


MessageBody = FileName : " is invalid. Please verify."
SendMailResult = SendMail(ToAddressList, FromAddress,
Subject, MessageBody)
goto ExitProcess
The Data Warehousing and Business Intelligence

******************************************************************
*************************************************
******************************************************************
*************************************************
IncompleteFile:
strMove = ""
CloseSeq INFILE
If StrOSType = "NT" then
strMove = "move " : DQuote(FilePath : "\WORK\" : FileName) :
" " : FilePath : "\ERROR\INVALIDFILE\"
Call DSExecute("NT", strMove, Output, SystemReturnCode)
END ELSE
strMove = "mv " : DQuote(FilePath : "/WORK/" : FileName) : "
" : FilePath : "/ERROR/INVALIDFILE/"
Call DSExecute("SH", strMove, Output, SystemReturnCode)
END

Ans4 = InsertErrorLog(3, FileName , "AHT",hdbc)


Ans5 = UpdateProcessQue(FileName , "E", "AHT", "",hdbc)

*Log Actions in Event Log


Ans9 = InsertEventLog(3, "E", FileName : " Error encountered
while parsing the file " , hdbc)
Ans10 = InsertEventLog(9, "I", "Mail notification sent for
file " : FileName , hdbc)
Ans11 = InsertEventLog(12, "I", "Queue status updated for "
:FileName , hdbc)
* Send notification for the Incomplete file.
MessageBody = FileName : " is Incomplete. Please verify."
SendMailResult = SendMail(ToAddressList, FromAddress, Subject,
MessageBody)
goto ExitProcess
******************************************************************
*************************************************
******************************************************************
*************************************************
EmptyFile:
CloseSeq INFILE
strMove = ""
SendMailResult = ""
If StrOSType="NT" then
strMove = "move " : DQuote(FilePath : "\WORK\" :
FileName) : " " : FilePath : "\ERROR\EMPTYFILE\"
Call DSExecute("NT", strMove, Output,
SystemReturnCode)
END ELSE
strMove = "mv " : DQuote(FilePath : "/WORK/" :
FileName) : " " : FilePath : "/ERROR/EMPTYFILE/"
Call DSExecute("SH", strMove, Output,
SystemReturnCode)
END

Ans4=InsertErrorLog(2, FileName, "AHT",hdbc)

Ans5 = UpdateProcessQue(FileName, "E", "AHT", "",hdbc)

*Log Actions in Event Log


The Data Warehousing and Business Intelligence

Ans10 = InsertEventLog(9, "I", "Mail notification sent for


file " : FileName , hdbc)
* Send notification for the empty file.
MessageBody = FileName: " contains no records. Please
verify."
SendMailResult = SendMail(ToAddressList, FromAddress,
Subject, MessageBody)

goto ExitProcess
******************************************************************
*************************************************

******************************************************************
*************************************************
EndProcessing:
Ans = iTotalCount
CloseSeq INFILE

Ans = iTotalCount
CloseSeq INFILE
strMove = ""
If StrOSType="NT" then
strMove = "move " : DQuote(FilePath : "\WORK\" :
FileName) : " " : FilePath : "\PROCESSED\"
Call DSExecute("NT", strMove, Output,
SystemReturnCode)
END ELSE
strMove = "mv " : DQuote(FilePath : "/WORK/" :
FileName) : " " : FilePath : "/PROCESSED/"
Call DSExecute("SH", strMove, Output,
SystemReturnCode)
END
*Log Actions in Event Log
Ans11 = InsertEventLog(12, "U", "Queue status updated for "
:FileName , hdbc)
Ans3 = InsertEventLog(15,"P",FileName : " UFF conversion
done.",hdbc)
Ans10 = InsertEventLog(4, "U", FileName : "File load
successful" , hdbc)

*Update Process Queue


Ans5 = UpdateProcessQue(FileName, "P", "Process Group Name",
"",hdbc)
******************************************************************
*************************************************
******************************************************************
*************************************************
ExitProcess:
Ans = iTotalCount
******************************************************************
*************************************************

Sanity Checks in a DWH


In the entire ETL process there are 2 key check points which should pass a sanity check. They
are the pre-load stage (implement a pre-load sanity) and the post load stage (implement post
load sanity).
The pre-load sanity includes doing a check on following areas:
The Data Warehousing and Business Intelligence

• Confirming successful load of the data for the previous business day.
• Confirming data accuracy between the actual feed file and the header/trailer
information.

Post Load Sanity includes doing a check on following areas


• Process to delete duplicate records after the staging load.
• Process to allow re-keying process if the dimensional lookup has failed during the
Staging to Fact load. This process can be architected in following 2 ways:
o Abort the load process if there is a lookup failure.
o Create an intermediate keyed file having the [key/value] pair if the records to
be loaded to the fact table. The records that have failed lookup go into an un-
keyed file with the key = UNK (Unknown). Decision can then be taken to load
these records with the key values as UNK or run them through the keying
process once-over again.

• Data validation checks on 2 or more fact tables after they are loaded through post
load process.

Sanity checks can be done by implementing the logic in shared container or at database level
– whichever is appropriate.

Key Deliverable Documents:


Architecture/Approach Document (High Level) – Provides a bird’s eye view of the DWH
architecture.
Detailed Design Document. – Provides low level details of what has been discussed in the
Approach document.
FMEA (Failure mode Effective Analysis) Document – three questions need to be addressed on
the FMEA:
1) What risks are there to business if we do not make this change?
2) What other systems/components can this change impact?
3) What can go wrong in making this change?
The FMEA Document should have the following:
• Item/ Process Step – whether internal (like the process is part of the ETL load)
or external (like arrival of feed file from the data provider).
• Potential Failure Mode – What can fail or pose a risk or failure
• Potential Effect(s) of Failure – Who or what will be affected if the failure
occurs.
• Severity - ranked in scale of 1-10. Higher number indicates higher severity.
• Potential Cause(s) of Failure
• Occurrence - ranked in scale of 1-10. Higher number indicates higher
possibility of occurrence.
• Current Controls – Any process in place to prevent, predict or notify when the
failure occurs.
• Detection - ranked in scale of 1-10. Higher number indicates that there is
more chance of the Business being affected before we can prevent the failure.
• Risk Priority Number (RPN) – This product of Severity and Occurrence and
Detection (Severity*Occurrence*Detection). This allows to rank the risk so
that high priority once can be addressed first.
• Recommended Actions – To prevent the failures.
• Responsibility and Target Date Completion – Of who is supposed to implement
the recommended actions.
• Action Taken – Activity that was actually done to mitigate the risk.
• Risk re-assessment – After the recommended actions are implemented.

Potrebbero piacerti anche