Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
04 April 2013
Develop IBM InfoSphere DataStage jobs that can be called from a command line or shell
script using UNIX pipes for more compact and efficient integration. This technique has the
potential of saving storage space and bypassing landing intermediate files that eventually will
feed an ETL job. This also reduces overall execution time and allows sharing the power of
DataStage jobs through remote execution.
Integration scenario
DataStage jobs are usually run to process data in batches, which are scheduled to run at specific
intervals. When there is no specific schedule to follow, the DataStage operator can start the job
manually through the DataStage and QualityStage Director client, or at the command line. If the
job is run at the command line, you would most likely do it as follows.
dsjob -run -param input_file=/path/to/in_file -param output_file=/path/to/out_file
dstage1 job1.
Trademarks
Page 1 of 8
developerWorks
ibm.com/developerWorks/
In normal circumstances, the in_file and out_file are stored in a file system on the machine where
DataStage runs. But, in Linux or UNIX, input and output can be piped in a series of commands.
For example, when a program requires sorting, you can do the following. command|sort |uniq
> /path/to/out_file. In this case, Figure 2 shows the flow of data, where the output of one
command becomes the input of the next, and the final output is landed in the file system.
Assuming the intermediate processes produce many millions of lines, you are potentially avoiding
landing the intermediate files, thus saving space in the file system and the time to write those
files. DataStage jobs do not take standard input through a pipe, like many programs or commands
executed in UNIX. This article will describe a method and show the script to make that happen, as
well as the practical uses of it.
If the job should accept standard input and produce standard output like a regular UNIX command,
then it would have to be called through a wrapper script as follows. command1|piped_ds_job.sh|
command2 > /path/to/out_file.
Or maybe you will have to send the output to a file such as the following. command1|
piped_ds_job.sh > /path/to/out_file.
The diagram in Figure 3 shows you how the script should be structured.
DataStage command line integration
Page 2 of 8
ibm.com/developerWorks/
developerWorks
The script will have to convert standard input into a named pipe, and also convert the output file of
the DataStage job into standard output. In the next sections, you will learn how to accomplish this.
The DSX for this job is available in the downloads section of this article. The job simply takes a text
file, treats the full line as a single column, sorts it, and writes to the output file.
Additionally, the job will have to allow multiple instance execution. It should take the input line with
no separator and no quotes, and the output file will have the same characteristics.
Page 3 of 8
developerWorks
ibm.com/developerWorks/
You can proceed to do the FIFO creation and the dsjob execution. At this point, the job will wait
until the pipe starts receiving input. The code warns you if the DataStage job execution has thrown
an error, as shown in Listing 2.
&
if [ $? -ne 0 ]; then
echo "error calling DataStage job."
rm $infname
rm $outfname
exit 1
fi
At the end of the dsjob command, you see an ampersand, which is necessary since the job is
waiting for the input named pipe to send data, but the data will be streamed a few lines ahead.
The following code prepares the output to be sent to standard output via a simple cat command.
As you can see the cat command and the rm command are within parenthesis, meaning that
DataStage command line integration
Page 4 of 8
ibm.com/developerWorks/
developerWorks
those two commands are invoked in a sub-shell that is sent to the background (specified by the
ampersand at the end of the line), as shown in Listing 3.
The latter is necessary so when the job is finished writing the output, the temporary named pipe
file name is removed. The code that follows, tests if the script was called with a parameter as a file,
or if you are receiving from the data from a pipe. After the input stream (file or pipe) is sent to the
input named pipe, you finish and remove the file.
You can name the script as piped_ds_job.sh and execute it as mentioned previously. command1|
piped_ds_job.sh > /path/to/out_file. The fact that the script can receive the input via an
anonymous pipe, allows the uses shown in Listing 4.
The last sample where you use SSH assumes that you are executing from another machine, and
therefore the DataStage job is somehow used as a service. This also would be a representative
usage of how you can bypass the file transmission (and decompression in this case).
Conclusion
The mechanism described in this article allows for a more flexible DataStage job invocation at the
command line and in shell scripting. The explained wrapper script can easily be customized to
make it more general and flexible. The technique is a simple one that can be quickly implemented
for current jobs and can convert them in services through remote execution via SSH. The benefits
in avoiding landing data in a regular file are most notable when file sizes are in the order of dozens
of million of rows, but even if your data is not that large, the integration use case is very valuable.
Page 5 of 8
developerWorks
ibm.com/developerWorks/
Downloads
Description
Name
Size
job_and_script.zip
10KB
Page 6 of 8
ibm.com/developerWorks/
developerWorks
Resources
Learn
Read about Information Server and DataStage in the InfoSphere Information Server 9.1
Information Center.
Review the UNIX IPC in the "Speaking UNIX: Interprocess communication with shared
memory" developerWorks article.
Visit the developerWorks Information Management zone to find more resources for DB2
developers and administrators.
Stay current with developerWorks technical events and webcasts focused on a variety of
IBM products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and
tools as well as IT industry trends.
Follow developerWorks on Twitter.
Watch developerWorks on-demand demos ranging from product installation and setup demos
for beginners, to advanced functionality for experienced developers.
Get products and technologies
Build your next development project with IBM trial software, available for download directly
from developerWorks.
Evaluate IBM products in the way that suits you best: Download a product trial, try a product
online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox
learning how to implement Service Oriented Architecture efficiently.
Discuss
Participate in the discussion forum for this content.
Get involved in the My developerWorks community. Connect with other developerWorks
users while exploring the developer-driven blogs, forums, groups, and wikis.
Page 7 of 8
developerWorks
ibm.com/developerWorks/
Page 8 of 8