Sei sulla pagina 1di 28

Question Answers

SRC has 1 record I want 10records in target how is it 1. source-copy stage take ten out links from copy stage-link it to funnel
possible ? stage-sequential file in this way we can get 10 records in output
2. goto transformer stage ,in that stage select system variables ,select
iteration and specify the condition in loopig is @iteration=10
Explain how a source file is populated We can populate a source file in many ways such as by creating a SQL
query in Oracle, or by using row generator extract tool etc
Name the command line functions to import and To import the DS jobs, dsimport.exe is used and to export the DS jobs,
export the DS jobs dsexport.exe is used
Differentiate between data file and descriptor file As the name implies, data files contains the data and the descriptor file
contains the description/information about the data in the data files.
Define Routines and their types Routines are basically collection of functions that is defined by DS
manager. It can be called via transformer stage. There are three types of
routines such as, parallel routines, main frame routines and server
routines.
How can you write parallel routines in datastage PX We can write parallel routines in C or C++ compiler. Such routines are
also created in DS manager and can be called from transformer stage
What is the method of removing duplicates, without Duplicates can be removed by using Sort stage. We can use the option,
the remove duplicate stage as allow duplicate = false.
Transformer Stage using Stage Variables
Differentiate between Join, Merge and Lookup stage All the three concepts are different from each other in the way they use
the memory storage, compare input requirements and how they treat
various records. Join and Merge needs less memory as compared to the
Lookup stage.
Differentiate between Symmetric Multiprocessing and In Symmetric Multiprocessing, the hardware resources are shared by
Massive Parallel Processing processor. The processor has one operating system and it communicates
through shared memory. While in Massive Parallel processing, the
processor access the hardware resources exclusively. This type of
processing is also known as Shared Nothing, since nothing is shared in
this. It is faster than the Symmetric Multiprocessing.
Define APT_CONFIG in Datastage It is the environment variable that is used to identify the *.apt file in
Datastage. It is also used to store the node information, disk storage
information and scratch information
Source is a flat file and has 200 records . These have to Use SEQ---> Transformer stage ----> 4 SEQ Files In transformer stage add
split across 4 outputs equally . 50 records in each . constraints as Mod(@INROWNUM,4) =0, Mod(@INROWNUM,4) =1,
The total number of records in the source may vary Mod(@INROWNUM,4) =2, Mod(@INROWNUM,4) =3
everyday ,according to the count records are to split
equally at 4 outputs
Input is 1.take transformer and use system variable @iteration<=input.cola
cola 2.In Stg Varible
1 "str(input_column,Input_column)"
2 This Str function
3 Repeats the input string no.of times given in repeats
this should be populated at the output as
cola
1
22
333

source table 1. This we can do by using transfromer.take 2 stage variables.


name sv2: if sv1=iputcol then sv2+1 else 1
A sv1 : inputcol
A in derivation
B inputcol:sv2 | outputcol
B 2.We can use Aggregator stage and use the option Count rows(group by
B on the column name) connect the column to the output you ll get the
C counts of the rows like:
C Name CountRows
D A2
In source table data like this B3
but I want traget table like this C2
name count D1
A1 then use Transformer and write a Loop While condition as @iteration <=
A2 your Column name(in this case Count Rows)
B1 then append a new column to ur output and name it as @iteration from
B2 input side and change the name of the column at output end as Count
B3 3.
C1
C2
D1
Explain some scenarios where a sequential file stage changing the "no.of readers per node" to more than one . you can read a
runs in parallel sequential file in parallel mode

I have two columns in the source , COl A and Col B . 1. Loop Condition: @ITERATION <= Len(DSLink2.COLB) --Loop will end
Input is like when the length of the string hits.
Cola Colb
100 ABCDEF Derivation: Right(Left(DSLink2.COLB, @ITERATION),1) --Right & Left
I should achieve the output as functions will extract characters
Cola Colb 2. use pivot stage
100 A
100 B
100 C
100 D
100 E
100 F

I have 3 jobs A,B & C, which are dependent each 1. This can be done in Director -->Add to Scheduler
other, I want to run A & C jobs daily and B job run only 2. create 2 sequencers, in 1st sequence job A&c and in the 2nd sequence
on Sunday. How can I do it only B will be there. Schedule the 1st Sequence through director for
daily run and schedule the 2nd to run only on sunday.
3.You can create a new job to test whether the current day is Sunday or
not. If it is true then create a file else create a 0KB file.
Then you can create only 1 sequence with jobs A and C one after the
order and then the new job and exec command stage which will trigger
the job B if the file is available else it will not trigger the job and
complete the sequence.

I have a source file having data like: 1. Sourcefile --> copy stage --> 1st link --> Removeduplicate stage -->
10 outputfile1 with 10,20,30,40,50,60,70
10 Copy stage-->2nd link --> aggregator stage (creates the row count)-->
10 filter stage-->filter1 (count>1) -->outputfile2 with 10,20,30,40 --
20 >Filter2(count=1)-->outputfile3 with 50,60,70
20 2. Use Sort Stage-->Define the key field, In Property, Key Change column
20 is TRUE.
30 Then use a Transformer, In constraint , KeyChange=1 for Unique record
30 O/P and KeyChange=0 for Duplicate O/P
40
40
50
60
70 I want three output from the above input file,
these output would be:
1) having only unique records no duplicates should be
there. Like:
2) having only duplicate records
3) only unique record

I have one source file which contains the below data in transformer stage use stringtodate(datecol,format)
input column: DOB varchar(8)
20000303
20000409
1999043
1999047
validate the Date if valid date else pass the default-
1999-12-31,convert varchar to date

What is the use of node in data stage ? If we increase 1. In a grid environment a Node is the place where the jobs are
the nodes wt will happens executed. There will be a Conductor node and multiple grid nodes which
are executing in parallel to make the processing faster. If the number of
nodes are increased it increases the Parallelism of the job and hence the
performance
2.A basic answer I have in mind is for load balancing/performance. For
example, you can group x number of jobs to nodeA, and another group
of jobs to nodeB. This way your jobs are not using just the default node.
You can do this by setting your $APT_CONFIG_FILE for each of your jobs.
Just don't get confused with the term Node Vs Processor. A processor
can have 1 or more nodes
Say I have 5 rows in a source table ‎and for each rows
10 rows matching in a lookup table and my range is for
lookup is 9 to 99. what will be the row count in output
table?
I have table(Emp) with the columns Using Transformer Days Between and divided by 365 or Year from date
Eid,Ename,Sal,month(sal),year(sal) and DOB (say 15th- function
Jan-1981).
Design a job such that the output displays
Ename,year(sal),tot(sal) and current age i.e. For Ex: 18
yrs
Suppose i am having a source file and 3 output tables Use SEQ---> Transformer stage ----> 3 SEQ Files In transformer stage add
and I want first row to be written to first table second constraints as Mod(@INROWNUM,3) =1, Mod(@INROWNUM,3) =2,
row to second table, third row to third table likewise Mod(@INROWNUM,3) =0
how can we achieve this using datastage without using
partitioning?
How can we identify updated records in datastage? you can use the change capture stage which will show you the data
Only updated records without having any row-id and before any update change made and after the update change was made
date column available.
How to get top five rows in DataStage? I tried to use 1. If the source in your case is a file, then we can use "Read first rows"
@INROWNUM,@OUTROWNUM system variables in property of Sequential file stage and specify 5 as the value.
transformer..but they are not giving unique sequential Or use "Row number" property in the Sequential file stage and add a
numbers for every row filter stage with a clause Rownumber < = 5
Or use "Filter" property in the Sequential file stage and specify the
command "head -5" (without quotes)

If the source is a database like Oracle, use Rownumber pseudo column


and specify rownum < 6 to get top 5 rows. Or you can use "TOP 5" clause
if your source is sql server.

Or irrespective of source, use Head stage after the source stage and
specify 5 for the number of rows to display
2. We can use Head stage to get top n rows.
Keep option like "No of Rows(Per Partion)=5"

Then run this head stage in sequential mode only


3.Use stage variables and build a counter like
svCounter + 1 => svCounter
and then you build a condition to your output link svCounter <= 5

what is slowly changing dimension(SCD)? How do we


rectify it in datastage.
How we read comma delimiter file in sequential file For Sequential file there is an option "First Row Column Name" set to
stage and how we can remove the header and footer true so Header is removed and File End Type You need to select through
from a comma delimiter file? which Footer is Removed.
Now For Comma Delimiter Use : :"Field Defaults -> Delimiter = Comma"
input file A contains12345678910input file B 1. solve this by using the Change capture stage. First, i am going to use
contains6789101112131415Output file X source as A and refrerence as B both of them are connected to Change
contains12345Output file y contains678910Output file capture stage. From, change capture stage it connected to filter stage
z contains1112131415How can we do in this in a and then targets X,Y and Z. In the filter stage: keychange column=2 it
single ds job in px ?....could you please give me the goes to X [1,2,3,4,5] Keychange column=0 it goes to Y [6,7,8,9,10]
logic to implement ??? Keychange column=1 it goes to Z [11,12,13,14,15] Revert me PLz
2.Do a full outer join between two files and from transformer draw
three output links
1st link-->wherever left side is null
2nd link->wherever right side is null
3rd link->wherever match is there
3.In your scenario we need use two processing stages,1. Funnel stagge
and transformer stage.In your scenario total 2 input files are giving with
different values, the 2 files we need to club by using funnel stage
continues option. next we need to take transformer stage.In this
transformer stage we need to apply constraint based on that we can
split in to 3 files. constraint u need to apply like:
DSLink15.rowid < 5
DSLink15.rowid >5 and DSLink15.rowid < 10
DSLink15.rowid >10 and DSLink15.rowid < 15
4. Add an extra column colA and colB to the files A and B respectively.
Let the value for colsA be a for all the rows in file A and the value for
colB be b in file B(using the column generator stage).Now join both the
files using join stage. Perform full outer join. Map the ID col, colA and
colB to output. Next pass it through a transformer.
5. create one px job.
src file= seq1 (1,2,3,4,5,6,7,8,9,10)
1st lkp = seq2 (6,7,8,9,10,11,12,13,14,15)
o/p - matching recs - o/p 1 (6,7,8,9,10)
not-matching records - o/p 2 (1,2,3,4,5)
2nd lkp:
src file - o/p 1 (6,7,8,9,10)
lkp file - seq 2 (6,7,8,9,10,11,12,13,14,15)
not matching recs - o/p 3 (11,12,13,14,15)
Transformer constraint:
1) file X - colA=a and colB<>b
2) file Y - colA=a and colB=b
3) file X - colA<>a and colB=b
Datasatge has 2 types of routines ,Below are the 2 types.
What are Routines and where/how are they written
and have you written any routines before? 1.Before/After Subroutine.
2.Transformer routines/Functions.

Before/After Subroutines :

These are built-in routines.which can be called in before or after


subroutines. Below is the list of the same.

1.DSSendMail :Used to Send mail using Local send mail program.


2.DSWaitForFile : This routine is called to suspend a job until a named
job either exists, or does not exist.
3.DSReport :Used to Generate Job Execution Report.
4.ExecDos: This routine executes a command via an MS-DOS shell. The
command executed is specified in the routines input argument.
5.ExecDOSSilent. As ExecDOS, but does not write the command line to
the job log.
6. ExecTCL. This routine executes a command via an Info Sphere
Information Server engine shell. The command executed is specified in
the routines input argument.
7.ExecSH:This routine executes a command via a UNIX Korn shell.
8.ExecSHSilent:As ExecSH, but does not write the command line to the
job log.

Transformer Routines:

Transformer Routines are custom developed functions, as you all know


even DS has some limitations corresponding to inbuilt
functions(TRIM,PadString,.etc), like in DS version 8.1 we donâ t have
any function to return ASCII value of a character, Now from 8.5 they
have introduced seq() function for above mentioned scenario.

These Custom routines are developed in C++ Writing a routine in CPP


and linking it to our datastage project is really simple task as follows,

v Write CPP code


v Compiling with the required flags.
v Put the output file in a shared dir.
v Link it in the datastage.
v Use it in a transformer like other functions.

Below is the Sample C++ Code :

#include
#include

using namespace std;


int addNumber(int a,int b)
{
return a+b;
}

Note :We need to make sure ,our code should not contain main()
function as it is not required.

Compiling with the required flags:

Get the values of below 2 Environment variables.


1.APT_COMPILER
2.APT_COMPILER_OPT

Use as below to compile the code from unix prompt.

APT_COMPILER APT_COMPILER_OPTFile_Name(with Cpp Code).

v Once you run the above command it will create object file at same
path with .o as extension. Now login to DS Designer at routines folder do
a right click and select new Parallel routine.
v In new opened window put required details like Name and Select type
as External Function external subroutine name as the function name we
need to access, . Select proper return type and also provide the
complete path of the .o file.
v Now Select Arguments tab and add required arguments ,select proper
Data types for arguments.
v Now you are all done. Go to your job open any transformer and in any
expression just select the ellipsis button [...] you will get the list and
there select routine. There you will get our new routine listed.

But remember few points while writing the function i.e the cpp code,

v Data stage cannot accept return type as string, so we need to design


our function to return char* instead.

v The same applies to input arguments too. So our function can accept
char* only not string. But later in the cpp code we can change it to
string.
Datastage Architecture There mainly 3 parts . 1. DS Engine 2. Metadata Repository 3. Services.
If these 3 tiers installed on a single server the it will be called Single Tier
architecture.
If DSEngine on 1 Server and Metadata Repository and WAS Services on
another server then it will be called 2 tier architecture.
If these 3 are on Different servers the it will be 3 tier architecture.
And Dont forget that The DS Client will be installed on Windows Server
and connection has to be established from Server to client.
Apart from these... there are SMP(Symmetric Multi Processing(Sharing))
and MPP(Massive Parallel Processing(Share Nothing) environments.
2.Project architecture would be like:
*****************************
You have: 1 Source--------> 1 Staging Area-------->1 Temporary area to
store ur transformed data-------> finally ur target.
So its 4 layer architecture.

How to find if the next value in a column is


incrementing or not
for ex
100
200
300
400
If the curval greater than previous val then print
greater if lesser print lesser
For ex
100
200
150
400, Here 150<200 so print lesser
My input has a unique column-id with the values 1. In transformer using constraints we can achieve
10,20,30.....how can i get first record in one o/p 1). Link--> @inrownum=1
file,last record in another o/p file and rest of the 2).link --> lastrow()
records in 3rd o/p file? 3). link --> click the otherwise condition
How do you you delete header and footer on the
source sequential file and how do you create header
and footer on target sequential file using datastage?
A Sequences is calling activity 1, activity 2 and activity You have to check the " Do not checkpoint run " checkbox for activity 2.
3.while running, activity 1 and 2 got finished but 3 got If you set the checkbox for a job that job will be run if any of the job later
aborted. How can I design a sequence such that the in the sequence fails and the sequence is restarted
sequence has to run from activity 2 when I restart the 2. To make the job re-run from activity 3, we need to introduce
sequences restartability in the sequence job. For this below points have to be taken
care of in Job Sequence Adding Checkpoints: Checkpoints have to be
maintained for the sequence to be restartable. So, the option Add
checkpoints so sequence is restartable in the Job Sequence Properties,
should be checked. Also only then we would find the option Do not
checkpoint run under the Job tab in each of the job activities present in
the job sequence.

˜The option Do not checkpoint run in job activity should not be checked
because if another activity say activity4 later in the sequencer fails, and
the sequencer is restarted, this acticity will again rerun(which is
unnecessary). However if the logic demands, then it has to be checked.

The option Reset if required, then run has to be chosen in the job
activity so that when an aborted sequence is re-run then, the aborted
job(due to which sequence is aborted) will re-run successfully. Hope this
helps..
I have a source file1 consist of two datatypes In Transformer stage there is one function IsInteger and IsChar , We can
file1: identify If IsInteger (column name) then file1 else file2
no(integer) 2. I think this Question is to confuse the Job Aspirant by using Datatypes
1 and all...
2 Its very simple... File1-->2 Columns. 1.NO(Integer) 2.DEPT(Char).
3 Target1: NO(Integer), Target2: DEPT(Char).
& Take a Copy stage and Draw 2 output links.
dept(char) In One Output, Map only 1Column i.e NO(Integer)
cs In 2nd Output, Map Only 1Column i.e DEPT(Char).
it Simple mapping and there is no need to use "Transformer Stage".
ie Correct me if I was wrong...
and i want to seperate these two datatypes and load it
into target files
file2 & file3.
how can i do this in datastage and by using which
stage

Difference between JOIN , LOOKUP , MERGE The three stages differ mainly in the memory they use

DataStage doesn't know how large your data is, so cannot make an
informed choice whether to combine data using a join stage or a lookup
stage. Here's how to decide which to use:

if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This
can involve I/O if the data is big enough, but the I/O is all highly
optimized and sequential. Once the sort is over the join processing is
very fast and never involves paging or other I/O

Unlike Join stages and Lookup stages, the Merge stage allows you to
specify several reject links as many as input links
2) Join Stage:
1.) It has n input links(one being primary and remaining being secondary
links), one output link and there is no reject link
2.) It has 4 join operations: inner join, left outer join, right outer join and
full outer join
3.) join occupies less memory, hence performance is high in join stage
4.) Here default partitioning technique would be Hash partitioning
technique
5.) Prerequisite condition for join is that before performing join
operation, the data should be sorted.

Look up Stage:
1.) It has n input links, one output link and 1 reject link
2.) It can perform only 2 join operations: inner join and left outer join
3.) Join occupies more memory, hence performance reduces
4.) Here default partitioning technique would be Entire

Merge Stage:
1.) Here we have n inputs master link and update links and n-1 reject
links
2.) in this also we can perform 2 join operations: inner join, left outer
join
3.) the hash partitioning technique is used by default
4.) Memory used is very less, hence performance is high
5.) sorted data in master and update links are mandatory

How can we retrieve the particular rows in dataset by $orchadmin dump [options] Sample.ds
using orchadmin command?
options : -p period(N) : Lists every Nth record from each partition
starting from first record

2.
orchadmin dump -part 0 -n 17 -field name input.ds

I have source like this steps-


Num, SeqNo,Ln,Qty - use transformer
101, 1 ,1,5 -define stage variable stgvar_Qty with value 1 means stgvar_Qty=1
I wanna target following below -define loop with condition @ITERATION<=INPUTLINK.Qty
Num,SeqNo,Ln,Qty as per urs example @ITERATION<=5
101, 1 , 1, 1 - column in derivation
101, 1 , 2, 1 INPUTLINK.Num - Num
101, 1 , 3, 1 INPUTLINK.SeqNum - SeqNum
101, 1 , 4, 1 @ITERATION - Ln
101, 1 , 5, 1 stgvar_Qty - Qty
Based on Qty value records will be incremented.If qty there is no need to define stgvar_Qty also , use direct value 1 for column
value is 4 then o/p will be like below Qty
Num,SeqNo,Ln,Qty
101, 1 , 1, 1
101, 1 , 2, 1
101, 1 , 3, 1
101, 1 , 4, 1
Datastage Configuration File and Usage The Datastage configuration file is specified at runtime by a
$APT_CONFIG_FILE variable.
Configuration file structure

Datastage EE configuration file defines number of nodes, assigns


resources to each node and provides advanced resource optimizations
and configuration.

The configuration file structure and key instructions:

a. node - a node is a logical processing unit. Each node in a configuration


file is distinguished by a virtual name and defines a number and speed of
CPUs, memory availability, page and swap space, network connectivity
details, etc.
b. fastname defines nodes hostname or IP address
c. pool - defines resource allocation. Pools can overlap across nodes or
can be independent.
d. resource (resources) names of disk directories accessible to each
node.
The resource keyword is followed by the type of resource that a given
resource is restricted to, for instance resource disk, resource scratchdisk,
resource sort, resource bigdata
How to perform left outer join and right outer join in In Lookup stage properties, you will have constraints option. If you click
lookup stage on constraints button- you will get options like continue, drop, fail and
reject
If you select the option continue: it means left outer join operation will
be performed.
If you select the option drop: it means inner join operation will be
performed.
TABLE T1 with c1 and Table T2 with c1 Use lookup stage...and choose reject option !
13 you will get the matched records in master output file(3,4,5,5) and
24 unmatched in reject file(1,2,2,6,7).
25
3
4
5
5
6
7
These 2 are my source tables and i should get o/p as
1,2,2,6,7
Desing a Parrllel job in datastage

I have source like this We can do it using Transformer.Take 3 stage variables(s1,s2,s3),for s1


a,b,c,1,2,3 ( All this in one column) map the input column and for s2 we have to write the condition like if
I wanna target following below alpha(inputcolumn)= true then trim(s3:,:s1,,,) else s3:,:inputcolumn. for
a,b,c,1 ( Ist row) s3 also we have to write the condition like if alpha(inputcol) true then
a,b,c,2 (2nd row) map s2 into s3 else map s3 to s3.
a,b,c,3 (3rd row) Input column---->S1
if alpha(inputcolumn)=true then trim(s3:,:s1) else
trim(s3:,:inputcolumn)-------->s2
if alpha(inputcoumn)=true then s2 else s3------->s3
In constraint part we have to write the below condition
if alnum(s2)=true
In derivation part we have to map s2 to output column
I think it will work
2) Source->TR->RD->Pivot->Target,by using these order of stages we can
get required output.
Transformer:
we have to concatenate the values by using loop(we will get like
a,b,c,1,2,3) after that we have to split it into separate fields using field
function.
o/p is:c1 c2 c3 c4 c5 c6 c7
1a
1ab
1abc
1abc1
1abc12
1abc123
(if we dont want this dummy column we can drop it here itself)
3) Source.....trans.....pivot...target
trans...we can use filed function and create new column
col1:filed(inputcol,,,4,1)
col2:filed(inputcol,,,5,1)
Col3:filed(inputcol,,,6,1)
And pivot put output col :col1,col2,col3
Then you can get output
RemoveDuplicate:
put condition retain last
o/p is: c1 c2 c3 c4 c5 c6 c7
1abc123
pivot;
in derivation of c5 column give c5,c6,c7
here we are converting columns into rows
o/p is: c2 c3 c4 c5 c6 c7
abc1
abc2
abc3
What are Sequencers A sequencer allows you to synchronize the control flow of multiple
activities in a job sequence. It can have multiple input triggers as well as
multiple output triggers.The sequencer operates in two modes:ALL
mode. In this mode all of the inputs to the sequencer must be TRUE for
any of the sequencer outputs to fire.ANY mode. In this mode, output
triggers can be fired if any of the sequencer inputs are TRUE

Performance Tuning of Jobs By query optimization i.e you check your extract and insert query
2. From Server side you can increase the memory
3. By removing unnecessary stages which are not required i.e with the
proper job design
4. By using the constraint you can also increase the performance
5. Use parametrized jobs and sequence with the proper
parallel/sequential mapping flow
1. First filter then extract. But dont extract and filter. Use SQL instead of
table method when extracting. Say 1 million records are coming from
input table but there is a filter condition (Acct_Type=S) in job as per
business documents which results only few records say (100).
2. Reduce as many as transformer stages.
3. Reduce stage variables.
4. Remove sort stage and apply the partition techniques at stage level
(Ex: for join-hash, lookup-entire).
5. Be careful while operating with Joins. Be specific to inner join untill
business needs left outer.
Use Copy stage instead of a Transformer for simple operations like :
â ¢placeholder between stages
â ¢renaming Columns
â ¢dropping Columns
â ¢implicit (default) type Conversions
USe Stage variables wisely.The more teir number, the slower the
transformer and the job is. A job should not be overloaded with stages,
so split job design into smaller jobs.
Reading or writing data from/to a sequential file is slow and this can be a
bottleneck in caseof huge data. So, in such cases, to have faster reading
from the Sequential File stage the number of readers per node can be
increased (default value is one). Or Read from multiple nodes can be set.
Ensure that RCP is propagating unnecessary metadata to down stream
stages.
Have a volumetric analysis done when you introduce
Lookup(Normal/Sparse), Join and Merge stage in the design.
Always use CONNECTOR stages for connecting to databases as they are
more robust and fast. If not available, use Enterprise stage. Plug-in
stages come next.
Examine the execution plans for SQL queries used in jobs and create
indices for appropriate columns. Having indices improves performance
drastically.
Some times dropping the index and loading the table and re-creating
once load is done may be a good option
What is the uses of the copy stage with out copies Use Copy stage for simple operations like :
input link to output link datasetsand it can have any
purposes pls send me with example â ¢to multicast the same input data among multiple output links.
â ¢placeholder between stages to avoid buffer issues
â ¢renaming Columns
â ¢dropping Columns
â ¢implicit (default) type Conversions

How do you use surrogate key in reporting 1) In a typical data warehouse environment, we normally have a
requirement to keep history. So, we would end up having multiple rows
for a given Primary key. So, we define a new column that doesn't have a
business meaning of its own but acts a Primary key in a dimension. If you
have Surrogate Keys defined in each of your dimensions, then your fact
table will have each of these keys from dimensions as foreign keys and
measures.

Coming to usage of Surrogate keys in reporting - they are not specifically


reported but, we build queries where the fact table is joined with each
of these dimensions on these keys.

2) In Slowly Changing Dimension (SCD) Implementation we use


Surrogate key as the primary key is being duplicated for the sake of
keeping history data for the records with the same pk

How to get the dataset record count? without using Use datastage dataset management utility[GUI].
orchadmin command 2. Use orchadmin utility from command line
3. dsrecords command
Dimensional Modelling types and significance 1) Dimension modelling means to define a datawarehouse architecture
using any one of the availabale schema
dimensionl modelling are 3 types 1. conceptual model 2. logical model
3.physical model
conceptual model: gathering of all the requirements
logical modelling:define facts and dimension
physical mosel:physically moving data to datawarehouse
datawarehouse graph: gathering all requirement--->define facts and
dimension--->load data(datawarehouse)--->generate reports--->analysis-
-->decision
2) Dimensional Modelling:
It is a designing methodlogy for designing datawarehouse with
dimensions& facts.
It is of three types 1) Conceptual Modelling 2) Logical Modelling 3)
Physical Modelling
1) Conceptual Modelling:
A datawarehouse architect gathers the business requirements.
Identify the facts, dimensions & relationships.
2) Logical Modelling:
Design the fact & dimension tables.
Create the relationship between fact & dimension tables.
3) Physical Modelling:
Execute the dimension & fact tables in database for loading..
If are given a list of .txt files and asked to read only the In sequential file we can take a single file by using the file as Specified
first 3 files using seq file stage. How will you do it? file.

But we can take the more than one file use File Pattern with different
file names.

Metadata must be same.

If you have Numerical+Characters data in the source, Use following function in derivation of the transformer:
how will you load only Character data to the target?
Which functions will you use in Transformer stage? convert("123456789"," ",StringName)

Please note that the second argument contains nine


spaces.Here,StringName is the string for which the numerical values are
to be removed.

Hence,if your string's value is "Test1234"(StringName),it will be


converted to "Test".In this way,only character data will be loaded to the
target
2) Use following function in derivation of the transformer:

convert("123456789","",StringName)
T12e5st --> Test (desired rlt)esu

convert("123456789"," ",StringName)
T12e5st --> T e st (space will appear in place of digits)
Source has sequential file stage in 10 records and 1) Add a third output link and make the filter condition as negation of
move to transformer stage it has one output link 2 first output link and reject link. All the records that do not match the
records and reject link has 5 records ? But i want filter condition of first output or filter condition of reject would be
remaining 3 records how to capture passed to the third link.
2)You can choose Otherwise option in constraints, so remaining records
you will get.

What is Dataset and Fileset 1) I assume you are referring Lookup fileset only.It is only used for
lookup stages only.Dataset: DataStage parallel extender jobs use data
sets to manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other
DataStage jobs.FileSet: DataStage can generate and name exported files,
write them to their destination, and list the files it has generated in a file
whose extension is, by convention, .fs. The data files and the file that
lists them are called a file set. This capability is useful because some
operating systems impose a 2 GB limit on the size of a file and you need
to distribute files among nodes to prevent overruns.

How do you perform commit and rollback in loading


jobs? What happens if the job fails in between and
What will you do?
How many number of ways that you can implement 3 ways to construct the scd2 in datastage 8.0.1
SCD2 ? Explain them 1)using SCD stage in processing stage
2)using change capture and change applay stages
3)using source file,lookup,transformers,filters,surrogate key generator
or stored procedures,target tables

Minimise the usage of Transformer (Instead of this use Copy, modify,


What are other Performance tunings you have done in Filter, Row Generator)
your last project to increase the performance of slowly Use SQL Code while extracting the data
running jobs? Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into
tables
Gen\'ll we cannot avoid no of lookups if our requirements to do lookups
compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job
into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look
up and as well as target, disable this Option.
If the hash file is used only for lookup then \"enable Preload to
memory\". This will improve the performance. Also, check the order of
execution of the routines.
Don\'t use more than 7 lookups in the same transformer; introduce new
transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you
would not use.
Cache the hash files you are reading from and writting into. Make sure
your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for
your hash files.
This would also minimize overflow on the hash file.

If possible, break the input into multiple threads and run multiple
instances of the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database
on the server using Hash/Sequential files for optimum performance also
for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction'
numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-
Sort for better performance of jobs
Removed the data not used from the source as early as possible in the
job.
Worked with DB-admin to create appropriate Indexes on tables for
better performance of DS queries
Converted some of the complex joins/business in DS to Stored
Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then
use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the
functionality required in one of the standard routines supplied in the sdk
or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of
time to process. This may be the case if the constraint calls routines or
external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself.
This will eliminate the unnecessary records even getting in before joins
are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the
database.
Using a constraint to filter a record set is much slower than performing a
SELECT WHERE .
Make every attempt to use the bulk loader for your particular database.
Bulk loaders are generally faster than using ODBC or OLE.
Types of views in Datastage Director There are 3 types of views in Datastage Directora) Job View - Dates of
Jobs Compiled.b) Log View - Status of Job last runc) Status View -
Warning Messages, Event Messages, Program Generated Messages
How do you execute datastage job from command line Using "dsjob" command as follows.dsjob -run -jobstatus projectname
prompt jobname
Functionality of Link Partitioner and Link Collector Link Partitioner: It actually splits data into various partitions or data
flows usingvarious partition methods.Link Collector: It collects the data
coming from partitions, merges it into a single dataflow and loads to
target
What is the difference between Server Job and Server jobs were doesn’t support the partitioning techniques but parallel
Parallel Jobs? jobs support the partition techniques
Server jobs are not support SMTP,MPP but parallel supports SMTP,MPP
Server jobs are running in single node but parallel jobs are running in
multiple nodes
Server jobs prefer while getting source data is low but data is huge then
prefer the parallel
what are different types of containers A group of stages and link in a job design is called container.There are
two kinds of Containers: Local and Shared. Local Containers only exist
within the single job they are used. Use Shared Containers to simplify
complex job designs.Shared Containers exist outside of any specific job.
They are listed in the Shared Containers branch is Manager. These
Shared Containers can be added to any job. Shared containers are
frequently used to share a commonly used set of job components.A Job
Container contains two unique stages. The Container Input stage is used
to pass data into the Container. The Container Output stage is used to
pass data out of the Container
how do you use procedure in datastage job Use ODBC plug,pass one dummy colomn and give procedure name in
SQL tab.
What is hash file ? what are its types? Hash file is just like indexed sequential file , this file internally indexed
with a particular key value . There are two type of hash file Static Hash
File and Dynamic Hash File
What is job control ? A job control routine provides the means of controlling other jobs from
the current job . A set of one or more jobs can be validated, run ,reset ,
stopped , scheduled in much the same way as the current job can be .
Define the difference between active and Passive There are two kinds of stages:Passive stages define read and write
Stage access to data sources and repositories.• Sequential• ODBC•
HashedActive stages define how data is filtered and transformed.•
Transformer• Aggregator• Sort plug-in
What is the Job control code Job control code used in job control routine to creating controlling job,
which invokes and run other jobs
You need to invoke a job from the command line that dsjob -run -mode NORMAL
is a multi-instance enabled. What is the correct syntax
to start a multi-instance job
A client must support multiple languages in selected Choose Unicode setting in the extended column attribute.
text columns when reading from DB2 database. Choose NVar/NVarchar as data types
Which two actions will allow selected columns to
support such data
Which two system variables/techniques must be used "@PARTITIONNUM",
in a parallel Transformer derivation to generate a "@NUMPARTITIONS"
unique sequence of integers across partitions
You are experiencing performance issues for a given Run job with $APT_TRACE_RUN set to true.
job. You are assigned the task of understanding what Review the objectives of the job
is happening at run time for the given job. What are
the first two steps you should take to understand the
job performance issues
Your customer asks you to identify which stages in a $APT_PM_PLAYER_TIMING
job are consuming the largest amount of CPU time.
Which product feature would help identify these
stages
Unix command to check datastage jobs running at ps -ef | grep phantom
server
Unix Command to check Datastage sessions running at netstat –na | grep dsr
backend netstat –a | grep dsr
netstat –a | grep dsrpc
How to unlock a Datastage job Cleanup Resourses in Director
Clear Status File in Director
DS.Tools in Administrator
DS.Tools in Unix
Command to check the Datastage Job Status dsjob –status

Part of configuration File Node


ServerName
Pools
FastName
ResourceDisk
Where datastage temprory dataset files stored while ResourceScratchDisk
running a Datastage parallel Job
Which three defaults are set in DataStage Project level defaults for environment variables.
Administrator project level default for compile options
project level default for Runtime Column Propagation
Which three statements are true about National NLS must be selected during installation to use it.
Language Support (NLS) Within an NLS enabled DataStage environment, maps are used to
convert external data into UTF-#6.
Reading or writing 7-bit ASCII data from a database does not require NLS
support
Upon which two conditions does the number of data The number of processing nodes in the default node pool.
files created by a File Set depend The number of disks in the export or default disk pool connected to each
processing node in the default node pool
You are working on a project that contains a large Use the Advanced Find feature contained in the Designer interface
number of jobs contained in many folders. You would
like to review the jobs created by the former
developer of the project. How can you find these jobs
Techniques you will use to abort a job in Transformer Create a dummy output link with a constraint that tests for the condition
stage to abort on set the "Abort After Rows" property to #.
What can you do from the Administrator client Set up user permissions for projects
Purge job log file
Set Environment variable default value
Add, delete, and move InfoSphere® DataStage® projects
Which two actions can improve sort performance in a Specify only the key columns which are necessary.
DataStage job Minimize the number of sorts used within a job flow.
Adjusting the "Restrict Memory Usage" option in the Sort stage
How do you execute datastage job from command line Using "dsjob" command as follows. dsjob -run -jobstatus projectname
prompt jobname
ex:$dsjob -run
and also the options like
-stop -To stop the running job
-lprojects - To list the projects
-ljobs - To list the jobs in project
-lstages - To list the stages present in job.
-llinks - To list the links.
-projectinfo - returns the project information(hostname and project
name)
-jobinfo - returns the job information(Job status,job runtime,endtime,
etc.,)
-stageinfo - returns the stage name ,stage type,input rows etc.,)
-linkinfo - It returns the link information
-lparams - To list the parameters in a job
-paraminfo - returns the parameters info
-log - add a text message to log.
-logsum - To display the log
-logdetail - To display with details like event_id,time,messge
-lognewest - To display the newest log id.
-report - display a report contains Generated time, start time,elapsed
time,status etc.,
-jobid - Job id information.
Difference between sequential file,dataset and fileset Sequential File:
1. Extract/load from/to seq file max 2GB
2. when used as a source at the time of compilation it will be converted
into native format from ASCII
3. Does not support null values
4. Seq file can only be accessed on one node.

Dataset:
1. It preserves partition.it stores data on the nodes so when you read
from a dataset you dont have to repartition the data
2. it stores data in binary in the internal format of datastage. so it takes
less time to read/write from ds to any other source/target.
3. You cannot view the data without datastage.4. It Creates 2 types of
file to storing the data.
A) Descriptor File : Which is created in defined folder/path.
B) Data File : Created in Dataset folder mentioned in configuration file.
5. Dataset (.ds) file cannot be open directly, and you could follow
alternative way to achieve that, Data Set Management, the utility in
client tool(such as Designer and Manager), and command line
ORCHADMIN.

Fileset:
1. It stores data in the format similar to that of sequential file.Only
advantage of using fileset over seq file is it preserves partition scheme.
2. you can view the data but in the order defined in partitioning scheme.
3. Fileset creates .fs file and .fs file is stored as ASCII format, so you could
directly open it to see the path of data file and its schema.
What is the main differences between Lookup, Join All are used to join tables, but find the difference.
and Merge stages Lookup: when the reference data is very less we use lookup. bcoz the
data is stored in buffer. if the reference data is very large then it wl take
time to load and for lookup.
Join: if the reference data is very large then we wl go for join. bcoz it
access the data directly from the disk. so the
processing time wl be less when compared to lookup. but here in join
we cant capture the rejected data. so we go for merge.
Merge: if we want to capture rejected data (when the join key is not
matched) we use merge stage. for every detailed link there is a reject
link to capture rejected data.
Significant differences that I have noticed are:
1) Number Of Reject Link
(Join) does not support reject link.
(Merge) has as many reject link as the update links (If there are n-input
links then 1 will be master link and n-1 will be the update link).
2) Data Selection
(Join) There are various ways in which data is being selected. e.g. we
have different types of joins inner outer( left right full) cross join etc. So
you have different selection criteria for dropping/selecting a row.
(Merge) Data in Master record and update records are merged only
when both have same value for the merge key columns
What are the different types of lookup? When one In DS 7.5 we have 2 types of lookup options are avilable: 1. Normal 2.
should use sparse lookup in a job Sparce
In DS 8.0.1 Onwards, we have 3 types of lookup options are available 1.
Normal 2. Sparce 3. Range
Normal lkp: To perform this lkp data will be stored in the memory first
and then lkp will be performed due to which it takes more execution
time if reference data is high in volume. Normal lookup it takes the
entiretable into memory and perform lookup.
Sparse lkp: Sql query will be directly fired on the database related record
due to which execution is faster than normal lkp. sparse lookup it
directly perform the lookup in database level.
i.e If reference link is directly connected to Db2/OCI Stage and firing
one-by-one query on the DB table to fetcht the result.
Range lookup: this will help you to search records based on perticular
range. it will serch only that perticuler range records and provides good
performance insted of serching the enire record set.
i.e Define the range expression by selecting the upper bound and lower
bound range columns and the required operators.
For example:
Account_Detail.Trans_Date >= Customer_Detail.Start_Date AND
Account_Detail.Trans_Date <= Customer_Detail.End_Date
Use and Types of Funnel Stage in Datastage The Funnel stage is a processing stage. It copies multiple input data sets
to a single output data set. This operation is useful for combining
separate data sets into a single large data set. The stage can have any
number of input links and a single output link.
The Funnel stage can operate in one of three modes:
•Continuous Funnel combines the records of the input data in no
guaranteed order. It takes one record from each input link in turn. If
data is not available on an input link, the stage skips to the next link
rather than waiting.
•Sort Funnel combines the input records in the order defined by the
value(s) of one or more key columns and the order of the output records
is determined by these sorting keys.
•Sequence copies all records from the first input data set to the output
data set, then all the records from the second input data set, and so on.
For all methods the meta data of all input data sets must be identical.
Name of columns should be same in all input links
What is the Diffrence Between Link Sort and Sort If the volume of the data is low, then we go for link sort.
Stage? If the volume of the data is high, then we go for sort stage.
Or Diffrence Between Link sort and Stage Sort "Link Sort" uses scratch disk (physical location on disk), whereas
"Sort Stage" uses server RAM (Memory). Hence we can change the
default memory size in "Sort Stage"
Using SortStage you have the possibility to create a KeyChangeColumn -
not possible in link sort.
Within a SortStage you have the possibility to increase the memory size
per partition,
Within a SortStage you can define the 'don't sort' option on sort key
they are already sorted.
Link Sort and stage sort,both do the same thing.Only the Sort Stage
provides you with more options like the amount of memory to be
used,remove duplicates,sort in Ascending or descending order,Create
change key columns and etc.These options will not be available to you
while using Link Sort
what is main difference between change capture and Change Capture stage : compares two data set(after and before) and
change apply stages makes a record of the differences.
change apply stage : combine the changes from the change capture
stage with the original before data set to reproduce the after data set.

Change capture stage catch holds of changesfrom two different datasets


and generates a new column called change code.... change code has
values
0-copy
1-insert
2-delete
3-edit/update

Change apply stage applies these changes back to those data sets based
on the chanecode column

Remove duplicates using Sort Stage and Remove We can remove duplicates using both stages but in the sort stage we can
Duplicate Stages and Diffrence capture duplicate records using create key change column property.

1)The advantage of using sort stage over remove duplicate stage is that
sort stage allows us to capture the duplicate records whereas remove
duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in
ascending order and try to get the last rec. The same can be done using
sort stage by sorting in descending order to retain the first record.
What is the use Enterprise Pivot Stage The Pivot Enterprise stage is a processing stage that pivots data
horizontally and vertically.
· Specifying a horizontal pivot operation : Use the Pivot Enterprise stage
to horizontally pivot data to map sets of input columns onto single
output columns.
Table 1. Input data for a simple horizontal pivot operation
REPID last_name Jan_sales Feb_sales Mar_sales
100 Smith 1234.08 1456.80 1578.00
101 Yamada 1245.20 1765.00 1934.22
Table 2. Output data for a simple horizontal pivot operation
REPID last_name Q1sales Pivot_index
100 Smith 1234.08 0
100 Smith 1456.80 1
100 Smith 1578.00 2
101 Yamada 1245.20 0
101 Yamada 1765.00 1
101 Yamada 1934.22 2
Specifying a vertical pivot operation: Use the Pivot Enterprise stage to
vertically pivot data and then map the resulting columns onto the
output columns.
Table 1. Input data for vertical pivot operation
REPID last_name Q_sales
100 Smith 1234.08
100 Smith 1456.80
100 Smith 1578.00
101 Yamada 1245.20
101 Yamada 1765.00
101 Yamada 1934.22
Table 2. Out put data for vertical pivot operation
REPID last_name Q_sales (January) Q_sales1 (February) Q_sales2
(March) Q_sales_average
100 Smith 1234.08 1456.80 1578.00 1412.96
101 Yamada 1245.20 1765.00 1934.22 1648.14
What is a staging area? Do we need it? What is the Data staging is actually a collection of processes used to prepare source
purpose of a staging area system data for loading a data warehouse. Staging includes the following
steps:
Source data extraction, Data transformation (restructuring),
Data transformation (data cleansing, value transformations),
Surrogate key assignments

What are active transformation / Passive Active transformation can change the number of rows that pass through
transformations it. (decrease or increase rows)
Passive transformation can not change the number of rows that pass
through it
What is a Data Warehouse A Data Warehouse is the "corporate memory". Academics will say it is a
subject oriented, point-in-time, inquiry only collection of operational
data.
Typical relational databases are designed for on-line transactional
processing (OLTP) and do not meet the requirements for effective on-
line analytical processing (OLAP). As a result, data warehouses are
designed differently than traditional relational databases.
What is the difference between a data warehouse and This is a heavily debated issue. There are inherent similarities between
a data mart the basic constructs used to design a data warehouse and a data mart. In
general a Data Warehouse is used on an enterprise level, while Data
Marts is used on a business division/department level. A data mart only
contains the required subject specific data for local analysis.
What is the difference between a W/H and an OLTP Typical relational databases are designed for on-line transactional
application processing (OLTP) and do not meet the requirements for effective on-
line analytical processing (OLAP). As a result, data warehouses are
designed differently than traditional relational databases.
Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read
only) and Integrated.
OLTP databases are designed to maintain atomicity, consistency and
integrity (the "ACID" tests). Since a data warehouse is not updated,
these constraints are relaxed.

What is the difference between OLAP, ROLAP, MOLAP ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical
and HOLAP Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in
cubes with dimensions, but the data is really stored in a Relational
Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain
level, response times are usually slow.

MOLAP stands for Multidimensional OLAP. Users see their data


organized in cubes with dimensions, but the data is store in a Multi-
dimensional database (MDBMS) like Oracle Express Server. In a MOLAP
system lot of queries have a finite answer and performance is usually
critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds.


Seagate Software's Holos is an example HOLAP environment. In a HOLAP
system one will find queries on aggregated data as well as on detailed
data.
What is the difference between an ODS and a W/H An ODS (Operational Data Store) is an integrated database of
operational data. Its sources include legacy systems and it contains
current or near term data. An ODS may contain 30 to 90 days of
information.
A warehouse typically contains years of data (Time Referenced). Data
warehouses group data by subject rather than by activity (subject-
oriented). Other properties are: Non-volatile (read only) and Integrated
What is a star schema? Why does one design this way A single "fact table" containing a compound primary key, with one
segment for each "dimension," and additional columns of additive,
numeric facts.
Why?
It allows for the highest level of flexibility of metadata
Low maintenance as the data warehouse matures
Best possible performance
When should you use a STAR and when a SNOW- The star schema is the simplest data warehouse schema. Snow flake
FLAKE schema schema is similar to the star schema. It normalizes dimension table to
save data storage space. It can be used to represent hierarchies of
information.
what is fact and Dimension Fact are numerical values such as Number, Amount Quantity etc. which
are used for the analysing purpose with respect to dimensions.
Dimensions are a set of related characterstic objects on which a user
wants to analyze the facts. A single fact table can consist of maximum
255 characterstic infoobjects..
How to print/display the first line of a file? There are many ways to do this. However the easiest way to display the
first line of a file is using the [head] command.
$> head -1 file.txt
If you specify [head -2] then it would print first 2 records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful
text editor which can be used for various text manipulation purposes
like this.
$> sed '2,$ d' file.txt

How to print/display the last line of a file? The easiest way is to use the [tail] command.
$> tail -1 file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test
From our previous answer, we already know that '$' stands for the last
line of the file. So '$ p' basically prints (p for print) the last line in
standard output screen. '-n' switch takes [sed] to silent mode so that
[sed] does not print anything else in the output.
How to display n-th line of a file The easiest way to do it will be by using [sed] I guess. Based on what we
already know about [sed] from our previous examples, we can quickly
deduce this command:
$> sed –n ' p' file.txtYou need to replace with the actual line number. So
if you want to print the 4th line, the command will be$> sed –n '4 p' test
Of course you can do it by using [head] and [tail] command as well like
below:
$> head - file.txt | tail -1You need to replace with the actual line
number. So if you want to print the 4th line, the command will be$>
head -4 file.txt | tail -1
How to remove the first line / header from a file We already know how [sed] can be used to delete a certain line from the
output – by using the'd' switch. So if we want to delete the first line the
command should be:
$> sed '1 d' file.txt
But the issue with the above command is, it just prints out all the lines
except the first line of the file on the standard output. It does not really
change the file in-place. So if you want to delete the first line from the
file itself, you have two options.
Either you can redirect the output of the file to some other file and then
rename it back to original file like below:
$> sed '1 d' file.txt > new_file.txt
$> mv new_file.txt file.txt
Or, you can use an inbuilt [sed] switch '–i' which changes the file in-
place. See below:
$> sed –i '1 d' file.txt
How to remove the last line/ trailer from a file in Unix Always remember that [sed] switch '$' refers to the last line. So using
script this knowledge we can deduce the below command:
$> sed –i '$ d' file.txt

How to remove certain lines from a file in Unix If you want to remove line to line from a given file, you can accomplish
the task in the similar method shown above. Here is an example:$> sed
–i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
How to remove the last n-th line from a file This is bit tricky. Suppose your file contains 100 lines and you want to
remove the last 5 lines. Now if you know how many lines are there in the
file, then you can simply use the above shown method and can remove
all the lines from 96 to 100 like below:
$> sed –i '96,100 d' file.txt # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the
file may be generated dynamically, etc.) In that case there are many
different ways to solve the problem. There are some ways which are
quite complex and fancy. But let's first do it in a way that we can
understand easily and remember easily. Here is how it goes:$> tt=`wc -l
file.txt | cut -f1 -d' '`;sed –i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-
colon) calculates the total number of lines present in the file and stores
it in a variable called “tt”. The second command (after the semi-colon),
uses the variable and works in the exact way as shows in the previous
example
How to check the length of any line in a file We already know how to print one line from a file which is this:
$> sed –n ' p' file.txtWhere is to be replaced by the actual line number
that you want to print. Now once you know it, it is easy to print out the
length of this line by using [wc] command with '-c' switch.$> sed –n '35
p' file.txt | wc –c
The above command will print the length of 35th line in the file.txt.

How to get the nth word of a line in Unix Assuming the words in the line are separated by space, we can use the
[cut] command. [cut] is a very powerful and useful command and it's
real easy. All you have to do to get the n-th word from the line is issue
the following command:
cut –f -d' ''-d' switch tells [cut] about what is the delimiter (or separator)
in the file, which is space ' ' in this case. If the separator was comma, we
could have written -d',' then. So, suppose I want find the 4th word from
the below string: “A quick brown fox jumped over the lazy cat”, we will
do something like this:$> echo “A quick brown fox jumped over the lazy
cat” | cut –f4 –d' '
And it will print “fox”
How to reverse a string in unix Pretty easy. Use the [rev] command.
$> echo "unix" | rev
xinu

How to get the last word from a line in Unix file We will make use of two commands that we learnt above to solve this.
The commands are [rev] and [cut]. Here we go.
Let's imagine the line is: “C for Cat”. We need “Cat”. First we reverse the
line. We get “taC rof C”. Then we cut the first word, we get 'taC'. And
then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat

How to get the n-th field from a Unix command output We know we can do it by [cut]. Like below command extracts the first
field from the output of [wc –c] command
$>wc -c file.txt | cut -d' ' -f1
109
But I want to introduce one more command to do this here. That is by
using [awk] command. [awk] is a very powerful command for text
pattern scanning and processing. Here we will see how may we use of
[awk] to extract the first field (or first column) from the output of
another command. Like above suppose I want to print the first column
of the [wc –c] output. Here is how it goes like this:
$>wc -c file.txt | awk ' ''{print $1}'
109
The basic syntax of [awk] is like this:
awk 'pattern space''{action space}'
The pattern space can be left blank or omitted, like below:$>wc -c
file.txt | awk '{print $1}'
109
In the action space, we have asked [awk] to take the action of printing
the first column ($1). More on [awk] later.
How to replace the n-th line in a file with a new line in This can be done in two steps. The first step is to remove the n-th line.
Unix And the second step is to insert a new line in n-th line position. Here we
go.

Step 1: remove the n-th line


$>sed -i'' '10 d' file.txt # d stands for delete
Step 2: insert a new line at n-th line position
$>sed -i'' '10 i This is the new line' file.txt # i stands for insert

How to show the non-printable characters in a file Open the file in VI editor. Go to VI command mode by pressing [Escape]
and then [:]. Then type [set list]. This will show you all the non-printable
characters, e.g. Ctrl-M characters (^M) etc., in the file
Diff b/w Compile and Validate Compile option only checks for all mandatory requirements like link
requirements, stage options and all. But it will not check if the database
connections are valid.
Validate is equivalent to Running a job except for extraction/loading of
data. That is, validate option will test database connectivity by making
connections to databases
Field mapping using Transformer stage: Right("0000000000":Trim(Lnk_Xfm_Trans.link),18)

Requirement:
field will be right justified zero filled, Take last 18
characters
We have a source which is a sequential file with Type command in putty: sed '1d;$d' file_name>new_file_name (type
header and footer. How to remove the header and this in job before job subroutine then use new file in seq stage)
footer while reading this file using sequential file stage
of Datastage?

Potrebbero piacerti anche