Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SRC has 1 record I want 10records in target how is it 1. source-copy stage take ten out links from copy stage-link it to funnel
possible ? stage-sequential file in this way we can get 10 records in output
2. goto transformer stage ,in that stage select system variables ,select
iteration and specify the condition in loopig is @iteration=10
Explain how a source file is populated We can populate a source file in many ways such as by creating a SQL
query in Oracle, or by using row generator extract tool etc
Name the command line functions to import and To import the DS jobs, dsimport.exe is used and to export the DS jobs,
export the DS jobs dsexport.exe is used
Differentiate between data file and descriptor file As the name implies, data files contains the data and the descriptor file
contains the description/information about the data in the data files.
Define Routines and their types Routines are basically collection of functions that is defined by DS
manager. It can be called via transformer stage. There are three types of
routines such as, parallel routines, main frame routines and server
routines.
How can you write parallel routines in datastage PX We can write parallel routines in C or C++ compiler. Such routines are
also created in DS manager and can be called from transformer stage
What is the method of removing duplicates, without Duplicates can be removed by using Sort stage. We can use the option,
the remove duplicate stage as allow duplicate = false.
Transformer Stage using Stage Variables
Differentiate between Join, Merge and Lookup stage All the three concepts are different from each other in the way they use
the memory storage, compare input requirements and how they treat
various records. Join and Merge needs less memory as compared to the
Lookup stage.
Differentiate between Symmetric Multiprocessing and In Symmetric Multiprocessing, the hardware resources are shared by
Massive Parallel Processing processor. The processor has one operating system and it communicates
through shared memory. While in Massive Parallel processing, the
processor access the hardware resources exclusively. This type of
processing is also known as Shared Nothing, since nothing is shared in
this. It is faster than the Symmetric Multiprocessing.
Define APT_CONFIG in Datastage It is the environment variable that is used to identify the *.apt file in
Datastage. It is also used to store the node information, disk storage
information and scratch information
Source is a flat file and has 200 records . These have to Use SEQ---> Transformer stage ----> 4 SEQ Files In transformer stage add
split across 4 outputs equally . 50 records in each . constraints as Mod(@INROWNUM,4) =0, Mod(@INROWNUM,4) =1,
The total number of records in the source may vary Mod(@INROWNUM,4) =2, Mod(@INROWNUM,4) =3
everyday ,according to the count records are to split
equally at 4 outputs
Input is 1.take transformer and use system variable @iteration<=input.cola
cola 2.In Stg Varible
1 "str(input_column,Input_column)"
2 This Str function
3 Repeats the input string no.of times given in repeats
this should be populated at the output as
cola
1
22
333
I have two columns in the source , COl A and Col B . 1. Loop Condition: @ITERATION <= Len(DSLink2.COLB) --Loop will end
Input is like when the length of the string hits.
Cola Colb
100 ABCDEF Derivation: Right(Left(DSLink2.COLB, @ITERATION),1) --Right & Left
I should achieve the output as functions will extract characters
Cola Colb 2. use pivot stage
100 A
100 B
100 C
100 D
100 E
100 F
I have 3 jobs A,B & C, which are dependent each 1. This can be done in Director -->Add to Scheduler
other, I want to run A & C jobs daily and B job run only 2. create 2 sequencers, in 1st sequence job A&c and in the 2nd sequence
on Sunday. How can I do it only B will be there. Schedule the 1st Sequence through director for
daily run and schedule the 2nd to run only on sunday.
3.You can create a new job to test whether the current day is Sunday or
not. If it is true then create a file else create a 0KB file.
Then you can create only 1 sequence with jobs A and C one after the
order and then the new job and exec command stage which will trigger
the job B if the file is available else it will not trigger the job and
complete the sequence.
I have a source file having data like: 1. Sourcefile --> copy stage --> 1st link --> Removeduplicate stage -->
10 outputfile1 with 10,20,30,40,50,60,70
10 Copy stage-->2nd link --> aggregator stage (creates the row count)-->
10 filter stage-->filter1 (count>1) -->outputfile2 with 10,20,30,40 --
20 >Filter2(count=1)-->outputfile3 with 50,60,70
20 2. Use Sort Stage-->Define the key field, In Property, Key Change column
20 is TRUE.
30 Then use a Transformer, In constraint , KeyChange=1 for Unique record
30 O/P and KeyChange=0 for Duplicate O/P
40
40
50
60
70 I want three output from the above input file,
these output would be:
1) having only unique records no duplicates should be
there. Like:
2) having only duplicate records
3) only unique record
I have one source file which contains the below data in transformer stage use stringtodate(datecol,format)
input column: DOB varchar(8)
20000303
20000409
1999043
1999047
validate the Date if valid date else pass the default-
1999-12-31,convert varchar to date
What is the use of node in data stage ? If we increase 1. In a grid environment a Node is the place where the jobs are
the nodes wt will happens executed. There will be a Conductor node and multiple grid nodes which
are executing in parallel to make the processing faster. If the number of
nodes are increased it increases the Parallelism of the job and hence the
performance
2.A basic answer I have in mind is for load balancing/performance. For
example, you can group x number of jobs to nodeA, and another group
of jobs to nodeB. This way your jobs are not using just the default node.
You can do this by setting your $APT_CONFIG_FILE for each of your jobs.
Just don't get confused with the term Node Vs Processor. A processor
can have 1 or more nodes
Say I have 5 rows in a source table and for each rows
10 rows matching in a lookup table and my range is for
lookup is 9 to 99. what will be the row count in output
table?
I have table(Emp) with the columns Using Transformer Days Between and divided by 365 or Year from date
Eid,Ename,Sal,month(sal),year(sal) and DOB (say 15th- function
Jan-1981).
Design a job such that the output displays
Ename,year(sal),tot(sal) and current age i.e. For Ex: 18
yrs
Suppose i am having a source file and 3 output tables Use SEQ---> Transformer stage ----> 3 SEQ Files In transformer stage add
and I want first row to be written to first table second constraints as Mod(@INROWNUM,3) =1, Mod(@INROWNUM,3) =2,
row to second table, third row to third table likewise Mod(@INROWNUM,3) =0
how can we achieve this using datastage without using
partitioning?
How can we identify updated records in datastage? you can use the change capture stage which will show you the data
Only updated records without having any row-id and before any update change made and after the update change was made
date column available.
How to get top five rows in DataStage? I tried to use 1. If the source in your case is a file, then we can use "Read first rows"
@INROWNUM,@OUTROWNUM system variables in property of Sequential file stage and specify 5 as the value.
transformer..but they are not giving unique sequential Or use "Row number" property in the Sequential file stage and add a
numbers for every row filter stage with a clause Rownumber < = 5
Or use "Filter" property in the Sequential file stage and specify the
command "head -5" (without quotes)
Or irrespective of source, use Head stage after the source stage and
specify 5 for the number of rows to display
2. We can use Head stage to get top n rows.
Keep option like "No of Rows(Per Partion)=5"
Before/After Subroutines :
Transformer Routines:
#include
#include
Note :We need to make sure ,our code should not contain main()
function as it is not required.
v Once you run the above command it will create object file at same
path with .o as extension. Now login to DS Designer at routines folder do
a right click and select new Parallel routine.
v In new opened window put required details like Name and Select type
as External Function external subroutine name as the function name we
need to access, . Select proper return type and also provide the
complete path of the .o file.
v Now Select Arguments tab and add required arguments ,select proper
Data types for arguments.
v Now you are all done. Go to your job open any transformer and in any
expression just select the ellipsis button [...] you will get the list and
there select routine. There you will get our new routine listed.
But remember few points while writing the function i.e the cpp code,
v The same applies to input arguments too. So our function can accept
char* only not string. But later in the cpp code we can change it to
string.
Datastage Architecture There mainly 3 parts . 1. DS Engine 2. Metadata Repository 3. Services.
If these 3 tiers installed on a single server the it will be called Single Tier
architecture.
If DSEngine on 1 Server and Metadata Repository and WAS Services on
another server then it will be called 2 tier architecture.
If these 3 are on Different servers the it will be 3 tier architecture.
And Dont forget that The DS Client will be installed on Windows Server
and connection has to be established from Server to client.
Apart from these... there are SMP(Symmetric Multi Processing(Sharing))
and MPP(Massive Parallel Processing(Share Nothing) environments.
2.Project architecture would be like:
*****************************
You have: 1 Source--------> 1 Staging Area-------->1 Temporary area to
store ur transformed data-------> finally ur target.
So its 4 layer architecture.
˜The option Do not checkpoint run in job activity should not be checked
because if another activity say activity4 later in the sequencer fails, and
the sequencer is restarted, this acticity will again rerun(which is
unnecessary). However if the logic demands, then it has to be checked.
The option Reset if required, then run has to be chosen in the job
activity so that when an aborted sequence is re-run then, the aborted
job(due to which sequence is aborted) will re-run successfully. Hope this
helps..
I have a source file1 consist of two datatypes In Transformer stage there is one function IsInteger and IsChar , We can
file1: identify If IsInteger (column name) then file1 else file2
no(integer) 2. I think this Question is to confuse the Job Aspirant by using Datatypes
1 and all...
2 Its very simple... File1-->2 Columns. 1.NO(Integer) 2.DEPT(Char).
3 Target1: NO(Integer), Target2: DEPT(Char).
& Take a Copy stage and Draw 2 output links.
dept(char) In One Output, Map only 1Column i.e NO(Integer)
cs In 2nd Output, Map Only 1Column i.e DEPT(Char).
it Simple mapping and there is no need to use "Transformer Stage".
ie Correct me if I was wrong...
and i want to seperate these two datatypes and load it
into target files
file2 & file3.
how can i do this in datastage and by using which
stage
Difference between JOIN , LOOKUP , MERGE The three stages differ mainly in the memory they use
DataStage doesn't know how large your data is, so cannot make an
informed choice whether to combine data using a join stage or a lookup
stage. Here's how to decide which to use:
if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This
can involve I/O if the data is big enough, but the I/O is all highly
optimized and sequential. Once the sort is over the join processing is
very fast and never involves paging or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to
specify several reject links as many as input links
2) Join Stage:
1.) It has n input links(one being primary and remaining being secondary
links), one output link and there is no reject link
2.) It has 4 join operations: inner join, left outer join, right outer join and
full outer join
3.) join occupies less memory, hence performance is high in join stage
4.) Here default partitioning technique would be Hash partitioning
technique
5.) Prerequisite condition for join is that before performing join
operation, the data should be sorted.
Look up Stage:
1.) It has n input links, one output link and 1 reject link
2.) It can perform only 2 join operations: inner join and left outer join
3.) Join occupies more memory, hence performance reduces
4.) Here default partitioning technique would be Entire
Merge Stage:
1.) Here we have n inputs master link and update links and n-1 reject
links
2.) in this also we can perform 2 join operations: inner join, left outer
join
3.) the hash partitioning technique is used by default
4.) Memory used is very less, hence performance is high
5.) sorted data in master and update links are mandatory
How can we retrieve the particular rows in dataset by $orchadmin dump [options] Sample.ds
using orchadmin command?
options : -p period(N) : Lists every Nth record from each partition
starting from first record
2.
orchadmin dump -part 0 -n 17 -field name input.ds
Performance Tuning of Jobs By query optimization i.e you check your extract and insert query
2. From Server side you can increase the memory
3. By removing unnecessary stages which are not required i.e with the
proper job design
4. By using the constraint you can also increase the performance
5. Use parametrized jobs and sequence with the proper
parallel/sequential mapping flow
1. First filter then extract. But dont extract and filter. Use SQL instead of
table method when extracting. Say 1 million records are coming from
input table but there is a filter condition (Acct_Type=S) in job as per
business documents which results only few records say (100).
2. Reduce as many as transformer stages.
3. Reduce stage variables.
4. Remove sort stage and apply the partition techniques at stage level
(Ex: for join-hash, lookup-entire).
5. Be careful while operating with Joins. Be specific to inner join untill
business needs left outer.
Use Copy stage instead of a Transformer for simple operations like :
â ¢placeholder between stages
â ¢renaming Columns
â ¢dropping Columns
â ¢implicit (default) type Conversions
USe Stage variables wisely.The more teir number, the slower the
transformer and the job is. A job should not be overloaded with stages,
so split job design into smaller jobs.
Reading or writing data from/to a sequential file is slow and this can be a
bottleneck in caseof huge data. So, in such cases, to have faster reading
from the Sequential File stage the number of readers per node can be
increased (default value is one). Or Read from multiple nodes can be set.
Ensure that RCP is propagating unnecessary metadata to down stream
stages.
Have a volumetric analysis done when you introduce
Lookup(Normal/Sparse), Join and Merge stage in the design.
Always use CONNECTOR stages for connecting to databases as they are
more robust and fast. If not available, use Enterprise stage. Plug-in
stages come next.
Examine the execution plans for SQL queries used in jobs and create
indices for appropriate columns. Having indices improves performance
drastically.
Some times dropping the index and loading the table and re-creating
once load is done may be a good option
What is the uses of the copy stage with out copies Use Copy stage for simple operations like :
input link to output link datasetsand it can have any
purposes pls send me with example â ¢to multicast the same input data among multiple output links.
â ¢placeholder between stages to avoid buffer issues
â ¢renaming Columns
â ¢dropping Columns
â ¢implicit (default) type Conversions
How do you use surrogate key in reporting 1) In a typical data warehouse environment, we normally have a
requirement to keep history. So, we would end up having multiple rows
for a given Primary key. So, we define a new column that doesn't have a
business meaning of its own but acts a Primary key in a dimension. If you
have Surrogate Keys defined in each of your dimensions, then your fact
table will have each of these keys from dimensions as foreign keys and
measures.
How to get the dataset record count? without using Use datastage dataset management utility[GUI].
orchadmin command 2. Use orchadmin utility from command line
3. dsrecords command
Dimensional Modelling types and significance 1) Dimension modelling means to define a datawarehouse architecture
using any one of the availabale schema
dimensionl modelling are 3 types 1. conceptual model 2. logical model
3.physical model
conceptual model: gathering of all the requirements
logical modelling:define facts and dimension
physical mosel:physically moving data to datawarehouse
datawarehouse graph: gathering all requirement--->define facts and
dimension--->load data(datawarehouse)--->generate reports--->analysis-
-->decision
2) Dimensional Modelling:
It is a designing methodlogy for designing datawarehouse with
dimensions& facts.
It is of three types 1) Conceptual Modelling 2) Logical Modelling 3)
Physical Modelling
1) Conceptual Modelling:
A datawarehouse architect gathers the business requirements.
Identify the facts, dimensions & relationships.
2) Logical Modelling:
Design the fact & dimension tables.
Create the relationship between fact & dimension tables.
3) Physical Modelling:
Execute the dimension & fact tables in database for loading..
If are given a list of .txt files and asked to read only the In sequential file we can take a single file by using the file as Specified
first 3 files using seq file stage. How will you do it? file.
But we can take the more than one file use File Pattern with different
file names.
If you have Numerical+Characters data in the source, Use following function in derivation of the transformer:
how will you load only Character data to the target?
Which functions will you use in Transformer stage? convert("123456789"," ",StringName)
convert("123456789","",StringName)
T12e5st --> Test (desired rlt)esu
convert("123456789"," ",StringName)
T12e5st --> T e st (space will appear in place of digits)
Source has sequential file stage in 10 records and 1) Add a third output link and make the filter condition as negation of
move to transformer stage it has one output link 2 first output link and reject link. All the records that do not match the
records and reject link has 5 records ? But i want filter condition of first output or filter condition of reject would be
remaining 3 records how to capture passed to the third link.
2)You can choose Otherwise option in constraints, so remaining records
you will get.
What is Dataset and Fileset 1) I assume you are referring Lookup fileset only.It is only used for
lookup stages only.Dataset: DataStage parallel extender jobs use data
sets to manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other
DataStage jobs.FileSet: DataStage can generate and name exported files,
write them to their destination, and list the files it has generated in a file
whose extension is, by convention, .fs. The data files and the file that
lists them are called a file set. This capability is useful because some
operating systems impose a 2 GB limit on the size of a file and you need
to distribute files among nodes to prevent overruns.
If possible, break the input into multiple threads and run multiple
instances of the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database
on the server using Hash/Sequential files for optimum performance also
for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction'
numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-
Sort for better performance of jobs
Removed the data not used from the source as early as possible in the
job.
Worked with DB-admin to create appropriate Indexes on tables for
better performance of DS queries
Converted some of the complex joins/business in DS to Stored
Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then
use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the
functionality required in one of the standard routines supplied in the sdk
or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of
time to process. This may be the case if the constraint calls routines or
external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself.
This will eliminate the unnecessary records even getting in before joins
are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the
database.
Using a constraint to filter a record set is much slower than performing a
SELECT WHERE .
Make every attempt to use the bulk loader for your particular database.
Bulk loaders are generally faster than using ODBC or OLE.
Types of views in Datastage Director There are 3 types of views in Datastage Directora) Job View - Dates of
Jobs Compiled.b) Log View - Status of Job last runc) Status View -
Warning Messages, Event Messages, Program Generated Messages
How do you execute datastage job from command line Using "dsjob" command as follows.dsjob -run -jobstatus projectname
prompt jobname
Functionality of Link Partitioner and Link Collector Link Partitioner: It actually splits data into various partitions or data
flows usingvarious partition methods.Link Collector: It collects the data
coming from partitions, merges it into a single dataflow and loads to
target
What is the difference between Server Job and Server jobs were doesn’t support the partitioning techniques but parallel
Parallel Jobs? jobs support the partition techniques
Server jobs are not support SMTP,MPP but parallel supports SMTP,MPP
Server jobs are running in single node but parallel jobs are running in
multiple nodes
Server jobs prefer while getting source data is low but data is huge then
prefer the parallel
what are different types of containers A group of stages and link in a job design is called container.There are
two kinds of Containers: Local and Shared. Local Containers only exist
within the single job they are used. Use Shared Containers to simplify
complex job designs.Shared Containers exist outside of any specific job.
They are listed in the Shared Containers branch is Manager. These
Shared Containers can be added to any job. Shared containers are
frequently used to share a commonly used set of job components.A Job
Container contains two unique stages. The Container Input stage is used
to pass data into the Container. The Container Output stage is used to
pass data out of the Container
how do you use procedure in datastage job Use ODBC plug,pass one dummy colomn and give procedure name in
SQL tab.
What is hash file ? what are its types? Hash file is just like indexed sequential file , this file internally indexed
with a particular key value . There are two type of hash file Static Hash
File and Dynamic Hash File
What is job control ? A job control routine provides the means of controlling other jobs from
the current job . A set of one or more jobs can be validated, run ,reset ,
stopped , scheduled in much the same way as the current job can be .
Define the difference between active and Passive There are two kinds of stages:Passive stages define read and write
Stage access to data sources and repositories.• Sequential• ODBC•
HashedActive stages define how data is filtered and transformed.•
Transformer• Aggregator• Sort plug-in
What is the Job control code Job control code used in job control routine to creating controlling job,
which invokes and run other jobs
You need to invoke a job from the command line that dsjob -run -mode NORMAL
is a multi-instance enabled. What is the correct syntax
to start a multi-instance job
A client must support multiple languages in selected Choose Unicode setting in the extended column attribute.
text columns when reading from DB2 database. Choose NVar/NVarchar as data types
Which two actions will allow selected columns to
support such data
Which two system variables/techniques must be used "@PARTITIONNUM",
in a parallel Transformer derivation to generate a "@NUMPARTITIONS"
unique sequence of integers across partitions
You are experiencing performance issues for a given Run job with $APT_TRACE_RUN set to true.
job. You are assigned the task of understanding what Review the objectives of the job
is happening at run time for the given job. What are
the first two steps you should take to understand the
job performance issues
Your customer asks you to identify which stages in a $APT_PM_PLAYER_TIMING
job are consuming the largest amount of CPU time.
Which product feature would help identify these
stages
Unix command to check datastage jobs running at ps -ef | grep phantom
server
Unix Command to check Datastage sessions running at netstat –na | grep dsr
backend netstat –a | grep dsr
netstat –a | grep dsrpc
How to unlock a Datastage job Cleanup Resourses in Director
Clear Status File in Director
DS.Tools in Administrator
DS.Tools in Unix
Command to check the Datastage Job Status dsjob –status
Dataset:
1. It preserves partition.it stores data on the nodes so when you read
from a dataset you dont have to repartition the data
2. it stores data in binary in the internal format of datastage. so it takes
less time to read/write from ds to any other source/target.
3. You cannot view the data without datastage.4. It Creates 2 types of
file to storing the data.
A) Descriptor File : Which is created in defined folder/path.
B) Data File : Created in Dataset folder mentioned in configuration file.
5. Dataset (.ds) file cannot be open directly, and you could follow
alternative way to achieve that, Data Set Management, the utility in
client tool(such as Designer and Manager), and command line
ORCHADMIN.
Fileset:
1. It stores data in the format similar to that of sequential file.Only
advantage of using fileset over seq file is it preserves partition scheme.
2. you can view the data but in the order defined in partitioning scheme.
3. Fileset creates .fs file and .fs file is stored as ASCII format, so you could
directly open it to see the path of data file and its schema.
What is the main differences between Lookup, Join All are used to join tables, but find the difference.
and Merge stages Lookup: when the reference data is very less we use lookup. bcoz the
data is stored in buffer. if the reference data is very large then it wl take
time to load and for lookup.
Join: if the reference data is very large then we wl go for join. bcoz it
access the data directly from the disk. so the
processing time wl be less when compared to lookup. but here in join
we cant capture the rejected data. so we go for merge.
Merge: if we want to capture rejected data (when the join key is not
matched) we use merge stage. for every detailed link there is a reject
link to capture rejected data.
Significant differences that I have noticed are:
1) Number Of Reject Link
(Join) does not support reject link.
(Merge) has as many reject link as the update links (If there are n-input
links then 1 will be master link and n-1 will be the update link).
2) Data Selection
(Join) There are various ways in which data is being selected. e.g. we
have different types of joins inner outer( left right full) cross join etc. So
you have different selection criteria for dropping/selecting a row.
(Merge) Data in Master record and update records are merged only
when both have same value for the merge key columns
What are the different types of lookup? When one In DS 7.5 we have 2 types of lookup options are avilable: 1. Normal 2.
should use sparse lookup in a job Sparce
In DS 8.0.1 Onwards, we have 3 types of lookup options are available 1.
Normal 2. Sparce 3. Range
Normal lkp: To perform this lkp data will be stored in the memory first
and then lkp will be performed due to which it takes more execution
time if reference data is high in volume. Normal lookup it takes the
entiretable into memory and perform lookup.
Sparse lkp: Sql query will be directly fired on the database related record
due to which execution is faster than normal lkp. sparse lookup it
directly perform the lookup in database level.
i.e If reference link is directly connected to Db2/OCI Stage and firing
one-by-one query on the DB table to fetcht the result.
Range lookup: this will help you to search records based on perticular
range. it will serch only that perticuler range records and provides good
performance insted of serching the enire record set.
i.e Define the range expression by selecting the upper bound and lower
bound range columns and the required operators.
For example:
Account_Detail.Trans_Date >= Customer_Detail.Start_Date AND
Account_Detail.Trans_Date <= Customer_Detail.End_Date
Use and Types of Funnel Stage in Datastage The Funnel stage is a processing stage. It copies multiple input data sets
to a single output data set. This operation is useful for combining
separate data sets into a single large data set. The stage can have any
number of input links and a single output link.
The Funnel stage can operate in one of three modes:
•Continuous Funnel combines the records of the input data in no
guaranteed order. It takes one record from each input link in turn. If
data is not available on an input link, the stage skips to the next link
rather than waiting.
•Sort Funnel combines the input records in the order defined by the
value(s) of one or more key columns and the order of the output records
is determined by these sorting keys.
•Sequence copies all records from the first input data set to the output
data set, then all the records from the second input data set, and so on.
For all methods the meta data of all input data sets must be identical.
Name of columns should be same in all input links
What is the Diffrence Between Link Sort and Sort If the volume of the data is low, then we go for link sort.
Stage? If the volume of the data is high, then we go for sort stage.
Or Diffrence Between Link sort and Stage Sort "Link Sort" uses scratch disk (physical location on disk), whereas
"Sort Stage" uses server RAM (Memory). Hence we can change the
default memory size in "Sort Stage"
Using SortStage you have the possibility to create a KeyChangeColumn -
not possible in link sort.
Within a SortStage you have the possibility to increase the memory size
per partition,
Within a SortStage you can define the 'don't sort' option on sort key
they are already sorted.
Link Sort and stage sort,both do the same thing.Only the Sort Stage
provides you with more options like the amount of memory to be
used,remove duplicates,sort in Ascending or descending order,Create
change key columns and etc.These options will not be available to you
while using Link Sort
what is main difference between change capture and Change Capture stage : compares two data set(after and before) and
change apply stages makes a record of the differences.
change apply stage : combine the changes from the change capture
stage with the original before data set to reproduce the after data set.
Change apply stage applies these changes back to those data sets based
on the chanecode column
Remove duplicates using Sort Stage and Remove We can remove duplicates using both stages but in the sort stage we can
Duplicate Stages and Diffrence capture duplicate records using create key change column property.
1)The advantage of using sort stage over remove duplicate stage is that
sort stage allows us to capture the duplicate records whereas remove
duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in
ascending order and try to get the last rec. The same can be done using
sort stage by sorting in descending order to retain the first record.
What is the use Enterprise Pivot Stage The Pivot Enterprise stage is a processing stage that pivots data
horizontally and vertically.
· Specifying a horizontal pivot operation : Use the Pivot Enterprise stage
to horizontally pivot data to map sets of input columns onto single
output columns.
Table 1. Input data for a simple horizontal pivot operation
REPID last_name Jan_sales Feb_sales Mar_sales
100 Smith 1234.08 1456.80 1578.00
101 Yamada 1245.20 1765.00 1934.22
Table 2. Output data for a simple horizontal pivot operation
REPID last_name Q1sales Pivot_index
100 Smith 1234.08 0
100 Smith 1456.80 1
100 Smith 1578.00 2
101 Yamada 1245.20 0
101 Yamada 1765.00 1
101 Yamada 1934.22 2
Specifying a vertical pivot operation: Use the Pivot Enterprise stage to
vertically pivot data and then map the resulting columns onto the
output columns.
Table 1. Input data for vertical pivot operation
REPID last_name Q_sales
100 Smith 1234.08
100 Smith 1456.80
100 Smith 1578.00
101 Yamada 1245.20
101 Yamada 1765.00
101 Yamada 1934.22
Table 2. Out put data for vertical pivot operation
REPID last_name Q_sales (January) Q_sales1 (February) Q_sales2
(March) Q_sales_average
100 Smith 1234.08 1456.80 1578.00 1412.96
101 Yamada 1245.20 1765.00 1934.22 1648.14
What is a staging area? Do we need it? What is the Data staging is actually a collection of processes used to prepare source
purpose of a staging area system data for loading a data warehouse. Staging includes the following
steps:
Source data extraction, Data transformation (restructuring),
Data transformation (data cleansing, value transformations),
Surrogate key assignments
What are active transformation / Passive Active transformation can change the number of rows that pass through
transformations it. (decrease or increase rows)
Passive transformation can not change the number of rows that pass
through it
What is a Data Warehouse A Data Warehouse is the "corporate memory". Academics will say it is a
subject oriented, point-in-time, inquiry only collection of operational
data.
Typical relational databases are designed for on-line transactional
processing (OLTP) and do not meet the requirements for effective on-
line analytical processing (OLAP). As a result, data warehouses are
designed differently than traditional relational databases.
What is the difference between a data warehouse and This is a heavily debated issue. There are inherent similarities between
a data mart the basic constructs used to design a data warehouse and a data mart. In
general a Data Warehouse is used on an enterprise level, while Data
Marts is used on a business division/department level. A data mart only
contains the required subject specific data for local analysis.
What is the difference between a W/H and an OLTP Typical relational databases are designed for on-line transactional
application processing (OLTP) and do not meet the requirements for effective on-
line analytical processing (OLAP). As a result, data warehouses are
designed differently than traditional relational databases.
Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read
only) and Integrated.
OLTP databases are designed to maintain atomicity, consistency and
integrity (the "ACID" tests). Since a data warehouse is not updated,
these constraints are relaxed.
What is the difference between OLAP, ROLAP, MOLAP ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical
and HOLAP Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in
cubes with dimensions, but the data is really stored in a Relational
Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain
level, response times are usually slow.
How to print/display the last line of a file? The easiest way is to use the [tail] command.
$> tail -1 file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test
From our previous answer, we already know that '$' stands for the last
line of the file. So '$ p' basically prints (p for print) the last line in
standard output screen. '-n' switch takes [sed] to silent mode so that
[sed] does not print anything else in the output.
How to display n-th line of a file The easiest way to do it will be by using [sed] I guess. Based on what we
already know about [sed] from our previous examples, we can quickly
deduce this command:
$> sed –n ' p' file.txtYou need to replace with the actual line number. So
if you want to print the 4th line, the command will be$> sed –n '4 p' test
Of course you can do it by using [head] and [tail] command as well like
below:
$> head - file.txt | tail -1You need to replace with the actual line
number. So if you want to print the 4th line, the command will be$>
head -4 file.txt | tail -1
How to remove the first line / header from a file We already know how [sed] can be used to delete a certain line from the
output – by using the'd' switch. So if we want to delete the first line the
command should be:
$> sed '1 d' file.txt
But the issue with the above command is, it just prints out all the lines
except the first line of the file on the standard output. It does not really
change the file in-place. So if you want to delete the first line from the
file itself, you have two options.
Either you can redirect the output of the file to some other file and then
rename it back to original file like below:
$> sed '1 d' file.txt > new_file.txt
$> mv new_file.txt file.txt
Or, you can use an inbuilt [sed] switch '–i' which changes the file in-
place. See below:
$> sed –i '1 d' file.txt
How to remove the last line/ trailer from a file in Unix Always remember that [sed] switch '$' refers to the last line. So using
script this knowledge we can deduce the below command:
$> sed –i '$ d' file.txt
How to remove certain lines from a file in Unix If you want to remove line to line from a given file, you can accomplish
the task in the similar method shown above. Here is an example:$> sed
–i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
How to remove the last n-th line from a file This is bit tricky. Suppose your file contains 100 lines and you want to
remove the last 5 lines. Now if you know how many lines are there in the
file, then you can simply use the above shown method and can remove
all the lines from 96 to 100 like below:
$> sed –i '96,100 d' file.txt # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the
file may be generated dynamically, etc.) In that case there are many
different ways to solve the problem. There are some ways which are
quite complex and fancy. But let's first do it in a way that we can
understand easily and remember easily. Here is how it goes:$> tt=`wc -l
file.txt | cut -f1 -d' '`;sed –i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-
colon) calculates the total number of lines present in the file and stores
it in a variable called “tt”. The second command (after the semi-colon),
uses the variable and works in the exact way as shows in the previous
example
How to check the length of any line in a file We already know how to print one line from a file which is this:
$> sed –n ' p' file.txtWhere is to be replaced by the actual line number
that you want to print. Now once you know it, it is easy to print out the
length of this line by using [wc] command with '-c' switch.$> sed –n '35
p' file.txt | wc –c
The above command will print the length of 35th line in the file.txt.
How to get the nth word of a line in Unix Assuming the words in the line are separated by space, we can use the
[cut] command. [cut] is a very powerful and useful command and it's
real easy. All you have to do to get the n-th word from the line is issue
the following command:
cut –f -d' ''-d' switch tells [cut] about what is the delimiter (or separator)
in the file, which is space ' ' in this case. If the separator was comma, we
could have written -d',' then. So, suppose I want find the 4th word from
the below string: “A quick brown fox jumped over the lazy cat”, we will
do something like this:$> echo “A quick brown fox jumped over the lazy
cat” | cut –f4 –d' '
And it will print “fox”
How to reverse a string in unix Pretty easy. Use the [rev] command.
$> echo "unix" | rev
xinu
How to get the last word from a line in Unix file We will make use of two commands that we learnt above to solve this.
The commands are [rev] and [cut]. Here we go.
Let's imagine the line is: “C for Cat”. We need “Cat”. First we reverse the
line. We get “taC rof C”. Then we cut the first word, we get 'taC'. And
then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat
How to get the n-th field from a Unix command output We know we can do it by [cut]. Like below command extracts the first
field from the output of [wc –c] command
$>wc -c file.txt | cut -d' ' -f1
109
But I want to introduce one more command to do this here. That is by
using [awk] command. [awk] is a very powerful command for text
pattern scanning and processing. Here we will see how may we use of
[awk] to extract the first field (or first column) from the output of
another command. Like above suppose I want to print the first column
of the [wc –c] output. Here is how it goes like this:
$>wc -c file.txt | awk ' ''{print $1}'
109
The basic syntax of [awk] is like this:
awk 'pattern space''{action space}'
The pattern space can be left blank or omitted, like below:$>wc -c
file.txt | awk '{print $1}'
109
In the action space, we have asked [awk] to take the action of printing
the first column ($1). More on [awk] later.
How to replace the n-th line in a file with a new line in This can be done in two steps. The first step is to remove the n-th line.
Unix And the second step is to insert a new line in n-th line position. Here we
go.
How to show the non-printable characters in a file Open the file in VI editor. Go to VI command mode by pressing [Escape]
and then [:]. Then type [set list]. This will show you all the non-printable
characters, e.g. Ctrl-M characters (^M) etc., in the file
Diff b/w Compile and Validate Compile option only checks for all mandatory requirements like link
requirements, stage options and all. But it will not check if the database
connections are valid.
Validate is equivalent to Running a job except for extraction/loading of
data. That is, validate option will test database connectivity by making
connections to databases
Field mapping using Transformer stage: Right("0000000000":Trim(Lnk_Xfm_Trans.link),18)
Requirement:
field will be right justified zero filled, Take last 18
characters
We have a source which is a sequential file with Type command in putty: sed '1d;$d' file_name>new_file_name (type
header and footer. How to remove the header and this in job before job subroutine then use new file in seq stage)
footer while reading this file using sequential file stage
of Datastage?