Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
and
rollup
of
same
perform
result
in
main
memory,
what are kinds of layouts does ab initio supports
Aggregate
action,
does
rollup
not
records.
intermediat
display
support
intermediat
result
Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on
the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is
defined such as its same as the degree of parallelism.
graph
there
should
be
call
to
abc.ksh.
Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much
faster than retirving from disk. It allows the transform component to process the data records of multiple files fastly.
functions
to
is a component
retrive
records
of abinitio graph
much
more
quickly
where we can store data
than
it
and retrieve
could
it by
A
lookup
file
is
the
physical
file
where
the
data
for
How many components in your most complicated graph? It depends the type of components you us.
the
retrive
using a
lookup
from
Disk.
key parameter.
is
stored.
has
of
built-in
records
functions
to
with
retrieve
values
slim
using
What
is
a
The limit parameter contains an integer that represents a number of reject events
the
key
record
for
the
ramp
length.
lookup
limit?
The ramp parameter contains a real number that represents a rate of reject events in the number of records processed.
no
of
bad
records
allowed
=
limit
+
no
of
records*ramp.
ramp
is
basically
the
percentage
This two together provides the threshold value of bad records.
value
(from
to
1)
initialise
rollup
finalise
For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and
finally calls the finalise function once at the end of last rollup call.
it
is
not
Business
the
already
Rules
displayed,
tab
display
it
if
the
is
Transform
Editor
not
already
Grid.
displayed.
are
many
1)
2)
Use
Use
a
optimum
3)
4)
Minimise
Minimise
sorted
5)
6)
Use
Use
7)
8)
If
the
two
For
ways
limited
value
join
are
large
huge
performance
number
of
max
the
component
only
required
phasing/flow
inputs
the
and
fields
buffers
then use
dataset
9)
Minimise
the
use
of
regular
10) Avoid repartitioning of data unnecessarily
of
core
if
of
number
possible
in
in
graph
in
sort
can
a
sort,
of
otherwise
use
functions
them
like
use
sort
in-memory
by
reformat,
merge,
hash join
broadcast
re_index
be
particular
join
and
of
replace
the
case
sorted join,
dont
expression
the
components
values
for
in
improved.
phase
components
components
join/hash
join
join
components
sorted
joins
with
the
proper
as
trasfer
driving port
partitioner
functions
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be
partitioned.
How do you truncate a table?
From
Abinitio
run
sql
By using the Truncate table component in Ab Initio
component
using
the
DDL
trucate
table
/*
string_lrtrim
used
to
trim
leading
and
trailing
spaces
*/
In RDBMS the relationship between the two tables is represented as Primary key and foreign key relationship.Wheras the primary key table is
the parent table and foreignkey table is the child table.The criteria for both the tables is there should be a matching column.
What is the difference between clustered and non-clustered indices? and why do you use a clustered index?
What is an outer join?
An outer join is used when one wants to select all the records from a port whether it has satisfied the join
criteria or not.
without
join
key.
Key
should
be
{}.
What
is
the
purpose
of
having
stored
procedures
in
a
database?
Main Purpose of Stored Procedure for reduse the network trafic and all sql statement executing in cursor so speed too high.
Why might you create a stored procedure with the with recompile option?
Recompile is useful when the tables referenced by the stored proc undergoes a lot of modification/deletion/addition of data. Due to the heavy
modification activity the execute plan becomes outdated and hence the stored proc performance goes down. If we create the stored proc with
recompile option, the sql server wont cache a plan for this stored proc and it will be recompiled every time it is run.
implicit
is
using
for
internal
processing
and
explicit
is
using
for
user
open
data
requied.
Describe the elements you would review to ensure multiple scheduled batch jobs do not collide with each other.
Because every job depend upon another job for example if you first job result is successfull then another job will execute otherwise your job
doesnt work.
table
can
are
move the
contains
table
in
several
the same
or
mission
other
ways
tablespace
and
critical
to
rebuild
all
the
do
indexes
data.
on
the
this:
table.
alter
activity
reclaims
the
defragmented
space
analyze
table
table_name
compute
statistics
to
capture
the
2)Reorg could be done by taking a dump of the table, truncate the table and import the dump back into the table.
in
the
updated
table
statistics.
Have
you
worked
with
packages?
Ans: Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function
and can include this in other transfer functions.
Have
you
used
rollup
component?
Describe
how.
Ans: If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform
function
and
it
contains
the
following
mandatory
functions.
1.
2.
initialise
rollup
3.
Also need to declare one temporary variable if you want to get counts of a particular group.
finalise
For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and
finally calls the finalise function once at the end of last rollup call.
How
do
you
add
default
rules
in
transformer?
Ans: Add Default Rules Opens the Add Default Rules dialog. Select one of the following: Match Names Match names: generates a set
of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule Generates one rule that copies input fields to
output fields with the same name.
1)If
2)Click
it
is
the
not
Business
already
Rules
displayed,
tab
if
display
it
the
is
Transform
Editor
not
already
Grid.
displayed.
In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr
unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the
functionality.
What
is
the
difference
between
partitioning
with
key
and
round
robin?
Ans: Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key
is present in large volume then there can large data skew. But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is
zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin manner.
How
Ans: There
do
are
you
many
1)
2)
Use
Use
a
optimum
3)
4)
Minimise
Minimise
sorted
5)
6)
Use
Use
7)
8)
9)
10)
If
the
two
For
Minimise
improve
ways
the
limited
value
join
are
large
the
use
Avoid
number
max
the
component
only
required
phasing/flow
inputs
of
huge
of
the
performance
and
fields
buffers
then use
dataset
of
core
if
performance
of
the
components
values
for
number
possible
in
in
regular
expression
repartitioning
in
sort
like
use
sort
in-memory
by
hash join
broadcast
re_index
of
graph?
improved.
be
particular
join
and
reformat,
merge,
of
otherwise
use
functions
them
sort,
a
can
of
replace
the
case
sorted join,
dont
of
graph
phase
components
components
join/hash
join
join
components
sorted
joins
with
in
data
the
proper
as
trasfer
driving port
partitioner
functions
unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be
partitioned.
How
Ans: From
do
Abinitio
you
run
sql
component
truncate
using
the
a
DDL
trucate
table?
table
What is the difference between look-up file and look-up, with a relevant example? Ans: Generally Lookup file
represents one or more serial files(Flat files). The amount of data is small enough to be held in the memory. This allows transform functions
to
retrive
records
much
more
quickly
than
it
could
retrive
from
Disk.
A lookup is a component of abinitio graph where we can
A lookup file is the physical file where the data for the lookup is stored.
store
data
and
retrieve
it
by
using
key
parameter.
How many components in your most complicated graph? It depends the type of components you us.
Ans: Usually avoid using much complicated transform function in a graph.
Explain
what
is
lookup?
Ans: Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file
(serial/multi file). The dataset can be static as well dynamic ( in case the lookup file is being generated in previous phase and used as lookup
file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the input to the join contains less
number
of
records
with
AbInitio has built-in functions to retrieve values using the key for the lookup
What
Ans: The
is
limit
parameter
contains
an
a
integer
that
slim
represents
record
ramp
a
number
of
length.
reject
limit?
events
The ramp parameter contains a real number that represents a rate of reject events in the number of records processed.
no
of
bad
records
allowed =
limit
+
no
of
records*ramp.
ramp
is
basically
the
percentage
This two together provides the threshold value of bad records.
value
(from
What kind of services operating system provides? What kind of services operating system provides?
What for an assignment statement is used? What for an assignment statement is used?
What are the four basic types of data? What are the four basic types of data?
What for a conditional loop is best suited? What for a conditional loop is best suited?
What for an incremented loop is best suited? What for an incremented loop is best suited?
What is Relational operators used for? What is Relational operators used for?
What Relational Operators Do you know? (C) What Relational Operators Do you know? (C)
What does grep() stand for? (Unix interview question) What does grep() stand for? (Unix interview question)
What does RPG stand for? What does RPG stand for?
to
1)
What does RPG stand for? What does RPG stand for?
What does Lisp stand for? What does Lisp stand for?
What does HTML stand for? . What does HTML stand for?
What does Fortran stand for? What does Fortran stand for?
What does DOS stand for? What does DOS stand for?
What does CGI stand for? What does CGI stand for?
What does CORBA stand for? What does CORBA stand for?
What does Cobol stand for? What does Cobol stand for?
What does Case stand for? What does Case stand for?
What does BASIC stand for? What does BASIC stand for?
What does ASCII stand for? What does ASCII stand for?
What does Algol stand for? What does Algol stand for?
What does SQL stand for? What does SQL stand for?
What is the latest version that is available in Ab-initio?
How to take the input data from an excel sheet?
Which one is faster for processing fixed length dmls or delimited dmls and why ?
What is the use of aggregation when we have rollup as we know rollup component in abinitio is used to summirize group of
data record. then where we will use aggregation ?
Describe the process steps you would perform when defragmenting a data table. This table contains mission critical data.
When running a stored procedure definition script how would you guarantee the definition could be rolled back in the
event of problems.
Describe how you would ensure that database object definitions (Tables, Indices, Constraints, Triggers, Users, Logins,
Connection Options, and Server Options etc) are consistent and repeatable between multiple database instances (i.e.: a
test and production copy of a database).
What is a cursor? Within a cursor, how would you update fields on the row just fetched?
How would you find out whether a SQL query is using the indices you expect?
When using multiple DML statements to perform a single unit of work, is it preferable to use implicit or explicit transactions,
and why.
Describe the elements you would review to ensure multiple scheduled batch jobs do not collide with each other.
What is semi-join
What do you have to give the value for the Record Required parameter for a natural join?
What is Adhoc File System? Give me a scenario where you used it.
What are the different commands that you used when writing wrappers?
What do the hidden files in a sandbox represent and what does start.ksh represent?
What is the difference between sandbox and EME, can we perform checkin and checkout through sandbox/ Can anybody
explain checkin and checkout?
What are different things that you have to consider when loading data into a table?
What are differences between different GDE versions(1.10,1.11,1.12,1.13and 1.15)? What are differences between different
versions of Co-op?
What are the different versions and releases of ABinitio (GDE and Co-op version)
How to create a computer program that computes the monthly interest charge on a credit card account?
What is the difference between partitioning with key and round robin?
Can anyone tell me what happens when the graph run? i.e The Co-operating System will be at the host, We are running the
graph
at
some
other
place.
How
How would you do performance tuning for already built graph ? Can you let me know some examples?
the
How to execute the graph from start to end stages? Tell me and how to run graph in non-Abinitio system?
What are the most commonly used components in a Abinition graph? can anybody give me a practical example of a
trasformation of data, say customer data in a credit card company into meaningful output based on business rules?
Please let me know whether we have ab initio GDE version 1.14 and what is the latest GDE version and Co-op version?
Please give us insight on Enterprise Meta Environment, and some possible questions on that.
What error would you get when you use Partition by Round Robin and Join?
Have you eveer encountered an error called depth not equal? (This occurs when you extensively create graphs it is a trick
question)
What is the difference between look-up file and look-up, with a relevant example?
In which scenarios would you use Partition by Key and also, Partition by Round Robin and differences between the both?
What are the different dimension tables that you used and some columns in the fact table?
What is m_dump
What is the function you would use to transfer a string into a decimal?
For data parallelism, we can use partition components. For component parallelism, we can use replicate component. Like
this which component(s) can we use for pipeline parallelism?
What is mean by Co > Operating system and why it is special for Ab-initio ?
How to retrive data from database to source in that case whice componenet is used for this?
How to do we run sequences of jobs ,, like output of A JOB is Input to B How do we co-ordinate the jobs
What are the compilation errors you came across while executing your graphs?
What is depth_error?
Difference between conventional loading and direct loading ? When it is used in real time .
During the execution of graph, let us say you lost the network connection, would you have to start the process all over
again or does it start from where it stopped?
Define Multi file system. Can you create multifile system on the same server? Also, if you have a table that has Name,
Address, Status, Position attributes, can Name and Address be on one partition and Status and Position in the other
partition?
What is a sandbox? Did the co-operating system version 2.8 have sandbox, if not how would you store the respective files?
How did you do version control? Which tool did you use?
What are the usual errors that you encounter during ETL process apart from compilation process?
Were you involved in production support? What were the different kinds of problems that you encountered?
How do you count the number of records in a multifile system without using GDE?
What does Scan and Rollup component do and give a scenario where you used them?
Did you ever used user defined functions or packages? If yes, give a scenario.
Sometimes you have to use dynamic length strings. Can you give me one circumstance where you need it?
Why might you create a stored procedure with the with recompile option?
How to Schedule Graphs in AbInitio, like workflow Schedule in Informatica? And where we must is Unix shell scripting in
AbInitio?
1 Yes
1 No
2 :: When using multiple DML statements to perform a single unit of work, is it preferable to use implicit or
explicit transactions, and why?
Because implicit is using for internal processing and explicit is using for user open data requied.
1 Yes
1 No
1 Yes
0 No
1 Yes
1 No
5 :: Describe the elements you would review to ensure multiple scheduled batch jobs do not collide with
each other?
Because every job depend upon another job for example if you first job result is successfull then another job will execute
otherwise your job doesn't work.
0 Yes
1 No
0 Yes
1 No
7 :: Describe how you would ensure that database object definitions (Tables, Indices, Constraints, Triggers,
Users, Logins, Connection Options, and Server Options etc) are consistent and repeatable between multiple
database instances (i.e.: a test and production copy of a database)?
Take
an
entire
database
backup
and
restore
it
in
different
instance.
Take
statistics
of
all
valid
and
invalid
objects
and
match.
Periodically refresh
0 Yes
0 No
8 :: How would you find out whether a SQL query is using the indices you expect?
Explain plan can be reviewed to check the execution plan of the query. This would guide if the expected indexes are used or not.
0 Yes
0 No
0 Yes
0 No
10 :: When running a stored procedure definition script how would you guarantee the definition could be
rolled back in the event of problems?
There are quite a few factors that determines the approach such as what type of version control are used, what is the size of the
change, what is the impact of the change, is it a new procedure or replacing an existing and so on.
If
it
is
new,
then
just
drop
the
wrong
one
if it is a replacement then how big is the change and what will be the possible impact, depending upon you can have the entire
database backed up or just create a script for your original procedure before messing it up or you just do an ed and change the file
back to original and reapply. you may rename the old procedure as old and then work on new and so on.
few issues to keep in mind are synonyms, dependancies, grants, any job calling the procedure at the time of change and so on. In
nutshell, scenario can be varied and solution also can be varied.
0 Yes
0 No
12 :: Describe the process steps you would perform when defragmenting a data table. This table contains
mission critical data?
There
1)
We
are
can
move
the
several
table
in
the
same
ways
or
other
tablespace
to
and
rebuild
do
all
the
indexes
this:
on
the
table.
alter
table
analyze
<table_name>
table
move
table_name
<tablespace_name>
compute
this
activity
statistics
reclaims
to
the
defragmented
space
the
updated
capture
in
the
table
statistics.
2)Reorg could be done by taking a dump of the table, truncate the table and import the dump back into the table.
0 Yes
0 No
0 Yes
0 No
14 :: What is a cursor? Within a cursor, how would you update fields on the row just fetched?
The oracle engine uses work areas for internal processing in order to the execute sql statement is called cursor.There are two types
of cursors like Implecit cursor and Explicit cursor.Implicit cursor is using for internal processing and Explicit cursor is using for user
open for data required.
0 Yes
0 No
15 :: Why might you create a stored procedure with the with recompile option?
Recompile is useful when the tables referenced by the stored proc undergoes a lot of modification/deletion/addition of data. Due to
the heavy modification activity the execute plan becomes outdated and hence the stored proc performance goes down. If we create
the stored proc with recompile option, the sql server wont cache a plan for this stored proc and it will be recompiled every time it is
run.
0 Yes
0 No
0 Yes
0 No
0 Yes
0 No
0 Yes
0 No
0 Yes
0 No
0 Yes
0 No
main
difference
represent
b/w
format
dml
&
of
xfr
the
is
that
metadata.
XFR
rules
represent
the
tranform
functions.which
will
contain
0 Yes
business
0 No
0 Yes
0 No
0 Yes
0 No
is
simple
In
example
start
to
use
script
export
start
script
lets
in
give
as:
$DT=`date
Now
this
variable
Now
somewhere
DT
in
will
the
have
graph
graph:
'+%m%d%y'`
today's
transform
date
we
before
can
the
use
graph
this
is
run.
variable
as;
out.process_dt::$DT;
which provides the value from the shell.
0 Yes
0 No
30
::
What
are
differences
between
different
GDE
versions(1.10,1.11,1.12,1.13and
1.15)?
is
non
key
version
and
rest
are
0 Yes
Is This Answer Correct?
0 No
key
versions.
1 Yes
0 No
0 Yes
0 No
Takes
data
from
multiple
inputs,
combines
it
and
sends
it
to
all
the
output
ports.
Eg - You have 2 incoming flows (This can be data parallelism or component parallelism) on Broadcast component, one with 10
records & other with 20 records. Then on all the outgoing flows (it can be any number of flows) will have 10 + 20 = 30 records
Replicate - It replicates the data for a particular partition and send it out to multiple out ports of the component, but maintains the
partition
integrity.
Eg - Your incoming flow to replicate has a data parallelism level of 2. with one partition having 10 recs & other one having 20 recs.
Now suppose you have 3 output flos from replicate. Then each flow will have 2 data partitions with 10 & 20 records respectively.
0 Yes
0 No
0 Yes
0 No
35 :: What is m_dump?
m_dump
command
prints
the
data
in
formatted
0 Yes
0 No
0 Yes
0 No
way.
0 Yes
0 No
0 Yes
1 No
can
use
if
$mpjret
in
endscript
like
-eq($mpjret)
then
echo
"success"
else
mailx -s "[graphname] failed" mailid
0 Yes
0 No
25 :: I am unable to connect sever database(oracle) from GDE(db config file) local system.i set all these?
ChalapathiFirst we can check the properties in internet options and then u can check in cmd format telenet abinitio ip_add.
0 Yes
0 No
0 Yes
0 No
is
suppose
the
i/p
mesaureof
is
comming
data
from
gb=
250
each
and
size
is
partation
gb
100mb+200mb+300mb+5oomb)
250
)/500=
-->
-150/500
==
calclu
+ve
files
to
1000mb/4=
(100-
flow
cal
ur
mb
self
it
wil
come
in
for
value
of
-ve
value.
200,500,300.
skew
is
allways
desriable.
0 Yes
0 No
0 Yes
0 No
http://en.wikipedia.org/wiki/Ab_Initio
http://www.abinitio.com
http://www.patents.com/Ab-Initio-SoftwareCorporation/Lexington/MA/301339/company/
http://www.bi-nerd.com/ab-initio-the-dark-horse-of-etl/
http://www.linkedin.com/companies/ab-initio
Ab Initio is a private company, its main offices are in Lexington, Massachusetts (near Boston, USA - since 1994),
but they have offices all over the world (as you can see on their web site). They have very good talented devoted
people. I've heard that when you are calling their customer service - there is a 75% chance that you will speak
with a Ph.D.. It may very well be true. The company was formed by former employees of the Thinking Machines
Corporation. Some key people: Craig W. Stanfill, Richard A. Shapiro, Stephen A. Kukolich.
Ab Initio also uses its own people as well as independent consulting firms to build proof of concept for a client, and
then to guide clients in using their tools.
Unfortunately Ab Initio provides very little information about their solutions to general public. So not getting into
details, most of AI functionality can be scripted using several commands which you can give from prompt (with
many options):
m_* commands ( for example, m_shutdown, m_mkfs, m_cp, etc. ) are used for
administering
air ... (some options) - to work with EME (basically a specialized version control
system)
"Co>Operating System"
"Component Library"
"Data Profiler"
"Conduct>It"
Main power of Ab Initio - parallelism - is achieved via its "Co>Operating System" which provides the facilities for
"parallel execution (multiple CPUs and/or multiple boxes), platform independent data transport, check pointing, and
process monitoring. A lot of attention is devoted to monitoring resources (CPU, memory). multi-file, multidirectory.
Component Library - a set of software modules to perform sorting, data transforming, and high speed data loading
and unloading tasks.
Ab Initio tools incorporate best practices, such as check-pointing, rerunnability, tagging everything with unique Ids, etc.
Unfortunately Ab Initio doesn't advertise or publish any information. So there are just bits and pieces here and
there. Here is an interesting blog:
http://www.geekinterview.com/Interview-Questions/Data-Warehouse/Abinitio
Question
Answer
==============================================
============
Phases - are used to break the graph into pieces. Temporary files created
during a phase will be deleted after its completion. Phases are used to
effectively separately manage resource-consuming (memory, CPU, disk)
parts of the application.
Phases vs
Checkpoint
Checkpoints - created for recovery purposes. These are points where
s
everything is written to disk. You can recover to the latest saved point - and
rerun from it.
You can have phase breaks with or without checkpoints.
xfr
three
types of
parallelism
A new sandbox will have many directories: mp, dml, xfr, db, ... . xfr is a
directory where you put files with extension .xfr containing your own
custom functions (and then use : include "somepath/xfr/yourfile.xfr").
Usually XFR stores mapping.
the graph)
3) Pipeline (sequential).
Multi-File System
MFS
Memory
requireme
nts of a
graph
How to
calculate a
SUM
Add size of lookup files used in phase (if multiple components use
same lookup only count it once)
SCAN
ROLLUP
SCANWITHROLLUP
Scan followed by Dedup sort and select the last
If we don't use any key in the sort component while using the dedup sort,
then the output depends on the keep parameter.
dedup sort
with null
key
join on
partitioned
file1 (A,B,C) , file2 (A,B,D). We partition both files by "A", and then join by
"A,B". IS it OK? Or should we partition by "A,B" ? Not clear.
flow
checkin,
checkout
You can do checkin/checkout using the wizard right from the GDE using
versions and tags
how to
have
different
passwords
for QA and
production
How to get
records
50-75 out
of 100
Hot to
convert a
serial file
into FFS
project
parameter
s vs.
sandbox
parameter
s
When you check out a project into your sandbox - you get project
parameters. Once in your sandbox - you can refer to them as sandbox
parameters.
BadStraightflow
merging
graphs
You can not merge two ab initio graphs. You can use the ouput of one graph
as input for another. You can also copy/paste the contents between graphs.
See also about using .plan
partitionin
g, repartitionin
g,
departition
ing
lookup file
for large amounts of data use MFS lookup file (instead of serial)
indexing
Environme
nt project
Aggregate
vs Rollup
EME, GDE,
Cooperating
sytem
Continuou
Continuous components - produce useful output file while running
s
continously. For example, Continuous rollup, Continuous update batch
component
subscribe
s
Question
Answer
===============================================
===========
deadlock
Deadlock is when two or more processes are requesting the same resource.
To avoid use phasing and resource pooling.
environm
ent
wrapper
script
multistag
e
compone
nt
Dynamic
DML
lock
a user can lock the graph for editing so that others will see the message and
can not edit the same graph.
join vs
lookup
Lookup is good for spped for small files (will load whole file in memory). For
large files use join. You may need to increase the maxcore limit to handle
big joins.
multi
update
scheduler
Api and
Utility
modes in
input
table
These are database interfaces (api - uses SQL, utility - bulk loads, whatever
vendor provides)
lookup file
Calling
stored
proc in
DB
You can call stored proc (for example, from input component). In fact, you
can even write SP in Ab Initio. Make it "with recompile" to assure good
performance.
Frequentl
y used
functions
data
validation
driving
port
When joining inputs (in0, in1, ...) one of the ports is used as "driving (by
default - in0). Driving input is usually the largest one. Whereas the smallest
can have "Sorted-Input" parameter be set to "Input need not be sorted"
because it will be loaded completely in memory.
Ab Initio
vs
amounts of data, easy to build and run. Generates scripts which can be
easily modified as needed )if something couldn't be done in ETL tool itself).
The scripts can be easily scheduled using any external scheduler - and easily
integrated with other systems.
Ab Initio doesn't require a dedicated administrator.
Informati
ca for ETL
override
key
override key option is used when we need to join 2 fields which have
different field names.
control
file
control file should be in the multifile directory (contains the addresses of the
serial files)
max-core
max-core parameter (for example, sort 100 MBytes) specifies the amount of
memory used by a component (like Sort or Rollup) - per partition - before
spilling to disk. Usually you don't need to change it - just use default value.
Setting it too high may degrade the performance because of OS swapping
and degrading of the performance of other components.
graph > select parameters tab > click "create" - and create a parameter.
Input
Usage: $paramname. Edit > parameters. These parameters will be
Parameter
substituted during run time. You may need to declare you parameter scope
s
as formal.
Error
Trapping
Each component has reject, error, and log ports. Reject captures rejected
records, Error captures corresponding error, and log captures the execution
statistics of the component. You can control reject status of each component
by setting reject threshold to either Never Abort, Abort on first reject, or
setting ramp/limit. You can also use force_error() function in transform
function.
Question
Answer
============================================
==============
How to see
resource usage
In GDE goto options View > Tracking Details - will see each
component's CPU and memory usage, etc.
assign keys
component
Join in DB vs
join in Ab Initio
Join with DB
Data Skew
skew = (partition size - avg.part.size)* 100 / (size of the largest
partition)
.dbc - database configuration file (dbname, nodes, version user/pwd) resides in the db directory
dbc vs cfg
.cfg - any tyoe of config file. for example, remote connection config
(name of remote server, user/pwd to connect to db, location of OS on
remote machine, connection method). .cfg file resides in the config dir.
types of
partitions
unused port
tuning
performance
Use Ad Hoc MFS to read many serial files in parallel, and use
concat component.
use lookup local rather than lookup (especially for big lookups).
when getting data from database - make sure your queries are
fast (use indexes, etc.). If possible, do necessary selection /
aggregation / sorting in the database before getting data into
Ab Initio.
Components like join/ rollup should have the option "Input must
be sorted"
if they are placed after a sort component.
delta table
scan vs rollup
packages
Reformat vs
"Redefine
Format"
Conditional
DML
SORTWITHING
ROUP
passing a
condition as a
parameter
Passing file
name as a
parameter
. $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR
#Exporting the script parameter1 to INPUT_FILE_NAME
export INPUT_FILE_NAME $1
# This grpah is using the input file
cd $AI_RUN
./my_graph1.ksh
# This graph also is using the input file.
./my_graph2.ksh
exit 0;
How to remove
header and
trailer lines?
How to create
a multi file
system on
Windows
use conditional dml where you can separate detail from header and
trailer. For validations use reformat with count :3 (out0:header
out1:detail out2:trailer.)
Vector
Dependency
Analysis
Question
Surrogate
key
Answer
===============================================
===========
There are many ways to create a surrogate key. For example, you can
use next_in_sequence() function in your transform. Or you can use
"Assign key values" component. Or you can write a stored procedure - and
call it.
Note: if you use partitions, then do something like this:
(next_in_sequence()-1)*no_of_partition()+this_partition()
.abinitiorc
.profile
your ksh init file ( environment, aliases, path variables, history file settings,
command prompt settings, etc.)
data
mapping,
data
modelling
Hwo to
execute
the graph
From GDE - whole graph or by phases. From checkpoint. Also using ksh
scripts
Write
Multiplefil
es
Testing
Run the graph - see the results. Use components from Validate category.
Sandbox
vs EME
Sandbox is your private area where you develop and test. Only one project
and one version can be in the sandbox at any time. The EME
Datastorecontains all versions of the code that have been checked into it
(source control).
Layout
Where the data-files are and where the components are running. For
example, for data - serial or partitioned (multi-file). The layout is defined by
the location of the file (or a control file for the multifile). In the graph the
layout can propagate automatically (for multifile you have to provide
details).
Latest
versions
Graph
paramete
rs
menu edit > parameters - allows you to specify private parameters for the
graph. They can be of 2 types - local and formal.
Plan>It
You can define pre- and post-processes, triggers. Also you can specify
methods to run on success or on failure of the graphs.
Frequentl
y used
compone
nts
running
on hosts
conventio
nal
loading vs
direct
loading
lookup / lookup_local
reformat
gather / concatenate
join
runsql
join with db
compression components
filter by expression
rollup
trash
semi-join
abinitio online help gives 3 examples of joins: inner join, outer join, and
semi join.
for inner join 'record_requiredN' parameter is true for all "in" ports.
for semi join it is true for both port (like InnerJoin), but the dedup
option is set only on one side