Sei sulla pagina 1di 45

Part

Two

By Riman Bhaumik
17 Lookup We Will Learn
Priority Assignment
18 Instrumentation Parameters
Diagnostic Ports

19 max-core
NULL in Ab Initio
PDL

20 Parameter Definition Language

21 Component Folding Dynamic


Know
Script
Generation
Product ICFF

Features
22 Dynamic Script Generation

23 In-memory v/s Sorted


Broadcast v/s Replicate

Component
.WORK Folding
24 Datasets & Lineage
ablocal()
Lookup
Lookup Files is a fast way for a transform
component to process the data records
of multiple files if one of them is small
enough to fit in memory.

A lookup file is an indexed dataset, and it actually consists of two files:


one file holds data and the other holds a hash-index into the data file. The index file is
usually created on the fly.

Storing data in a memory-resident lookup file as opposed to a disk-resident


input file increases graph performance by eliminating the graph's need to access the disk
repeatedly.

Of course, this speed enhancement comes at the cost of added memory demand. Also
lookup data can be either serial or spread across a multifile .
Typically, you only use a dataset as a lookup file if the dataset is small enough to avoid
pushing up against the 2 GB component memory limit.

Partitioned Lookup is one way to reduce per-process memory demand a 4 GB lookup


file partitioned 4 ways requires only a gigabyte for each component partition.

Block-Compressed Lookup: Only the index resides in memory. The lookup function
uses the index file to locate the proper block, reads the indicated block from disk,
decompresses the block in memory, and searches it for matching records.

A typical application for ICFFs:

Large amounts of static data

Frequent addition of new data

Very short response times


ICFFs present advantages in a number of categories:

Disk requirements Because ICFFs store compressed data in flat files without the
overhead associated with a DBMS, they require much less disk storage capacity than
databases on the order of 10 times less.

Memory requirements Because ICFFs organize data in discrete blocks, only a small
portion of the data needs to be loaded in memory at any one time.

Speed ICFFs allow you to create successive generations of updated information


without any pause in processing. This means the time between a transaction taking
place and the results of that transaction being accessible can be a matter of seconds.

Performance Making large numbers of queries against database tables that are
continually being updated can slow down a DBMS. In such applications, ICFFs
outperform databases.

Volume of data ICFFs can easily accommodate very large amounts of data so large,
in fact, that it can be feasible to take hundreds of terabytes of data from archive tapes,
convert it into ICFFs, and make it available for online access and processing.
Priority Assignment
Its the order of evaluation that is assign to rules
assigned to the same output field in a transform function.
The rule with the lowest-numbered priority is
evaluated before rules with higher-numbered priorities. The
rule without an assigned priority is evaluated last.

out.ssn :1: in1.ssn;


out.ssn :2: in2.ssn;
out.ssn :3: "999999999";

To change the priority of a rule in the Transform Editor, right-


click the transform rule and select Set Priority from the pop-
up menu. Else can be edited at Text Mode.
With prioritized rules, you can attach more than one rule to a single output field. Rules are
attempted in order of priority until one of them yields a non-NULL value, which is then
assigned to the output.
Instrumentation Parameters
Limit : Number of errors to tolerate

Ramp: Scale of errors to tolerate per input

Tolerance value=limit + ramp*total number of records read

Typical Limit and Ramp settings:

Limit = 0 Ramp = 0.0 Abort on any error


Limit = 50 Ramp = 0.0 Abort after 50 errors
Limit = 1 Ramp = 0.01 Abort if more than 1 in 100 records causes
error
Limit = 1 Ramp = 1 Never Abort
Diagnostic Ports
Every transform component has got diagnostic ports. Click on Show Optional Ports to
view them.

REJECT : Input records that caused error


ERROR: Associated error message
LOG: Logging records
max-core
The max-core parameter specify the maximum amount of memory that can be
allocated for the component.

The maximum allowable value for max-core is 2147483647 (231-1).

If the component has enough memory allocated to it, it can do everything it needs to
do in memory, without having to write anything temporarily to disk.

The max-core setting is an upper limit: the entire amount specified will not be
allocated unless necessary.

For SORT, 100 MB is the default value. For example, for sorting 1 GB of data and if the
process is running in serial, the number of temporary files that will be created is:
3 1000MB / 100 MB = 30 files

If the component is running in parallel, the value of maxcore represents the maximum
memory usage per partition - not the sum for all partitions.
NULL in Ab Initio
NULL represents the absence of a value. When the Co>Operating System
cannot evaluate an expression, it produces NULL.

In DML, the NULL value represents an unknown or missing piece of data.

A NULL is a special value that represents an unknown, missing, not


applicable, or undefined value.

NULLs are treated completely differently from ordinary values, including


empty or space values.

Ab Initio can interpret any value as a NULL, but this must be flagged.
Without such a flag, an additional byte (or more) is added to a data record
to flag which of the fields that are permitted NULL values actually have
NULL data.
Parameter Definition Language

It is a simple set of rules and notations for referencing


parameter values.
In graphs you "use" PDL by specifying PDL interpretation,
either for a transform or record format definition (in a
component's Parameters or Ports tab), or for a parameter
definition (in
AB_DML_DEFS: It isthe GDE's
a graph or Graph Parameters Editor)
project parameter that you
declare, containing DML
declarations and definitions for
use within inline DML in other
parameter definitions

A DML function is defined in AB_DML_DEFS or a transform defined there can be


called in any subsequent parameter definition.
Component Folding
A feature of the Co>Operating System that reduces graph memory use.
It reduces the number of processes created by graphs and can enhance performance.
It effectively combines several separate graph components into a single group during runtime
making each partition of that group runs as a single process.

Requirement
a. Same phase and layout
b. Connected by straight flows,
c. Components must be Foldable.
d. Fed by a single root component (i.e. the furthest upstream component need to be one)
Note: Join with "in-memory" option only, its foldable.

PROS
a. Fewer processes
b. Inter-process communication

CONS
a. Loss of pipeline parallelism
b. Less total address space
c. Internal buffering
Without component folding, every partition of every graph component creates a separate
OS process at runtime.
So a graph with 2 way layout and having 3 components would create 6 processes

With component folding, the Co>Op scans a graph at runtime, and where possible
combines the logic of multiple components, turning them into a single folded group
creating one process for each partition.
So a graph with 2 way layout and having 3 components would create 2 processes.
Dynamic Script Generation
Dynamic script generation is a feature of Co>Operating Systems 2.14 and higher that gives
the option of running a graph without having to deploy it from the GDE. Enabling it also
makes it possible to use PDL in your graphs, and to use the Co>Operating System
component folding feature to improve your graphs' performance.

Dynamic script
generation creates a
reduced script which
does not include the
code representing the
components of your
graph. Instead it includes
commands that set up
the run host and its
environment, then
initiate the graph's
execution using the air
sandbox run command.
In-memory v/s Sorted
An in-memory ROLLUP or JOIN requires more memory (although even that
might not be true if you consider your whole graph), but might enable you to
avoid sorting.

If the data volume is very large or if other components in the graph or


subsequent runs of this graph or other graphs can take advantage of sorted
data, it might be better to sort the data and then execute an in-memory
ROLLUP or JOIN.

In-memory ROLLUP and JOIN components are most efficient when they have
sufficient memory allocated to them.
When data is written to disk, relative performance degrades, but the
application does not fail. Of course, writing to disk if possible should be
avoided.

One should be careful if data volumes are growing over time, as one day max-
core might be exceeded and the data will spill to disk. When that happens,
performance will suddenly degrade.

With an in-memory JOIN, one input is designated as the driving input (usually
the input with the largest volume). All records in the non-driving inputs are
loaded into memory before the driving input is read. So, in deciding whether to
use an in-memory JOIN, consider whether one of the inputs is small enough to
fit in memory.
BROADCAST v/s REPLICATE
BROADCAST and REPLICATE are similar components.

Replicate is generally used to increase component parallelism,


emitting multiple straight flows to separate pipelines.

Broadcast is used to increase data parallelism by feeding


records to fan-out or all-to-all flows.

Both copy each input record to multiple downstream components and both increase
parallelism. The only difference between them lies in how their flows are set up and their
layouts are propagated in the GDE.

Basically, Broadcast is treated like a partitioner component that defines the transition from
one layout to another.
Replicate allows multiple outputs for a given layout and propagates the layout from the
input to the output.
Scenarios
Serial Input and 4-way MFS Output
Both REPLICATE and the BROADCAST default to fan-out flows. If we have 12 input records
and four output partitions, the 12 input records are copied to each of the four output
partitions, resulting in 48 records being copied.

4-way MFS Input and 4-way MFS Output


Broadcast defaults to an all-to-all flow, and the flow from Replicate defaults to a straight
flow. Therefore, Broadcast copies each of the 12 input records to each of the four output
partitions and writes 48 records. However, because the default for Replicate is a straight
flow, it copies the 12 records from their original input partitions into the corresponding
output partitions for a total of 12 records written along the flow.

4-way MFS Input and 4-way MFS Output (Forced Flow)


If the Broadcast has a straight flow, it copies the 12 records from their input partitions into
the corresponding output partitions for a total of 12 records written along the flow.

If Replicate has an all-to-all flow, it copies each of the 12 input records to each of the
output partitions for a total of 48 records.

If both Broadcast and Replicate have fan-out flows, because the layout is the same on both
sides of the fan-out flow, it behaves like a straight flow.
Datasets & Lineage
Data files & Tables and queues are represented in the EME by logical
datasets (also called EME datasets), which are placeholders that contain
metadata the location and record format of the actual, physical data. They
do not contain the data itself. (Physical data files are usually not checked in
to the EME.)

When a graph is checked in, the EME inspects all the file and table
components in the graph, comparing them with logical datasets that already
exist in the EME datastore. If no corresponding dataset is found, the EME
creates one, by default in one of the following locations:
FILE: in the location derived from the component URL
TABLE: in the tables directory of the project where the .dbc file is located

It is common for many graphs to use the same dataset. For example, the
following diagram shows two graphs that use the same file, customers.dat
one as output and the other as input:
Sometimes the filename of a data file changes. For example, you get a
new file containing sales data every day, and the filename includes the date.
The contents of the file differ from day to day, but the files are logically
equivalent.

In the EME datastore, the same logical dataset can maintain the lineage by
number of ways.
Using parameters and naming conventions, datasets that are logically the same will
automatically map to the same logical (EME) dataset.

In cases where the names or locations changes, parameters can be used to specify the
logical dataset names. Then create EME-specific parameter definitions that map the
logically equivalent datasets to identical names and locations.

Another option for mapping logically equivalent datasets is to use the dataset mapping
feature in the GDE.
ABLOCAL

Some complex SQL statements contain grammar that is not recognized by the
Ab Initio Parser when unloading in parallel.
In those cases you can use the ABLOCAL construct to prevent the INPUT TABLE
component from parsing the SQL. It will then get passed through to the database.
It also specifies which table to use for the PARALLEL clause.

SELECT a.cust_id FROM customer_info a ABINITIO(DB00112): The following errors were returned:
WHERE a.cust_type=1 ABINITIO(DB00112): ----------------------------------------------
ABINITIO(DB00038): Failed getting the default host layout
AND a.account_open_date<(SYSDATE-30); ABINITIO(DB00038): for the SQL:
SELECT a.cust_id
FROM customer_info a
WHERE a.cust_type=1
need to be changed to: AND a.account_open_date<(SYSDATE-30)
ABINITIO(DB00007): Failed parsing SQL
<SELECT a.cust_id
SELECT a.cust_id FROM customer_info a FROM customer_info a
WHERE a.cust_type=1
WHERE (ABLOCAL(customer_info a)) AND a.account_open_date<(SYSDATE-30)>.
ABINITIO: SQL parsing error: Error.
AND a.cust_type=1 ABINITIO: Unexpected text found: -30
ABINITIO(*): Possible Resolutions:
AND a.account_open_date <(SYSDATE-30); ABINITIO(*): Most likely this SQL was being parsed to
ABINITIO(*): generate parallel queries. Please
ABINITIO(*): email this SQL to Ab Initio so that we
ABINITIO(*): may add automatic parallel support for this SQL
ABINITIO(*): in the future. But as a work around, you can
ABINITIO(*): bypass the SQL parse by putting the ABLOCAL
ABINITIO(*): clause in your SQL query.
ABINITIO(*): This ABLOCAL "hint" will give Ab Initio the needed
ABINITIO(*): information to generate the parallel queries
ABINITIO(*): without having to parse the SQL.
.WORK
When you run a graph, the Co>Operating System creates .WORK directories in
the directories of the layouts of program components.

During the execution of the graph, if components need to write temporary


files, the Co>Operating System creates unique subdirectories in the
appropriate .WORK directories, and the components write files in these
unique subdirectories.

The locations of the .WORK directories are based on the layout of a program
component.

These files are cleaned up on successful job completion or after rolling back
the job with m_rollback -d.

NOTE: .WORK contains hidden files and should never be manually removed
and recreated.
17 Sandboxes We Will Learn

18 EME

19 EME Projects
Projects Types

Checkin
20 Checkout
Tag

21 Lock
Sandbox States

22 Dependency Analysis

23 Branch

24 Parameter
Parameter Attributes
Sandboxes
Sandboxes are the file-system directories in which graph development actually occurs.

The sandbox directory


(folder) contains a set of
specific subdirectories for
graphs and graph-related
files.

The sandbox directory can


have any name; the
subdirectories have
standard names that
indicate their functions.

A set of parameters
defines the sandboxs
properties.
Enterprise Meta>Environment
EME is an object oriented data storage system that version controls and
manages various kinds of information associated with Ab Initio applications i.e.
metadata, which may range from design information to operational data.

In simple terms, it is a repository, which contains data about data metadata.

Enterprise metadata integrates technical and business metadata and enables us


to grasp the entirety of your data processing from operations to analytical
systems.

Technical metadata Application-related transform rules, record formats, and execution


statistics

Business metadata User-defined documentation of deem important to company, such as


logical models, data dictionaries, and roles and responsibilities
The EME enables deep analysis of applications and the relationships between them:

Dependency analysis answers questions of data lineage:


Where did a given value come from?
How was the output value computed?
Which applications produce and depend on this data?

Impact analysis lets developers understand the consequences of proposed modifications:


If this piece changes, what else will be affected?
If this source format changes, which applications will be affected?

In addition, all objects stored in the EME are versioned, making it possible to examine the
state of things as of last week, last month, or last year and to compare it with the state of
things today.

The EME collects job-tracking (or execution) information that enables trend analysis and
capacity planning.

How fast is our data growing?


How long did that application take to run?
How much data did it process, and at what rate?
What resources did the application consume?
When will we need to add another server?
EME Projects
Every EME datastore is grouped into projects.

The datastore at its highest level contains a /Projects directory. Each of its immediate contents
represents a separate project, or a subdirectory containing projects.

A datastore can contain multiple projects.

Projects can include other projects; they are called client projects when they do this.

The included projects


are called common
projects.

The complete
parameter environment
of the common project
becomes accessible to
the client project.
The directory tree of a project contains subdirectories for transform files, record format
files, graphs, plans, and so on.

EME projects held in the datastore cannot be directly manipulated.

To work on an EME project, a working area is set up in the machine. This working area is
called a sandbox.
The contents of a sandbox, when put under source control in the EME datastore, become
a project.

The project is the "master copy" of the sandbox contents; users check out copies of files
in the project to work on in their sandboxes, and check in the altered files to update the
project copies under source control.

Multiple users can set up sandboxes based on the same project, thus creating multiple
copies of it. Each sandbox can be associated with only one project, but a project can be
(and usually is) associated with multiple sandboxes. Any user working on the project will
have his or her own sandbox in which to work.

When a project is checked in to the EME, certain contents of the sandbox are copied into
the EME datastore project area. When a project is checked out, the filesystem copies of
the project contents are made up-to-date in a sandbox.
Only one version of the code can be held within the sandbox at any time.
The EME datastore contains all versions of the code that have been checked into it.

User Sandbox Project in EME Datastore

Project1
Check-in Project1

v1 File1
v5 File1
Check-out v2 File1
v3 File1
v4 File1
v5 File1
Private & Public Projects
Two types of project are available in an Ab Initio environment:

Public Projects : Its the one that is visible to other projects. The public
project's parameters thus become visible and accessible to the
including (client) project.
The environment project (stdenv) is a special public project which is
automatically included in all other projects in an Ab Initio
environment.

Private Projects: Its the one that cannot be (or is not supposed to be)
accessible to other projects: in this respect it is the opposite of a public
project. Most graph development typically goes on in private projects;
public projects allow work to be shared among including projects.
Checkin
Checkin is the process by which projects, graphs, or files are copied from
sandbox to a project in an EME datastore.

Checkin places the new versions under source control and makes them
available to other users.

Check in can be done through the GDEs Checkin Wizard or at the


command line using air command.

Checkin Process involves:


The GDE imports the specified files into the EME datastore.
Dependency analysis by default is performed in the EME on the checked-in files.
If no errors have occurred, the EME automatically increments the version number of the objects
checked in. The checked-in files are tagged, if a tag is specified during the checkin process.
Files marked for removal are removed from the current version of the directory in the datastore.
By default, all file locks are released.
Checkout
Checking an object out from a datastore is done to update the
sandbox with the latest EME datastore version of that object
or to add the object if not present in the sandbox.

Even specific version of graphs and other files, as well as


entire projects can be checkout.

After an object is checked out, the sandbox copy is still read-


only.
In order to edit the object in the sandbox, it must have to be
first locked so that no one else can edit their sandboxs
version of the object . After editing when the object is
checked in, the lock is released by default.
Tag
Identification of versions of objects in the EME datastore can be done not
only by their version numbers, but also by labeling them with tags.
A tag is a string of text that we can associate with one or more objects.
The objects can be at different versions.

We can tag
objects either
during checkin
or anytime at
the command
line, through
certain air
utility
commands.
A tag is global to the datastore; its scope is not limited to a specific project.

Thus, tag names cannot be reused even when they point to objects in different
projects.

Tags are datastore objects, and just like any other datastore objects, they are
versioned.

The most common use of tags is to identify and group all the objects in a given
release. Typically, you create a tag that contains the current versions of objects at a
given point in time for example, a tag of all objects that you want to migrate to
Production immediately after the QA team completes testing.

Tags make it convenient to manipulate groups of objects. Once tagged, objects can
be accessed indirectly by specifying the tag rather than listing the objects
themselves.

Tagged objects can easily be promoted to another EME datastore. Objects tagged
on checkin to the datastore can easily be checked out to a sandbox according to the
tag name, thereby ensuring that all developers access the same specific version.
Lock

Locking is done to sets a lock on a graph or file in EME project.

Locking a file gives the lock owner, exclusive permission to modify and save
the file, and prevents others from making changes to the same file.

When a graph or file is locked, the Lock button looks like a closed lock
and when the checkin or lock is released Lock button changes to look like an
unlock lock .

If the graph has already been locked by some other user, the lock is red in
colour denoting that there is already a lock on it.
Sandbox States
The following diagram illustrates the different states a checked-in file in a sandbox can
be in with respect to its datastore, and how these states can arise:

Solid ovals represent sandbox states.

Shaded ovals represent the states that


require you to take immediate action at
the command line or to manually
resolve the conflicts and remove the
.conflict. file(s). Otherwise, you cannot
check in or check out files.

Solid lines represent the actions you


take in the GDE or at the command line
that cause transitions between states.

Dashed lines represent actions that


another user takes that cause state
transition.
Dependency Analysis
Dependency analysis is the process by which the EME examines a
project in its entirety and traces how data is transformed and transferred,
field by field, from component to component, within and between graphs and
across projects.

Once an analysis is invoked, the results are viewable in a Web browser:

How operations that occur later (downstream) in a graph are affected by components earlier
(upstream) in the graph .
What data is operated on by each component, and what the operations are .

All details of every component in the project .

What happens to each field in a dataset throughout a graph and across graphs and projects.

All graphs that use a particular dataset .

Issues that might prevent the graph from working in a different environment.
Dependency analysis can save time and resources in a number of ways:

Code analysis and data lineage We can see an up-to-date view of the current project
and what data and graphs already exist and how they were created, which may help
you avoid duplicating development work or data. We can see how a field gets created
and how its value changes within a graph, across graphs, and across projects (lineage).
We can also assess the impact of planned changes: for example, if you were to add a
field to a particular dataset, which graphs would be affected?

Quality control Dependency analysis helps assessing whether a graph matches its
original specifications, and whether it meets its goals. It also provides information
about how a graph will run outside the development environment. For example, graphs
without analysis warnings are more likely to run smoothly when deployed in the
production environment or migrated to a different datastore. Dependency analysis
even detects certain types of runtime errors, such as DML parsing problems.

Transparency and accountability The results of dependency analysis are useful to


people throughout an organization, beyond the development process. Anyone who
uses the organizations data may be interested in discovering, for example, how a
particular data set is derived. Employees may need to know why their latest weekly
sales report suddenly looks different, or which fields are used to calculate a particular
metric.
Branch
A branch is similar to a copy of an EME datastore, with the advantage that, unlike a real
copy, a branch uses very little extra disk space. Changes made on a branch are
independent of the original (parent) branch from which the branch derived.

In every datastore, the branch you start with is called the main branch. Child branches
of the main branch may themselves have child branches.

A child branch is identical to its parent


branch up to the point (version or tag) at
which you created the branch.

In the following figure, branch1 starts out


as a snapshot of the main branch at tag2.
A branch has access to all tags that were created on its parent as well as to its own tags
(unless purging has been run). For example, if branch1 is created at the tag2 version on
the main branch, then running air tag list on branch1 would show tag1 and tag2 but not
tag3, created after the branch-off version.
Although branching has many uses, its major application is to help you easily create
software patch releases.

Another benefit is that groups of developers can work in teams independently of one
another without interfering with each others work.

For example, various teams are working on there respective projects all of which are
dependent upon another team working on project infrastructure. If the infrastructure
team makes changes, it may break things for the other teams. In this case the
infrastructure team should work on a separate branch and post code to the dependent
teams in a controlled manner.

In some ways this is similar to a release process, but it is a release within development,
not a release to test or production.

Limitation:
The branching feature does not provide a way to automatically merge changes made on
a child back into the original parent branch, usually the main branch. If you want to
merge modifications in a release branch into the main branch, you must do so manually.
Parameter
A parameter is a name-value pair with some additional attributes that determine when and how to
interpret or resolve its value.

Parameters are used to provide logical names to physical location and should always be used instead of
hard coded paths in graphs.

The value of a parameter may be derived from a previously defined parameter by reference to it as
$NAME, where NAME refers to the parameters name.

Private project parameters: these are the parameters that allow you to access any object within the
immediate project in other words, anything found within the sandbox .

Public project parameters These are parameters in projects intended to be included by other private
projects.

abenv parameters abenv is a built-in Co>Operating System project which contains the default
definitions for all environment-wide Ab Initio Environment parameters. The abenv project is included
into the environment by the stdenv project. You modify abenvs parameter values by means of overrides
from your private and public projects. You do not edit abenvs parameters directly.

Environment project parameters An environment project, called stdenv, is created and automatically
included as a common project in all private or public projects in an Ab Initio Environment. Its
parameters define various environment-specific values. It also includes abenv as a common project. In
combination, the abenv and stdenv parameters define the Ab Initio Environment.
Parameter Attributes
The attributes for a selected parameter are displayed in the Attributes pane of
the Parameters Editor:

Type The parameters data type. E.g. : Expression, Choices , Float, Integer,
etc

Input Specifies whether this is an input or a local parameter.


Input parameters receive their values as input from users of the object that the parameters
appear in.
Local parameters contain values determined by internal mechanisms or specified by
developers; they are not accessible to users.

Location Specifies how the parameters value is stored. Either Embedded or


File.

Required Specifies whether the parameter must have a value or not.

Description An optional comment describing the parameter.


Kind If the selected parameter is an input parameter, specifies how it receives its value.

Environment - The parameters value is supplied by an environment variable with the same
name as the parameter
Implicit - It receive its value only from an input values set (input .pset) file.
Keyword - Value passed to the parameter is identified on the scripts command line by a keyword
preceding it
Positional - The parameters value is identified by its position on the command line.

Interpretation Specifies how special characters in the parameters value are interpreted.
5 possible interpretation are
PDL
shell
$ substitution
${ } substitution
constant

Export to Environment Specifies whether the parameter is exported to the shell


environment as a shell variable

Potrebbero piacerti anche