Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Two
By Riman Bhaumik
17 Lookup We Will Learn
Priority Assignment
18 Instrumentation Parameters
Diagnostic Ports
19 max-core
NULL in Ab Initio
PDL
Features
22 Dynamic Script Generation
Component
.WORK Folding
24 Datasets & Lineage
ablocal()
Lookup
Lookup Files is a fast way for a transform
component to process the data records
of multiple files if one of them is small
enough to fit in memory.
Of course, this speed enhancement comes at the cost of added memory demand. Also
lookup data can be either serial or spread across a multifile .
Typically, you only use a dataset as a lookup file if the dataset is small enough to avoid
pushing up against the 2 GB component memory limit.
Block-Compressed Lookup: Only the index resides in memory. The lookup function
uses the index file to locate the proper block, reads the indicated block from disk,
decompresses the block in memory, and searches it for matching records.
Disk requirements Because ICFFs store compressed data in flat files without the
overhead associated with a DBMS, they require much less disk storage capacity than
databases on the order of 10 times less.
Memory requirements Because ICFFs organize data in discrete blocks, only a small
portion of the data needs to be loaded in memory at any one time.
Performance Making large numbers of queries against database tables that are
continually being updated can slow down a DBMS. In such applications, ICFFs
outperform databases.
Volume of data ICFFs can easily accommodate very large amounts of data so large,
in fact, that it can be feasible to take hundreds of terabytes of data from archive tapes,
convert it into ICFFs, and make it available for online access and processing.
Priority Assignment
Its the order of evaluation that is assign to rules
assigned to the same output field in a transform function.
The rule with the lowest-numbered priority is
evaluated before rules with higher-numbered priorities. The
rule without an assigned priority is evaluated last.
If the component has enough memory allocated to it, it can do everything it needs to
do in memory, without having to write anything temporarily to disk.
The max-core setting is an upper limit: the entire amount specified will not be
allocated unless necessary.
For SORT, 100 MB is the default value. For example, for sorting 1 GB of data and if the
process is running in serial, the number of temporary files that will be created is:
3 1000MB / 100 MB = 30 files
If the component is running in parallel, the value of maxcore represents the maximum
memory usage per partition - not the sum for all partitions.
NULL in Ab Initio
NULL represents the absence of a value. When the Co>Operating System
cannot evaluate an expression, it produces NULL.
Ab Initio can interpret any value as a NULL, but this must be flagged.
Without such a flag, an additional byte (or more) is added to a data record
to flag which of the fields that are permitted NULL values actually have
NULL data.
Parameter Definition Language
Requirement
a. Same phase and layout
b. Connected by straight flows,
c. Components must be Foldable.
d. Fed by a single root component (i.e. the furthest upstream component need to be one)
Note: Join with "in-memory" option only, its foldable.
PROS
a. Fewer processes
b. Inter-process communication
CONS
a. Loss of pipeline parallelism
b. Less total address space
c. Internal buffering
Without component folding, every partition of every graph component creates a separate
OS process at runtime.
So a graph with 2 way layout and having 3 components would create 6 processes
With component folding, the Co>Op scans a graph at runtime, and where possible
combines the logic of multiple components, turning them into a single folded group
creating one process for each partition.
So a graph with 2 way layout and having 3 components would create 2 processes.
Dynamic Script Generation
Dynamic script generation is a feature of Co>Operating Systems 2.14 and higher that gives
the option of running a graph without having to deploy it from the GDE. Enabling it also
makes it possible to use PDL in your graphs, and to use the Co>Operating System
component folding feature to improve your graphs' performance.
Dynamic script
generation creates a
reduced script which
does not include the
code representing the
components of your
graph. Instead it includes
commands that set up
the run host and its
environment, then
initiate the graph's
execution using the air
sandbox run command.
In-memory v/s Sorted
An in-memory ROLLUP or JOIN requires more memory (although even that
might not be true if you consider your whole graph), but might enable you to
avoid sorting.
In-memory ROLLUP and JOIN components are most efficient when they have
sufficient memory allocated to them.
When data is written to disk, relative performance degrades, but the
application does not fail. Of course, writing to disk if possible should be
avoided.
One should be careful if data volumes are growing over time, as one day max-
core might be exceeded and the data will spill to disk. When that happens,
performance will suddenly degrade.
With an in-memory JOIN, one input is designated as the driving input (usually
the input with the largest volume). All records in the non-driving inputs are
loaded into memory before the driving input is read. So, in deciding whether to
use an in-memory JOIN, consider whether one of the inputs is small enough to
fit in memory.
BROADCAST v/s REPLICATE
BROADCAST and REPLICATE are similar components.
Both copy each input record to multiple downstream components and both increase
parallelism. The only difference between them lies in how their flows are set up and their
layouts are propagated in the GDE.
Basically, Broadcast is treated like a partitioner component that defines the transition from
one layout to another.
Replicate allows multiple outputs for a given layout and propagates the layout from the
input to the output.
Scenarios
Serial Input and 4-way MFS Output
Both REPLICATE and the BROADCAST default to fan-out flows. If we have 12 input records
and four output partitions, the 12 input records are copied to each of the four output
partitions, resulting in 48 records being copied.
If Replicate has an all-to-all flow, it copies each of the 12 input records to each of the
output partitions for a total of 48 records.
If both Broadcast and Replicate have fan-out flows, because the layout is the same on both
sides of the fan-out flow, it behaves like a straight flow.
Datasets & Lineage
Data files & Tables and queues are represented in the EME by logical
datasets (also called EME datasets), which are placeholders that contain
metadata the location and record format of the actual, physical data. They
do not contain the data itself. (Physical data files are usually not checked in
to the EME.)
When a graph is checked in, the EME inspects all the file and table
components in the graph, comparing them with logical datasets that already
exist in the EME datastore. If no corresponding dataset is found, the EME
creates one, by default in one of the following locations:
FILE: in the location derived from the component URL
TABLE: in the tables directory of the project where the .dbc file is located
It is common for many graphs to use the same dataset. For example, the
following diagram shows two graphs that use the same file, customers.dat
one as output and the other as input:
Sometimes the filename of a data file changes. For example, you get a
new file containing sales data every day, and the filename includes the date.
The contents of the file differ from day to day, but the files are logically
equivalent.
In the EME datastore, the same logical dataset can maintain the lineage by
number of ways.
Using parameters and naming conventions, datasets that are logically the same will
automatically map to the same logical (EME) dataset.
In cases where the names or locations changes, parameters can be used to specify the
logical dataset names. Then create EME-specific parameter definitions that map the
logically equivalent datasets to identical names and locations.
Another option for mapping logically equivalent datasets is to use the dataset mapping
feature in the GDE.
ABLOCAL
Some complex SQL statements contain grammar that is not recognized by the
Ab Initio Parser when unloading in parallel.
In those cases you can use the ABLOCAL construct to prevent the INPUT TABLE
component from parsing the SQL. It will then get passed through to the database.
It also specifies which table to use for the PARALLEL clause.
SELECT a.cust_id FROM customer_info a ABINITIO(DB00112): The following errors were returned:
WHERE a.cust_type=1 ABINITIO(DB00112): ----------------------------------------------
ABINITIO(DB00038): Failed getting the default host layout
AND a.account_open_date<(SYSDATE-30); ABINITIO(DB00038): for the SQL:
SELECT a.cust_id
FROM customer_info a
WHERE a.cust_type=1
need to be changed to: AND a.account_open_date<(SYSDATE-30)
ABINITIO(DB00007): Failed parsing SQL
<SELECT a.cust_id
SELECT a.cust_id FROM customer_info a FROM customer_info a
WHERE a.cust_type=1
WHERE (ABLOCAL(customer_info a)) AND a.account_open_date<(SYSDATE-30)>.
ABINITIO: SQL parsing error: Error.
AND a.cust_type=1 ABINITIO: Unexpected text found: -30
ABINITIO(*): Possible Resolutions:
AND a.account_open_date <(SYSDATE-30); ABINITIO(*): Most likely this SQL was being parsed to
ABINITIO(*): generate parallel queries. Please
ABINITIO(*): email this SQL to Ab Initio so that we
ABINITIO(*): may add automatic parallel support for this SQL
ABINITIO(*): in the future. But as a work around, you can
ABINITIO(*): bypass the SQL parse by putting the ABLOCAL
ABINITIO(*): clause in your SQL query.
ABINITIO(*): This ABLOCAL "hint" will give Ab Initio the needed
ABINITIO(*): information to generate the parallel queries
ABINITIO(*): without having to parse the SQL.
.WORK
When you run a graph, the Co>Operating System creates .WORK directories in
the directories of the layouts of program components.
The locations of the .WORK directories are based on the layout of a program
component.
These files are cleaned up on successful job completion or after rolling back
the job with m_rollback -d.
NOTE: .WORK contains hidden files and should never be manually removed
and recreated.
17 Sandboxes We Will Learn
18 EME
19 EME Projects
Projects Types
Checkin
20 Checkout
Tag
21 Lock
Sandbox States
22 Dependency Analysis
23 Branch
24 Parameter
Parameter Attributes
Sandboxes
Sandboxes are the file-system directories in which graph development actually occurs.
A set of parameters
defines the sandboxs
properties.
Enterprise Meta>Environment
EME is an object oriented data storage system that version controls and
manages various kinds of information associated with Ab Initio applications i.e.
metadata, which may range from design information to operational data.
In addition, all objects stored in the EME are versioned, making it possible to examine the
state of things as of last week, last month, or last year and to compare it with the state of
things today.
The EME collects job-tracking (or execution) information that enables trend analysis and
capacity planning.
The datastore at its highest level contains a /Projects directory. Each of its immediate contents
represents a separate project, or a subdirectory containing projects.
Projects can include other projects; they are called client projects when they do this.
The complete
parameter environment
of the common project
becomes accessible to
the client project.
The directory tree of a project contains subdirectories for transform files, record format
files, graphs, plans, and so on.
To work on an EME project, a working area is set up in the machine. This working area is
called a sandbox.
The contents of a sandbox, when put under source control in the EME datastore, become
a project.
The project is the "master copy" of the sandbox contents; users check out copies of files
in the project to work on in their sandboxes, and check in the altered files to update the
project copies under source control.
Multiple users can set up sandboxes based on the same project, thus creating multiple
copies of it. Each sandbox can be associated with only one project, but a project can be
(and usually is) associated with multiple sandboxes. Any user working on the project will
have his or her own sandbox in which to work.
When a project is checked in to the EME, certain contents of the sandbox are copied into
the EME datastore project area. When a project is checked out, the filesystem copies of
the project contents are made up-to-date in a sandbox.
Only one version of the code can be held within the sandbox at any time.
The EME datastore contains all versions of the code that have been checked into it.
Project1
Check-in Project1
v1 File1
v5 File1
Check-out v2 File1
v3 File1
v4 File1
v5 File1
Private & Public Projects
Two types of project are available in an Ab Initio environment:
Public Projects : Its the one that is visible to other projects. The public
project's parameters thus become visible and accessible to the
including (client) project.
The environment project (stdenv) is a special public project which is
automatically included in all other projects in an Ab Initio
environment.
Private Projects: Its the one that cannot be (or is not supposed to be)
accessible to other projects: in this respect it is the opposite of a public
project. Most graph development typically goes on in private projects;
public projects allow work to be shared among including projects.
Checkin
Checkin is the process by which projects, graphs, or files are copied from
sandbox to a project in an EME datastore.
Checkin places the new versions under source control and makes them
available to other users.
We can tag
objects either
during checkin
or anytime at
the command
line, through
certain air
utility
commands.
A tag is global to the datastore; its scope is not limited to a specific project.
Thus, tag names cannot be reused even when they point to objects in different
projects.
Tags are datastore objects, and just like any other datastore objects, they are
versioned.
The most common use of tags is to identify and group all the objects in a given
release. Typically, you create a tag that contains the current versions of objects at a
given point in time for example, a tag of all objects that you want to migrate to
Production immediately after the QA team completes testing.
Tags make it convenient to manipulate groups of objects. Once tagged, objects can
be accessed indirectly by specifying the tag rather than listing the objects
themselves.
Tagged objects can easily be promoted to another EME datastore. Objects tagged
on checkin to the datastore can easily be checked out to a sandbox according to the
tag name, thereby ensuring that all developers access the same specific version.
Lock
Locking a file gives the lock owner, exclusive permission to modify and save
the file, and prevents others from making changes to the same file.
When a graph or file is locked, the Lock button looks like a closed lock
and when the checkin or lock is released Lock button changes to look like an
unlock lock .
If the graph has already been locked by some other user, the lock is red in
colour denoting that there is already a lock on it.
Sandbox States
The following diagram illustrates the different states a checked-in file in a sandbox can
be in with respect to its datastore, and how these states can arise:
How operations that occur later (downstream) in a graph are affected by components earlier
(upstream) in the graph .
What data is operated on by each component, and what the operations are .
What happens to each field in a dataset throughout a graph and across graphs and projects.
Issues that might prevent the graph from working in a different environment.
Dependency analysis can save time and resources in a number of ways:
Code analysis and data lineage We can see an up-to-date view of the current project
and what data and graphs already exist and how they were created, which may help
you avoid duplicating development work or data. We can see how a field gets created
and how its value changes within a graph, across graphs, and across projects (lineage).
We can also assess the impact of planned changes: for example, if you were to add a
field to a particular dataset, which graphs would be affected?
Quality control Dependency analysis helps assessing whether a graph matches its
original specifications, and whether it meets its goals. It also provides information
about how a graph will run outside the development environment. For example, graphs
without analysis warnings are more likely to run smoothly when deployed in the
production environment or migrated to a different datastore. Dependency analysis
even detects certain types of runtime errors, such as DML parsing problems.
In every datastore, the branch you start with is called the main branch. Child branches
of the main branch may themselves have child branches.
Another benefit is that groups of developers can work in teams independently of one
another without interfering with each others work.
For example, various teams are working on there respective projects all of which are
dependent upon another team working on project infrastructure. If the infrastructure
team makes changes, it may break things for the other teams. In this case the
infrastructure team should work on a separate branch and post code to the dependent
teams in a controlled manner.
In some ways this is similar to a release process, but it is a release within development,
not a release to test or production.
Limitation:
The branching feature does not provide a way to automatically merge changes made on
a child back into the original parent branch, usually the main branch. If you want to
merge modifications in a release branch into the main branch, you must do so manually.
Parameter
A parameter is a name-value pair with some additional attributes that determine when and how to
interpret or resolve its value.
Parameters are used to provide logical names to physical location and should always be used instead of
hard coded paths in graphs.
The value of a parameter may be derived from a previously defined parameter by reference to it as
$NAME, where NAME refers to the parameters name.
Private project parameters: these are the parameters that allow you to access any object within the
immediate project in other words, anything found within the sandbox .
Public project parameters These are parameters in projects intended to be included by other private
projects.
abenv parameters abenv is a built-in Co>Operating System project which contains the default
definitions for all environment-wide Ab Initio Environment parameters. The abenv project is included
into the environment by the stdenv project. You modify abenvs parameter values by means of overrides
from your private and public projects. You do not edit abenvs parameters directly.
Environment project parameters An environment project, called stdenv, is created and automatically
included as a common project in all private or public projects in an Ab Initio Environment. Its
parameters define various environment-specific values. It also includes abenv as a common project. In
combination, the abenv and stdenv parameters define the Ab Initio Environment.
Parameter Attributes
The attributes for a selected parameter are displayed in the Attributes pane of
the Parameters Editor:
Type The parameters data type. E.g. : Expression, Choices , Float, Integer,
etc
Environment - The parameters value is supplied by an environment variable with the same
name as the parameter
Implicit - It receive its value only from an input values set (input .pset) file.
Keyword - Value passed to the parameter is identified on the scripts command line by a keyword
preceding it
Positional - The parameters value is identified by its position on the command line.
Interpretation Specifies how special characters in the parameters value are interpreted.
5 possible interpretation are
PDL
shell
$ substitution
${ } substitution
constant