Sei sulla pagina 1di 5

9916447007

Incremental Loads

Challenge

Data warehousing incorporates large volumes of data, making the process of loading into the
warehouse without compromising its functionality increasingly difficult. The goal is to create a load
strategy that will minimize downtime for the warehouse and allow quick and robust data management.

Description

As time windows shrink and data volumes increase, it is important to understand the impact of a
suitable incremental load strategy. The design should allow data to be incrementa lly added to the data
warehouse with minimal impact to the overall system. The following pages describe several possible
load strategies.

Considerations

 Incremental Aggregation –loading deltas into an aggregate table.


 Error-un/loading data– strategies for recovering, reloading, and unloading data.
 History tracking–keeping track of what has been loaded and when.
 Slowly changing dimensions– Informatica Wizards for generic mappings (a good start to an
incremental load strategy).

Source Analysis

Data sources typically fall into the following possible scenarios:

 Delta Records - Records supplied by the source system include only new or changed records. In
this scenario, all records are generally inserted or updated into the data warehouse.
 Record Indicator or Flags - Records that include columns that specify the intention of the record
to be populated into the warehouse. Records can be selected based upon this flag to all for
inserts, updates and delete.
 Date stamped data - Data is organized by timestamps. Data will be loaded into the warehouse
based upon the last processing date or the effective date range.
 Key values are present - When only key values are present, data must be checked against what
has already been entered into the warehouse. All values must be checked before entering the
warehouse.
 No Key values present - Surrogate keys will be created and all data will be inserted into the
warehouse based upon validity of the records.

Sai.M msramms@gmail.com
9916447007
Identify Which Records Need to be Compared

Once the sources are ident ified, it is necessary to determine which records will be entered into the
warehouse and how. Here are some considerations:

 Compare with the target table. Determine if the record exists in the target table. If the record
does not exist, insert the record as a new row. If it does exist, determine if the record needs to
be updated, inserted as a new record, or removed (deleted from target or filtered out and not
added to the warehouse). This occurs in cases of delta loads, timestamps, keys or surrogate
keys.
 Record indicators. Record indicators can be beneficial when lookups into the target are not
necessary. Take care to ensure that the record exists for updates or deletes or the record can
be successfully inserted. More design effort may be needed to manage errors in these
situations.

Determine the Method of Comparison

1. Joins of Sources to Targets. Records are directly joined to the target using Source Qualifier join
conditions or using joiner transformations after the source qualifiers (for heterogeneous sources).
When using joiner transformations, take care to ensure the data volumes are manageable.

2. Lookup on target. Using the lookup transformation, lookup the keys or critical columns in the target
relational database. Keep in mind the caches and indexing possibilities.

3. Load table log. Generate a log table of records that have been already inserted into the target
system. You can use this table for comparison with lookups or joins, depending on the need and
volume. For example, store keys in a separate table and compare source records against this log table
to determine load strategy.

Source Based Load Strategies

Complete Incremental Loads in a Single File/Table

The simplest method of incremental loads is from flat files or a database in which all rec ords will be
loaded. This particular strategy requires bulk loads into the warehouse, with no overhead on
processing of the sources or sorting the source records.

Loading Method

Data can be loaded directly from these locations into the data warehouse. There is no additional
overhead produced in moving these sources into the warehouse.

Date Stamped Data

This method involves data that has been stamped using effective dates or sequences. The incremental
load can be determined by dates greater than the previous load date or data that has an effective key
greater than the last key processed.

Sai.M msramms@gmail.com
9916447007
Loading Method

With the use of relational sources, the records can be selected based on this effective date and only
those records past a certain date will be loaded into the warehouse. Views can also be created to
perform the selection criteria so the processing will not have to be incorporated into the mappings.
Placing the load strategy into the ETL component is much more flexible and controllable by the ETL
developers and metadata.

Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced
keys. A router transformation or a filter can be placed after the source qualifier to remove old
records.

To compare the effective dates, you can use mapping variables to provide the previous date
processed. The alternative is to use control tables to store the date and update the control table after
each load.

For detailed instruction on how to select dates, refer to Best Practice: Variable and Mapping
Parameters.

Changed Data based on Keys or Record Information

Data that is uniquely identified by keys can be selected based upon selection criteria. For example,
records that contain key information such as primary keys, alternate keys etc can be used to
determine if they have already been entered into the data warehouse. If they exist, you can also check
to see if you need to update these records or discard the source record.

Load Method

It may be possible to do a join with the target tables in which new data can be selected and loaded
into the target. It may also be feasible to lookup in the target to see if the data exists or not.

Target Based Load Strategies

Load Directly into the Target

Loading directly into the target is possible when the data will be bulk loaded. The mapping will be
responsible for error control, recovery and update strategy.

Load into Flat Files and Bulk Load using an External Loader

The mapping will load data directly into flat files. An external loader can be invoked at that point to
bulk load the data into the target. This method reduces the load times (with less downtime for the
data warehouse) and also provides a means of maintaining a history of data being loaded into the
target. Typically this method is only used for updates into the warehouse.

Load into a Mirror Database

The data will be loaded into a mirror database to avoid down time of the active data warehouse. After
data has been loaded, the databases are switched, making the mirror the active database and th e
active as the mirror.

Sai.M msramms@gmail.com
9916447007
Using Mapping Variables and Parameter Files

A mapping variable can be used to perform incremental loading. This is a very important issue that
everyone should understand. The mapping variable is used in the join condition in order t o select only
the new data that has been entered based on the create_date or the modify_date, whichever date can
be used to identify a newly inserted record. The source system must have a reliable date to use.. Here
are the steps involved in this method:

Step 1: Create Mapping Variable.

In the Informatica Designer, with the mapping designer open, go to the menu and select Mappings,
then select Parameters and Values.

Name the variable and, in this case, make your variable a date/time. For the Aggregatio n option,
select MAX.

In the same screen, state your initial value. This is the date at which the load should start. The date
must follow one of these formats:

 MM/DD/RR
 MM/DD/RR HH24:MI:SS
 MM/DD/YYYY
 MM/DD/YYYY HH24:MI:SS

Step 2: Use the Mapping Variable in the Source Qualifier

The select statement will look like the following:

Select * from tableA

Where

CREATE_DATE > to_date('$$INCREMENT_DATE', 'MM-DD-YYYY HH24:MI:SS')

Step 3: Use the Mapping Variable in an Expression

For the purpose of this example, use an expression to work with the variable functions to set and use
the mapping variable.

In the expression create a variable port and use the SETMAXVARIABLE variable function and do the
following:

SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)

CREATE_DATE is the date for which you would like to store the maximum value.

You can use the variable functions in the following transformations:

 Expression
 Filter
 Router
 Update Strategy

Sai.M msramms@gmail.com
9916447007
The variable constantly holds (per row) the max value between source and variable. So if one row
comes through with 9/1/2001, then the variable gets that value. If all subsequent rows are LESS than
that, then 9/1/2001 is preserved.

After the mapping completes, that is the PERSISTENT value stored in the repository for the next run of
your session. You can view the value of the mapping variable in the session log file.

The value of the mapping variable and incremental loading is that it allows the session t o use only the
new rows of data. No table is needed to store the max(date)since the variable takes care of it.

Sai.M msramms@gmail.com

Potrebbero piacerti anche