Sei sulla pagina 1di 5

Implement an Error Event Schema with

Oracle Data Integrator


By Michael Rainey ◾ Michelle Malcher, Editor

Quality is often overlooked or underutilized when


integrating data into a data warehouse. The typical
scenario has multiple distributed data sources in
different formats such as files, databases, legacy
systems, XML, etc.

Because it is often difficult to know the level of quality checks occurring on each
source, the burden is placed on the data warehouse. Luckily, with the Kimball ETL
subsystems methodology and Oracle Data Integrator 12c (ODI), the problem of
data quality can be addressed with consistency and ease.

The ETL Subsystems and Error Event NULL values or a specific data format. Structure screens will test
Schema data across records and tables with a common check being a
foreign key reference. Finally, complex screens can be created to
The Kimball Group has long been the authoritative source for validate data against complex business rules or aggregate threshold
dimensional modeling and data warehouse implementation and checks. For simplicity, we’ll focus on the column screen quality
management methodologies. One of the core pieces of information check in this article, but the concepts will apply to all three types
gained from their approach is that of the ETL subsystems — 34 of of screens.
them, to be exact. The Kimball ETL subsystems are categorized into
four major focus areas: Extracting, Cleaning and Conforming, The Error Event Schema, or Subsystem 5, is a dimensional schema
Delivering Data for Presentation, and Managing the ETL Environment. designed to track the data quality screens, the data that failed those
This article will focus on just two of the 34 Subsystems in the checks, and when the quality screening occurred. This schema will
Cleaning and Conforming category: Data Cleansing System and Error create a single place for data warehouse administrators to capture,
Event Schema. track and analyze all errors that occur as a function of data
warehouse processing. As shown in Figure 1, the dimensional model
The Data Cleansing System, or Subsystem 4, according to the “Data has a main fact table to capture the error event. There are also three
Warehouse Lifecycle Toolkit” (Wiley Publishing Inc., 2008) is all dimension tables. The Date Dimension contains the date of the error,
about data quality. Oracle Data Integrator 12c (ODI) is the premier the Batch Dimension captures the batch ETL process run instance
data integration technology by Oracle. This boils down to creating and the Screen Dimension will track the quality screen that produced
quality screens in ODI, which are tests against the data to determine the error. There is also an additional fact table for the Error Event
if the record meets quality standards or not, and how the system Detail. This table tracks the error by actual column, ensuring that the
responds to a quality event, as in a failed quality screen. There are attributes used in those complex business rules can be captured
several types of quality screens that can be created. Column screens individually. Because we are focusing on the column quality checks,
validate the contents of an individual column, such as a check for we will only use the four base tables in our example.

14 ◾ Q2-16 H www.ioug.org
such as a request for iPads for their entire classroom or funding for
a field trip to the local museum. Then, folks like you and I can go out
and choose a project to donate to, with the ability to select based on
different attributes of the school such as location, poverty level or
project subject area, among others. The organization allows public
access to their data set, hoping that some advanced analytics
performed against the data set may lead to more donations for the
schools. It’s a great organization, and I recommend you check it out.
But for now, back to the subject of data quality screens and how we
can check the quality of the DonorsChoose projects table.
Some data profiling has already been completed on the table, using
Figure 1: Error Event Schema Oracle Enterprise Data Quality, so it is known that there may be a
slight data quality issue with the ZIP code column and its length.
Oracle Data Integrator 12c
With a framework built to extract and load across heterogeneous
data sources, ODI is built for performance and managing business
rules. The transformations are completed using the target system,
rather than needing an intermediate ETL server. This, along with a
declarative design approach to building mappings, makes ODI a
flexible and performance-focused technology. It also has accessible
metadata, which will be discussed later on.
Data quality in ODI 12c is possible through the use of logical Figure 2: ZIP Code Column Length
constraints created on the data store, a logical representation of the As you can see, there’s a very significant outlier in the length data
underlying physical source or target object, such as a table, file or profile for ZIP code. The length of four characters is clearly an
XML document element. The logical constraints can be defined as anomaly, and we’ll want to ensure this data does not flow through to
one of three different types, as well as the mandatory constraint the target table. First, a condition constraint will be created for the
(NOT NULL for the Oracle Database folks) that can be placed on ZIP code column on the PROJECTS target data store, checking that
each column. The primary key constraint on the data store forces the
the length of the data must always be five characters. This column
uniqueness of one or more columns. The foreign key constraint
based quality screen will ensure that only the data with a valid length
enforces the relationships between other tables for the primary key
is moved through to the target. All other data with invalid lengths will
columns and how they relate to the column on the other table.
be pushed to the error (E$_PROJECTS) table, allowing analysis and
These logical constraints are stored as metadata in the ODI work handling of the data quality issue at a later time.
repository and can be accessed by the ODI Substitution API via a
Check Knowledge Module (CKM). The CKM is a code template that
can be reused across mappings to perform logical constraint checks
against static tables or during execution of a mapping. During design
time for a given mapping, the flow control option can be enabled in
the Integration Knowledge Module, allowing the CKM steps, which
check each logical constraint defined on the target data store, to be
executed during runtime. The CKM also creates the error table in the
work schema, prefixed with an E$, to store the records that fail the
constraint checks.

Working Through a Data Quality Example


Let’s take a look now and see how everything we’ve discussed so far
fits together. First, we’ll create a logical constraint on a data store
that will be used as the target in our mapping. For this example, the
DonorsChoose.org data set will be used, with a focus on the projects
table in particular. DonorsChoose is an organization that is essentially
crowdfunding for donations for schools throughout the United States.
Teachers submit a project idea and donation request to the site, Figure 3: ODI Condition

www.ioug.org H Q2-16 ◾ 15
When performing the load from the file opendata_projects.csv to the We can see that at least one error row occurred during execution of
PROJECTS table, in an Oracle Database schema, the flow control the load to the target PROJECTS table based on the yellow warning
option will be enabled on the Integration Knowledge Module. This sign with an exclamation point at the session level of the output.
option will activate the CKM steps, allowing the condition on the ZIP This can be validated by looking at the SNP_CHECK table, which
code column, and any other constraints on the target data store, to will show all of the failed constraints and the number of times each
be evaluated. failed. To dig deeper, the exact row(s) that failed the constraints
can be found in the E$_Target_Table_Name (in this example,
Note: This will add a step in the executed process for each and every
constraint, including the check for nullability, so be sure to manage E$_PROJECTS) table.
those constraints that do not need to be logically checked during the
data load, thus reducing overhead during the data transformation.
Let’s run the mapping and see what happens.
Figure 5: ODI Condition Error

So now we have seen a data quality check, or quality screen,


performed within ODI 12c and know that the result of the failed data
will reside in both a summary and detail table. Now what does this
mean for Subsystem 5, the Error Event Schema? Let’s take a look
and see how the physical tables in the ODI repository and work
schema map to the logical Error Event Schema.
First, let’s breakdown the dimensions, as the physical tables
referenced may not make too much sense at this point. The table
SNP_LPI_RUN will store all of the data for each Load Plan execution.
A Load Plan is a top-level execution object in ODI that helps to
organize the execution of scenarios (compiled packages, procedures
or mappings) in a sequence of steps. These steps can be executed
in parallel or sequentially and can also include conditional branching
based on an ODI variable value. From the SNP_LPI_RUN table, we
can capture each batch execution ID and store it in the batch
Figure 4: ODI CKM Steps dimension. Next, we have the source for the screen dimension:

Figure 6: Error Event Schema

16 ◾ Q2-16 H www.ioug.org
SNP_COND (condition constraints), SNP_KEY (primary key
constraints) and SNP_JOIN (foreign key constraints). Each of these
There are more than just one
physical ODI repository tables stores a quality screen type and can be type of constraint that can be
used to load the screen dimension. Finally, we have the error table,
E$_PROJECTS. This is the table that tracks each failed constraint for tracked in the ODI metadata-
the target data store during execution of the mapping. Each failed based Error Event Schema.
constraint check for this target will be captured in the error table.

also to lessen the impact of adding in new E$ tables, a reusable


Loading the Error Event Schema
mapping can be created. This reusable mapping will contain all E$
The physical tables that map to the logical Error Event Schema tables combined with a set component that performs a UNION
model have now been identified. Next, the ETL that will load that between each of the common attributes. The output of the reusable
physical schema must be built. Using various tables in the ODI work mapping can then be added to the main mapping, generating an
repository, along with the E$ tables, we can create the Error Event inline view during code execution.
Schema fact and dimensions. First, let’s look at the dimensions.
Note: In this example, the ODI work repository schema is ODI_REPO
and the work schema is DW_WRK.
The BATCH_DIM is simply based on a Load Plan execution instance.

SELECT LPI.I_LP_INST, LPI.NB_RUN, LPI.LOAD_PLAN_NAME, LPI.START_DATE,


LPI.END_DATE
FROM ODI_REPO.SNP_LPI_RUN LPI;
Code: BATCH_DIM source

The SCREEN_DIM source query is a bit more involved but still not
too difficult to write. The driving table for this example is SNP_COND,
which stores the condition constraints in the ODI work repository. Figure 7: RMAP Error Tables Set
The table (data store) and model in which that table resides are also
added as attributes. Finally, the SQL for the condition must be added Another reusable mapping that will be added to the final mapping
by joining to the SNP_TXT_HEADER table. If you recall from earlier, SQL, and generating another inline view, is built to return the target
there are more than just one type of constraint that can be tracked in table, or tables, for a given mapping. This code walks through the
the ODI metadata-based Error Event Schema. For this article, the mapping components and their connection points, checking for any
foreign key constraints, found in the SNP_JOIN table, and the data store component that has an output connection point that is not
primary key constraints, from the SNP_KEY table, were left out of the also an input to another component and returns those table names.
code. Finishing up the full ETL for the SCREEN_DIM and including Here’s a look at the code behind the reusable mapping.
those additional quality screen types has been left as an exercise for
the readers.
SELECT C.I_COND, M.I_MAPPING FROM ODI_REPO.SNP_MAPPING M INNER JOIN
ODI_REPO.SNP_MAP_COMP MC ON M.I_MAPPING=MC.I_OWNER_MAPPING
INNER JOIN ODI_REPO.SNP_MAP_CP CP ON MC.I_MAP_COMP=CP.I_OWNER_MAP_COMP
SELECT C.I_COND, M.COD_MOD, T.TABLE_NAME, C.COND_NAME, C.COND_TYPE, TXT. INNER JOIN ODI_REPO.SNP_MAP_REF MR ON MC.I_MAP_REF = MR.I_MAP_REF
FULL_TEXT INNER JOIN ODI_REPO.SNP_TABLE T ON MR.I_REF_ID = T.I_TABLE
FROM ODI_REPO.SNP_COND C INNER JOIN ODI_REPO.SNP_TXT_HEADER TXT ON C.I_ INNER JOIN ODI_REPO.SNP_MODEL MDL ON T.I_MOD = MDL.I_MOD
TXT_COND_SQL = TXT.I_TXT LEFT OUTER JOIN ODI_REPO.SNP_COND C ON T.I_TABLE=C.I_TABLE
INNER JOIN ODI_REPO.SNP_TABLE T ON C.I_TABLE = T.I_TABLE WHERE CP.DIRECTION = ‘O’ AND --CONNECTION POINT DIRECTION - OUT
INNER JOIN ODI_REPO.SNP_MODEL M ON T.I_MOD = M.I_MOD; CP.I_MAP_CP NOT IN (SELECT I_START_MAP_CP FROM SNP_MAP_CONN)
Code: SCREEN_DIM source --CONNECTION POINT IS NOT A STARTING CP.
) C ON SCEN.I_MAPPING = C.I_MAPPING;
Code: Get the Mapping target table
The DATE_DIM is a standard date dimension, so there’s no need to
go into detail on that table. Let’s dig into the fact table now, which is
Several additional tables will be used to round out the fact table load.
a bit more involved than the two dimensions.
The tables used from the ODI work repository will join to the session
Start first with the main source of the fact, the E$ prefixed error number that was recorded in the E$ table for a particular data quality
tables. In this example, we’re only using one constraint to check one screen failure, working their way up to the Load Plan instance run.
table. In real life, there could be hundreds of these E$ tables that This will provide the link from quality screen to batch run for the fact
must flow data into the fact. To simplify the overall mapping, and table load.

www.ioug.org H Q2-16 ◾ 17
Figure 8: Fact Map

Here’s a look at the main code logic (without dimension lookups, The load of the Error Event Schema can be scheduled to run based
E$ table reusable mapping, etc.). on the interval necessary. Often the data will only need to be loaded
once per day, after the main batch ETL processing has completed.
But the process could be run at any time, given the requirements for
SELECT ...
FROM DW_WRK.E$_PROJECTS PROJ INNER JOIN ODI_REPO.SNP_SESSION SESS ON analysis of the Error Event Schema data.
PROJ.ODI_SESS_NO = SESS.GLOBAL_ID
INNER JOIN ODI_REPO.SNP_LPI_STEP_LOG LPSL ON SESS.SESS_NO = LPSL.SESS_NO
INNER JOIN ODI_REPO.SNP_LPI_STEP LPIS ON LPSL.I_LP_INST = LPIS.I_LP_INST Conclusion
AND LPSL.I_LP_STEP = LPIS.I_LP_STEP
INNER JOIN ODI_REPO.SNP_LPI_RUN LPI ON LPIS.I_LP_INST = LPI.I_LP_INST When it comes to data warehousing, it’s important to have a good
INNER JOIN ODI_REPO.SNP_SCEN SCEN ON SESS.SCEN_NAME = SCEN.SCEN_NAME AND handle on the quality of data that is flowing into and through the data
SESS.SCEN_VERSION = SCEN.SCEN_VERSION
INNER JOIN ODI_REPO.SNP_SESS_TASK_LOG TL ON LPSL.SESS_NO = TL.SESS_NO warehouse. Remember, this data is ultimately used to make daily task
INNER JOIN decisions and long-term strategic decisions throughout your company.
(SELECT C.I_COND, M.I_MAPPING FROM ODI_REPO.SNP_MAPPING M INNER JOIN
ODI_REPO.SNP_MAP_COMP MC ON M.I_MAPPING=MC.I_OWNER_MAPPING Capturing the bad quality data and understanding the cause of
INNER JOIN ODI_REPO.SNP_MAP_CP CP ON MC.I_MAP_COMP=CP.I_OWNER_MAP_COMP these data issues will help. Not only will the data output improve,
INNER JOIN ODI_REPO.SNP_MAP_REF MR ON MC.I_MAP_REF = MR.I_MAP_REF
INNER JOIN ODI_REPO.SNP_TABLE T ON MR.I_REF_ID = T.I_TABLE but ultimately the ETL code, and hopefully the source system
INNER JOIN ODI_REPO.SNP_MODEL MDL ON T.I_MOD = MDL.I_MOD data constraints, as well. Following the Kimball ETL subsystems,
LEFT OUTER JOIN ODI_REPO.SNP_COND C ON T.I_TABLE=C.I_TABLE
WHERE CP.DIRECTION = ‘O’ AND --CONNECTION POINT DIRECTION - OUT specifically the quality screens and Error Event Schema, will enhance
CP.I_MAP_CP NOT IN (SELECT I_START_MAP_CP FROM ODI_REPO.SNP_MAP_CONN) the development, management and monitoring of the data warehouse.
--CONNECTION POINT IS NOT A STARTING CP.
) C ON SCEN.I_MAPPING = C.I_MAPPING
And as shown here, ODI 12c has the capability to implement the ETL
WHERE LPIS.IND_ENABLED=’1’ AND TL.TASK_STATUS=’D’ AND TL.NB_ERR > 0 ; subsystems and, through use of logical constraints and metadata
Code: ERROR_EVENT_FACT source tables, can load the error event schema.

Contact
Michael Rainey is the data integration lead at Rittman Mead America, driving the growth
of the data integration practice. Focused on building a team of Oracle data integration
experts, Michael provides technical direction and leadership. His expertise in Oracle Data
Integrator and Oracle GoldenGate enable him to help the Rittman Mead data integration
team deliver excellent products and solutions through project oversight. He also is the lead
instructor for the Rittman Mead Oracle Data Integrator Bootcamp in America. Michael enjoys
sharing his knowledge through blog posts, articles and presentations at many of the great
Oracle conferences throughout the world.

18 ◾ Q2-16 H www.ioug.org

Potrebbero piacerti anche