Basic Elements of Data Warehouse Architecture

The Basic Elements of Data Warehouse Architecture
Source System: An operational system of record whose

function it is to capture the transactions of the business.
A source system is often called a legacy system in a
mainframe environment. The main priorities of the
source system are uptime and availability. Queries
against source systems are narrow, account-based
queries that are part of the normal transaction flow and
severely restricted in their demands on the legacy
system. Source systems have keys that make certain
things unique, like product keys or customer keys. We
call these source system keys production keys, and we
treat them as attributes, just like any other textual
description of something. We never use the production
keys as the keys within our data warehouse.
Data Staging Area: A storage area and set of
processes that clean, transform, combine, de-duplicate,
household, archive, and prepare source data for use in
the data warehouse. The data staging area is everything
in between the source system and the presentation
server. The key defining restriction on the data staging
area is that it does not provide query and presentation
services.
Presentation Server: The target physical machine on
which the data warehouse data is organized and stored
for direct querying by end users, report writers, and other
applications.
Three very different systems are required for a data
warehouse to function: the source system, the data
staging area, and the presentation server. The source
system should be thought of as outside the data
warehouse, since we have no control over the content
and format of the data in the legacy system. The data
staging area is the initial storage and cleaning system for
data that is moving toward the presentation server, and
the data staging area may consist of a system of flat files.
It is the presentation server where we insist that the data
be presented and stored in a dimensional framework. If
the presentation server is based on a relational database,
then the tables will be organized as star schemas. If the
presentation server is based on non-relational on-line
analytic processing (OLAP) technology, then the data
will still have recognizable dimensions.
Dimensional Model: is a specific discipline for modeling
data that is an alternative to entity-relationship (E/R)
modeling. A dimensional model contains the same
information as an E/R model but packages the data in a
symmetric format whose design goals are user
understandability, query performance, and resilience to

change. The main components of a dimensional model
are fact tables and dimension tables. A fact table is the
primary table in each dimensional model that is meant to
contain measurements of the business. We will use the
word fact to represent a business measure. Every fact
table represents a many-to-many relationship and every
fact table contains a set of two or more foreign keys that
join to their respective dimension tables. A dimension
table is one of a set of companion tables to a fact table.
Each dimension is defined by its primary key that serves
as the basis for referential integrity with any given fact
table to which it is joined. Most dimension tables contain
many textual attributes (fields).
Business Process: A coherent set of business activities
that make sense to the business users of our data
warehouses. A business process is usually a set of
activities like order processing or customer pipeline
management, but business processes can overlap.
Data Mart: A logical subset of the complete data
warehouse. A data warehouse is made up of the union
of all its data marts. The data mart is probably
sponsored by and built by a single part of the business,
and a data mart is usually organized around a single
business process. Every data mart must be represented
by a dimensional model and, within a single data
warehouse, all such data marts must be built from
conformed dimensions and conformed facts. This is the
basis of the Data Warehouse Bus Architecture. There
are two contrasting points of view about top-down vs
bottom-up data warehouses. The extreme top-down
perspective is that a completely centralized, tightly
designed master database must be completed before
parts of it are summarized and published as individual
data marts. The extreme bottom-up perspective is that
an enterprise data warehouse can be assembled from
disparate and unrelated data marts.
Data Warehouse: The queriable source of data in the
enterprise. The data warehouse is nothing more than the
union of all the constituent data marts. A data
warehouse is fed from the data staging area. The data
warehouse manager is responsible both for the data
warehouse and the data staging area.
Operational Data Store (ODS): ODS was meant to
serve as the point of integration for operational systems.
This was especially important for legacy systems that
grew up independent of each other.( Banks, for example,
typically had several independent systems set up to
support different productsloans, checking accounts,
savings accounts, and so on. The advent of teller
support computers and the ATM helped push many
banks to create an operational data store to integrate
current balances and recent history from these separate
accounts under one customer number.)
OLAP (On-Line Analytic Processing): The OLAP
vendors technology is non-relational and is almost
always based on an explicit multidimensional cube of
data. OLAP databases are also known as
multidimensional databases, or MDDBs. OLAP
installations would be classified as small, individual data
marts when viewed against the full range of data
warehouse applications.
ROLAP (Relational OLAP): A set of user interfaces and
applications that give a relational database a
dimensional flavor.
MOLAP (Multidimensional OLAP): A set of user
interfaces, applications, and proprietary database
technologies that have a strongly dimensional flavor.
End User Application: A collection of tools that query,
analyze, and present information targeted to support a
business need.
End User Data Access Tool: A client of the data
warehouse. In a relational data warehouse, such a client
maintains a session with the presentation server,
sending a stream of separate SQL requests to the server.
Eventually the end user data access tool is done with the
SQL session and turns around to present a screen of
data or a report, a graph, or some other higher form of
analysis to the user.
Ad Hoc Query Tool: A specific kind of end user data
access tool that invites the user to form their own
queries by directly manipulating relational tables and
their joins.
Modeling Applications: A sophisticated kind of data
warehouse client with analytic capabilities that transform
or digest the output from the data warehouse. Modeling
applications include:
Forecasting models that try to predict the future
Behavior scoring models that cluster and classify
customer purchase behavior or customer credit behavior
Allocation models that take cost data from the data
warehouse and spread the costs across product
groupings or customer groupings
Most data mining tools
Metadata: All of the information in the data warehouse
environment that is not the actual data itself.
//not in syllabus
Basic Processes of the Data Warehouse
Data staging is a major process that includes, among
others, the following sub processes: extracting,
transforming, loading and indexing, and quality
assurance checking.
Extracting. The extract step is the first step of getting
data into the data warehouse environment. Extracting
means reading and understanding the source data, and
copying the parts that are needed to the data staging
area for further work.
Transforming. Once the data is extracted into the data
staging area, there are many possible transformation
steps, including
Cleaning the data by correcting misspellings, resolving
domain conflicts dealing with missing data elements, and
parsing into standard formats
Purging selected fields from the legacy data that are
not useful for the data warehouse
Combining data sources, by matching exactly on key
values or by performing fuzzy matches on non-key

attributes, including looking up textual equivalents of
legacy system codes
Creating surrogate keys for each dimension record in
order to avoid a dependence on legacy defined keys,
where the surrogate key generation process enforces
referential integrity between the dimension tables and
the fact tables
Building aggregates for boosting the performance of
common queries
Loading and Indexing. At the end of the
transformation process, the data is in the form of load
record images. Loading in the data warehouse
environment usually takes the form of replicating the
dimension tables and fact tables and presenting these
tables to the bulk loading facilities of each recipient data
mart. Bulk loading is a very important capability that is to
be contrasted with record-at-a-time loading, which is far
slower. The target data mart must then index the newly
arrived data for query performance, if it has not already
done so.
Quality Assurance Checking. Quality assurance can
be checked by running a comprehensive exception
report over the entire set of newly loaded data. All
reported values must be consistent with the time series
of similar values that preceded them. The exception
report is probably built with the data marts end user
report writing facility.
Release/Publishing. When each data mart has been
freshly loaded and quality assured, the user community
must be notified that the new data is ready. Publishing
also communicates the nature of any changes that have
occurred in the underlying dimensions and new
assumptions that have been introduced into the
measured or calculated facts.
Updating. Contrary to the original religion of the data
warehouse, modern data marts may well be updated,
sometimes frequently. Incorrect data should obviously
be corrected. Changes in labels, changes in hierarchies,
changes in status, and changes in corporate ownership
often trigger necessary changes in the original data
stored in the data marts that comprise the data
warehouse, but in general these are managed load
updates, not transactional updates.
Querying. Querying is a broad term that encompasses
all the activities of requesting data from a data mart,
including ad hoc querying by end users, report writing,
complex decision support applications, requests from
models, and full-fledged data mining. Querying never
takes place in the data staging area. By definition,
querying takes place on a data warehouse presentation
server. Querying, obviously, is the whole point of using
the data warehouse.
Auditing. At times it is critically important to know
where the data came from and what were the
calculations performed. audit records are linked directly
to the real data in such a way that a user can ask for the
audit record (the lineage) of the data at any time.
Securing. Protect the valuable sensitive data from
hackers, snoopers, and industrial spies. The data
warehouse team must now include a new senior
member: the data warehouse security architect. Data
warehouse security must be managed centrally, from a
single console.
Backing Up and Recovering. Since data warehouse
data is a flow of data from the legacy systems on
through to the data marts and eventually onto the users
desktops, a real question arises about where to take the
necessary snapshots of the data for archival purposes
and disaster recovery. Additionally, it may be even more
complicated to back up and recover all of the metadata
that greases the wheels of the data warehouse operation.

Basic Elements of Data Warehouse Architecture

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Basic Elements of Data Warehouse Architecture

Caricato da

Copyright:

Formati disponibili

The Basic Elements of Data Warehouse Architecture

Source System: An operational system of record whose

Potrebbero piacerti anche