Data Warehousing

1
PROGRAM: BACHELOR OF COMPUTER APPLICATION

SEMESTER: 6
th
SEMESTER
SUBJECT CODE & NAME: BC0058- DATA WAREHOUSING

Q1. DIFFERENTIATE BETWEEN OLTP AND DATA WAREHOUSE?

Differences between OLTP and Data Warehouse
Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to
be recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a
customer but was unable to record this event in the bank records. If this happens frequently, the bank
wouldn't stay in business for too long. So the banking system is designed to make sure that every
transaction gets recorded within the time you stand before the ATM machine.

A Data Warehouse (DW) on the other end, is a database (yes, you are right, it's a database) that is
designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing)
systems, these databases contain read-only data that can be queried and analyzed far more efficiently as
compared to your regular OLTP application databases. In this sense an OLAP system is designed to be
read-optimized.

Separation from your application database also ensures that your business intelligence solution is scalable
(your bank and ATMs don't go down just because the CFO asked for a report), better documented and
managed.

Creation of a DW leads to a direct increase in quality of analysis as the table structures are simpler (you
keep only the needed information in simpler tables), standardized (well-documented table structures), and
often de-normalized (to reduce the linkages between tables and the corresponding complexity of queries).
Having a well-designed DW is the foundation for successful BI (Business Intelligence)/Analytics
initiatives, which are built upon.

Data Warehouses usually store many months or years of data. This is to support historical analysis. OLTP
systems usually store data from only a few weeks or months. The OLTP system stores only historical data
as needed to successfully meet the requirements of the current transaction.

PROPERTY OLTP DATA WAREHOUSE
Nature of Data Warehouse 3NF Multidimensional
Indexes Few Many
Joins Many Some
Duplicate Data Normalized Demoralized
2

Aggregate Data Rare Common
Queries Mostly Predefined Mostly adhoc
Nature of Queries Mostly Simple Mostly Complex
Update All The Time Not Allowed, Only Refreshed
Historical Data Often Not Available Essential

Q2 WHAT ARE THE KEY ISSUE IN PLANING A DATA WAREHOUSE

Planning Data Warehouses and Key Issues

More than any other factor, improper planning and inadequate project management tend to result in
failures. First and foremost, determine if your company really needs a Data Warehouse. Is it really ready
for one? You need to develop criteria for assessing the value expected from your Data Warehouse. Your
company has to decide on the type of Data Warehouse to be built and where to keep it. You have to
ascertain where the data is going to come from and even whether you have all the needed data. You have
to establish who will be using the Data Warehouse, how they will use it, and at what times.

We will discuss the various issues related to the proper planning of a Data Warehouse.

You will learn how a Data Warehouse project differs from the types of projects you were involved in the
past. We will study the guidelines for making your Data Warehouse projects a success.

Key Issues during Data warehouse Construction

Planning for your Data Warehouse begins with a thorough consideration of the key issues. Answers to the
key questions are vital for the proper planning and the successful completion of the project. Therefore,
let us consider the pertinent issues, one by one.

Values and Expectations. Some companies jump into Data Warehousing without assessing the value to be
derived from their proposed Data Warehouse. Of course, first you have to be sure that, given the culture
and the current requirements of your company; a Data Warehouse is the most viable solution. After you
have established the suitability of this solution, only then can you begin to enumerate the benefits and
value propositions.

Risk Assessment. Planners generally associate project risks with the cost of the project. If the project
fails, how much money will go down the drain? But the assessment of risks is more than calculating the
loss from the project costs. What are the risks faced by the company without the
benefits derivable from a Data Warehouse? What losses are likely to be incurred? What opportunities
are likely to be missed?

3

Q3. EXPLAIN SOURSE DATA COMPOMNENT AND DATA STAGING
COMPONENTS OF DATA WAREHOUSE ARCHITECTURE

Components of Data Warehouse Architecture

The major components of DWH Architecture are:

Source Data Component
Production Data
Internal Data
Archived Data
External Data
Data Staging Component
Data Extraction
Data Transformation
Data Loading

Source Data Component
Production Data This category of data comes from the various operational systems of the enterprise.
Based on the information requirements in the Data Warehouse, you choose segments of data from the
different operational systems. While dealing with this data, you come across many variations in the
data formats. You also notice that the data resides on different hardware platforms. Further, the
data is supported by different database systems and operating systems. This is the data from many
vertical applications.
In operational systems, information queries are narrow. You query an operational system for
information about specific instances of business objects. You may want just the name and address of
a single customer. Or, you may need the orders placed by a single customer in a single week. Or, you
may just need to look at a single invoice and the items billed on that single invoice. In operational
systems, you do not have broad queries. You do not query the operational system in unexpected ways.
The queries are all predictable. Again, you do not expect a particular query to run across different
operational systems. What do all these mean? There is no conformance of data among the various
operational systems of an enterprise. A term like an account may have different meanings in different
systems.
The significant and disturbing characteristic of production data is disparity. Your great challenge is
to standardize and transform the disparate data from the various production systems, convert the
data, and integrate the pieces into useful data for storage in the Data Warehouse.

Internal Data In every organization, users keep their private spreadsheets, documents, customer
profiles, and sometimes even departmental databases. This is the internal data, parts of which could
be useful for Data Warehouse for analysis.
4

If your organization does business with the customers on a one-to-one basis and the contribution of
each customer to the bottom line is significant, then detailed customer profiles with ample
demographics are important in a Data Warehouse. Profiles of individual customers become very
important for consideration. When your account representatives talk to their assigned customers or
when your marketing department wants to make specific offerings to individual customers, you need
the details. Although much of this data may be extracted from production systems, individuals and
departments in their private files hold a lot of it.
You cannot ignore the internal data held in private files in your organization. It is a collective
judgment call on how much of the internal data should be included in the Data Warehouse. The IT
department must work with the user departments to gather the internal data.
Internal data adds additional complexity to the process of transforming and integrating the data
before it can be stored in the Data Warehouse. You have to determine strategies for collecting data
from spreadsheets, find ways of taking data from textual documents, and tie into departmental
databases to gather pertinent data from those sources. Again, you may want to schedule the
acquisition of internal data. Initially, you may want to limit yourself to only some significant portions
before going live with your first data mart.

Archived Data Operational systems are primarily intended to run the current business. In every
operational system, you periodically take the old data and store it in archived files. The circumstances
in your organization dictate how often and which portions of the operational databases are archived
for storage. Some data is archived after a year.
Sometimes data is left in the operational system databases for as long as five years. Many different
methods of archiving exist. There are staged archival methods. At the first stage, recent data is
archived to a separate archival database that may still be online. At the second stage, the older data
is archived to flat files on disk storage. At the next stage, the oldest data is archived to tape
cartridges or microfilm and even kept off-site.
As mentioned earlier, a Data Warehouse keeps historical snapshots of data. You essentially need
historical data for analysis over time. For getting historical information, you look into your archived
data sets. Depending on your Data Warehouse requirements, you have to include sufficient historical
data. This type of data is useful for detecting patterns and analyzing trends.

External Data Most executives depend on data from external sources for a high percentage of the
information they use. They use statistics relating to their industry produced by external agencies.
They use market share data of competitors. They use standard values of financial indicators for their
business to check on their performance.

For example, the Data Warehouse of a car rental company contains data on the current production
schedules of the leading automobile manufacturers. This external data in the Data Warehouse helps
the car rental company plan for their fleet management. The purposes served by such external data
sources cannot be fulfilled by the data available within your organization itself. The insights gleaned
from your production data and your archived data are somewhat limited. They give you a picture based
5

on what you are doing or have done in the past. In order to spot industry trends and compare
performance against other organizations, you need data from external sources.
Usually, data from outside sources do not conform to your formats. You have to do conversions of
data into your internal formats and data types. You have to organize the data transmissions from the
external sources. Some sources may provide information at regular, stipulated intervals. Others may
give you the data on request. You need to accommodate the variations.

Data Staging Component
After you have extracted data from various operational systems and from external sources, you have
to prepare the data for storing in the Data Warehouse. The extracted data coming from several
disparate sources need to be changed, converted, and made ready in a format that is suitable to be
stored for querying and analysis.
Three major functions need to be performed for getting the data ready. You have to extract the
data, transform the data, and then load the data into the Data Warehouse storage. These three
major functions of extraction, transformation, and preparation for loading take place in a staging
area. The data-staging component consists of a workbench for these functions. Data staging provides
a place and an area with a set of functions to clean, change, combine, convert, reduplicate, and
prepare source data for storage and use in the Data Warehouse.
Why do you need a separate place or component to perform the data preparation? Cant you move the
data from the various sources into the Data Warehouse storage itself and then prepare the data?
When we implement an operational system, we are likely to pick up data from different sources, move
the data into the new operational system database, and run data conversions. Why cant this method
work for a Data Warehouse? The essential difference here is this: in a Data Warehouse you pull in
data from many source operational systems. Remember that data in a Data Warehouse is subject-
oriented and cuts across operational applications. A separate staging area, therefore, is a necessity
for preparing data for the Data Warehouse.
Now that we have clarified the need for a separate data-staging component, let us understand what
happens in data staging. We will now briefly discuss the three major functions that take place in the
staging area.
Data Extraction This function has to deal with numerous data sources. You have to employ the
appropriate technique for each data source. Source data may be from different source machines in
diverse data formats. Part of the source data may be in relational database systems. Some data may
be on other legacy network and hierarchical data models. Many data sources may still be in flat files.
You may want to include data from spreadsheets and local departmental data sets. Data extraction
may become quite complex.
Tools are available on the market for data extraction. You may want to consider using outside tools
suitable for certain data sources. For the other data sources, you may want to develop in-house
programs to do the data extraction. Purchasing outside tools may entail high initial costs. In-house
programs, on the other hand, may mean ongoing costs for development and maintenance.
6

After you extract the data, where do you keep the data for further preparation? You may perform
the extraction function in the legacy platform itself if that approach suits your framework. More
frequently, Data Warehouse implementation teams extract the source into a separate physical
environment from which moving the data into the Data Warehouse would be easier. In the separate
environment, you may extract the source data into a group of flat files, or a data-staging relational
database, or a combination of both.

Data Transformation In every system implementation, data conversion is an important function. For
example, when you implement an operational system such as a magazine subscription application, you
have to initially populate your database with data from the prior system records. You may be
converting over from a manual system. Or, you may be moving from a file-oriented system to a modern
system supported with relational database tables. In either case, you will convert the data from the
prior systems. So, what is so different for a Data Warehouse? How is data transformation for a Data
Warehouse more involved than for an operational system?
Again, as you know, data for a Data Warehouse comes from many disparate sources. If data
extraction for a Data Warehouse poses great challenges, data transformation presents even greater
challenges. Another factor in the Data Warehouse is that the data feed is not just an initial load. You
will have to continue to pick up the ongoing changes from the source systems. Any transformation
tasks you set up for the initial load will be adapted for the ongoing revisions as well.
You perform a number of individual tasks as part of data transformation. First, you clean the data
extracted from each source. Cleaning may just be correction of misspellings, or may include resolution
of conflicts between state codes and zip codes in the source data, or may deal with providing default
values for missing data elements, or elimination of duplicates when you bring in the same data from
multiple source systems.
Standardization of data elements forms a large part of data transformation. You standardize the
data types and field lengths for same data elements retrieved from the various sources. Semantic
standardization is another major task. You resolve synonyms and homonyms. When two or more terms
from different source systems mean the same thing, you resolve the synonyms. When a single term
means many different things in different source systems, you resolve the homonym.

Data transformation involves many forms of combining pieces of data from the different sources. You
combine data from single source record or related data elements from many source records. On the
other hand, data transformation also involves purging source data that is not useful and separating
out source records into new combinations. Sorting and merging of data takes place on a large scale in
the data staging area.
In many cases, the keys chosen for the operational systems are field values with built-in meanings.
For example, the product key value may be a combination of characters indicating the product
category, the code of the warehouse where the product is stored, and some code to show the
production batch. Primary keys in the Data Warehouse cannot have built-in meanings. Data
transformation also includes the assignment of surrogate keys derived from the source system
primary keys.
7

A grocery chain point-of-sale operational system keeps the unit sales and revenue amounts by
individual transactions at the checkout counter at each store. But in the Data Warehouse, it may not
be necessary to keep the data at this detailed level. You may want to summarize the totals by product
at each store for a given day and keep the summary totals of the sale units and revenue in the Data
Warehouse storage. In such cases, the data transformation function would include appropriate
summarization.
When the data transformation function ends, you have a collection of integrated data that is cleaned,
standardized, and summarized. You now have data ready to load into each data set in your Data
Warehouse.

Data Loading Two distinct groups of tasks form the data loading function. When you complete the
design and construction of the Data Warehouse and go live for the first time, you do the initial
loading of the data into the Data Warehouse storage. The initial load moves large volumes of data
using up substantial amounts of time. As the Data Warehouse starts functioning, you continue to
extract the changes to the source data, transform the data revisions, and feed the incremental data
revisions on an ongoing basis. The figure below illustrates the common types of data movements from
the staging area to the Data Warehouse storage.

Q4. DISCUSS THE EXTRACTION METHODS IN DATA WAREHOUSES.

Extraction Methods in Data Warehouses
The extraction method you choose is highly dependent on the source system and also from the
business needs in the targeted Data Warehouse environment. Very often, there's no possibility to add
additional logic to the source systems to enhance an incremental extraction of data due to the
performance or the increased workload of these systems. Sometimes even the customer is not
allowed to add anything to an out-of-the-box application.
The estimated amount of the data to be extracted and the stage in the ETL process (initial load or
maintenance of data) may also impact the decision of how to extract, from a logical and a physical
8

perspective. Basically, you have to decide how to extract data logically and physically.

Logical Extraction Methods
There are two kinds of logical extraction:
Full Extraction
Incremental Extraction

Full Extraction
The data is extracted completely from the source system. Since this extraction reflects all the data
currently available on the source system, there's no need to keep track of changes to the data source
since the last successful extraction. The source data will be provided as-is and no additional logical
information (for example, timestamps) is necessary on the source site. An example for a full
extraction may be an export file of a distinct table or a remote SQL statement scanning the complete
source table.

Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event back in history
will be extracted. This event may be the last time of extraction or a more complex business event like
the last booking day of the fiscal period. To identify this delta changed there must be a possibility to
identify all the changed information since the specific time event. This information can be either
provided by the source data itself like an application column, reflecting the last-changed timestamp or
a change table where an appropriate additional mechanism keeps track of the changes beside the
originating transactions. In most case using the latter method means adding extraction logic to the
source system

Physical Extraction Methods
Depending on the chosen logical extraction method and the capabilities and restrictions on the source
side, the extracted data can be physically extracted by two mechanisms. The data can either be
extracted online from the source system or from an offline structure. Such an offline structure
might already exist or it might be generated by an extraction routine.
These are the following methods of physical extraction:
Online Extraction
Offline Extraction

Online Extraction
9

The data is extracted directly from the source system itself. The extraction process can connect
directly to the source system to access the source tables themselves or to an intermediate system
that stores the data in a reconfigured manner (for example, snapshot logs or change tables). Note
that the intermediate system is not necessarily physically different from the source system. With
online extractions, you need to consider whether the distributed transactions are using original
source objects or prepared source objects.

Offline Extraction
The data is not extracted directly from the source system but is staged explicitly outside the original
source system. The data already has an existing structure (for example, redo logs, archive logs or
transportable tablespaces) or was created by an extraction routine.

You should consider the following structures:
Flat Files:
Data is in a defined, generic format. Additional information about the source object is necessary for
further processing.
Dump Files:
An Oracle - specific format in which the information about the containing objects is included.
Redo And Archive Logs
redo logs comprise files in a proprietary format which log a history of all changes made to the data
base. Each redo log file consists of redo records. A redo record (redo entry) , holds a group of
change-vectors, each of which describes or represents a change made to a single block in the
database.
For example, if a user UPDATEs a salary-value in an employee-table, the DBMS generates a redo
record containing change-vectors that describe changes to the data segment block for the table. And
if the user then COMMITs the update, Oracle generates another redo record and assigns the change
a "system change number" (SCN).
A single transaction may involve multiple changes to data blocks, so it may have more than one redo
record.
A group of redo log files to one or more offline destinations, known collectively as the archived redo
log, or more simply the archive log. The process of turning redo log files into archived redo log files is
called archiving. This process is only possible if the database is running in ARCHIVELOG mode. You
can choose automatic or manual archiving.

10

Q5. WRITE SHORT NOTES ON: (i) RAID 0 (ii) RAID 1
(i) RAID 0 (Stripping)
RAID 0 provides data stripping. It takes data that needs to be stored and distributes it evenly
between two or more hard drives. Because the system considers the two hard drives as one logical
hard drive, the data is stored only once.
In a two-drive setup, for example, RAID 0 saves and accesses data quickly and efficiently. Rather
than one bit at a time, RAID 0 stores and retrieves two bits of data simultaneously.
Theoretically, the time it takes to save and access information is cut in half over a single drive
system.
RAID 0 is popular for video and image production and editing, pre-press applications, and other
applications requiring high bandwidth. However, RAID 0 does not provide fault tolerance if one drive
fails, the information on it is lost
RAID 0 does not implement error checking so any error is unrecoverable. More disks in the
array means higher bandwidth, but greater risk of data loss

(ii) RAID 1: Shadowing/Mirroring / Duplexing
RAID level 1 refers to maintaining duplicate sets of all data on separate disk drives. Of the RAID
levels, level 1 provides the highest data availability since two complete copies of all information are
maintained. In addition, the Read performance may be enhanced if the array controller allows
simultaneous reads from both members of a mirrored pair. During writes, there will be a minor
performance penalty when compared to writing to a single disk. Higher availability will be achieved if
both disks in a mirror pair are on separate I/O busses, known as duplexing
11

.
Q6. WHAT IS METADATA MANAGEMENT? EXPLAIN INTEGRATED
METADATA MANAGEMENT WITH A BLOCK DIAGRAM.

Metadata Management
The purpose of Metadata management is to support the development and administration of data
warehouse infrastructure as well as analysis of the data of time.
Metadata widely considered as a promising driver for improving effectiveness and efficiency of data
warehouse usage, development, maintenance and administration. Data warehouse usage can be
improved because metadata provides end users with additional semantics necessary to reconstruct
the business context of data stored in the data warehouse.

Integrated Metadata Management
An integrated Metadata Management supports all kinds of users who are involved in the data
warehouse development process. End users, developers and administrators can use/see the Metadata.
Developers and administrators mainly focus on technical Metadata but make use of business Metadata
if they want. Developers and administrators need metadata to understand transformations of object
data and underlying data flows as well as the technical and conceptual system architecture.
12

Several Metadata management systems are in existence. One such system tool is Integrated
Metadata Repository System (IMRS). It is a metadata management tool used to support a corporate
data management function and is intended to provide metadata management services. Thus, the IMRS
will support the engineering and configuration management of data environments incorporating e-
business transactions, complex databases, federated data environments, and data warehouses / data
marts. The metadata contained in the IMRS used to support application development, data
integration, and the system administration functions needed to achieve data element semantic
consistency across a corporate data environment, and to implement integrated or shared data
environments.
Metadata management has several sub processes like data warehouse development.
Some of them are listed below,
Metadata definition
Metadata collection
Metadata control
Metadata publication to the right people at the right time.
Determining what kind of data to be captured.

Data Warehousing

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Warehousing

Caricato da

Copyright:

Formati disponibili

1

PROGRAM: BACHELOR OF COMPUTER APPLICATION

Potrebbero piacerti anche