Sei sulla pagina 1di 36

INTERNAL RESEARCH ASSIGNMENT

Name of the candidate

Amit Kumar

Enrollment no.

00415904412

Course

MCA 4th sem

Batch

2012-2015

Subject

Data Warehousing Data Mining

Subject code

MCA- 204

Subject Teachers name

Dr.Manoj Kr. Gupta

Assignment Submission Form


(For office use only)

Enrolment No. : 00415904412

Name: Amit Kumar


Course: MCA

Batch : 2012-2015

Section: 4th
Subject Name: Data Warehousing Data Mining

Subject Code : MCA-204

Faculty Name: Dr. Manoj Kr. Gupta


Date of submission of assignment

: ..

Signature of Student

Receipt
Assignment Submission Form

Enrolment No. : 00415904412

Name: Amit Kumar


Course: MCA

Batch : 2012-2015

Section: 4th
Subject Name: Data Warehousing Data Mining

Subject Code : MCA-204

Faculty Name: Dr. Manoj Kr. Gupta


Date of submission of assignment

: ..

Signature of receiver

ACKNOWLEDGEMENT

It is my profound privilege and pleasure to express the over whelming sense of gratitude,
devotion and regards to my research assignment guide : Dr.Manoj Kr. Gupta ", for his
valuable suggestions, timely guidance and words of encouragement during the assignment
work. Without his co-operation this project would not have been in the form, as it is today.
Also I am very grateful to Dr.Manoj Kr. Gupta, for their kind support & guidance for the
accomplishment of this assignment.

Amit Kumar
Roll No. : 00415904412
COURSE: MCA 4th semester

TABLE OF CONTENTS
S. No.

CONTENT

Abstract

Problem statement

Introduction

Characteristics of data
warehouse Brief descriptions

Architecture of Data ware house of XYZ


Publishing Corporation

Operation of data warehouse and OLAP

Operations of OLAP

Merits of OLAP

Data extraction

10

Data extraction techniques

11

Evaluation of technique for publishing


company

12

Future scope

13

References

Abstract

Page No.

Data warehousing has become very popular among organizations seeking to utilize
information technology to gain a competitive advantage. Moreover, many vendors, having
noticed this trend, have begun to manufacture various kinds of hardware, software, and tools
to help data warehouses function more effectively. In this research paper, I summarize the
development and basic terminologies necessary to understand data warehousing and present
the results of a publishing company for exploring the operations for a data warehouse, make a
case for OLAP . The study classifies the data warehousing and the merits of OLAP and how
it will be essential in your environment, also get to know which tool is the best one for data
extraction. Studies all tools and find which one will be best suitable for this criterion of the
company.

Problem Statement
As a senior analyst on the project team of a publishing company exploring the
operations for a data warehouse, make a case for OLAP. Describe the merits of OLAP
and how it will be essential in your environment. For the same company which tool is
the best one for data extraction. Study all tools and find which one will be best suitable
for these criteria.

Introduction
The publishing industry is rapidly changing to meet the needs of large number of customers
and as a result customer relationships are becoming more important. To stay on the leading
edge of its industry and improve its customer relationships, xyz publishing house turned to
manage large data towards data warehouse.
This XYZ Publishing Corporation having large Volume of data to store, save, manage and
process as XYZ publishing corporation is having branches in several Indian states like Delhi,
Mumbai, Pune, Calculta ,Madras, Banglore, Kerela,etc, . Xyz publishing corporation need
to store the data like Name of its customers, number of copies sales per year, month , week,
day ,hour., address of the customer, stock in there warehouse and stores, number of
employees working , in each store, their names ,their salary, contact details and all such
multi-dimensional information are need to store in the data warehouse. And if data is store in
multi dimensions then when we need to drag the data from the data warehouse we required
performing some operations
AS One of the key developments in the Information System (IS) field is data warehousing.
Unlike On-Line Transaction Processing (OLTP) databases, which are application-oriented,
detailed, and operational , DW is a subject-oriented ,integrated, non-volatile, and time
variant, non-updatable collection of data to support management decision-making processes
and business intelligence. DWs are widely perceived as valuable devices for acquiring
information from multiple sources and delivering it to managers and analysts who may be
using different software or computer platforms with special features and capabilities. DWs
meant to support managers with answers to important business questions that require
analytics such as pivoting, drill-downs, roll-ups, aggregations and data slicing and dicing
.Moreover, all levels of management decision-making processes are supported by DW
through the collection, integration, transformation, and interpredation of both internal and
external data. Moreover, Al has elaborated how a DW could provide useful and valuable
information and knowledge at a strategic, management control, Knowledge and operational
levels.

Characteristics of data warehouse Brief description


Subjectoriented Data are grouped by subjects. For example, data on customers are grouped an
d stored as an

interrelated

set.

Integrated Data are stored in a globally consistent format. This implies cleansing the data so t
hat data have consistent naming conventions and physical attributes,

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about
your company's sales data, you can build a warehouse that concentrates on sales. Using this
warehouse, you can answer questions like "Who was our best customer for this item last
year?" This ability to define a data warehouse by subject matter, sales in this case, makes the
data ware house subject oriented.

Integrated
Integration is closely related to subject orientation. Data warehouses must put data from
disparate sources into a consistent format. They must resolve such problems as naming
conflicts and inconsistencies among units of measure. When they achieve this, they are said
to be integrated.

Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is
logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very
much in contrast to online transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an archive. A data warehouse's focus
on change over time is what is meant by the term time variant.

Before digging more into the main study objectives, its crucial to differentiate a DW, the
repository of summarized data , from data warehousing which revolve around the
development, management, methods, and practices that defines how these summarized data
are acquired, integrated, interpreted, managed, and used within business organizations . The
business intelligence is rooted in interpreting the data acquired through environmental
scanning with respect to a business task contextualization and its supposed to provide
tactical and strategic information to decision-makers so as to be able to manage and
coordinate operations and processes in their business organizations. For the purpose
of business intelligence, many analytical tools have been developed such as: excel, reporting
tools, dashboards, OLAP, and data mining. Business intelligence revolves around knowledge
discovery and inferences by analysing the data stored in DW to acquire valuable information.

Architecture of Data ware house of XYZ Publishing Corporation

Operation of data warehouse and OLAP


There are various tools have been developed such as: excel, reporting tools, dashboards,
OLAP, and data mining. Among all these XYZ Publishing Corporation is use OLAP.
AS OLAP is use to do this task, Where as it stands for online analytical processing, or which
is an approach to answering multi-dimensional analytical queries swiftly. OLAP is part of the
broader category of business intelligence, which also encompasses relational database, report
writing and data mining. OLAP tools enable users to analyze multidimensional data
interactively from multiple perspectives. OLAP consists of three basic analytical operations:
consolidation (roll-up), drill-down, and slicing and dicing. Consolidation involves the
aggregation of data that can be accumulated and computed in one or more dimensions.
In other words OLAP (online analytical processing) is computer processing that enables a
user to easily and selectively extract and view data from different points of view. For
example, a user can request that data be analyzed to display a spreadsheet showing all of a
company's beach ball products sold in Florida in the month of July, compare revenue figures
with those for the same products in September, and then see a comparison of other product
sales in Florida in the same time period. To facilitate this kind of analysis, OLAP data is
stored in a multidimensional database. Whereas a relational database can be thought of as
two-dimensional, a multidimensional database considers each data attribute (such as product,
geographic sales region, and time period) as a separate "dimension." OLAP software can
locate the intersection and display them. Attributes such as time periods can be broken down
into sub attributes.
OLAP can be used for data mining or the discovery of previously undiscerned relationships
between data items. An OLAP database does not need to be as large as a data warehouse,
since not all transactional data is needed for trend analysis. Using Open Database
Connectivity (ODBC), data can be imported from existing relational databases to create a
multidimensional database for OLAP.
Two leading OLAP products are Hyperion Solution's Essbase and Oracle's Express Server.
OLAP products are typically designed for multiple-user environments, with the cost of the
software based on the number of users.

Operations of OLAP
OLAP having the various Operations. Some of popular operations are listed as:

Roll-up

Takes the current aggregation level of fact values and does a further aggregation on
one or more of the dimensions.

Equivalent to doing GROUP BY to this dimension by using attribute hierarchy.

Decreases a number of dimensions - removes row headers.

SELECT

[attribute

FROM

list],

SUM

[attribute

[table

WHERE

[condition

names]
list]
list]

GROUP BY [grouping list];

Drill-down

Opposite of roll-up.

Summarizes data at a lower level of a dimension hierarchy, thereby viewing data in a


more specialized level within a dimension.

Increases a number of dimensions - adds new headers

Slice

Performs a selection on one dimension of the given cube, resulting in a sub-cube.

Reduces the dimensionality of the cubes.

Sets one or more dimensions to specific values and keeps a subset of dimensions for
selected values.

Dice

Define a sub-cube by performing a selection of one or more dimensions.

Refers to range select condition on one dimension, or to select condition on more than
one dimension.

Reduces the number of member values of one or more dimensions.

Pivot (or rotate)

Rotates the data axis to view the data from different perspectives.

Groups data with different dimensions.

Drill-across

Accesses more than one fact table that is linked by common dimensions.

Combines cubes that share one or more dimensions.

Drill-through

Drill down to the bottom level of a data cube down to its back-end relational tables.

Cross-tab

Spreadsheet style row/column aggregates

Merits of OLAP
OLAP applications for Publishing Corporation offer many benefits to the budgeting and
planning, publishing getting customer details process. They put business context into
underlying data sets and transform data into intelligible information for making business
decisionsincreasing timeliness, analysis and visibility along the way.
Eliminate manual tasks. OLAP eliminates the need to compile numerous spread sheets
since all data is shared on a centralized server across the organization; therefore, everyone is
always looking at the same data. Even further, multiple roll-ups in OLAP allow users to view
the same data from multiple points of view with a simple click of the button. Data is
aggregated and summarized across multiple business units quickly and securely providing
different consolidation and sub-consolidation levels on the fly.
Improve data analysis. Being multi-dimensional, OLAP business models allow users to
analyse data across multiple business perspectives. The ability to analyse data in an ad hoc
manner is inherent in the technology. The ability to drill through reported data to transactional
details is also possible. OLAP gives users the ability to drag and drop and slice and dice the
data to manipulate information to test different scenarios or find answers to business
questions. Detailed as well as high-level consolidated data can be reviewed and analysed
within seconds.
Enhanced visibility. With multiple levels of security, from database-level to cell-based, users
can have direct access to their own data, eliminating the need to rely on IT to create queries
and produce reports. This puts information into the hands of the decision makers and brings
confidence back to analysing information. Stakeholders receive the right information at the
right time.

Data Extraction
Data extraction is the act or process of retrieving data out of data sources for further data
processing or data storage (data migration). As an IT professional, you must have participated
in data extractions and conversions when implementing operational systems. When you went
from a file-oriented order entry system to a new order processing system using relational
database technology, you may have written data extraction programs to capture data from the
files to get the data ready for populating the relational database.
Two major factors differentiate the data extraction for a new operational system from the
data extraction for a data warehouse. First, for a data warehouse, you have to extract data
from many disparate sources. Next, for a data warehouse, you have to extract data on the
changes for ongoing incremental loads as well as for a one-time initial full load. For
operational systems, all you need is one-time extractions and data conversions.
These two factors increase the complexity of data extraction for a data warehouse and,
therefore, warrant the use of third-party data extraction tools in addition to in-house programs
or scripts. Third-party tools are generally more expensive than in-house programs, but they
record their own metadata. On the other hand, in-house programs increase the cost of
maintenance and are hard to maintain as source systems change. If your company is in an
industry where frequent changes to business conditions are the norm, then you may want to
minimize the use of in-house programs. Third-party tools usually provide built-in flexibility.
All you have to do is to change the input parameters for the third-part tool you are using.
Effective data extraction is a key to the success of your data warehouse. Therefore, you need
to pay special attention to the issues and formulate a data extraction strategy for your data
warehouse. Here is a list of data extraction issues:

Source Identificationidentify source applications and source structures.

Method of extractionfor each data source, define whether the extraction process is
manual or tool-based.

Extraction frequencyfor each data source, establishes how frequently the data
extraction must be done daily, weekly, quarterly, and so on.

Time windowfor each data source, denote the time window for the extraction
process.

Job sequencingdetermine whether the beginning of one job in an extraction job

stream has to wait until the previous job has finished successfully.

Exception handlingdetermine how to handle input records that cannot be extracted.

Source identification
Let us consider the first of the above issues, namely, source identification. We will deal with
the rest of the issues later as we move through the remainder of this chapter. Source
identification, of course, encompasses the identification of all the proper data sources. It does
not stop with just the identification of the data sources. It goes beyond that to examine and
verify that the identified sources will provide the necessary value to the data warehouse. Let
us walk through the source identification process in some detail.

Assume that a part of your

database, maybe one of your data marts, is designed to provide strategic information on the
fulfilment of orders. For this purpose, you need to store historical information about the
fulfilled and pending orders. If you ship orders through multiple delivery channels, you need
to capture data about these channels. If your users are interested in analyzing the orders by
the status of the orders as the orders go through the fulfilment process, then you need to
extract data on the order statuses.
In the fact table for order fulfilment, you need attributes about the total order amount,
discounts, commissions, expected delivery time, actual delivery time, and dates at different
stages of the process. You need dimension tables for product, order disposition, delivery
channel, and customer. First, you have to determine if you have source systems to provide
you with the data needed for this data mart. Then, from the source systems, you have to
establish the correct data source for each data element in the data mart. Further, you have to
go through a verification process to ensure that the identified sources are really the right ones.
This Figure describes a stepwise approach to source identification for order fulfilment.
Source identification is not as simple a process as it may sound. It is a critical first process in
the data extraction function. You need to go through the source identification process for
every piece of information you have to store in the data warehouse. As you might have
already figured out, source identification needs thoroughness, lots of time, and exhaustive
analysis.

Data Extraction Techniques


Before examining the various data extraction techniques, you must clearly understand the
Nature of the source data you are extracting or capturing. Also, you need to get an insight into
how the extracted data will be used. Source data is in a state of constant flux. Business
transactions keep changing the data in the source systems. In most cases, the value of an
attribute in a source system is the value of that attribute at the current time. If you look at
every data structure in the source operational systems, the day-to-day business transactions
constantly change the values of the attributes in these structures. When a customer moves to
another state, the data about that customer changes in the customer table in the source system.
When two additional package types are added to the way a product may be sold, the product
data changes in the source system. When a correction is applied to the quantity ordered, the
data about that order gets changed in the source system.
Data in the source systems are said to be time-dependent or temporal. This is because
source data changes with time. The value of a single variable varies over time. Again, take the
example of the change of address of a customer for a move from New York State to
California. In the operational system, what is important is that the current address of the
customer has CA as the state code. The actual change transaction itself, stating that the
previous state code was NY and the revised state code is CA, need not be preserved. But
think about how this change affects the information in the data warehouse. If the state code is
used for analyzing some measurements such as sales, the sales to the customer prior to the
change must be counted in New York state and those after the move must be counted in
California. In other words, the history cannot be ignored in the data warehouse. This brings
us to the question: how do you capture the history from the source systems? The answer
depends on how exactly data is stored in the source systems. So let us examine and
understand how data is stored in the source operational systems.

Data in Operational Systems


These source systems generally store data in two ways. Operational data in the source system
may be thought of as falling into two broad categories. The type of data extraction technique
you have to use depends on the nature of each of these two categories.
Current Value- Most of the attributes in the source systems fall into this category. Here the
stored value of an attribute represents the value of the attribute at this moment of time. The

values are transient or transitory. As business transactions happen, the values change. There is
no way to predict how long the present value will stay or when it will get changed next.
Customer name and address, bank account balances, and outstanding amounts on individual
orders are some examples of this category.
What is the implication of this category for data extraction? The value of an attribute
remains constant only until a business transaction changes it. There is no telling when it will
get changed. Data extraction for preserving the history of the changes in the data warehouse
gets quite involved for this category of data.
Periodic Status- This category is not as common as the previous category. In this category,
the value of the attribute is preserved as the status every time a change occurs. At each of
these points in time, the status value is stored with reference to the time when the new value
became effective. This category also includes events stored with reference to the time when
each event occurred. Look at the way data about an insurance policy is usually recorded in
the operational systems of an insurance company. The operational databases store the status
data of the policy at each point of time when something in the policy changes. Similarly, for
an insurance claim, each event, such as claim initiation, verification, appraisal, and
settlement, is recorded with reference to the points in time.
For operational data in this category, the history of the changes is preserved in the source
systems themselves. Therefore, data extraction for the purpose of keeping history in the data
warehouse is relatively easier. Whether it is status data or data about an event, the source
systems contain data at each point in time when any change occurred.
Please study Figure and confirm your understanding of the two categories of data stored in
the operational systems. Pay special attention to the examples.

Example- Having reviewed the categories indicating how data is stored in the operational
systems; we are now in a position to discuss the common techniques for data extraction.
When you deploy your data warehouse, the initial data as of a certain time must be moved to
the data warehouse to get it started. This is the initial load. After the initial load, your data
warehouse must be kept updated so the history of the changes and statuses are reflected in the
data warehouse. Broadly, there are two major types of data extractions from the source
operational systems: as is (static) data and data of revisions.
As is or static data is the capture of data at a given point in time. It is like taking a
snapshot of the relevant source data at a certain point in time. For current or transient data,
this capture would include all transient data identified for extraction. In addition, for data
categorized as periodic, this data capture would include each status or event at each point in
time as available in the source operational systems.
You will use static data capture primarily for the initial load of the data warehouse.
Sometimes, you may want a full refresh of a dimension table. For example, assume that the
product master of your source application is completely revamped. In this case, you may find
it easier to do a full refresh of the product dimension table of the target data warehouse. So,
for this purpose, you will perform a static data capture of the product data.

Data of revisions is also known as incremental data capture. Strictly, it is not incremental
data but the revisions since the last time data was captured. If the source data is transient, the
capture of the revisions is not easy. For periodic status data or periodic event data, the
incremental data capture includes the values of attributes at specific times. Extract the
statuses and events that have been recorded since the last date of extract.
Incremental data capture may be immediate or deferred. Within the group of immediate
data capture there are three distinct options. Two separate options are available for deferred
data capture.

Immediate Data Extraction


In this option, the data extraction is real-time. It occurs as the transactions happen at the
source databases and files. Figure shows the immediate data extraction options.

Now let us go into some details about the three options for immediate data extraction.
Capture through Transaction Logs. This option uses the transaction logs of the DBMSs
maintained for recovery from possible failures. As each transaction adds, updates, or deletes a
row from a database table, the DBMS immediately writes entries on the log file. This data
extraction technique reads the transaction log and selects all the committed transactions.
There is no extra overhead in the operational systems because logging is already part of the
transaction processing.
You have to make sure that all transactions are extracted before the log file gets refreshed.
As log files on disk storage get filled up, the contents are backed up on other media and the
disk log files are reused. Ensure that all log transactions are extracted for data warehouse
updates.
If all of your source systems are database applications, there is no problem with this
technique. But if some of your source system data is on indexed and other flat files, this
option will not work for these cases. There are no log files for these non database
applications. You will have to apply some other data extraction technique for these cases.
While we are on the topic of data capture through transaction logs, let us take a side
excursion and look at the use of replication. Data replication is simply a method for creating
copies of data in a distributed environment. Please refer to Figure illustrating how replication
technology can be used to capture changes to source data.

The appropriate transaction logs contain all the changes to the various source database tables.
Here are the broad steps for using replication to capture changes to source data:

Identify the source system DB table

Identify and define target files in staging area

Create mapping between source table and target files

Define the replication mode

Schedule the replication process

Capture the changes from the transaction logs

Transfer captured data from logs to target files

Verify transfer of data changes

Confirm success or failure of replication

In metadata, document the outcome of replication

Maintain definitions of sources, targets, and mappings

Capture through Database Triggers. Again, this option is applicable to your source systems
that are database applications. As you know, triggers are special stored procedures (programs)
that are stored on the database and fired when certain predefined events occur.
You can create trigger programs for all events for which you need data to be captured. The
output of the trigger programs is written to a separate file that will be used to extract data for
the data warehouse. For example, if you need to capture all changes to the records in the
customer table, write a trigger program to capture all updates and deletes in that table.
Data capture through database triggers occurs right at the source and is therefore quite
reliable. You can capture both before and after images. However, building and maintaining
trigger programs puts an additional burden on the development effort. Also, execution of
trigger procedures during transaction processing of the source systems puts additional
overhead on the source systems. Further, this option is applicable only for source data in
databases.
Capture in Source Applications. This technique is also referred to as application-assisted
data capture. In other words, the source application is made to assist in the data capture for
the data warehouse. You have to modify the relevant application programs that write to the

source files and databases. You revise the programs to write all adds, updates, and deletes to
the source files and database tables. Then other extract programs can use the separate file
containing the changes to the source data.
Unlike the previous two cases, this technique may be used for all types of source data
irrespective of whether it is in databases, indexed files, or other flat files. But you have to
revise the programs in the source operational systems and keep them maintained. This could
be a formidable task if the number of source system programs is large. Also, this technique
may degrade the performance of the source applications because of the additional processing
needed to capture the changes on separate files.

Deferred Data Extraction


In the cases discussed above, data capture takes place while the transactions occur in the
source operational systems. The data capture is immediate or real-time. In contrast, the
techniques under deferred data extraction do not capture the changes in real time. The capture
happens later. Please see Figure showing the deferred data extraction options.

Now let us discuss the two options for deferred data extraction.
Capture Based on Date and Time Stamp. Every time a source record is created or updated it
may be marked with a stamp showing the date and time. The time stamp provides the basis
for selecting records for data extraction. Here the data capture occurs at a later time, not while
each source record is created or updated. If you run your data extraction program at midnight
every day, each day you will extract only those with the date and time stamp later than
midnight of the previous day. This technique works well if the number of revised records is
small.
Of course, this technique presupposes that all the relevant source records contain date and
time stamps. Provided this is true, data capture based on date and time stamp can work for
any type of source file. This technique captures the latest state of the source data. Any
intermediary states between two data extraction runs are lost.
Deletion of source records presents a special problem. If a source record gets deleted in
between two extract runs, the information about the delete is not detected. You can get around
this by marking the source record for delete first, do the extraction run, and then go ahead and

physically delete the record. This means you have to add more logic to the source
applications.
Capture by Comparing Files. If none of the above techniques are feasible for specific
source files in your environment, then consider this technique as the last resort. This
technique is also called the snapshot differential technique because it compares two snapshots
of the source data. Let us see how this technique works.
Suppose you want to apply this technique to capture the changes to your product data.
While performing todays data extraction for changes to product data, you do a full file
comparison between todays copy of the product data and yesterdays copy. You also compare
the record keys to find the inserts and deletes. Then you capture any changes between the two
copies.
This technique necessitates the keeping of prior copies of all the relevant source data.
Though simple and straightforward, comparison of full rows in a large file can be very
inefficient. However, this may be the only feasible option for some legacy data sources that
do not have transaction logs or time stamps on source records.

Evaluation of the Techniques for publishing company


To summarize, the following options are available for data extraction

Capture of static data

Capture through transaction logs

Capture through database triggers

Capture in source applications

Capture based on date and time stamp

Capture by comparing files

You are faced with some big questions. Which ones are applicable in publishing company
environment? Which techniques we should use? You will be using the static data capture
technique at least in one situation when you populate the data warehouse initially at the time
of deployment. After that, you will usually find that you need a combination of a few of these
techniques for your environment. If you have old legacy systems, you may even have the
need for the file comparison method. Figure highlights the advantages and disadvantages of
the different techniques.

Please study it carefully and use it to determine the techniques you would need to use in your
environment.
Let us make a few general comments. Which of the techniques are easy and inexpensive to
implement the database extraction of publishing company? Consider the techniques of using
transaction logs and database triggers. Both of these techniques are already available through
the database products. Both are comparatively cheap and easy to implement. The technique
based on transaction logs is perhaps the most inexpensive. There is no additional overhead on
the source operational systems. In the case of database triggers, there is a need to create and
maintain trigger programs. Even here, the maintenance effort and the additional overhead on
the source operational systems are not that much compared to other techniques.
Data capture in source systems could be the most expensive in terms of development and
maintenance. This technique needs substantial revisions to existing source systems. For many
legacy source applications, finding the source code and modifying it may not be feasible at
all. However, if the source data does not reside on database files and date and time stamps are
not present in source records, this is one of the few available options.
What is the impact on the performance of the source operational systems? Certainly, the
deferred data extraction methods have the least impact on the operational systems. Data
extraction based on time stamps and data extraction based on file comparisons are performed
outside the normal operation of the source systems. Therefore, these two are preferred options
when minimizing the impact on operational systems is a priority. However, these deferred
capture options suffer from some inadequacy. They track the changes from the state of the
source data at the time of the current extraction as compared to its state at the time of the
previous extraction. Any interim changes are not captured. Therefore, wherever you are
dealing with transient source data, you can only come up with approximations of the history.
So what is the bottom line? Use the data capture technique in source systems sparingly
because it involves too much development and maintenance work. For your source data on
databases, capture through transaction logs and capture through database triggers are obvious
first choices. Between these two, capture through transaction logs is a better choice because
of better performance. Also, this technique is applicable to non relational databases. The file
comparison method is the most time-consuming for data extraction. Use it only if all others
cannot be applied.

Future scope of data warehousing and data extraction in


publishing company
Ask any data warehouse developer what media data will reside on and the automatic answer
is high performance disk storage. Most data warehouse developers have never built a
system on anything but high performance disk storage during their entire career. Indeed
many data warehouse developers are not even aware that there are alternatives to high
performance disk storage.

There are many reasons why the volume of data in the warehouse is

exploding:

data warehouses carry historical data,

data warehouses carry detailed data,

data warehouses carry data for which there is no known need,

data warehouses carry ecommerce data, and so forth.

In a word, the volumes of data found in the data warehouse surpass anything ever seen
before.
But when you look into the future and see what is in store for data warehousing when it
comes to storage, surprisingly the answer comes back - the future of data warehousing is
NOT high performance disk storage, despite the strong track record of disk storage for the
past twenty years and the protestations of the storage vendor. Instead high performance disk
storage plays only a secondary role in the future of data warehousing. The real future of data
warehousing is in a storage media collectively known as "alternative storage".
Alternative storage consists of two forms of storage - near line storage and/or secondary
storage. Near line storage is soloed tape storage where soloed cartridges of tape storage are

managed robotically. The technology for soloed tape storage has been around for a long time
and is certainly proven and mature technology.
Secondary storage is a form of disk storage but whose disk is slower, significantly less
expensive and less cached than high performance storage.
There are lots of reasons why alternative storage fits well with the data warehouse
environment. Perhaps the most fundamental reason why there is such a good fit is that data
warehouse data is very stable. The nature of data in a warehouse is that the data is put into the
warehouse in a time stamped snapshot mode. If there is a change in the data that the
warehouse needs to be aware of, a new snapshot is made. The old snapshot of data remains
undisturbed. Because of this mode of storing data, no updates are made into the data
warehouse. Ultimately style of storage and processing results in very stable data. The
stability of the data fits very nicely with the "write once" data found in near line storage.
But there are some other reasons why data warehouse data fits nicely on alternative storage.
The next reason is that the queries that operate on warehouse data need long streams of data,
and often times that data is stored sequentially. Unlike a job stream for online processing
where there is constant demand for different units of data from different parts of the disk
device, in data warehouse processing the processing that occurs is fundamentally different.
Both near line storage and secondary storage fit this model of a job stream very nicely.
Another very important reason for alternative storage is that of the need to store many, many
records in the data warehouse. Because data warehouses store detailed and historical data,
they contain far more data than their online, OLTP brethren. The ability to store far more data
on near line and/or secondary storage is a very important reason why high performance disk
storage is not the future of data warehousing.
Not only can much greater volumes of data be stored in alternative storage, but those massive
volumes can be stored much less expensively than on high performance disk storage. How
much cheaper? About an order of magnitude less expensively.
One can hear the high performance disk vendor proclaim - "but hardware is getting cheaper
all the time". Indeed the rate at which secondary storage and near line storage is getting
cheaper is at a faster rate than high performance storage. The hardware vendors who wish to

maintain the status quo have been saying this for as long as there has been a computer
industry.
There is yet another powerful reason why high performance disk storage is not the future of
data warehousing and that reason is that - IRONICALLY, AND MUCH TO THE CHAGRIN
OF THE HIGH PERFORMANCE VENDORS - performance gets BETTER, not WORSE
when you move your data to near line storage or secondary storage. The reason why
performance gets better by moving data to near line or secondary storage is because of the
phenomenon in data warehousing called "dormant data". Dormant data is data that is seldom
or never used. In the early days of data warehousing when the warehouse is new and small,
there is little or no dormant data. But as the warehouse matures, the volumes of data rises and
the patterns of usage of the data stabilize. Soon only a fraction of the data warehouse is being
used. At this point, the dormant data is moved to alternative storage. Performance for the
remaining actively used data picks up dramatically. If dormant data is left on high
performance disk storage, the dormant data "gets in the way" of query processing. Data that
is needed for the query is hidden by the masses of data that is not regularly needed. But by
moving dormant data to alternative storage, performance is greatly enhanced.
But the greatest advantage of selecting alternative storage as the basis for the data in the data
warehouse environment is that the designer can choose the lowest level of granularity desired
for the data warehouse. When high performance disk storage is used as the only medium on
which data is stored, then the designer ends up being restricted as to how much detailed data
can be placed in the data warehouse. The telecommunications designer must aggregate or
summarize detailed call level detail. The bank designer must add together checking and ATM
activity into a monthly aggregate record. The retailing executive must summarize POS data to
the store level and/or to the daily level. In short, placing the data warehouse on disk storage
forces a compromise to occur. But when the medium the bulk of the data in the warehouse is
stored on is alternative storage, the designer can afford to store data at the lowest level of
detail that exists. In doing so the data warehouse ends up with a great deal more functionality
than if the warehouse were stored on high performance disk storage.
There are then some very powerful reasons why the medium of storage for the data
warehouse should be alternative storage. Admittedly some of the data warehouse data - the
actively used component of the warehouse - will be stored on high performance disk storage.

But the vast majority of the data stored in the warehouse will reside on slower, less expensive
alternative storage.
The notion that data should be stored on different media based on the volume and usage
characteristics of the data is not a new idea. Years ago there was the notion of technology
called HSM - hierarchical storage management. HSM was the intellectual predecessor of
alternative storage. The primary difference between HSM and alternative storage is that
alternative storage operates at the row or record level while HSM operates at the table or data
set level. Management of storage at the table or data base level is simply unthinkable for the
volumes of data and the kind of processing that occurs in the data warehouse.
In order to make the alternative storage architecture perform at the optimal level, two types of
software are needed. The first type of software that is needed is that of the activity monitor.
The activity monitor sits between the data warehouse dbms server and the users and collects
information about the activity that is occurring inside the data warehouse. Once collected the
data warehouse administrator is in a position to be able to know what data is and is not being
used in the actively used portion of the warehouse. With that knowledge the data warehouse
administrator is able to precisely determine what data belongs in actively used storage and
what data belongs in alternative storage.
The second type of software that is needed for the data warehouse environment that operates
on alternative storage is software that can be called a cross media storage manager. The job of
the cross media storage manager is to manage the traffic between the actively used storage
and alternative storage. The traffic can be managed by actually moving data to and from one
component to the other or can be used to satisfy query processing where the data resides in
either actively used storage or alternative storage.
Both types of software are needed in order for alternative storage to operate effectively. As a
rule the activity monitor is first used to determine how much data needs to be placed in
alternative storage. After the decision is made to place data in alternative storage, cross media
storage manager and alternative storage are purchased and installed.
The alternative storage solution for data warehousing is a compelling story. For warehouses
that will grow to any size at all, alternative storage is not an option - it is plainly mandatory.

What then are the obstacles to the success and adoption of alternative storage? The primary
obstacle is a familiar one to those who have been around the information processing
community a while. The attitude of - "well, we didn't used to do it that way before...." is the
primary reason why people do not immediately adopt alternative storage. And the vendors...
the vendors have made so much money for so long selling disk storage as if all there were
was OLTP online processing. The very success of the high performance disk vendors traps
them into thinking that their world will remain static forever. The high performance disk
storage vendors want to stick their head in the sand and pretend that the world is not
changing, such has been their success.

References

Data Warehousing Fundamentals By Paulraj Ponniah

Data Warehousing: Using the Wal-Mart Model (The Morgan Kaufmann Series in Data
Management Systems).

Mastering Data Warehouse Design: Relational and Dimensional Techniques.

www.wikipedia.com

www.google.com

Potrebbero piacerti anche