Sei sulla pagina 1di 26

Dr.

Babasaheb Ambedkar Marathwada University, Aurangabad

G. S. Mandal’s

MARATHWADA INSTITUTE OF TECHNOLOGY


CIDCO, AURANGABAD

A seminar report
Data Warehousing

Submitted by –
Mr. Aniket Deshpande
MSc(CS)

Guided by –
(Prof. Suraj Raut)
In the fulfillment of the degree
Master of Science (Computer Science)
Department of Computer Science & Information Technology
Academic Year : 2018-19
G. S. Mandal’s
MARATHWADA INSTITUTE OF TECHNOLOGY
CIDCO, AURANGABAD

Certificate

This is to certify that Mr. Aniket Deshpande have successfully


completed the seminar entitled “Data Warehousing” in the fulfillment of
the degree MSc (CS) in the academic year 2018-19 in the Department of
Computer Science & Information Technology.

During the project work, he has done the work very sincerely.

HOD Seminar Guide


(Prof. S.A.Vyavahare) (Prof. Suraj Raut)

External Examiner Principal


(Dr. M.E.Jadhav)
Acknowledgement

It gives me proud privilege to complete this seminar work. This is the only page
where I have the opportunity to express my emotions and gratitude from the bottom of
my heart.

It is my great pleasure in expressing sincere and deep gratitude towards my guide


Prof. Suraj Raut , Marathwada Institute of Technology, Cidco, Aurangabad, for his
valuable and firm suggestions, guidance and constant support throughout this work. I am
thankful to for providing me various resources and infrastructure facilities.
I also offer my most sincere thanks to Principle Dr. Mukti Jadhav, Principal,
Marathwada Institute of Technology, Cidco, Aurangabad, my colleagues and staff
members of Computer Science and Information Technology Department, Marathwada
Institute of Technology, Cidco for cooperation provided by them in many ways.

Mr. Aniket Deshpande


MSc(CS) IIyear
Index
 Introduction
 History
 Data Warehousing Architecture
 Data Warehousing Process
 Components of Data Warehousing
 Types of Data Warehousing
 Security in Data Warehousing
 Application of Data Warehousing
 Advantages Data Warehousing
 Disadvantages Data Warehousing
 Conclusion
 Reference
Introduction

“The data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment”. Data entering the data warehouse
comes from operational environment in almost every case. Data warehousing provides
architectures and tools for business executives to systematically organize, understand ,and
use their data to make strategic decisions. A large number of organizations have found that
data warehouse systems are valuable tools in today’s competive,fast-evolving world. In the
last several years, many firms have spent millions of dollars in building enterprise wide data
warehouses. Many people feel that with competition mounting in every industry, data
warehousing is the latest must have marketing weapon –a way to keep customers by learning
more about their needs. So you may ask, full of intrigue,” What exactly is a data warehouse
“. Data warehouses have been defined in many ways, making it difficult to formulate a
rigorous definition. Loosely speaking, a data warehouse refers to a database that is
maintained separately from an organizations operational databases. Data warehouse systems
allow for integration of a variety of applications systems. They support information
processing by providing a solid platform of consolidated historical data for analysis. Data
warehousing is a more formalized methodology of these techniques. For example, many sales
analysis systems and executive information systems (EIS) get their data from summary files
rather than operational transaction files. The method of using summary files instead of
operational data is in essence what data warehousing is all about. Some data warehousing
tools neglect the importance of modelling and building a data warehouse and focus on the
storage and retrieval of data only. These tools might have strong analytical facilities, but lack
the qualities you need to build and maintain a corporate wide data warehouse. These tools
belong on the PC rather than the host. Your corporate wide (or division wide) data warehouse
needs to be scalable, secure, open and, above all, suitable for publication.
As defined Data Warehouse is “A Subject-oriented, integrated, time-variant and nonvolatile
collection of data in support of management’s decision-making process.”

In this definition the data is:

 Subject-oriented as the warehouse is organized around the major subjects of the


enterprise rather than major application areas.
 Integrated because of the coming together of source data from different enterprise
world-wide application systems.
 Time-variant because data in warehouse is only accurate and valid at some point in
time or over some time interval.
 Non-volatile as the data is not updated in real time but refreshed from on a regular
basis from different data sources.
History

In the 1990's as organizations of scale began to need more timely data about their business,
they found that traditional information systems technology was simply too cumbersome to
provide relevant data efficiently and quickly. Completing reporting requests could take days
or weeks using antiquated reporting tools that were designed more or less to 'execute' the
business rather than 'run' the business. From this idea, the data warehouse was born as a
place where relevant data could be held for completing strategic reports for management. The
key here is the word 'strategic' as most executives were less concerned with the day to day
operations than they were with a more overall look at the model and business functions.
As with all technology, over the course of the latter half of the 20th century, we saw
increased numbers and types of databases. Many large businesses found themselves with data
scattered across multiple platforms and variations of technology, making it almost impossible
for any one individual to use data from multiple sources. A key idea within data warehousing
is to take data from multiple platforms/technologies (As varied as spreadsheets, DB2
databases, IDMS records, and VSAM files) and place them in a common location that uses a
common querying tool. In this way operational databases could be held on whatever system
was most efficient for the operational business, while the reporting / strategic information
could be held in a common location using a common language. Data Warehouses take this
even a step farther by giving the data itself commonality by defining what each term means
and keeping it standard.
All of this was designed to make decision support more readily available and without
affecting day to day operations. One aspect of a data warehouse that should be stressed is that
it is NOT a location for ALL of a business’s data, but rather a location for data that is
'interesting'. Data that is interesting will assist decision makers in making strategic decisions
relative to the organization's overall mission
Data Warehousing Architecture
 OPERATIONAL DATA WAREHOUSE : for the DW is supplied from mainframe
operational data held in first generation hierarchical and network databases,
departmental data held in proprietary file systems, private data held on
workstations and private serves and external systems such as the Internet,
commercially available DB, or DB associated with and organization’s suppliers or
customers.
 OPERATIONAL DATABASE: is a repository of current and integrated
operational data used for analysis. It is often structured and supplied with data in
the same way as the data warehouse, but May in fact simply act as a staging area
for data to be moved into the warehouse.
 LOAD MANAGER: also called the frontend component, it performance all the
operations associated with the extraction and loading of data into the warehouse.
These operations include simple transformations of the data to prepare the data for
entry into the warehouse.
 WAREHOUSE MANAGER: performs all the operations associated with the
management of the data in the warehouse. The operations performed by this
component include analysis of data to ensure consistency, transformation and
merging of source data, creation of indexes and views, generation of
demoralizations and aggregations, and archiving and backing-up data.
 QUERY MANAGER: also called backend component, it performs all the
operations associated with the management of user queries. The operations
performed by this component include directing queries to the appropriate tables
and scheduling the execution of queries • detailed, lightly and lightly summarized
data, archive/backup data.
 END USER ACCESS TOOLS: can be categorized into five main groups: data
reporting and query tools, application development tools, executive information
system (EIS) tools, online analytical processing (OLAP) tools, and data mining
tools.
Data Warehousing Processes

The process of extracting data from source systems and bring it into the data warehouse
is commonly called ELT, which stands for extraction, transformation, and loading.
However, the acronym ETL is perhaps too simplistic, because it omits some other phases
in the process of creating data warehouse data from the data sources, such as data
cleansing and transportation. Here, I refer to the entire process for building a data
warehouse, including the five phase mentioned above, as ELT. In addition, after the data
warehouse (detailed data) is created, several data warehousing processes that are relevant
to implementing and using the data warehouse are needed, which include data
summarization, data warehouse maintenance, date l lineage tracing, query rewriting, and
data mining.

Extraction in Data Warehouse


Extraction is the operation of extracting data from a source system for future use in a data
warehouse\environment. This is the first step of the ETL process. After extraction, data
can be transformed and loaded into the data warehouse. Extraction process does not need
involve complex algebraic database operations, such as join and aggregate functions. Its
focus is determining which data needs to be extracted, and bring the data into the data
warehouse, specifically, to the staging area. However, the data sources might be very
complex and poorly documented, so that designing and creating the extraction process is
often the most time-consuming task in the ELT process, and even in the entire data
warehousing process. The data has to be extracted normally not only once, but several
times in a periodic manner to supply all changed data to the data warehouse and keep it
up-to-date. Thus, data extraction is not only used in the process of building the data
warehouse, but also in the process of maintaining the data warehouse.
Full Extraction
The data is extracted completely from the data sources. As this extraction reflects all the
data currently available on the data source, there is no need to keep track of changes to
the data source since the last successful extraction. The source data will be provided as-is
and no additional logic information (e.g., timestamps) is necessary on the source site.

Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event back
in history will be extracted. The event may be the last time of extraction or a more
complex business event like the last sale day of a fiscal period. To identify this delta
change there must be a possibility to identify all the changed information since this
specific time event. This information can be either provided by the source data itself, or a
change table where an appropriate additional mechanism keeps track of the changes
besides the originating transaction. in most case, using the latter method means adding
extraction logic to the data source. For the independence of data sources, many data
warehouses do not use any change-capture technique as part of the extraction process,
instead, use full extraction logic. After full extracting, the entire extracted data from the
data sources can be compared with the previous extracted data to identify the changed
data. This approach may not have significant impact on the data source, but it clearly can
place a considerable burden on the data warehouse processes, particularly if the data
volumes are large. Incremental extraction, also called Change Data Capture, is an
important consideration for extraction. it is possible to make the ELT process much more
efficient, and especially, be used in the situation when I select incremental view
maintenance as the maintenance approach of the data warehouse.
Components of Data Warehousing

Operational data sources


For the DW is supplied from mainframe operational data held in first generation hierarchical
and network databases, departmental data held in proprietary file systems, private data held
on workstations and private serves and external systems such as the Internet, commercially
available DB, or DB associated with and organization’s suppliers or customers.

Operational data store


Is a repository of current and integrated operational data used for analysis, It is often
structured and supplied with data in the same way as the data warehouse, but May in fact
simply act as a staging area for data to be moved into the warehouse.

Load manager
Also called the frontend component, it performs all the operations associated with the
extraction and loading of data into the warehouse. These operations include simple
transformations of the data to prepare the data for entry into the warehouse.

Warehouse manager
Performs all the operations associated with the management of the data in the warehouse.
The operations performed by this component include analysis of data to ensure consistency,
transformation and merging of source data, creation of indexes and views, generation of
demoralizations’ and aggregations, and archiving and backing-up data.
Types of Data Warehousing

There are mainly three type of Data Warehouse

1. Enterprise Data Warehouse


2. Operational Data Store
3. Data Mart

Enterprise Data Warehouse:


Enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is
considered a core component of business intelligence. DWs are central repositories of
integrated data from one or more disparate sources. They store current and historical data in
one single place that are used for creating analytical reports for workers throughout the
enterprise.
The data stored in the warehouse is uploaded from the operational systems. The data may
pass through an optional data store and may require data cleansing for additional operations
to ensure data quality before it is used in the DW for reporting.

Operational Data Store:


An operational data store is used for operational reporting and as a source of data for the
enterprise data warehouse. It is complementary element to an EDW in decision support
landscape, and is used for operational reporting, Controls and decision making as opposed to
the EDW, which is used for tactical and strategic decision support.
ODS is database designed to integrate data from multiple sources for additional
operations on the data, for reporting, controls and operational decision support.it is not an
intrinsic part of EDH solution, although an EDH may be used to subsume some of the
processing performed by an ODS and the EDW. An EDH is broker of data. An ODS is
certainly not.
Data Mart:
A subset of a data warehouse that supports the requirements of particular department or
business function.
The characteristics that differentiate data marts and data warehouses include:
A data mart focuses on only the requirements of users associated with one department or
business function.
As data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated.
Data marts do not normally contain detailed operational data, unlike data warehouse.
Security of Data Warehousing

Data warehouse is an integrated repository derived from multiple source (operational and
legacy) databases. The data warehouse is created by either replicating the different source
data or transforming them to new representation. This process involves reading, cleaning,
aggregating and storing the data in the warehouse model. The software tools are used to
access the warehouse for strategic analysis, decision-making, marketing types of
applications. It can be used for inventory control of shelf stock in many departmental stores.
Medical and human genome researchers can create research data that can be
either marketed or used by a wide range of users. The information and access privileges in
data warehouse should mimic the constraints of source data. A recent trend is to create web-
based data warehouses and multiple users can create components of the warehouse and keep
an environment that is open to third party access and tools. Given the opportunity, users ask
for lots of data in great detail. Since source data can be expensive, its privacy and security
must be assured. The idea of adaptive querying can be used to limit access after some data
has been offered to the user. Based on the user profile, the access to warehouse data can be
restricted or modified.

1. Replication control
Replication can be viewed in a slightly different manner than perceived in
traditional literature. For example, an old copy can be considered a replica of the
current copy of the data. A slightly out-of date data can be considered as a good
substitute for some users. The basic idea is that either the warehouse keeps different
replicas of the same items or creates them dynamically.
The legitimate users get the most consistent and complete copy of data while
casual users get a weak replica. Such replica may be enough to satisfy the user's need
but do not provide information that can be used maliciously or breach privacy. We
have formally defined the equivalence of replicas and this notion can be used to
create replicas for different users. The replicas may be at one central site or can be
distributed to proxies who may serve the users efficiently. In some cases the user may
be given the weak replica and may be given an upgraded replica if willing to pay or
deserves it.

2. Aggregation and Generalization

The concept of warehouse is based on the idea of using summaries and consolidators.
This implies that source data is not available in raw form. This lends to ideas that can be used
for security. Some users can get aggregates only over a large number of records where as
others can be given for small data instances. The granularity of aggregation can be lowered
for genuine users. The generalization idea can be used to give users high level information at
first but the lower level details can be given after the security constraints are satisfied. For
example, the user may be given an approximate answer initially based on some
generalization over the domains of the database. Inheritance is another notion that will allow
increasing capability of access for users. The users can inherit access to related data after
having access to some data item.

3. Exaggeration and Misleading

These concepts can be used to mutilate the data. A view may be available to support a
particular query, but the values may be overstated in the view. For security concern, quality
of views may depend on the user involved and user can be given an exaggerated view of the
data. For example, instead of giving any specific sales figures, views may scale up and give
only exaggerated data. In certain situations warehouse data can give some misleading
information; information which may be partially incorrect or difficult to verify the
correctness of the information. For example, a view of a company’s annual report may
contain the net profit figure including the profit from sales of properties (not the actual sales
of products).
4. Anonymity

Anonymity is to provide user and warehouse data privacy. A user does not know the
source warehouse for his query and warehouse also does not who is the user and what
particular view a user is accessing (view may be constructed from many source databases for
that warehouse). Note that a user must belong to the group of registered users and similarly, a
user must also get data from only legitimate warehouses. In such cases, encryption is to be
used to secure the connection between the users and warehouse so that no outside user (user
who has not registered with the warehouse) can access the warehouse.

5. User Profile Based Security

User profile is a representation of the preferences of any individual user. User profiles
can help in authentication and determining the levels of security to access warehouse data.
User profile must describe how and what has to be represented pertaining to the users
information and security level authorization needs. The growth in warehouses has made
relevant information SeminarsTopics.com access difficult in reasonable time due to the large
number of sources differ in terms of context and representation. Warehouse can use data
category details in determining the access control. For example, if a user would like to access
an unpublished annual company report, the warehouse server may deny access to it. The
other alternative is to construct a view to reflect only projected sales and profit report. Such a
construction of view may be transparent to the user. A server can use data given in the profile
to decide whether the user should be given the access to associated graphical image data. The
server has the option to reduce the resolution or later the quality of images before making
them available to users.
Applications of Data Warehousing

Exploiting Data for Business Decisions


The value of a decision support system depends on its ability to provide the decision-maker
with relevant information that can be acted upon at an appropriate time. This means that the
information needs to be:
 Applicable: - The information must be current, pertinent to the field of interest and at
the correct level of detail to highlight any potential issues or benefits.
 Conclusive: - The information must be sufficient for the decision-maker to derive
actions that will bring benefit to the organization.
 Timely: - The information must be available in a time frame that allows decisions to
be effective. Each of these requirements has implications for the characteristics of the
underlying system.
To be effective, a decision support system requires access to all relevant data sources,
potentially at a detailed level. It must also be quick to return both ad-hoc and pre-
defined results so that the decision-maker can investigate to an appropriate level of
depth without affecting performance in other areas.

Decision Support through Data Warehousing


One approach to creating a decision support system is to implement a data warehouse, which
integrates existing sources of data with accessible data analysis techniques. An organization’s
data sources are typically departmental or functional databases that have evolved to service
specific and localized requirements.
Integrating such highly focused resources for decision support at the enterprise level requires
the addition of other functional capabilities:
 Fast query handling: - Data sources are normally optimised for data storage and
processing, not for their speed of response to queries.
 Increased data depth: - Many business conclusions are based on the comparison of
current data with historical data. Data sources are normally focussed on the present
and so lack this depth.
 Business language support: - The decision-maker will typically have a background
in business or management, not in database programming. It is important that such a
person can request information using words and not syntax.
A data warehouse meets these requirements by combining the data from the various and
disparate sources into a central repository on which analysis can be performed. This
repository is normally a relational database that provides both the capacity for extra data
depth and the support for servicing queries. Analytical functions are provided by a separate
component which is optimized for extracting, assembling and presenting summary
information in response to word-based queries, such as “show me last week’s sales by
region”.

The proliferation of data warehouses is highlighted by the “customer loyalty” schemes that
are now run by many leading retailers and airlines. These schemes illustrate the potential of
the data warehouse for “micromarketing” and profitability calculations, but there are other
applications of equal value, such as:
 Stock Control
 Product category management
 Basket analysis
 Fraud analysis
All of these applications offer a direct payback to the customer by facilitating the
identification of areas that require attention. This payback, especially in the fields of
fraud analysis and stock control, can be of high and immediate value.
Advantages of Data Warehousing

Major Advantage:
No need for the “Level” indicator in the dimension tables, since no aggregated data is
stored with lower-level detail.

The successful implementation of a data warehouse can bring major, benefits to an


organization including:

 Potential high returns on investment


Implementation of data warehousing by an organization requires a huge investment typically
from Rs 10 lack to 50 lacks. However, a study by the International Data Corporation (IDC) in
1996 reported that average three-year returns on investment (RO I) in data warehousing
reached 401%.

 Competitive advantage
The huge returns on investment for those companies that have successfully implemented a
data warehouse is evidence of the enormous competitive advantage that accompanies this
technology. The competitive advantage is gained by allowing decision-makers access to data
that can reveal previously unavailable, unknown, and untapped information on, for example,
customers, trends, and demands.

 Increased Productivity of corporate decision-makers


Data warehousing improves the productivity of corporate decision-makers by creating an
integrated database of consistent, subject-oriented, historical data. It integrates data from
multiple incompatible systems into a form that provides one consistent view of the
organization. By transforming data into meaningful information, a data warehouse allows
business managers to perform more substantive, accurate, and consistent analysis.
 Most Effective decision making
Data warehousing helps to reduce the overall cost of the· product· by reducing the number of
channels.

 Saves time and money


Keeping all the data in one place certainly saves user’s time to access a specific set of data.
They can make rapid decisions on key enterprise actions as enterprises do not spend extra
time in analyzing the unordered data from multiple sources.
A data warehouse execution does not require much of IT support and does not even involve a
higher number of channels, thereby ensuring cost-effectiveness. Similarly, the business
executives interested in querying data won’t wait for the other IT processes to work before
any data retrieval. The business continue to run every time and anytime, without any time lag
or reliance on external sources.

 Increased query and system performance


Data warehouses are also designed with speed of data retrieval and analysis in mind. You are
able to store large amounts of data and rapidly query it. These systems are built differently
than operational systems, which are more focused on creating and modifying data. Data
warehouses, on the other hand, are built specifically for analysis and retrieval rather than the
upkeep of individual records.

 Timely access to data


With data warehousing, users and business leadership have access to data from multiple
sources as needed. This way, only a small amount of time is spent on the actual retrieval
process. Scheduled data integration, or ETL, is an important aspect of warehousing because it
consolidates data from multiple sources and transforms it into a useful format. This allows
the user to easily access data from one interface, lessening the reliance on your IT team. In
short, the use of query and analysis tools within a data warehouse allows you to “spend more
time performing data analysis and less time gathering data.”
 Enhanced quality and consistency:
Data warehouse deployment involves the conversion of data from numerous sources and
transformation into a common format. This means that data from multiple business
departments and processes is standardized and consistent. In addition, individual units like
sales, marketing and operations will all use the same data repository for queries and reports.
This allows each department to produce results that align with other teams within the
organization.
Disadvantages of Data Warehousing

Disadvantage:
Dimension tables are still very large in some cases, which can slow performance; front-
end must be able to detect existence of aggregate facts, which requires more extensive
metadata.

 Underestimation of resources of data loading


Sometimes we underestimate the time required to extract, clean, and load the data into the
warehouse. It may take the significant proportion of the total development time, although
some tools are there which are used to reduce the time and effort spent on this process.

 Hidden problems with source systems


Sometimes hidden .problems associated with the source systems feeding the data warehouse
may be identified after years of being undetected. For example, when entering the details of a
new property, certain fields may allow nulls which may result in staff entering incomplete
property data, even when available and applicable.

 Required data not captured


In some cases the required data is not captured by the source systems which may be very
important for the data warehouse purpose. For example the date of registration for the
property may be not used in source system but it may be very important analysis purpose.

 Increased end-user demands


After satisfying some of end-users queries, requests for support from staff may increase
rather than decrease. This is caused by an increasing awareness of the users on the
capabilities and value of the data warehouse. Another reason for increasing demands is that
once a data warehouse is online, it is often the case that the number of users and queries
increase together with requests for answers to more and more complex queries.
 Data homogenization
The concept of data warehouse deals with similarity of data formats between different data
sources. Thus, results in to lose of some important value of the data.
 Long-duration projects
The building of a warehouse can take up to three years, which is why some organizations are
reluctant in investigating in to data warehouse. Some only the historical data of a particular
department is captured in the data warehouse resulting data marts. Data marts support only
the requirements of a particular department and limited the functionality to that department or
area only.
Conclusion

Since the primary task of management is effective decision making, the primary task of
research, and subsequently data warehouses, is to generate accurate information for use in
that decision making.
It is imperative that an organization’s data warehousing strategies reflect changes in the
internal and external business environment in addition to the direction in which the business
is traveling.
Playing an integral role in the growth, development and success of an organization, data
warehouses facilitate meaningful research which facilitates effective management.
Data warehousing is not a new phenomenon. All large organizations already have data
warehouses. But they are just not managing them. Over the next few years, the growth of
data warehousing is going to be enormous with new products and technologies coming out
frequently. In order to get the most out of this period, it is going to be important that data
warehouse planners and developers have a clear idea of what they are looking for and then
choose strategies and methods that will provide them with performance today and flexibility
for tomorrow.
Reference

 en.wikipedia.org/wiki/Data_warehouse
 www.stuudymafia.org
 Seminarstopics.com
 Google.com

Potrebbero piacerti anche