Sei sulla pagina 1di 9

FALL 2013 ASSIGNMENT

PROGRAM: BACHELOR OF COMPUTER APPLICATION


SEMESTER 6TH SEM
SUBJECT CODE & NAME
BC0058 DATA WAREHOUSING

Q. No 1Differentiate between OLTP and Data Warehouse.


ANSWER: Data Warehousing and Online Analytical Processing
29 out of 32 rated this helpful - Rate this topic
A data warehouse is often used as the basis for a decision-support system (also referred to from
an analytical perspective as a business intelligence system). It is designed to overcome some of
the problems encountered when an organization attempts to perform strategic analysis using
the same database that is used to perform online transaction processing (OLTP).
A typical OLTP system is characterized by having large numbers of concurrent users actively
adding and modifying data. The database represents the state of a particular business function
at a specific point in time, such as an airline reservation system. However, the large volume of
data maintained in many OLTP systems can overwhelm an organization. As databases grow
larger with more complex data, response time can deteriorate quickly due to competition for
available resources. A typical OLTP system has many users adding new data to the database
while fewer users generate reports from the database. As the volume of data increases, reports
take longer to generate.
As organizations collect increasing volumes of data by using OLTP database systems, the need to
analyze data becomes more acute. Typically, OLTP systems are designed specifically to manage
transaction processing and minimize disk storage requirements by a series of related,
normalized tables. However, when users need to analyze their data, a myriad of problems often
prohibits the data from being used:
Users may not understand the complex relationships among the tables, and therefore
cannot generate ad hoc queries.
Application databases may be segmented across multiple servers, making it difficult for
users to find the tables in the first place.
Security restrictions may prevent users from accessing the detail data they need.
Database administrators prohibit ad hoc querying of OLTP systems, to prevent analytical
users from running queries that could slow down the performance of mission-critical
production databases.
By copying an OLTP system to a reporting server on a regularly scheduled basis, an organization
can improve response time for reports and queries. Yet a schema optimized for OLTP is often
not flexible enough for decision support applications, largely due to the volume of data involved
and the complexity of normalized relational tables.

For example, each regional sales manager in a company may wish to produce a monthly
summary of the sales per region. Because the reporting server contains data at the same level of
detail as the OLTP system, the entire month's data is summarized each time the report is
generated. The result is longer-running queries that lower user satisfaction.
Additionally, many organizations store data in multiple heterogeneous database systems.
Reporting is more difficult because data is not only stored in different places, but in different
formats.
Data warehousing and online analytical processing (OLAP) provide solutions to these problems.
Data warehousing is an approach to storing data in which heterogeneous data sources (typically
from multiple OLTP databases) are migrated to a separate homogenous data store. Data
warehouses provide these benefits to analytical users:
Data is organized to facilitate analytical queries rather than transaction processing.
Differences among data structures across multiple heterogeneous databases can be
resolved.
Data transformation rules can be applied to validate and consolidate data when data is
moved from the OLTP database into the data warehouse.
Security and performance issues can be resolved without requiring changes in the
production systems.
Sometimes organizations maintain smaller, more topic-oriented data stores called data marts.
In contrast to a data warehouse which typically encapsulates all of an enterprise's analytical
data, a data mart is typically a subset of the enterprise data targeted at a smaller set of users or
business functions.
Whereas a data warehouse or data mart are the data stores for analytical data, OLAP is the
technology that enables client applications to efficiently access the data. OLAP provides these
benefits to analytical users:
Pre-aggregation of frequently queried data, enabling a very fast response time to ad hoc
queries.
An intuitive multidimensional data model that makes it easy to select, navigate, and
explore the data.
A powerful tool for creating new views of data based upon a rich array of ad hoc
calculation functions.
Technology to manage security, client/server query management and data caching, and
facilities to optimize system performance based upon user needs.
The terms data warehousing and OLAP are sometimes used interchangeably. However, it is
important to understand their differences because each represents a unique set of technologies,
administrative issues, and user implications.
SQL Server Tools for Data Warehousing and OLAP
Microsoft SQL Server provides several tools for building data warehouses and data marts, and
OLAP systems. Using DTS Designer, you can define the steps, workflow, and transformations
necessary to build a data warehouse from a variety of data sources. After the data warehouse is
built, you can use Microsoft SQL Server OLAP Services, which provides a robust OLAP server
that can be used to analyze data stored in many different formats, including SQL Server and
Oracle databases.

2 What are the key issues in Planning a Data Warehouse


ANSWER: Bad planning and improper project management practice is the main factor in Data
Warehouse project failures. First of all, make sure that your company really needs data
warehouse for their business support. Then, prepare criteria for assessing the value expected
from data warehouse. Decide the software on this project and make sure where the data
warehouse will collects its data sources. You need to make rules on who will be using the data
and who will operate the new systems. Next we will elaborate one by one this step in planning
your data warehouse for your company.
Important Key Issues How to make sure that the company is really needs the data warehouse?
The best way to find out is by answers the important key issues in planning your data
warehouse. Following items is the key questions in prepare your data warehouse importance.
1. Value and Expectations.
Will your data warehouse help the management to do better planning? Will this system help
them make the right decisions? How much this system could increase the company market
share? What is management expectation with this data warehouse?, all these questions is the
starting point to valuate your project planning.
Those all questions is the end to end guidelines in the all project phase. Whenever the project
face the difficulties and required the best solution, just simply go back to those primary
questions.
2. Risk Assessment.
Assessment of risks in IT project is more than calculating the loss from the project costs. We
should also consider the risk for the company if they not implemented the system, how many
opportunities will be missed by the company? What possible impact if the project is not finished
by the plan for the company business plan? All of these need to include in your assessment of
risk, besides also the loss of the project costs.
3. Top-down or Bottom-up. Top-down approach starts at the enterprise-wide data warehouse.
Data from the large enterprise-wide process by data warehouse and used in the departmental
and subject data marts. Bottom-up approach starts from individual data marts to make the
enterprise data warehouse.
The two approach that will be taken in your company data warehouse need to consider this
following things: - Do you have enough resources, time and budget to build a corporate-wide
data warehouse? But may advantages from the fully unified data warehouse. - Or your company
need to prove the data warehouse usefulness by implementing small amount of data marts and
continues to another data marts.
4. Build or Buy. In a data warehouse, there is a large range of functions, such as: Data

extraction, Data transformation and loading data from storage. You have to decide are you going
to buy all these function from vendor or making some function customize with your own
company business needs.
5. Single Vendor or Best-of-Breed.
Choosing a single vendor solution has a few advantages: - High level of integration among the
tools - Constant look and feel - Seamless cooperation among components - Centrally managed
information exchange - Overall price negotiable
Advantages on choosing the best-of-breed vendor selection: - Could build an environment to fit
your organization - No need to compromise between database and support tools - Select
products best suited for the function

3 Explain Source Data Component and Data Staging Components of Data Warehouse
Architecture.
ANSWER: Data Warehouse Architecture
Different data warehousing systems have different structures. Some may have an ODS
(operational data store), while some may have multiple data marts. Some may have a small
number of data sources, while some may have dozens of data sources. In view of this, it is far
more reasonable to present the different layers of a data warehouse architecture rather than
discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:

Data Source Layer


Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer

The picture below shows the relationships among the different components of the data
warehouse architecture:

Each component is discussed individually below:

Data Source Layer


This represents the different data sources that feed data into the data warehouse. The data
source can be of any format -- plain text file, relational database, other types of database, Excel
file, etc., can all act as a data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data, inventory data, marketing data,
systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely some
minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse / data
mart. Having one common area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from a
transactional nature to an analytical nature. This layer is also where data cleansing happens.
The ETL design phase is often the most time-consuming phase in a data warehousing project,
and an ETL tool is often used in this layer.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of
entities can be found here: data warehouse, data mart, and operational data store (ODS). In any
given system, you may have just one of the three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the underlying
data transformation rules, but do affect what the report looks like.
Data Presentation Layer

This refers to the information that reaches the users. This can be in a form of a tabular /
graphical report in a browser, an emailed report that gets automatically generated and sent
everyday, or an alert that warns users of exceptions, among others. Usually an OLAP tool and/or
a reporting tool is used in this layer.
Metadata Layer
This is where information about the data stored in the data warehouse system is stored. A logical
data model would be an example of something that's in the metadata layer. A metadata tool is
often used to manage metadata.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as ETL job
status, system performance, and user access history.

4 Discuss the Extraction Methods in Data Warehouses.


ANSWER: The extraction methods in data warehouse depend on the source system,
performance and business requirements. There are two types of extractions, Logical and
Physical. We will see in detail about the logical and physical designs.
Logical extraction
There are two types of logical extraction methods:
Full Extraction: Full extraction is used when the data needs to be extracted and loaded for the
first time. In full extraction, the data from the source is extracted completely. This extraction
reflects the current data available in the source system.
Incremental Extraction: In incremental extraction, the changes in source data need to be
tracked since the last successful extraction. Only these changes in data will be extracted and
then loaded. These changes can be detected from the source data which have the last changed
timestamp. Also a change table can be created in the source system, which keeps track of the
changes in the source data.
One more method to get the incremental changes is to extract the complete source data and then
do a difference (minus operation) between the current extraction and last extraction. This
approach causes a performance issue.
Physical extraction
The data can be extracted physically by two methods:

Online Extraction: In online extraction the data is extracted directly from the source system.
The extraction process connects to the source system and extracts the source data.
Offline Extraction: The data from the source system is dumped outside of the source system
into a flat file. This flat file is used to extract the data. The flat file can be created by a routine
process daily.

5 Define the process of Data Profiling, Data Cleansing and Data Enrichment.
ANSWER: Data quality is a critical factor for the success of enterprise intelligence initiatives.
Bad data on one system can easily and rapidly propagate to other systems. If information shared
across the organisation is contradictory, inconsistent or inaccurate, then interactions with
customers, suppliers and others will be based on inaccurate information, resulting in higher
costs, reduced credibility and lost business.
SAS Data Integration provides a single environment that seamlessly integrates data quality
within the data integration process, taking users from profiling and rules creation through
execution and monitoring of results. organisations can transform and combine disparate data,
remove inaccuracies, standardise on common values, parse values and cleanse dirty data to
create consistent, reliable information.
Rules can be built quickly while profiling data, and then incorporated automatically into the
data transformation process. This speeds the development and implementation of cleansed
data. A workflow design environment facilitates the easy augmentation of existing data with new
information to increase the usefulness and value of all enterprise data.
Key Benefits

Speeds the delivery of credible information by embedding data quality into batch and
real-time processes.

Reduces costly errors by preventing the propagation of bad data and correcting mistakes
at the source.

Keeps data current and accurate with regular auditing and cleansing.
standardises data from multiple sources and reduces redundancy in corporate data to
support more accurate reporting, analysis and business decisions.
Adds value to existing data by generating and/or appending information from other
sources.
Key Features
Database/data warehouse/data mart cleansing through a variety of techniques, including
standardization, transformation and rationalization, while maintaining an accurate audit trail.

Data profiling to identify incomplete, inaccurate or ambiguous data.

Data enrichment and augmentation.

Create reusable data quality business rules that are callable through custom exits,
message queues and Web services.

Real-time transaction cleansing using standard business rules.

Data summarization. Compress large static databases into representative points making
them more amenable for subsequent analysis.

Support for more than 20 worldwide regions with specific language awareness and
localizations.

6 What is Metadata Management? Explain Integrated Metadata Management with


a block diagram.
ANSWER: Meta-data management (also known as metadata management, without the
hyphen) involves managing data about other data, whereby this "other data" is generally
referred to as content data. The term is used most often in relation to Digital media, but older
forms of metadata are catalogs, dictionaries, and taxonomies. For example, the Dewey Decimal
Classification is a metadata management system for books developed in 1876 for libraries.
Tools for data profiling, data modeling, data transformation, data quality, and business
intelligence play a key role in data integration. The integrated metadata management
capabilities of IBM InfoSphere Information Server enable these tools to work together to
meet your enterprise goals.
Metadata management in InfoSphere Information Server offers many advantages:

Sharing metadata throughout the suite from a single metadata repository creates
accurate, consistent, and efficient processes.
Changes that you make to source systems can be quickly identified and propagated
throughout the flow of information.
You can identify downstream changes and use them to revise information in the source
systems.
You can track and analyze the data flow across departments and processes.
Metadata is shared automatically among tools.
Glossary definitions provide business context for metadata that is used in jobs and
reports.
Data stewards take responsibility for metadata assets such as schemas and tables that
they have authority over.
By using data lineage, you can focus on the end-to-end integration path, from the design
tool to the business intelligence (BI) report. Or you can drill down to view any element of
the lineage.
You can eliminate duplicate or redundant metadata to create a single, reliable, version
that can be used by multiple tools.

Managing metadata
The metadata repository of IBM InfoSphere Information Server stores metadata from
suite tools and external tools and databases and enables sharing among them. You can
import metadata into the repository from various sources, export metadata by various
methods, and transfer metadata assets between design, test, and production repositories.

Potrebbero piacerti anche