Sei sulla pagina 1di 117

INTRODUCTION

TO
BUSINESS ANALYTICS
IBM 2201

1 What is BIG DATA?


2

‹#›
2 Objectives
 Get an overview of big data that covers:
 What is “Big Data”
 Need of Big Data
 Characteristics of Big Data: 4V of Big Data
 Importance & Risks of Big Data
 The structure of Big Data
 What is Big Data Analytics? Benefits
 Big Data Adoption
 Applications
4 What is big data?
Examples of big data?

 https://www.youtube.com/watch?v=tkOwlXUaGMM&t=250
s

‹#›
What is Big Data ?
 Definition: “Big Data” is data whose scale,
distribution, diversity, and/or timeliness
require the use of new technical
architectures and analytics to enable
insights that unlock new sources of
business value.

 What make a data as “big data”?


• Huge volume of data (for instance, tools that can
manage billions of rows and billions of columns)
• Complexity of data types and structures, with an
increasing volume of unstructured data (80-90% of the
data in existence is unstructured)….part of the Digital
Shadow or “Data Exhaust”
• Speed or velocity of new data creation

Copyright © 2011 EMC Corporation. All Rights Reserved. Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
What is big data?
6
 Big Data is a general term used to describe the
voluminous amount of unstructured and semi-
structured data – where the data are capture from
social media, CCTV, sensors, smart watch, etc.
 Big Data term is often used when speaking about
Petabytes and Exabytes of data.
 A primary goal for looking at big data is to discover
repeatable business patterns.
 For example, If a customer is buying specific type and color of
cloth in a shop only that data would be available. However, the
customer might be having an occasion for which the same is
purchased and that occasion and the relationship with such
occasion can be captured by the external data that is created by
the said customer may be in social media.

‹#›
Why “big data” is growing!
7

‹#› https://www.youtube.com/watch?v=9s-vSeWej1U
A Growing Interconnected and
7 Instrumental World

12+ TBs 4.6


of tweet data
30 billion RFID billion
every day tags today
camera
(1.3B in 2005)
phones
world wide

100s of
data every day
? TBs of

millions
of GPS
enabled
devices
sold
25+ TBs of annually
log data
every day
2+
billion
people on
the Web
76 million smart by end
meters in 2009… 2011
200M by 2014
9 Videos
 Does social media have the power to
change the world?
 https://www.youtube.com/watch?v=Uppg_
2nGo54
 Are you ready for digitization?
 https://www.youtube.com/watch?v=ystdF6j
N7hc

‹#›
10 Need for Big Data
 Big Data can unlock significant value by making information
transparent
 As organizations create and store more transactional data in
digital form. In fact, some leading companies are using their
ability to collect and analyze big data to conduct controlled
experiments to make better management decisions.
 Big Data allows ever-narrower segmentation of customers
and therefore much more precisely tailored products or
services
 Sophisticated analytics can substantially improve decision-
making, minimize risks, and unearth valuable insights that
would otherwise remain hidden
 Big Data can be used to develop the next generation of
products and services

‹#›
11
IBM’s Big Data: 4V’s

‹#›
Characteristics of Big Data
12

 1) Volume
 Volume indicates the amount of data for analysis.
 characteristic most associated with big data, volume refers to
the mass quantities of data.
 Data volumes continue to increase at an unprecedented rate.
 2) Velocity
 Data in motion. The speed at which data is created,
processed and analyzed continues to accelerate.
 Velocity impacts latency – the lag time between when data is
created or captured, and when it is accessible.
 Data is continually being generated at a pace that is
impossible for traditional systems to capture, store and
analyze.

‹#›
Velocity (Speed)
 Data is begin generated fast and need to be processed fast
 Late decisions  will lead to missing opportunities
 Examples
 E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right
now for store next to you
 Healthcare monitoring: sensors monitoring your
activities and body  any abnormal measurements require
immediate reaction

IBM Big Data and Analytics at work in Banking

https://www.youtube.com/watch?v=1RYKgj-QK4I
Characteristics of Big Data
 3) Variety
 Different types of data and data sources. Variety is about
managing the complexity of multiple data types, including structured,
semi-structured and unstructured data.
 Organizations need to integrate and analyze data from a complex
array of both traditional and non-traditional information sources.
 Explosion of sensors, smart devices and social collaboration
technologies, generates data in countless forms like text, web data,
tweets, sensor data, audio, video and more.

 4) Veracity
 Veracity is about uncertainty of Data, correctness of data
 Huge amount of money is spent by Organizations because of data
quality issues.
 Decision makers are not confident of data that is being used by them
for decision making.
https://www.youtube.com/watch?v=wVAWAeOIIII
Module 1: Introduction to BDA
15 Veracity
 Establishing confidence of data is a biggest
challenge.
 Uncertainty of Data or Veracity is a very
important character of Big Data.
 Nearly 27% of respondents to a research study
expressed that they were unsure of how much of
their data was inaccurate.
 Poor data quality cost the USA economy around
$3.1 Trillion a year. 1 in every 3 Business Leaders
do not trust the information they use to make
decisions.
‹#›
16
Types of big data

 Structured data
 Semi-structured data
 Unstructured data

‹#› https://www.youtube.com/watch?v=mnoqT8nihT8
Variety – Complex Data Structures
Data Growth is Increasingly Unstructured

• Data containing a defined data type, format, structure

• Example: Transaction data and OLAP

• Textual data files with a discernable pattern,


More Structured

enabling parsing

• Example: XML data files that are self


describing and defined by an xml schema

• Data that has no inherent


structure and is usually stored
as different types of files.

• Example: Text documents,


PDFs, images, audio, pictures
and video
Structured data

 Structured Databases

Structured data is data that has been organized into a formatted repository,
typically a database, so that its elements can be made addressable for more
effective processing and analysis.

It refers to data that has a defined length and format for big data

Ex. numbers, dates, and groups of words and numbers called strings.

It’s usually stored in a database.

18
Semi-structured data
19
 Semi-structured data is a form of structured data that does
not conform with the formal structure of data models
associated with relational databases or other forms of 
data tables, but nonetheless contains tags or other markers
to separate semantic elements and enforce hierarchies of
records and fields within the data.
 Semi structure data is a set of documents on the web which
contain hyperlinks to other document and it cannot be
modeled in natural relational data model because the
pattern of hyperlinks is not regular across documents.
 Example : XML, log file

‹#›
20 Semi-structured data
 For example, a clickstream log may look like :
2017-11-01 14:27:57,944-INFO :
com.ovaledge.oasis.dao.DomainDaoImpl - RUNNING
QUERY: Select * from domain where
DOMAINTYPE='DATAAPP_CATEGORY’;
 where we see the structure but require some rules
to find the details.

‹#›
Unstructured Data

It is the text written in various forms like - web pages,
emails, chat messages, pdf files, word documents, etc.

Applications Music(Audio) Movie(vedio)

X-Rays Pictures
Real-time/Fast Data –
generate unstructured data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of data)
 The progress and innovation is no longer hindered by the ability to collect data

 But, by the ability to manage, analyze, summarize, visualize, and discover


knowledge from the collected data in a timely manner and in a scalable fashion
23 Big Data Analytics
Big data
Adoption
analytics is the process of examining
large amounts of data of different types (big
data) to uncover hidden patterns, unknown
correlations and other useful information.
 Such information can provide competitive advantages
over rival organizations and result in business benefits,
such as more effective marketing and increased revenue
Goal: to help companies make better business decisions
by enabling data scientists and other users to analyze huge
volumes of transaction data as well as other data sources
that may be left untapped by conventional business
intelligence (BI) programs.

‹#›
https://www.youtube.com/watch?v=LtScY2guZpo
24 Big Data Adoption
 The term “big data adoption” is used here to
represent a natural progression of the data,
sources, technologies and skills that are
necessary to create a competitive advantage
in the globally integrated marketplace.
 The four main stages of big data adoption and
progression are Educate, Explore, Engage and
Execute

‹#›
Big Data Analysis Adoption
25
Structure
1. Educate – build a base of knowledge
 Most organizations in this stage are studying the potential
benefits of big data technologies and analytics, and are trying
to understand how big data can help address important
business opportunities in their own industries or markets.
2. Explore – define the business case and
roadmap
 In this stage organizations get down to formal in-house
discussions about how to use big data to solve important
business challenges.
3. Engage – embracing big data
 organizations begin to prove the business value of big data,
as well as perform an assessment of their technologies and
skills. ‹#›
Big Data Analysis
26
Adoption Structure
4) Execute: Implementing big data at scale
In the Execute stage, big data and analytics capabilities are
more widely operationalized and implemented within the
organization.
The small number of organizations in the Execute stage is
consistent with the implementations we see in the marketplace.
Importantly, these leading organizations are leveraging big data
to transform their businesses and thus are deriving the greatest
value from their information assets.

‹#›
27
Benefits of Big Data
 Analytics
Benefits of Big Data Analytics
 Anything involving customers could benefit from big data
analytics.
 This includes better-targeted social-influencer marketing,
customer-base segmentation, and recognition of sales and
market opportunities.
 Business intelligence in general can benefit from big data
analytics
 This could result in more numerous and accurate business
insights, an understanding of business change, better planning
and forecasting, and the identification of root causes of cost.
 Specific analytic applications are likely beneficiaries of big
data analytics
 big data analytics might help automate decisions for real-time
business processes such as loan approvals or fraud detection.

‹#›

https://www.youtube.com/watch?v=QvyQFXbgW2c
28
Barriers of Big Data
Analytics
 Inadequate staffing and skills are the
leading barriers to big data analytics,
where it couldn’t making big data usable
for end users
 A lack of business support can hinder a big
data analytics program in terms of cost and
compelling business case.
 Problems with current database software
used which is lack of database analytics
can be barriers to big data analytics

‹#›
29
Risks of Big Data
 We are gathering data from different sources at different
types, to the extent that we often don’t know exactly what it
contains, big data carries its own special risks.
 You don’t know whether all or just a tiny piece of it might be
essential to corroborate your compliance with some
government regulation.
 You can’t have a perfect predictive model of how the future
business and regulatory environments are going to evolve.
 But you can have a comprehensive data-risk mitigation
program that will help you deal with new challenges as
they emerge.

The danger of Big Data


https://www.youtube.com/watch?v=y8yMlMBCQiQ
‹#›
Real-Time Analytics/Decision Requirement

Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively

What is Real time Analytics?


https://www.youtube.com/watch?v=ioHwEsARPWI
31 Applications of Data
Analytics – Use Case
How big data analytics can help:
–DR processing
–Churn prediction
–Geomapping / marketing
–Network monitoring

Telecommunication services
•Problem:
Legacy systems are used to gain insights from internally generated
data facing issues of high storage costs, long data loading time, and
long administration process.

‹#›
32
Applications of Data
Analytics – Use Case
 Financial Service
 Problem: 
 Manage the several Petabytes of data which is growing at 40-100%
per year under increasing pressure to prevent frauds and complain to
regulations.
 How big data analytics can help:
 Fraud detection
 Risk management
 360°View of the Customer

‹#›
33
Applications of Data
Analytics – Use Case
 Transportation services
 Problem:
 Traffic congestion has been increasing worldwide as a result of
increased urbanization and population growth reducing the efficiency
of transportation infrastructure and increasing travel time and fuel
consumption.
 How big data analytics can help:
 Real time analysis to weather and traffic congestion data streams to
identify traffic patterns reducing transportation costs.

‹#›
34
Applications of Data
Analytics – Use Case
 Healthcare and Life Sciences
 Problem:
 Vast quantities of real-time information are starting to come from
wireless monitoring devices that postoperative patients and those
with chronic diseases are wearing at home and in their daily lives.
 How big data analytics can help:
 Epidemic early warning
 Intensive Care Unit and remote monitoring

‹#›
35 Video

Demo: IBM Big Data and Analytics at work in Banking


https://www.youtube.com/watch?v=ioHwEsARPWI

What is Hadoop?
https://www.youtube.com/watch?v=4DgTLaFNQq0

What Is Big Data? & How Big Data Is Changing The


World!
https://www.youtube.com/watch?v=G_e3r4S2g80

‹#›
Summary
 Terminologies of Big Data
 Need of Big Data
 The Characteristics of Big Data (3V
and 4V)
 Types of Data
 Big Data Analytics Adoption
Chapter 6
INTRODUCTION TO DATA
MINING
Learning objectives:

 After this lesson, you are able to learn as the


following:
What is Data Mining?
Describe the various techniques in Data mining
process
Understand the KDD Process model
Describe the various phases of CRISP-DM
Applications of Data Mining
Definition of Data mining
 Data mining is the process of discovering interesting knowledge such
as unknown patterns, association or significant structures from large
amount of data stored in databases, data warehouses or other
information repositories in order to discover useful patterns.
 Another definition of data mining : Data mining is an iterative process
of creating predictive and descriptive models, by uncovering previously
unknown trends and patterns in vast amount of data in order to support
decision making.
 Data mining is a subset of Business Analytics
 There is a need to turn data into useful information and knowledge for
broad applications including
 Market analysis
 Business management
 Decision support
 Customer segmentation and behavior
 Etc.
How data mining works?

 Data mining builds models to discover


patterns among attributes presented in the
data set.
 Models are:
 Mathematical representations (simple
linear relationships and highly non-linear
relationship) that identify patterns among
attributes of the things such as customers
with products
 Some of these patterns are explanatory
and others are predictive (foretelling
future values of certain attributes)
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
What is (not) Data Mining?

What is not Data  What is Data Mining?


Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Examples of data mining
applications
 Regarding temporal data, for instance, banking data can be
mined for changing trends, which may aid in the scheduling
of bank tellers according to the volume of customer traffic.
 Stock exchange data can be mined so that trends that could
help to plan investment strategies can be uncovered
 Computer network data streams can be mined to detect
intrusions based on the anomaly of message flows, which
may be discovered by clustering, dynamic construction of
stream models or by comparing the current frequent
patterns with those at a previous time.
 With spatial data, look for patterns that describe changes in
metropolitan poverty rates based on city distances from
major highways. By examining the relationships among a set
of spatial objects, which subsets of objects are spatially auto
correlated or associated can be discovered.
Industry examples of DM
applications
 Sales/ Marketing
 Identify buying patterns from customers
 Find the association among customer demographic characteristics
  Banking
 Credit card fraudulent detection
 Identify ‘loyal’ customers
  Insurance and Health Care
 Claims analysis i.e., which medical procedures are claimed together
 Predict the customers who will buy new policies
  Transportation
 Determine the distribution schedules for the outlets
 Analyze loading patterns
  Medicine
 Characterize patient behavior in order to predict office visits
 Identify successful medical therapies for different diseases / illnesses
Take a break….
Watch a video

 Source of data mining


 https://www.youtube.com/watch?v=Y_JlkzzhAgw
D

a
Prediction
 Prediction is refer to the act of telling about
the future by taking into account the
experiences, opinions and other relevant
information in conducting the task of
foretelling.
 Depending on the nature of what is being
predicted, prediction can be specifically as :
 Classification (predicted thing is such as
tomorrow’s forecast, is a class label such as
“rainy” or “sunny”)
 Regression (predicted thing is tomorrow’s
temperature, is a real number such as 65 F)
 Time-series, the data consists of values of the
same variable that is captured and stored over
tine in regular intervals, such as stock price
Prediction techniques
 Classification : assign a new data record to one of several
predefined categories or classes. Also called supervised
learning.
 Classification approaches normally use a training set where
all objects are already associated with known class labels.
 The classification algorithm learns from the training set and
builds a model. The model is used to classify new objects.
 This method has been used in customer segmentation,
business modeling, and credit analysis.
 For example, after starting a credit policy, the
OurVideoStore managers could analyze the customers’
behaviours via their credit, and label accordingly the
customers who received credits with three possible labels
“safe”, “risky” and “very risky”. The classification analysis
would generate a model that could be used to either accept
or reject credit requests in the future
Associations
 Or association rule learning in data mining is
a popular and well-researched technique for
discovering interesting relationships among
variables in large databases.
 With the help of bar-code scanners, the use
of associations rules for discovering
regularities among products is able to
capture by the system.
 Types of associations:
 Link analysis : the linkage among many
objects of interest is discovered automatically,
such as the link between web pages and
referential relationships among groups of
academic publication authors
Associations techniques
 Market-basket: detect sets of attributes/items that
frequently has association relationship or correlations
among them, e.g. 90% of the people who buy cookies,
also buy milk (60% of all grocery shoppers buy both)
 In data mining, association rules are useful for
analyzing and predicting customer behavior. They
play an important part in shopping basket data
analysis, product clustering, catalog design and
store layout.
 Sequence mining (categorical): discover
sequences of events that commonly occur together,
.e.g. In a set of DNA sequences ACGTC is followed by
GTCA after a gap of 9, with 30% probability
 Something come after the other, for example:
when happen outbreak flu, the glove will be in
shortage
Association rules
Clustering
 Clustering: method of assigning a set of objects into
groups or segments based on similarities
automatically.
 Unlike classification, in clustering the class labels are
unknown.
 As the selected algorithm goes through the data set,
identifying the common of things based on their
characteristics, the clusters are established.
 Clustering techniques include optimization.
 Goal of clustering is to create groups so that the
members within each group have maximum similarity
and the members across groups have minimum
similarity.
Clustering techniques
 Cluster analysis is a means of identifying
classes of items so that items in a cluster
have more in common with each other
than with items in other clusters.
 Example: create customer segmentation
based on income, age, race, location, etc.
Data Mining Techniques
 Outlier Analysis: find the record(s) that is
(are) the most different from the other
records, i.e., find all outliers. Outliers are
data elements that cannot be grouped in a
given class or cluster.
Example of using Data
Mining
Data Mining versus
Data Mining
Statistics
Statistics

Starts with loosely defined Starts with a well-defined


discovery statement by proposition and by
using all existing data (i.e. collecting sample data (i.e.
observational and primary data) to test the
secondary data) to discover hypothesis
novel patterns and
relationships

Data sets in data mining Statistics looks for the right


are as “big” as possible size of data (if the size of
data required for statistical
analysis, usually sample of
data is used)
Data
Visualization
Take a break…
watch a video
 How Facebook Data Mining, And Your Info, Is
Influencing The 2016 Election | TODAY
https://www.youtube.com/watch?v=i-rIYadXoms
Knowledge Discovery in Database
(KDD)
 Knowledge Discovery from Data (KDD), refers to the
broad process of finding knowledge in data that
emphasizes the "high-level" application of particular
data mining methods.
  The unifying goal of KDD process - extract knowledge
from data in the context of large databases - done by
using data mining methods
 KDD refers to the entire process of discovering useful
knowledge from data.
 This process involves making decision of what qualifies
as knowledge by evaluating and possibly interpreting
the patterns. It also includes the choice of encoding
schemes, preprocessing, sampling, and projections of
the data prior to the data mining step.
KDD: A Definition

 KDD is the automatic extraction of non-


obvious, hidden knowledge from large
volumes of data.
Then run Data
Mining algorithms

106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
 The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
 The iterative process consists of the following steps:
 Data cleaning: also known as data cleansing, it is a phase in which noise
data and irrelevant data are removed from the collection or maybe missing
data.
 Data integration: at this stage, multiple data sources, often heterogeneous,
may be combined in a common source.
 Data selection: at this step, the data relevant to the analysis is decided on
and retrieved from the data collection.
 Data transformation: also known as data consolidation, it is a phase in
which the selected data is transformed into forms appropriate for the mining
procedure.
 Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful. Searching for patterns of interest in a
particular representational form or a set of such representations, including
classification rules or trees, regression, and clustering
 Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
 Knowledge representation: is the final phase in which the discovered
knowledge is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the data
mining results.
3 methodologies of KDD
model
 Fayyad et al. (Computer science)
 E.g., WEKA
 SEMMA (SAS) (Statistics)
 SAS Enterprise Miner
 CRISP-DM (SPSS, OHRA) (Business)
 SPSS
Methodology of KDD –
CRISP-DM
 CRISP-DM
 Stands for Cross Industry Standard Process for
Data Mining
 A non-proprietary, documented, and freely
available data mining model.
 It was developed by industry leaders with
input from more than 200 data mining users
and data mining tool and service providers.
 It is an industry-, tool- and application-neutral
model.
 This model encourages best practices and
offers organizations the structure needed to
realize better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate
view)
Six phases of CRISP-DM
1. Business Understanding
This initial phase focuses on understanding the project objectives
and requirements from a business perspective, and then converting
this knowledge into a data mining problem definition, and a
preliminary plan designed to achieve the objectives.
Such as “What are the common characteristics of the customers
we have lost to our competitors recently?”
2. Data Understanding
The data understanding phase starts with an initial data
collection. It proceeds with activities
 ▪ To get familiar with the data,
 ▪ To identify data quality problems,
 ▪ To discover first insights into the data, or to
 ▪ Detect interesting subsets to form hypotheses for hidden
information.
Six phases of CRISP-DM
3. Data Preparation
The data preparation phase covers all activities to
construct the final dataset (data that will be fed into
the modeling tool(s)) from the initial raw data.
Data preparation tasks are likely to be performed
multiple times, and not in any prescribed order. Tasks
include table, record, and attribute selection as well as
transformation and cleaning of data for modeling tools.
4. Modeling
In this phase, many modeling techniques are chosen
and applied, and calibrate their parameters to optimal
values. Typically, to the same data mining problem
type, several techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
The accuracy and generality of the model were dealt
with the previous evaluation steps. The degree to which
the model meets the business objectives is assessed in
this step.
Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another
option of evaluation.
6. Deployment
The end of the project is not just the creation of the
model. Though the purpose of the model is to increase
knowledge of the data, the knowledge gained needs to
be organized and presented in such a way that the client
can use.
KDD vs. DM
 DM is a component of the KDD process that is
mainly concerned with means by which
patterns and models are extracted and
enumerated from the data
 DM is quite technical
 Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
 KDD requires a lot of domain understanding
 The DM and KDD are often used
interchangeably
 Perhaps DM is a more common term in
business world, and KDD in academic world
The end.

Video: Data Mining and Business Intelligent


https://www.youtube.com/watch?v=peSNJ5bfjX0

How data mining works?


https://www.youtube.com/watch?
v=W44q6qszdqY
Chapter 7
Data Warehouse &
OLAP

Database Systems: Design, Implementation, and Management


4th Edition

Peter Rob & Carlos Coronel


The Need for Data Analysis

 Constant pressure from external and internal forces


requires prompt tactical and strategic decisions.
 The decision-making cycle time is reduced, while
problems are increasingly complex with a growing
number of internal and external variables.
 Managers need support systems for facilitating
quick decision making in a complex environment.
 Decision support systems (DSS).
 Building a Stock Decision Support Tool in Microsoft
Excel 2010, https://www.youtube.com/watch?
v=iXfxxHx21so
Data warehouse

 A data warehouse is a database that provides


support for decision making
 A data warehouse database must be:
 Integrated
 Subject-oriented
 Time-variant
 Non-volatile
 Benefits of Data warehouse (video)
The Data Warehouse

 The Data Warehouse is an integrated,


subject-oriented, time-variant, non-volatile
database that provides support for decision
making.

 Subject-oriented as the warehouse is organized


around the major subjects of the enterprise (such
as customers, products, and sales) rather than
the major application areas (such as customer
invoicing, stock control, and product sales). This
is reflected in the need to store decision-support
data rather than application-oriented data.
The Data Warehouse
 Integrated because of the coming together of source data
from different enterprise-wide applications systems. The
source data is often inconsistent using, for example, different
formats. The integrated data source must be made consistent
to present a unified view of the data to the users.
 Time-variant because data in the warehouse is only
accurate and valid at some point in time or over some time
interval. The time-variance of the data warehouse is also
shown in the extended time that the data is held, the implicit
or explicit association of time with all data, and the fact that
the data represents a series of snapshots.
 Non-volatile as the data is not updated in real time but is
refreshed from operational systems on a regular basis. New
data is always added as a supplement to the database, rather
than a replacement. The database continually absorbs this
new data, incrementally integrating it with the previous data
Table 13.6A Comparison Of Data Warehouse And Operational
Database Characteristics
Creating A Data Warehouse

Figure 13.3
A Data Warehouse Framework and Views
The Data Warehouse
Twelve Rules That Define a Data Warehouse
1. The Data Warehouse and operational
environments are separated.
2. The Data Warehouse data are integrated.
3. The Data Warehouse contains historical data
over a long time horizon.
4. The Data Warehouse data are snapshot data
captured at a given point in time.
5. The Data Warehouse data are subject-oriented.
6. The Data Warehouse data are mainly read-only
with periodic batch updates from operational data.
No online updates are allowed.
7. The Data Warehouse development life cycle
differs from classical systems development. The
Data Warehouse development is data driven; the
classical approach is process driven.
The Data Warehouse
8. The Data Warehouse contains data with several
levels of detail; current detail data, old detail data,
lightly summarized, and highly summarized data.
9. The Data Warehouse environment is characterized
by read-only transactions to very large data sets. The
operational environment is characterized by numerous
update transactions to a few data entities at the time.
10. The Data Warehouse environment has a system
that traces data resources, transformation, and
storage.
11. The Data Warehouse’s metadata are a critical
component of this environment. The metadata identify
and define all data elements. The metadata provide
the source, transformation, integration, storage,
usage, relationships, and history of each data element.
12. The Data Warehouse contains a charge-back
mechanism for resource usage that enforces optimal
use of the data by end users.
Architecture of Web-Based
Data Warehousing
OLAP vs. OLTP

We can divide IT systems into transactional (OLTP) and analytical


(OLAP).
In general we can assume that OLTP systems provide source data
to data warehouses, whereas OLAP systems help to analyze it. 
OLTP

 OLTP deals with recording the real time transactions that


use in operational system such as transactions happen
in e-commerce and also banking ATM system.
 OLTP (On-line Transaction Processing) is characterized by
a large number of short on-line transactions (INSERT,
UPDATE, DELETE).
 The main emphasis for OLTP systems is put on very fast
query processing, maintaining data integrity in multi-
access environments and an effectiveness measured
by number of transactions per second.
 In OLTP database there is detailed and current data, and
schema used to store transactional databases is the
entity model (usually 3NF). 
On-Line Analytical Processing
 On-Line Analytical Processing (OLAP) is deals
with analyzing the data store in the data
warehouse.
 an advanced data analysis environment that
supports decision making, business modeling,
and operations research activities.

 Four Main Characteristics of OLAP


 Use multidimensional data analysis techniques
 Provide advanced database support
 Provide easy-to-use end user interfaces
 Support client/server architecture
OLAP

 OLAP (On-line Analytical Processing) is characterized


by relatively low volume of transactions.
 Queries are often very complex and involve
aggregations.
 For OLAP systems a response time is an
effectiveness measure. OLAP applications are widely
used by Data Mining techniques. In OLAP database
there is aggregated, historical data, stored in multi-
dimensional schemas (usually star schema). 
More video

 Introduction to OLAP
 https://www.youtube.com/watch?
v=2ryG3Jy6eIY
 Excel Tutorial: What is Business Intelligence and
an OLAP Cube?
 https://www.youtube.com/watch?
v=yoE6bgJv08E
On-Line Analytical
Processing
 Multidimensional Data Analysis Techniques
 The processing of data in which data are
viewed as part of a multidimensional
structure.
 Multidimensional view allows end users to
consolidate or aggregate data at different
levels.
 Multidimensional view allows a business
analyst to easily switch business
perspectives.
 Refer to example : Excel
Figure 13.4 Operational Vs. Multidimensional View Of Sales
Figure 13.5 Integration Of OLAP With A Spreadsheet Program
INTEGRATION OF OLAP WITH A SPREADSHEET
PROGRAM - Pivot table in Excel
On-Line Analytical
Processing
 OLAP Architecture
 Three Main Modules
 OLAP Graphical User Interface (GUI)
 OLAP Analytical Processing Logic
 OLAP Data Processing Logic

 OLAP systems are designed to use


both operational and Data Warehouse
data.
As Figure 13.17 illustrates, OLAP systems are designed to use both operational and
data warehouse data. The figure shows the OLAP system components on a single computer,
but this single-user scenario is only one of many. In fact, one problem with the
installation shown here is that each data analyst must have a powerful computer to store
the OLAP system and perform all data processing locally.
Types of On-Line Analytical
Processing
 Relational OLAP (ROLAP)
o Relational On-Line Analytical Processing (ROLAP)
provides OLAP functionality by using relational
database and familiar relational query tools.
 Multidimensional OLAP (MOLAP)
o MOLAP extends OLAP functionality to multidimensional
databases (MDBMS).
o MDBMS end users visualize the stored data as a
multidimensional cube known as a data cube.
o Data cubes are created by extracting data from the
operational databases or from the data warehouse.
o Watch the video:
 https://www.youtube.com/watch?v=LzmAbi5ZOhE
ROLAP – using query tool
Multidimensional OLAP
(continued)
Relational Vs. Multidimensional OLAP

Table 13.8
Star Schema

• The star schema is a data-modeling technique used


to map multidimensional decision support into a
relational database.
• Star schemas yield an easily implemented model for
multidimensional data analysis while still preserving
the relational structure of the operational database.
• A star schema has four Components:
• Facts
• Dimensions
• Attributes
• Attribute hierarchies
Star Schema

• Facts
• Facts are numeric measurements (values) that represent a
specific business aspect or activity. For example, sales
figures are numeric measurements that represent product
and service sales.
• Facts commonly used in business data analysis are units,
costs, prices, and revenues. Facts are normally stored in a
fact table that is the center of the star schema.
• The fact table contains facts that are linked through their
dimensions, which are explained in the next section.
• Facts can also be computed or derived at run time. Such
computed or derived facts are sometimes called metrics to
differentiate them from stored facts.
• The fact table is updated periodically with data from
operational databases.
Star Schema
• Dimensions
• Dimensions are qualifying characteristics that
provide additional perspectives to a given fact. For
instance, sales might be compared by product from
region to region and from one time period to the
next.
• The kind of problem typically addressed by a BI
system might be to compare the sales of unit X by
region for the first quarters of 2006 through 2016.
• In that example, sales have product, location, and
time dimensions. In effect, dimensions are the
magnifying glass through which you study the facts.
• Such dimensions are normally stored in dimension
tables. Figure 13.6 depicts a star schema for sales
with product, location, and time dimensions.
A Simple Star Schema
Star Schema
• Attributes
 Each dimension table contains attributes. Attributes
are often used to search, filter, or classify facts.
 Dimensions provide descriptive characteristics
about the facts through their attributes.

Table 13.10 Possible Attributes For Sales Dimensions


Star Schema
• OLAP consists of three basic analytical operations:
• consolidation (roll-up)
• Consolidation involves the aggregation of data that can be accumulated
and computed in one or more dimensions.
• For example, all sales offices are rolled up to the sales department or
sales division to anticipate sales trends
• drill-down
•  the drill-down is a technique that allows users to navigate through the
details.
• For instance, users can view the sales by individual products that make
up a region's sales
• slicing and dicing.
• Slicing and dicing is a feature whereby users can take out (slicing) a
specific set of data of the OLAP cube and view (dicing) the slices from
different viewpoints.
• These viewpoints are sometimes called dimensions (such as looking at
the same sales by salesperson or by date or by customer or by product
or by region, etc.)
Example of Aggregation in
A Location Attribute Hierarchy

Figure 13.15
Attribute Hierarchies In Multidimensional Analysis

Figure 13.16
Data Warehouse Implementation Road Map

Figure 13.21
• Refer to the following video about “Data Warehouse Architecture”
• https://www.youtube.com/watch?v=CHYPF7jxlik

• Excel Tutorial: What is Business Intelligence and an OLAP Cube?


• https://www.youtube.com/watch?v=yoE6bgJv08E

• Data Cube Operations – SQL Queries


• https://blogs.perficient.com/2017/08/02/data-cube-operations-sql-queries/

Potrebbero piacerti anche