Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
TO
BUSINESS ANALYTICS
IBM 2201
‹#›
2 Objectives
Get an overview of big data that covers:
What is “Big Data”
Need of Big Data
Characteristics of Big Data: 4V of Big Data
Importance & Risks of Big Data
The structure of Big Data
What is Big Data Analytics? Benefits
Big Data Adoption
Applications
4 What is big data?
Examples of big data?
https://www.youtube.com/watch?v=tkOwlXUaGMM&t=250
s
‹#›
What is Big Data ?
Definition: “Big Data” is data whose scale,
distribution, diversity, and/or timeliness
require the use of new technical
architectures and analytics to enable
insights that unlock new sources of
business value.
Copyright © 2011 EMC Corporation. All Rights Reserved. Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
What is big data?
6
Big Data is a general term used to describe the
voluminous amount of unstructured and semi-
structured data – where the data are capture from
social media, CCTV, sensors, smart watch, etc.
Big Data term is often used when speaking about
Petabytes and Exabytes of data.
A primary goal for looking at big data is to discover
repeatable business patterns.
For example, If a customer is buying specific type and color of
cloth in a shop only that data would be available. However, the
customer might be having an occasion for which the same is
purchased and that occasion and the relationship with such
occasion can be captured by the external data that is created by
the said customer may be in social media.
‹#›
Why “big data” is growing!
7
‹#› https://www.youtube.com/watch?v=9s-vSeWej1U
A Growing Interconnected and
7 Instrumental World
100s of
data every day
? TBs of
millions
of GPS
enabled
devices
sold
25+ TBs of annually
log data
every day
2+
billion
people on
the Web
76 million smart by end
meters in 2009… 2011
200M by 2014
9 Videos
Does social media have the power to
change the world?
https://www.youtube.com/watch?v=Uppg_
2nGo54
Are you ready for digitization?
https://www.youtube.com/watch?v=ystdF6j
N7hc
‹#›
10 Need for Big Data
Big Data can unlock significant value by making information
transparent
As organizations create and store more transactional data in
digital form. In fact, some leading companies are using their
ability to collect and analyze big data to conduct controlled
experiments to make better management decisions.
Big Data allows ever-narrower segmentation of customers
and therefore much more precisely tailored products or
services
Sophisticated analytics can substantially improve decision-
making, minimize risks, and unearth valuable insights that
would otherwise remain hidden
Big Data can be used to develop the next generation of
products and services
‹#›
11
IBM’s Big Data: 4V’s
‹#›
Characteristics of Big Data
12
1) Volume
Volume indicates the amount of data for analysis.
characteristic most associated with big data, volume refers to
the mass quantities of data.
Data volumes continue to increase at an unprecedented rate.
2) Velocity
Data in motion. The speed at which data is created,
processed and analyzed continues to accelerate.
Velocity impacts latency – the lag time between when data is
created or captured, and when it is accessible.
Data is continually being generated at a pace that is
impossible for traditional systems to capture, store and
analyze.
‹#›
Velocity (Speed)
Data is begin generated fast and need to be processed fast
Late decisions will lead to missing opportunities
Examples
E-Promotions: Based on your current location, your
purchase history, what you like send promotions right
now for store next to you
Healthcare monitoring: sensors monitoring your
activities and body any abnormal measurements require
immediate reaction
https://www.youtube.com/watch?v=1RYKgj-QK4I
Characteristics of Big Data
3) Variety
Different types of data and data sources. Variety is about
managing the complexity of multiple data types, including structured,
semi-structured and unstructured data.
Organizations need to integrate and analyze data from a complex
array of both traditional and non-traditional information sources.
Explosion of sensors, smart devices and social collaboration
technologies, generates data in countless forms like text, web data,
tweets, sensor data, audio, video and more.
4) Veracity
Veracity is about uncertainty of Data, correctness of data
Huge amount of money is spent by Organizations because of data
quality issues.
Decision makers are not confident of data that is being used by them
for decision making.
https://www.youtube.com/watch?v=wVAWAeOIIII
Module 1: Introduction to BDA
15 Veracity
Establishing confidence of data is a biggest
challenge.
Uncertainty of Data or Veracity is a very
important character of Big Data.
Nearly 27% of respondents to a research study
expressed that they were unsure of how much of
their data was inaccurate.
Poor data quality cost the USA economy around
$3.1 Trillion a year. 1 in every 3 Business Leaders
do not trust the information they use to make
decisions.
‹#›
16
Types of big data
Structured data
Semi-structured data
Unstructured data
‹#› https://www.youtube.com/watch?v=mnoqT8nihT8
Variety – Complex Data Structures
Data Growth is Increasingly Unstructured
enabling parsing
Structured Databases
●
Structured data is data that has been organized into a formatted repository,
typically a database, so that its elements can be made addressable for more
effective processing and analysis.
●
It refers to data that has a defined length and format for big data
●
Ex. numbers, dates, and groups of words and numbers called strings.
●
It’s usually stored in a database.
18
Semi-structured data
19
Semi-structured data is a form of structured data that does
not conform with the formal structure of data models
associated with relational databases or other forms of
data tables, but nonetheless contains tags or other markers
to separate semantic elements and enforce hierarchies of
records and fields within the data.
Semi structure data is a set of documents on the web which
contain hyperlinks to other document and it cannot be
modeled in natural relational data model because the
pattern of hyperlinks is not regular across documents.
Example : XML, log file
‹#›
20 Semi-structured data
For example, a clickstream log may look like :
2017-11-01 14:27:57,944-INFO :
com.ovaledge.oasis.dao.DomainDaoImpl - RUNNING
QUERY: Select * from domain where
DOMAINTYPE='DATAAPP_CATEGORY’;
where we see the structure but require some rules
to find the details.
‹#›
Unstructured Data
●
It is the text written in various forms like - web pages,
emails, chat messages, pdf files, word documents, etc.
X-Rays Pictures
Real-time/Fast Data –
generate unstructured data
Mobile devices
(tracking all objects all the time)
‹#›
https://www.youtube.com/watch?v=LtScY2guZpo
24 Big Data Adoption
The term “big data adoption” is used here to
represent a natural progression of the data,
sources, technologies and skills that are
necessary to create a competitive advantage
in the globally integrated marketplace.
The four main stages of big data adoption and
progression are Educate, Explore, Engage and
Execute
‹#›
Big Data Analysis Adoption
25
Structure
1. Educate – build a base of knowledge
Most organizations in this stage are studying the potential
benefits of big data technologies and analytics, and are trying
to understand how big data can help address important
business opportunities in their own industries or markets.
2. Explore – define the business case and
roadmap
In this stage organizations get down to formal in-house
discussions about how to use big data to solve important
business challenges.
3. Engage – embracing big data
organizations begin to prove the business value of big data,
as well as perform an assessment of their technologies and
skills. ‹#›
Big Data Analysis
26
Adoption Structure
4) Execute: Implementing big data at scale
In the Execute stage, big data and analytics capabilities are
more widely operationalized and implemented within the
organization.
The small number of organizations in the Execute stage is
consistent with the implementations we see in the marketplace.
Importantly, these leading organizations are leveraging big data
to transform their businesses and thus are deriving the greatest
value from their information assets.
‹#›
27
Benefits of Big Data
Analytics
Benefits of Big Data Analytics
Anything involving customers could benefit from big data
analytics.
This includes better-targeted social-influencer marketing,
customer-base segmentation, and recognition of sales and
market opportunities.
Business intelligence in general can benefit from big data
analytics
This could result in more numerous and accurate business
insights, an understanding of business change, better planning
and forecasting, and the identification of root causes of cost.
Specific analytic applications are likely beneficiaries of big
data analytics
big data analytics might help automate decisions for real-time
business processes such as loan approvals or fraud detection.
‹#›
https://www.youtube.com/watch?v=QvyQFXbgW2c
28
Barriers of Big Data
Analytics
Inadequate staffing and skills are the
leading barriers to big data analytics,
where it couldn’t making big data usable
for end users
A lack of business support can hinder a big
data analytics program in terms of cost and
compelling business case.
Problems with current database software
used which is lack of database analytics
can be barriers to big data analytics
‹#›
29
Risks of Big Data
We are gathering data from different sources at different
types, to the extent that we often don’t know exactly what it
contains, big data carries its own special risks.
You don’t know whether all or just a tiny piece of it might be
essential to corroborate your compliance with some
government regulation.
You can’t have a perfect predictive model of how the future
business and regulatory environments are going to evolve.
But you can have a comprehensive data-risk mitigation
program that will help you deal with new challenges as
they emerge.
Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter
Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Telecommunication services
•Problem:
Legacy systems are used to gain insights from internally generated
data facing issues of high storage costs, long data loading time, and
long administration process.
‹#›
32
Applications of Data
Analytics – Use Case
Financial Service
Problem:
Manage the several Petabytes of data which is growing at 40-100%
per year under increasing pressure to prevent frauds and complain to
regulations.
How big data analytics can help:
Fraud detection
Risk management
360°View of the Customer
‹#›
33
Applications of Data
Analytics – Use Case
Transportation services
Problem:
Traffic congestion has been increasing worldwide as a result of
increased urbanization and population growth reducing the efficiency
of transportation infrastructure and increasing travel time and fuel
consumption.
How big data analytics can help:
Real time analysis to weather and traffic congestion data streams to
identify traffic patterns reducing transportation costs.
‹#›
34
Applications of Data
Analytics – Use Case
Healthcare and Life Sciences
Problem:
Vast quantities of real-time information are starting to come from
wireless monitoring devices that postoperative patients and those
with chronic diseases are wearing at home and in their daily lives.
How big data analytics can help:
Epidemic early warning
Intensive Care Unit and remote monitoring
‹#›
35 Video
What is Hadoop?
https://www.youtube.com/watch?v=4DgTLaFNQq0
‹#›
Summary
Terminologies of Big Data
Need of Big Data
The Characteristics of Big Data (3V
and 4V)
Types of Data
Big Data Analytics Adoption
Chapter 6
INTRODUCTION TO DATA
MINING
Learning objectives:
a
Prediction
Prediction is refer to the act of telling about
the future by taking into account the
experiences, opinions and other relevant
information in conducting the task of
foretelling.
Depending on the nature of what is being
predicted, prediction can be specifically as :
Classification (predicted thing is such as
tomorrow’s forecast, is a class label such as
“rainy” or “sunny”)
Regression (predicted thing is tomorrow’s
temperature, is a real number such as 65 F)
Time-series, the data consists of values of the
same variable that is captured and stored over
tine in regular intervals, such as stock price
Prediction techniques
Classification : assign a new data record to one of several
predefined categories or classes. Also called supervised
learning.
Classification approaches normally use a training set where
all objects are already associated with known class labels.
The classification algorithm learns from the training set and
builds a model. The model is used to classify new objects.
This method has been used in customer segmentation,
business modeling, and credit analysis.
For example, after starting a credit policy, the
OurVideoStore managers could analyze the customers’
behaviours via their credit, and label accordingly the
customers who received credits with three possible labels
“safe”, “risky” and “very risky”. The classification analysis
would generate a model that could be used to either accept
or reject credit requests in the future
Associations
Or association rule learning in data mining is
a popular and well-researched technique for
discovering interesting relationships among
variables in large databases.
With the help of bar-code scanners, the use
of associations rules for discovering
regularities among products is able to
capture by the system.
Types of associations:
Link analysis : the linkage among many
objects of interest is discovered automatically,
such as the link between web pages and
referential relationships among groups of
academic publication authors
Associations techniques
Market-basket: detect sets of attributes/items that
frequently has association relationship or correlations
among them, e.g. 90% of the people who buy cookies,
also buy milk (60% of all grocery shoppers buy both)
In data mining, association rules are useful for
analyzing and predicting customer behavior. They
play an important part in shopping basket data
analysis, product clustering, catalog design and
store layout.
Sequence mining (categorical): discover
sequences of events that commonly occur together,
.e.g. In a set of DNA sequences ACGTC is followed by
GTCA after a gap of 9, with 30% probability
Something come after the other, for example:
when happen outbreak flu, the glove will be in
shortage
Association rules
Clustering
Clustering: method of assigning a set of objects into
groups or segments based on similarities
automatically.
Unlike classification, in clustering the class labels are
unknown.
As the selected algorithm goes through the data set,
identifying the common of things based on their
characteristics, the clusters are established.
Clustering techniques include optimization.
Goal of clustering is to create groups so that the
members within each group have maximum similarity
and the members across groups have minimum
similarity.
Clustering techniques
Cluster analysis is a means of identifying
classes of items so that items in a cluster
have more in common with each other
than with items in other clusters.
Example: create customer segmentation
based on income, age, race, location, etc.
Data Mining Techniques
Outlier Analysis: find the record(s) that is
(are) the most different from the other
records, i.e., find all outliers. Outliers are
data elements that cannot be grouped in a
given class or cluster.
Example of using Data
Mining
Data Mining versus
Data Mining
Statistics
Statistics
106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
The iterative process consists of the following steps:
Data cleaning: also known as data cleansing, it is a phase in which noise
data and irrelevant data are removed from the collection or maybe missing
data.
Data integration: at this stage, multiple data sources, often heterogeneous,
may be combined in a common source.
Data selection: at this step, the data relevant to the analysis is decided on
and retrieved from the data collection.
Data transformation: also known as data consolidation, it is a phase in
which the selected data is transformed into forms appropriate for the mining
procedure.
Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful. Searching for patterns of interest in a
particular representational form or a set of such representations, including
classification rules or trees, regression, and clustering
Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
Knowledge representation: is the final phase in which the discovered
knowledge is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the data
mining results.
3 methodologies of KDD
model
Fayyad et al. (Computer science)
E.g., WEKA
SEMMA (SAS) (Statistics)
SAS Enterprise Miner
CRISP-DM (SPSS, OHRA) (Business)
SPSS
Methodology of KDD –
CRISP-DM
CRISP-DM
Stands for Cross Industry Standard Process for
Data Mining
A non-proprietary, documented, and freely
available data mining model.
It was developed by industry leaders with
input from more than 200 data mining users
and data mining tool and service providers.
It is an industry-, tool- and application-neutral
model.
This model encourages best practices and
offers organizations the structure needed to
realize better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate
view)
Six phases of CRISP-DM
1. Business Understanding
This initial phase focuses on understanding the project objectives
and requirements from a business perspective, and then converting
this knowledge into a data mining problem definition, and a
preliminary plan designed to achieve the objectives.
Such as “What are the common characteristics of the customers
we have lost to our competitors recently?”
2. Data Understanding
The data understanding phase starts with an initial data
collection. It proceeds with activities
▪ To get familiar with the data,
▪ To identify data quality problems,
▪ To discover first insights into the data, or to
▪ Detect interesting subsets to form hypotheses for hidden
information.
Six phases of CRISP-DM
3. Data Preparation
The data preparation phase covers all activities to
construct the final dataset (data that will be fed into
the modeling tool(s)) from the initial raw data.
Data preparation tasks are likely to be performed
multiple times, and not in any prescribed order. Tasks
include table, record, and attribute selection as well as
transformation and cleaning of data for modeling tools.
4. Modeling
In this phase, many modeling techniques are chosen
and applied, and calibrate their parameters to optimal
values. Typically, to the same data mining problem
type, several techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
The accuracy and generality of the model were dealt
with the previous evaluation steps. The degree to which
the model meets the business objectives is assessed in
this step.
Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another
option of evaluation.
6. Deployment
The end of the project is not just the creation of the
model. Though the purpose of the model is to increase
knowledge of the data, the knowledge gained needs to
be organized and presented in such a way that the client
can use.
KDD vs. DM
DM is a component of the KDD process that is
mainly concerned with means by which
patterns and models are extracted and
enumerated from the data
DM is quite technical
Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
KDD requires a lot of domain understanding
The DM and KDD are often used
interchangeably
Perhaps DM is a more common term in
business world, and KDD in academic world
The end.
Figure 13.3
A Data Warehouse Framework and Views
The Data Warehouse
Twelve Rules That Define a Data Warehouse
1. The Data Warehouse and operational
environments are separated.
2. The Data Warehouse data are integrated.
3. The Data Warehouse contains historical data
over a long time horizon.
4. The Data Warehouse data are snapshot data
captured at a given point in time.
5. The Data Warehouse data are subject-oriented.
6. The Data Warehouse data are mainly read-only
with periodic batch updates from operational data.
No online updates are allowed.
7. The Data Warehouse development life cycle
differs from classical systems development. The
Data Warehouse development is data driven; the
classical approach is process driven.
The Data Warehouse
8. The Data Warehouse contains data with several
levels of detail; current detail data, old detail data,
lightly summarized, and highly summarized data.
9. The Data Warehouse environment is characterized
by read-only transactions to very large data sets. The
operational environment is characterized by numerous
update transactions to a few data entities at the time.
10. The Data Warehouse environment has a system
that traces data resources, transformation, and
storage.
11. The Data Warehouse’s metadata are a critical
component of this environment. The metadata identify
and define all data elements. The metadata provide
the source, transformation, integration, storage,
usage, relationships, and history of each data element.
12. The Data Warehouse contains a charge-back
mechanism for resource usage that enforces optimal
use of the data by end users.
Architecture of Web-Based
Data Warehousing
OLAP vs. OLTP
Introduction to OLAP
https://www.youtube.com/watch?
v=2ryG3Jy6eIY
Excel Tutorial: What is Business Intelligence and
an OLAP Cube?
https://www.youtube.com/watch?
v=yoE6bgJv08E
On-Line Analytical
Processing
Multidimensional Data Analysis Techniques
The processing of data in which data are
viewed as part of a multidimensional
structure.
Multidimensional view allows end users to
consolidate or aggregate data at different
levels.
Multidimensional view allows a business
analyst to easily switch business
perspectives.
Refer to example : Excel
Figure 13.4 Operational Vs. Multidimensional View Of Sales
Figure 13.5 Integration Of OLAP With A Spreadsheet Program
INTEGRATION OF OLAP WITH A SPREADSHEET
PROGRAM - Pivot table in Excel
On-Line Analytical
Processing
OLAP Architecture
Three Main Modules
OLAP Graphical User Interface (GUI)
OLAP Analytical Processing Logic
OLAP Data Processing Logic
Table 13.8
Star Schema
• Facts
• Facts are numeric measurements (values) that represent a
specific business aspect or activity. For example, sales
figures are numeric measurements that represent product
and service sales.
• Facts commonly used in business data analysis are units,
costs, prices, and revenues. Facts are normally stored in a
fact table that is the center of the star schema.
• The fact table contains facts that are linked through their
dimensions, which are explained in the next section.
• Facts can also be computed or derived at run time. Such
computed or derived facts are sometimes called metrics to
differentiate them from stored facts.
• The fact table is updated periodically with data from
operational databases.
Star Schema
• Dimensions
• Dimensions are qualifying characteristics that
provide additional perspectives to a given fact. For
instance, sales might be compared by product from
region to region and from one time period to the
next.
• The kind of problem typically addressed by a BI
system might be to compare the sales of unit X by
region for the first quarters of 2006 through 2016.
• In that example, sales have product, location, and
time dimensions. In effect, dimensions are the
magnifying glass through which you study the facts.
• Such dimensions are normally stored in dimension
tables. Figure 13.6 depicts a star schema for sales
with product, location, and time dimensions.
A Simple Star Schema
Star Schema
• Attributes
Each dimension table contains attributes. Attributes
are often used to search, filter, or classify facts.
Dimensions provide descriptive characteristics
about the facts through their attributes.
Figure 13.15
Attribute Hierarchies In Multidimensional Analysis
Figure 13.16
Data Warehouse Implementation Road Map
Figure 13.21
• Refer to the following video about “Data Warehouse Architecture”
• https://www.youtube.com/watch?v=CHYPF7jxlik