Sei sulla pagina 1di 650

Course Material

Business Analytics & Data Mining


August 22, 2019 Data Mining: Concepts and Techniques 2
Chapter-1
Introduction
Chapter-1: Introduction

Introduction:
o Introduction to data mining and business
analytics– current status and examples
o Types of problems encountered with examples;
concepts of data generation process
About the Business:

 `Soft’ aspect only brings the ability of an Organisation to address overall


sensitivity to all the stakeholders.

 Anybody who is impacted by the products and processes of an Organisation


is called stakeholder.

 Stakeholders are:
 Owners/ Investors/ Shareholders/ Management.

Customer

 Government

 Design Collaborator

Employee

 Supplier

 Supply Chain

 Society

 Intellectual community

 Financial/ leasing institutes etc.


About the Business:

 Customer is somebody who can be defined as ‘Ultimate recipient of the


products or services for their respective use’.

Addressing sensitivity has three important issues:


o Customer has to be satisfied.
o Governments (Legal requirements) has to be at least complied with.
o We need to be sensitive to all other stakeholders. Sensitivity need not
mean satisfaction. Sensitivity may mean transparency on values/
policies/ practices/ processes/ requirements/ attitude or behaviour etc.

 Stakeholders support the Organisation to ensure long-term survival. Any


gain in sensitivity to the stakeholders would keep the Organisation better
focused.

 Hence the real meaning of the business is


o Business = Hard (Finance) + Soft (Sensitivity to the
stakeholders)
About the Business:

6 Increase Profit (Hard) + Ability (Soft)

Existing Resources More Resources

More output from More output from More output from more
Less resources same resources resources/ investments

More output from


Less investment
Expand Increase
capacity Market share
Reducing Reducing defects
opportunities for
defects (Waste) New Product

Value Stream Problem Solving


Mapping, Prevent defects so that output Six
Kaizen (DMAIC –Lean (DMAIC – Sigma from the beginning
Six Sigma) Six Sigma) (DFSS- DMADV)

CASH FLOW MARGINS REVENUE

BOTTOM LINE TOP LINE


About the Business:

6 Increase Profitability (Soft) - Non-financial metrics e.g. Customer


Satisfaction Index, Employee satisfaction index, Image etc.

Existing Resources More Resources

Better sensitivity Better sensitivity Better sensitivity from


from Less resources from same resources more resources/
investments

Better sensitivity
from Less investment
Customer looks for New market
more & more enjoying
Reducing defects
Reducing customers come in-
opportunities for Expand capacity
defects (Waste) New Product

Value Stream Problem Solving


Mapping, Prevent defects & waste so that
Kaizen (DMAIC –Lean (DMAIC – output Six Sigma from the beginning
Six Sigma) Six Sigma) (DFSS- DMADV)
Employee, Mgmt,
Supplier, Society Customer & Govt. All the stakeholders
About the Business- Core & Core
Processes:

 Any organisation will have core and core processes. Core is the kind
to products and services it offers to the customer to pursue a purpose
for the benefit of the society.

 Core does not gel well with the concept of `diversification’; but it
promotes the concept of `expansion’.

 Core is basically the product portfolio connected to the purpose of


the customers following the laws of the lands.

 Product= Product + Service with Finance- All the three in


combination or in isolation can be sold to the customers; yet it goes
well in line with the expansion of the `core’ not `diversification’.
Almost 44% of the revenue of GE comes from selling services, yet the
GE’s core is in electrical products. Product application, spares, financial
loan or lease may be well connected around the core products for sale.
About the Business- Core & Core
Processes:

P1 P2 P3 P4 P5 P6
Develop New Collect
Confirm Order Product/ Fulfill Order Deliver Order Collect Money Customer
Process Feedback

P1.1 – Create product / Offer Initial Demand


Portfolio
Unknown (unseen customer)
Demand Marketing will work in this area
D1
P1.2 – Access Demand (A)
D2 D1 : Your Sales
D2 : Competitor’s Sales
D3
D3 : Created by you as well as your competitor
P1.3 – Create Demand (C)

Total Demand = P1.2 + P1.3

P1.4 – Generate Enquiry (E)


DEMAND (D) = A + C
1. How Enquiry is generated?
Generate
P1.5 – Proposal you have sent
Enquiry (E)
(P)
Key Points:
Proposal you
1. P/E Ratio is very important
have sent (P)
P1.6 – Win 2. Number of Regrets (R) = (E-P) - This
helps in DFSS project identification
(New Product or New Process)
1. Understanding Potential = Marketing
2. Realize Potential = Sales
About the Business- Customer Loyalty
Strategy:

 Entry Strategy

 Salesmanship – The way to win the  “ It costs 6 – 7 times more to acquire a new
customer than retain an existing one. ”
customer
– Bain & Company
 First Serving Strategy  “ The probability of selling to an existing
 Requires more toil customer is 60 – 70%. The probability of
selling to a new prospect is 5-20%. “
 Requires strong commitment
– Marketing Metrics
 Demonstration of honesty to customer  “ A 2% increase in customer retention has

 Retention Strategy the same effect as decreasing costs by 10%.



 Being trust worthy to customer
– Leading on the Edge of Chaos, Emmet Murphy &
Mark Murphy
Business Analytics
 Business analytics (BA) refers to the skills,
technologies, applications and practices for continuous
iterative exploration and investigation of past business
performance to gain insight and drive business planning.
(Beller, Michael J.; Alan Barnett (2009-06-18). "Next Generation Business Analytics". Lightship Partners LLC. )

 Business analytics focuses on developing new insights


and understanding of business performance based on
data and statistical methods.

 Business analytics makes extensive use of data, statistical


and quantitative analysis, explanatory and predictive
modeling, and fact-based management to drive decision
making.
Contents
 Why Business Analytics?
 What is Business Analytics? What are business
problems?
 What are the assumptions / pitfalls in solving BA
problems
 What are supervised and unsupervised learning?
What variables are used to solve BA problems?
 What are the constituents of Business Analytic
systems? What are the differences between BA
and BI?
 The way forward – the organization of the course

13
Data and Extraction of Information - Current Scenario

• The growth of data availability is mind-boggling. According to Intel the quantity of information
generated from dawn of human history till 2003 – some 5 exabytes – is now created every two
days
• Data processing and storage costs have decreased by a factor of 1000 over the past decade
• Technologies like Hadoop and MapReduce eliminates the need to structure the data in rigidly
defined formats – a costly, labour-intensive proposition
• Powerful techniques for analyzing data to extract various insights have been developed and
software are available to enable easy implementation
• Advanced statistical, optimization, machine-learning and data-mining techniques enable
extraction of hitherto unavailable insights

• At present technology allows us to keep a lot of


• It should be possible for us to learn a lot about why things
happen and what could happen in future in addition to
uncovering interesting patterns
• This knowledge can often provide competitive advantage apart
from having positive influence on costs and customer satisfaction
• Ability to store the right data in appropriate structure and extract
meaningful information from the same is, therefore, becoming
crucial for business success
14
What is a Business Problem

 A problem is a difference between the desired


state and the current state
 We assume that the desired state has a higher
level of business benefits compared to the
current state
 The business benefit may be in terms of revenue
/ customer retention / employee satisfaction /
ease of operations (productivity improvement /
defect reduction / availability)…
 The benefits may be measurable in quantitative
terms. It must definitely be observable
15
Some Examples
• An automobile manufacturer wants to understand how the
fault, failure and usage related data captured through the
sensors may be used to classify the condition of vehicles so
that preventive maintenance may be carried out optimally.
• Similar situations are applicable to manufacturers of mining
equipment, aircrafts, and white goods.
• Insurers may wish to classify drivers as very risky, risky, safe
etc. on the basis of their driving habits so that insurance
premium may be fixed intelligently rather than offering one-
size-fits-all policy
• A company engaged in oil exploration may need to estimate
the time and expenses of drilling under different geological
conditions before taking up a drilling assignment
• A company may wish to forecast the total demand based on
past demands, initiatives taken up by the company as well as
past and current economic conditions
16
Some Examples
• Manufacturers of consumer electronics may need to
understand the sentiment of people communicating over
social media about their products
• A large retailer may like to understand the impact of
impending natural disasters (known through forecasts)
like a hurricane on purchase behaviour
• An e-commerce company may wish to know the impact of
making changes made on the portal on the quantum of
sales
• Credit card as well as health insurance companies may
wish to identify fraudulent transactions so that
appropriate actions may be initiated
• A retailer may like to suggest additional products a
customer may be willing to buy on the basis of the
current as well as past surfing data
17
Assumptions
 Habits / behaviours / usages are measurable
 Sentiments of buyers white goods

 Driving habits of car owners

 Characteristics of credit card transactions

 Surfing habits of individuals

 Geological conditions characterizing drilling


difficulty
 Characteristics of potential buyers entering

retail stores

18
Assumptions (Continued…)
 In case we are interested in a specific outcome,
characteristics of the same should be measurable
 Status of a sales offer – a consumer may or may not buy
the product that she enquired about. Accordingly the
outcome variable may be binary – 0 or 1.
 Time to failure for a television – the number of hours the
TV set has operated before failing. The outcome variable
will be a real number starting from 0.
 The number of near misses or minor accidents a driver
had during a period (or for driving certain distance). The
outcome will be an integer count starting from 0
 The perception score like outstanding/ very good/ good/
acceptable/ poor given by a customer regarding the quality
of service of a restaurant. Here the outcome is an ordered
categorical variable with 5 possible values.

19
Assumptions (Continued…)
 It is assumed that the behaviour / usage / habits
as well as outcome are measured on the same
entity. The following must necessarily be true
 The entity being studied should be well
defined and clearly identifiable
 The outcome and the characteristics of the
entity to be measured must be known in
advance
 The characteristics of the entity as well as the
outcome must be observable as well as
measurable

20
Assumptions (Continued…)
 In some cases we may not have a clearly defined
outcome
 We may like to assess the sentiments as positive,

negative or neutral without connecting it to sales


 In order to detect medical frauds we may group

transactions with respect to similarity but assess


the chance of a group being fraudulent later
 We may identify many variables that are
supposed to constitute customer satisfaction and
then group these variables to identify a set of
dimensions of customer satisfaction

21
Two Types of Problems of BA

 Supervised Analytics
 When the response is predefined and we are

trying to establish relationship between the


response (Y) and the input (X) variables
 In supervised analytics we usually try to develop a
functional relationship of the form y = f(x), where x
represents the vector of input variables
 Unsupervised Analytics
 We do not have any specific response. We are

trying to find some patterns of the input


variables
22
Understanding Variables

 The outcome variable is called response. It is also


often called Y variable or the output variable.
 The variables measuring the characteristics are
called the explanatory variables and are often
referred to as the X variables. These variables are
often called input variables or features.

23
Examples of Supervised Analytics

 An HR professional is trying to assess the impact


of characteristics of individuals like age at joining,
gender, current marital status, number of years
with the organization, growth in the organization
and such other variables on the length of stay.
Here the response is the length of stay and the
other variables are the input variables. The
individual employees are the units of analysis
(sampling units).

24
Examples of Supervised Analytics (Continued…)
 An engineer in a chemical plant wants to
understand the relationship between the
characteristics of a batch, e.g. the proportion of
certain ingredients and the maximum
temperature and pressure, with the batch
quality. Here batch quality is the response and
the individual batches are the units of analysis
(sampling units).
 Note that defining the unit of analysis may not
be easy in case of continuous production

25
Examples of Supervised Analytics (Continued…)
 An investment consultant is engaged in
estimating the possible closing values of a
particular stock on a daily basis. She uses the
data on the past values to make the prediction
for future days. Here the particular stock being
studied is the entity. The past closing values (on
daily basis) are the input (X) variables. The
value for the next day (if you have values for
day 1, 2, ….n then the value for day n + 1) is
the response (Y).

26
Examples of Unsupervised Analytics
 A large retailer wants to open a new outlet. Noting that
a new outlet is expensive, the company is planning to
survey the proposed locations to assess potential.
However, a large number of locations have been
identified and the company finds that surveying all
locations is also a costly proposition. In order to reduce
costs, the company collects secondary data relevant to
the potential of the locations and develops a small
number of similar clusters such that only one location
from each cluster may be surveyed. Once the ‘best’
cluster is chosen, survey may be conducted for a few of
the locations within the cluster so that the company can
arrive at a ‘good’ location at a low cost.

27
Unsupervised Analytics (Continued…)
 Software development is a skill intensive activity and it is
important to develop a holistic methodology to measure
the skills of individual software developers. Skill is
unobservable but it has many constitutive components
that may be observed and measured at least as expert
rating. It is important to group these constituent
components into a few broad dimensions so that scales
for measuring these dimensions may be constructed
using a subset of the proposed constitutive components.
It is also important to assess whether the proposed
constitutive components cover all the important
dimensions. It may be noted that the individual software
developers form the units of analysis and the
constitutive skill components are the input (X) variables.
It may further be noted that there are no responses (Y
variables).

28
Three Pillars of Business Analytics
 Acquisition, storage and preliminary compilation of data.
Includes acquisition of unstructured data using
technology and preliminary compilation includes
visualization and descriptive analyses
 In depth analyses of data using statistical and machine
learning techniques. At this stage we test hypotheses,
build models to uncover relationships and discover
patterns that are not easily visible
 Understanding the business perspective such that the
problems may be appropriately formulated, interesting
hypotheses may be proposed, right variables may be
identified, and the results may be communicated to the
business users in their language
29
Components of Business Analytics
Operational data bases, data
warehouses, online processing and
mining, enterprise information
management systems, data
acquisition and cleaning, big data
Data acquisition, engineering technologies like Hadoop &
and processing (mostly MapReduce
compilation)

Application of statistical and data mining


Feedback to business and implementation
tools on specific business problems

Formulation of the business problem in statistical


Understanding the organizational structure and skill
terms, breaking down a problem into a set of canonical
required for effective implementation, identification of
analytic tasks, identification of statistical / data mining
important business problems, understanding how
tools, verification of assumptions, avoiding traps like
information are used and created by line managers
selection bias, processing and interpreting data, model
(understanding of cognitive and behavioral sciences),
validation, presentation of quantitative solution, design
setting up measurement systems to assess success,
of data collection plans, analysis of data collected on
linking business analytics to the strategy of the firm
campaign basis (e.g. surveys)

Identification of types and classes of problems (horizontals and verticals)

30
Business Analytics Process

Problem Data
Problem
Statement Understanding
Formulation

Business Understanding Data Preparation

Operational Data
Databases Repository

Model Building and


Deployment Validation

31
Data Preparation
 Most organizations maintain data to support regular operations.
These are referred to as operational data.
 For instance, procurement department maintains data on
vendors, prices, and time to supply; manufacturing department
maintains data on defects, production, and manufacturability;
and engineering / design department maintains data on changes
made to drawings of parts procured from vendors. However,
improvement of manufacturability requires data from all three
departments. Often, getting data for the same entity is difficult.
For example which parts supplied by which vendors according to
which drawing numbers were used in particular assembly and
what were the results may not be easy to gather.
 Data preparation requires compiling different operational data to
obtain an overall holistic view. This activity is often same as
developing a data warehouse.

32
Comparison of BA and BI
 Business Intelligence (BI) involves
 Developing warehouses from operational data
 Providing elementary capabilities for data visualization and
descriptive analyses like
 Reporting quantum of sales, showing trends, allowing putting up
business specific alerts, allowing users to get different views
 Business Analytics (BA) involves
 Identifying the possible cause rather than only answering ‘what’
and ‘where’ addressed by BI
 Predicting possible outcomes and even automating decisions like
making suggestions to buyers for improving sales
 Uncovering interesting patterns that can lead to valuable insights

33
What Have We Learnt?

 Business analytics involves using large amount of


quantitative data to understand relationships or uncover
patterns. Thus business analytics involves learning from
data.
 There are two types of variables – response (Y) and
explanatory (X). The response variables are used to
measure outcomes and the explanatory variables
characterize the entity being studied
 There are two types of problems
 Understanding relationship between some response of interest
with some input variables – called supervised analytics
 Uncovering interesting patterns within the explanatory variables
– called unsupervised analytics

34
What Have We Learnt? (Continued…)
 The supervised analytics problem is often looked at as a
problem of fitting a function like y = f(x) on the basis of
the quantitative data
 The measurements of the X and Y variables must be
carried out on the same entity. Thus successful
application of BA techniques requires identifying the
entities to be studied and ensuring that the identified
variables are measurable. Ensuring that all
observations are taken on the same entity is of
utmost importance

35
What Have We Learnt? (Continued…)
 Identification of the variables and defining the
problem require substantive knowledge. This
does not come under the purview of the
statistical / machine learning techniques that we
will be covering in this course
 In case of supervised learning, care must be
taken to ensure that the X and Y variables are
expected to be related from substantive
perspective
 In case of unsupervised learning, the identified
patterns must make business sense

36
What Have We Learnt? (Continued…)
 Business analytics have three important
constituents – data acquisition, storage,
compilation and preliminary analyses; in depth
analyses using statistical / machine learning
techniques; and managerial and business
understanding for problem formulation as well
as effective communication with business users

37
What Have We Learnt? (Continued…)
 Usually businesses maintain data to support their
regular operations. These are called operational
data
 Successful business analytics requires connecting
different operational data and developing a data
warehouse. This activity is referred to as data
preparation
 Business analytics differ from BI as it offers in depth
insights not available from preliminary compilations
and visualization. BI is restricted to development of
data warehouses and providing preliminary
descriptive and visualization tools

38
Review Questions
 What is meant by learning from data? How is it related to
business problems?
 What are supervised and unsupervised learning (analytic)
problems? Give examples.
 What are response and explanatory variables? Give examples.
 What is a sampling unit (unit of analysis)? How and why is it
important in the context of business analytics?
 What is the role of substantive knowledge in the context of
business analytics?
 What are the constituents of business analytics?
 What is operational data? What is a data warehouse?
 What is meant by data preparation? How is it relevant to BA?
 What are the key differences between BA and BI?

39
Coverage
 In this course we will primarily look at formulation of
problems; understanding and preparing data; and
developing models to solve the identified problems
 The data preparation will be covered partially as we
will look into the issues of treatment of missing data
but will not cover technical issues of data capture or
the maintenance of data warehouses
 Concepts of modeling (we will refer to it as
statistical / machine learning techniques that forms
the core of data science) will be covered in detail
 Some deployment and managerial issues will be
discussed from theoretical standpoint as well as
through case examples

40
About the Business:
Organisation is in the business to ensure sustainable growth in
profit.
 Financial results does not mean the profit alone. Organisation
looks for:
Quantum of money- Revenue Top-line

Quality of money- Margin/ Savings/ Profit Bottom-line


Hard Money

Speed of money- Liquidity/ Cash flow Cash Flow

Ease of Money- Ease of doing business Soft Money

 First three are termed as Hard aspects of Business- sometimes


spelt out as `looking for hard savings- i.e. short-term financial
results’.
The fourth one calls for Soft gain. Customers, employees,
Management, Suppliers should feel comfortable to do business or
work with the Organisation. Soft money is the survival aspect
About the Business:
`Soft’ aspect only brings the ability of an Organisation
to address overall sensitivity to all the stakeholders.
 Anybody who is impacted by the products and
processes of an Organisation is called stakeholder.
 Stakeholders are:
 Owners/ Investors/ Shareholders/  Supply Chain
Management.
 Customer  Society
 Government  Intellectual community
 Financial/ leasing institutes etc.  Design Collaborator
 Employee  Supplier
About the Business
 Customer is somebody who can be defined as ‘Ultimate recipient of the
products or services for their respective use’.

 Addressing sensitivity has three important issues:


1.Customer has to be satisfied.
2.Governments (Legal requirements) has to be at least complied with.
3.We need to be sensitive to all other stakeholders. Sensitivity need not
mean satisfaction. Sensitivity may mean transparency on values/
policies/ practices/ processes/ requirements/ attitude or behavior etc.

 Stakeholders support the Organisation to ensure long-term survival.


Any gain in sensitivity to the stakeholders would keep the Organisation
better focused.
Hence the real meaning of the business is
Business = Hard (Finance) + Soft (Sensitivity to the stakeholders)
About the Business
 Any organisation will have core and core processes. Core is the
kind to products and services it offers to the customer to pursue
a purpose for the benefit of the society.

 Core does not gel well with the concept of `diversification’; but
it promotes the concept of `expansion’.

 Core is basically the product portfolio connected to the purpose


of the customers following the laws of the lands.

 Product= Product + Service with Finance- All the three in


combination or in isolation can be sold to the customers; yet it
goes well in line with the expansion of the `core’ not
`diversification’. Almost 44% of the revenue of GE comes from
selling services, yet the GE’s core is in electrical products.
Product application, spares, financial loan or lease may be well
connected around the core products for sale.
Basic domains within analytics

 Retail Sales analytics


 Financial services analytics
 Risk and credit analytics
 Marketing analytics
 Behavioral analytics
 Fraud analytics
 And many more……..
Where to start?

 DATA
 Lots of data is being collected and warehoused
 Web data, e-commerce

 purchases at department/

grocery stores
 Bank/Credit Card

transactions
 Data transactions at Mall

 .........
Types of Attributes

 There are two basic types of data

 Qualitative (Nominal, Ordinal)


 Quantitative/ Numeric

 Examples

 Need to take an account of data type in analysis.


Data Preparation

 Missing values-

 Ignore the touple,


 manually filling,
 replace by global constant,
 use attribute mean,
 use attribute mean of the corresponding class,
 use most probable value.
Data preparation

 Outlier treating..
 Meaning

 Influence

 Remedy
Data integration and Transformation

 Merging data from various sources


 Aggregation (daily to monthly to annually etc.)
 Normalization (min-max normalization, z-score)
Evolution of Database system Technology

 Data collection and Database creation (1960s and


earlier)

 DBMS( 1970s-early 1980s)

 Advanced Data Analysis (Data warehousing and


Data mining) (1980s-present)
EDA

 Summary

 Histograms (Spread, Shape, Location)

 Boxplots
Multivariate Data Analysis

Conventional Analysis :

 Summary
 Dispersion
 Association (between attributes)
 Correlation (between numeric variables)
 Group comparison
Data Reduction
 Principal component Analysis
 Variables under study are transformed to a

new varables, which are linear combinations of


the variables under study.
 It is projection of high dimensional space to a

low dimensional space


 New variables are called as principle

components. The first PC explains the


maximum variation in the data then second PC
and so on. These PCs are uncorrelated to each
other.
 Factor Analysis
Multiple Regression Analysis
 Concept of dependent and independent variables

 Basic assumptions

 Linear and nonlinear regression

 Use of regression

 Subset selection

 Effect of violation of assumption


Data Mining

 What is data mining?


 Extracting or mining knowledge from large

amount of data. (Like gold mining, it is actually knowledge


mining!)

 It is an interdisciplinary field, the confluence of


a set of disciplines, including database
systems, statistics, machine learning,
visualization, and information science.
Core Ideas in Data Mining

 Data Reduction
 Data Exploration
 Data Visualization
 Classification
 Prediction
 Association Rules or Affinity Analysis
Data Mining-Whose domain

Domain Expert Computer Science

Mathematician Statistician
Supervised Learning

 This type of learning is “similar” to human


learning from experience. Since computers have
no experience, we provide previous data, called
training data as a substitute. If is analogous to
learning from a teacher and hence the name
supervised. Two such tasks in data mining are
classification and prediction. In classification,
data attributes are related to a class label while in
prediction they are related to a numerical value.
Unsupervised Leaarning

 In this type of learning, we discover patterns in


data attributes to learn or better understand the
data. Clustering algorithms are used to discover
such patterns, i.e., to determine data clusters.
Algorithms are employed to organize data into
groups (clusters) where members in a group
are similar in some way and different from
members in other groups.
Major Data Mining Tasks

 Prediction Methods
 Use some variables to predict unknown or

future values of other variables.

 Description Methods
 Find human-interpretable patterns that

describe the data.


Steps involved in Data Mining

1. Develop an understanding of the purpose of the Data


Mining Project
2. Obtain the data set to be used in the analysis
3. Explore, Clean and preprocess the data
4. Reduce the data and partition it into possibly three sets-
training, validation and testing
5. Determine the data mining task (Classification,
prediction, clustering…)
6. Choose an appropriate Data mining technique
7. Use algorithms to perform the task
8. Interpret the results of analysis
9. Deploy the model
Training, Validation and Test

 In data mining applications we seek a model that learns


well from the available data as well as has good
generalization performance.
 Towards this objective, we need a way to manage model
complexity and a method to measure performance of the
chosen model.
 A common approach is to divide the data into three sets,
training, validation and test. Training data are used to
learn or develop candidate models, validation set is used
to select a model and test set is used for assessing model
performance on future data.
 However, in many applications only two sets are created,
training and test.
Big Data Mining

Three pillars – statistical and machine learning;


big data and data warehousing technology;
managerial usages, applications, and
competing with analytics
Where to start?

 DATA
 Lots of data is being collected and warehoused
 Web data, e-commerce

 purchases at department/

grocery stores
 Bank/Credit Card

transactions
 Data transactions at Mall

 .........
Data

Qualitative Quantitative

Ordinal Nominal Cardinal Discrete Continuous


Understanding Data

 Do all numbers have the same type?

August 22, 2019 Data Mining: Concepts and Techniques 67


Types of Attributes

 There are two basic types of data

 Attribute (Nominal, Ordinal)


 Numeric

 Examples

 Need to take an account of data type in analysis.


Data Mining

 What is data mining?


 Extracting or mining knowledge from large

amount of data. (Like gold mining, it is actually knowledge


mining!)

 It is an interdisciplinary field, the confluence of


a set of disciplines, including database
systems, statistics, machine learning,
visualization, and information science.

August 22, 2019 69


Core Ideas in Data Mining

 Data Reduction
 Data Exploration
 Data Visualization
 Classification
 Prediction
 Association Rules or Affinity Analysis
 Predictive Analytics
What is Data Mining?

Origins of Data Mining

 Draws ideas from machine learning/AI, pattern recognition,


statistics, and database systems
 Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
 Enormity of data
AI Pattern
 High dimensionality Recognition
of data
Data Mining
 Heterogeneous,

distributed nature
of data Database
systems
Problem

 Formulation of some data mining problems


(domian-wise)
Classification: Definition

 Given a collection of records (training set )


 Each record contains a set of attributes, one

of the attributes is the class or Label


 Find a model for class attribute as a

function of the values of other attributes.


(Relate class attribute with the rest of the
attributes)
 Goal: previously unseen records should be

assigned a class as accurately as possible.


Classification: Example 1

 Direct Marketing
 Goal: Reduce cost of mailing by

capturing/identifying a set of consumers


likely to buy a new cell-phone product.
Classification: Example 1
 Approach:
 Use the data for a similar product introduced
before.
 We know which customers decided to buy and
which decided otherwise. This {buy, don’t buy}
decision forms the class attribute.
 Collect various demographic, lifestyle, and
company-interaction related information about all
such customers.
 Type of business, where they stay, how much they earn,
etc.
 Use this information as input attributes to learn a
classifier model.
Classification: Example

 Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
Classification: Example

 Approach:
 Use credit card transactions and the information on
its account-holder as attributes.
 When does a customer buy, what does he buy, how often
he pays on time, etc
 Label past transactions as fraud or fair transactions.
This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit
card transactions on an account.
Classification: Example

 Customer Attrition/Churn:
 Goal: To predict whether a customer is likely to

be lost to a competitor.
Classification: Example

 Approach:
 Use detailed record of transactions with each of the
past and present customers, to find attributes.
 How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.
Classification Techniques
 Discriminant Analysis
 Nearest Neighbour Rule
 Logistic Regression
 Bayesian Classifier
 Decision Tree and CART
 And many more like ANN, SVM etc..
Chapter-2
Big Data
CHAPTER- 2: BIG DATA- BIG TICKET
PROJECT IN THE WORLD

STARTS WITH RECOGNISING BIG


MESS CREATED BY HUGE MULTI-
DIMENSIONAL DATA HAVING NO CLUE
TO CONCLUDE
BACKGROUND
•WE ARE TALKING ABOUT THE DATA WHICH ARE AVAILABLE FOR AN ORGANISATION BY THE
ORGANISATION AND BY ESTABLISHMENTS IN THE CONTXT OF THE ORGANISATION.

•DATA WHICH ARE AVAILABLE TO UNDERSTAND AN ORGANISATION FROM THE EXTERNAL


STAKEHOLDERS- VIZ. CUSTOMERS, SUPPLY CHAIN, COMPETITOR, GOVERNMENT, SOCIETY,
SUPPLIERS, FINANCIAL INSTITUTIONS, INTELLECTUAL COMMUNITY…

•THESE DATA ARE INPUT AND ARE AVAILABLE IN BITS AND PIECES USUALLY NOT AVAILABLE
IN REGULAR PERIOD OF TIME IN THE CONTEXT OF TACTICAL SURVEYS, LITERATURE
SEARCH, SPECIAL PURPOSE INVESTIGATIONS . DATA ARE AVAILABLE IN DIFFERENT
POCKETS AND APPARENTLY IN NON-SYNCHRONISABLE MANNER.

•IN A NUT SHELL, IF WE WANT TO UNDERSTAND THE PERCEPTIONS AND EXPERIENCE OF


EXTERNAL STAKEHOLDERS ON THE ORGANISATION OVER A PERIOD OF TIME, IT IS
APPARENTLY VERY DIFFICULT.

•SHALL WE LEAVE IT TO SUCH OR HELTER-KELTER BIG DATA IN DIFFERENT POCKETS OVER


A PERIOD OF TIME CAN BE PUT IN SOME ARRRANGEMENT WHEREFROM SOME MEANINGFUL
ANALYSIS IS NOT IMPOSSIBLE TO EXTRACT AND THE FINDNG CAN BE USED FOR EFFECTIVE
STRATEGIC OR TACTICS.
BIG DATA- SAMPLE OF CONFUSION
ORGANISATION- INTERNAL
EXTERNAL
OWNER/ MANAGEMENT
CUSTOMER EMPLOYEES
CONTRACT EMPLOYEES
COMPETITORS

SUPPLY CHAIN

TECHNOLOGY COLLABORATOR

GOVERNMENT
ALL SPEAK OR CARRY OUT
SUPPLIERS TRANSACTIONS DIFFERENTLY AND
IN DIFFERENT TIME INTERVAL IN
FINANCIAL INSTITUTE DIFFERENT FORMS AND RECORDS
ARE HARDLY OF ON-ONE
MEDIA/ NEWS PAPER
CORRESPONDENCE, MAKING IT
SOCIETY DIFFICULT TO ANALYSE.
EXPECTED QUESTIONS TO BIG DATA

-WHAT IS THE STATUS OF OUR PRODUCTS IN THE MARKET?


- HOW ARE COMPETITORS PERFORMING?
PIECEMEAL ANSWERS TO SUCH PIECEMEAL QUESTIONS ARE NOT
ADEQUATE- ABHIMANYU GOT KILLED BECAUSE SAPTARATHI
ATTACKED TOGETHER.

-COMBINE THEM IN MUTIVARIATE ARRANGEMENT

- LOOK AT THEM IN THE CHRONOLOGY OF TIME

- UNDERSTAND LEVEL OF OVERALL SENSITIVITY OF ALL THE


STAKEHOLDERS

-
Large-Scale Data Management
Big Data Analytics
Data Science and Analytics

• How to manage very large amounts of data and extract value


and knowledge from them
88
Definition of Big Data

• No single standard definition…

“Big Data” is data whose scale, diversity,


and complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
Characteristics of Big Data:
1-Scale (Volume)

• Data Volume
– 44x increase from 2009 to 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially

Exponential increase in
collected/generated data
Characteristics of Big Data:
2-Complexity (Varity)

• Various formats, types, and


structures
• Text, numerical, images, audio,
video, sequences, time series, social
media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of
data

To extract knowledge all these


types of data need to linked together
Characteristics of Big Data:
3-Speed (Velocity)

• Data is begin generated fast and need to be


processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body


 any abnormal measurements require immediate reaction
Big Data: 3V’s
Some Make it 4V’s
What’s driving Big Data

- Optimizations and predictive ana


- Complex statistical analysis
- All types of data, and many sour
- Very large datasets
- More of a real-time

- Ad-hoc querying and repo


- Data mining techniques
- Structured data, typical so
- Small to mid-size datasets
Big Data Technology

97
Defining Metadata

 Does data about data mean anything?

 Librarians equate it with a complete


bibliographic record
 Information technologists equate it to
database schema or definitions of the data
elements
 Archivists include context information,
restrictions and access terms, index terms, etc.
Bibliographic Metadata

 Providing a description of the information


package along with other information necessary
for management and preservation
 Encoding
 Providing access to this description

 Predominantly discovery and retrieval


Evolution of Database system Technology

 Data collection and Database creation (1960s and


earlier)

 DBMS( 1970s-early 1980s)

 Advanced Data Analysis (Data warehousing and


Data mining) (1980s-present)

August 22, 2019 100


Chapter-3
Data Preparation
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Know your data

View data or variables in data

# list objects in the working environment


ls()

# list the variables in mtcars


names(mtcars)

# list the structure of mtcars


str(mtcars)
Know your data
View data or variables in data
# dimensions of an object data
dim(mtcars)
# class of an object data(numeric, matrix, data frame, etc)
class(mtcars)
# print mtcars data
mtcars
# print first 10 rows of mtcars
head(mtcars, n=10)
# print last 5 rows of mtcars
tail(mtcars, n=5)
# Extracting a column
mtcars[,1] #extracts 1st column
mtcars$cyl #extracts column with heading ‘cyl’ from data mtcars
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Data cleaning

Data cleaning includes, actions on missing values and


outliers.
Missing observations may be numeric or attribute type.

1) Numeric-
Delete observation(s)
Replace by global constant, mean, mode etc.
2) Attribute-
Delete observation(s)
Experts suggestion(s)
Data cleaning...

- Deleting missing cases (rows)


mydata=read.csv(“D:/...mydata.csv”, header=TRUE)
is.na(mydata) ##logical output shows TRUE for NA difficult
to identify for large data.
length(which(is.na(mydata))) ##if zero, no missing (NA) o.w. some are NA’s

mydata[complete.cases(mydata),] ###removes all missing cases.


na.omit(mydat) ###removes all missing cases.
mydata[!complete.cases(mydata),] ##separates only cases with NA’s
This procedure can be used for data sets with both numeric, non-numeric features.
The data set without NA’s can be saved and used for further processes.
Data cleaning...
Data cleaning...

6 missing observations (may be from different feature

NA’s removed
Data cleaning...

- Replacing missing cases with mean

Need a code to identify a column with missing observation and


replacing by specified value or value computed from existing values.

Some advanced techniques like knn, Decision Tree, CART can be


used for imputing.
Data cleaning...
Data cleaning...

6 missing observations (may be from different features

NA’s removed
Data cleaning...

-Outlier detection and removing


The performance of model/tool used for a particular task get affected
badly with presence of outlier.
Causes:
- Misclassification if tool is used for classification
- Large prediction error if used for predicting numeric value

An outlier in numeric data can be a very large or very small value (Far
away from natural flow of the data).

A numeric value can be declared as an outlier, if


x ‘not in’ (mean(x) - t*sd(x), mean(x) + t*sd(x))
Usually t = 3.
Data cleaning...

Outlier
Data cleaning...
To find these outlier, one can use in-build functions or write a simple
R code for the same. Following is a simple R-code which will remove outliers
from your data. Need to write code for detecting and removing outliers from a
file with several numeric features.
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Data sorting

To sort a data frame in R, use the order( ) function. By default, sorting is


ASCENDING. Prepend the sorting variable by a minus sign to indicate
DESCENDING order.
Data sorting...

Ordered by mpg
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Data merging
Adding Columns
To merge two data frames (datasets) horizontally, use
the merge function. In most cases, you join two data frames by one or more
common key variables (i.e., an inner join).
Adding Rows
To join two data frames (datasets) vertically, use the rbind function.
The two data frames must have the same variables, but they do not have to be
in the same order.
Data merging...
# merge two data frames by ID
total <- merge(data frameA, data frameB,by="ID")
# merge two data frames by ID and Country
total <- merge(data frameA, data frameB, by=c("ID","Country"))
#Adding new rows
total <- rbind(data frameA, data frameB)
Data merging...

First data
set

Second data
set
Merged data
set
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Data aggregation

It is relatively easy to collapse data in R using one or more BY variables and a


defined function.
Data aggregation...
Data aggregation...
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Creating, recoding, renaming a variable
Creating new variables
Use the assignment operator <- to create new variables. A wide array
of operators and functions are available here.

# Three examples for doing the same computations

mydata$sum <- mydata$x1 + mydata$x2


mydata$mean <- (mydata$x1 + mydata$x2)/2

attach(mydata)
mydata$sum <- x1 + x2
mydata$mean <- (x1 + x2)/2
detach(mydata)

mydata <- transform( mydata, sum = x1 + x2, mean = (x1 + x2)/2)


Creating, recoding, renaming a variable...
Creating, recoding, renaming a variable...
Recoding variables
In order to recode data, you will probably use one or more of
R's control structures.

# create 2 age categories


mydata$agecat <- ifelse(mydata$age > 70,
c("older"), c("younger"))

# another example: create 3 age categories


attach(mydata)
mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"
detach(mydata)
Creating, recoding, renaming a variable...
Creating, recoding, renaming a variable...
Creating, recoding, renaming a variable...
Creating, recoding, renaming a variable...

Renaming variables
You can rename variables programmatically or interactively.

# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))

# you can re-enter all the variable names in order changing the ones you #need to change. the
limitation is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
#OR
names(mydata)[4]<- “New_name”

Dataset Name
Variable assigned
number
Creating, recoding, renaming a variable...
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Partitioning data or Sampling from data
Partitioning data or Sampling from data
Tasks in data preparation

- Know your data


- Data cleaning

- Sorting
- Merging

- Aggregating
- Creating, recoding, renaming a variable

- Partitioning data or Sampling from data


- Preparing data for time series
Preparing data for time series

Yearly
Quarter
ly

Monthl
y
Daily
Preparing data for time series
Preparing data for time series
Preparing data for time series
Preparing data for time series
Chapter-5
Visual Analytics and Exploratory Data Analysis
Chapter-4: Visual Analytics and Exploratory
Data Analysis

◦ Data Types
◦ Basic Statistics
◦ Summary of Statistics
◦ Data Visualisation
◦ Distributions
◦ Test of Hypothesis
Linkages within Summary of Statistics, Data
Visualisation & Exploratory Data Analysis- Univariate
Linkages within Summary of Statistics, Data
Visualisation & Exploratory Data Analysis- Pairwise
Basic Concepts of Probability
Learning Objectives
 Understand the concepts of events and
probability of events
 Understand the notion of conditional
probabilities and independence of different kinds
 Understand the concept of inverse probabilities
and Bayes’ theorem
 Understand specific concepts of lift, support,
sensitivity and specificity
 Develop ability to use these concepts for
formulation of business problems and providing
solutions to the same
Experiments
 An experiment is a process – real or hypothetical
– that can be repeated many times and whose
possible outcomes are known in advance
 Notes:
1. This is an intuitive definition but conveys the
meaning
2. We are discussing about experiments where
we agree about the possible outcomes at
the outset
Examples of Experiments
 Many activities in business, economics,
manufacturing and other areas may be considered
as experiments
 A customer walks into a retail outlet. The total
value of goods bought in a single trip may be
considered to be an experiment. The possible
value could be any real number ≥ 0
 The prepaid balance of the customer of a telecom
service provider might have become close to
zero. The customer may or may not buy further
talk time in a given period. The act of buying or
not buying may be looked at as an experiment
with two possible outcomes – 0 or 1.
More Examples
 The fuel consumption of a car as it travels may
be considered to be an experiment. The fuel
consumed per kilometer travelled in any given
journey may be the outcome of the experiment
and it may assume any positive value.
 A restaurant may approach its patrons and
request them to rate their service in a one to
five scale. The experiment has 5 possible
outcomes assuming that no customer declines to
provide a feedback
Examples (Continued…)
 A software development company tests the use cases as they
are developed. The testing may be considered to be the
experiment and the number of defects observed may be the
outcome. Observe that the outcome is an integer ≥ 0
 Notes:
1. Notice that thinking in terms of experiments forces the
analyst to concentrate on the entity defined in the
previous section.
2. The experiments and its outcomes is necessarily
idealized. Surely an use case cannot have 1010 defects or
the fuel efficiency cannot be 500000 km / lt. Also
defining an use case or a journey rigourously may not be
possible. However, we need to keep in mind that any
theory necessarily involves idealization.
Sample Space and Events

 The agreed set of all possible outcomes of the


idealized experiment is called the sample space.
In other words the sample space is the entire set
of possible outcomes.
 Every specific (indecomposable) outcome of an
experiment is called a simple event. Thus the
sample space consists of all possible simple
events
 An aggregate of certain simple events is called a
compound event
Examples of Events

 Consider the case of testing the use cases of a


software. The event that a particular use case
has 3 defects is a simple event whereas the
statement that a use case has at least 3 defects
describes a compound event.
 The simple events will be called the sample
points or points. Every indecomposable result of
the (idealized) random experiment is represented
by one, and only one, sample point.
Relative Frequency Interpretation of Probability
 The probability of an event is often defined to be
the relative frequency of the event when the
experiment has been conducted a large number of
times
 The relative frequency is computed as the number
of times the specific event occurs out of the total
number of times the experiment was conducted. It
is a number between 0 and 1 and a value like 0.6
indicates that in the long run the event will occur 60
times out of a hundred.
 This description is deliberately vague but supplies a
picturesque intuitive background sufficient for our
purpose
Case Example
1. Suppose we are interested in studying the effort required to resolve specific
service requests. (Look at the ticket resolution effort data). Here servicing the
requests constitute the experiment. The service requests are the entities being
studied. The effort required to service a particular request is the outcome and
it is the response variable as well.

2. The sample space consists of all possible values of effort. In this example the
sample points are values like 1, 4, 5, 6…and so on.

3. An event defined as the effort required = 10 hours is a simple event

4. Event that effort required > 50 hours; or 10 < effort < 100 hours are
examples of compound events

5. The experiment was conducted 6460 times – a fairly large number.

6. The relative frequencies of different events could be estimated as k / 6460


where k gives the number of times the event occurred
Example (Continued…)

 Consider the event A = Effort ≤ 10. We note that


there are 181 such observations out of a total of
6460 trials (each individual run of the experiment
is referred to as a trial). Thus the estimated
probability of the event A is given by P(A) = 181
/ 6460 = 0.028
 Consider another event B – 11 ≤ Effort ≤ 20.
There are 736 observations satisfying these
characteristics and hence the probability of this
event may be estimated as P(B) = 736 / 6460 =
0.114
Example-cum-Exercise

Suppose a telecom service provider has carried out a survey to find the level
of importance customers attach to various aspects of their experience of
using the service. Suppose the importance is given in a seven point scale (1
to 7) where 1 means least importance and 7 stands for the highest
importance. One of the aspects of customer experience is accuracy of bills
and suppose that the survey has yielded the following result
Value Frequency
1 1
2 3
3 6
4 13
5 72
6 135
7 130
Let A be the event that a randomly selected customer will consider the
importance of accurate billing to be 6 or more on a 7 point scale. What is
P(A)? How did you arrive at the value? Can you identify the experiment, the
entity involved and the sample space?
Example-cum-Exercise

P{A>=6} = (135+130)/Total Freq.= 265/360= 0.73611


Ω={1,2,3,4,5,6,7}
Note – BA thru Probability Models
 In business scenario we are often looking at the
possible outcomes of random experiments
 Some of the outcomes (events) are favourable to us
and we want to estimate their probability
 In a real life scenario usage of deterministic model
(when the outcome is known with certainty) is
practically useless
 Once we understand the probabilities of the events
we act such that the risks are minimized or the
chance of favourable events are maximized
 Probability models refer to quantitative models to
estimate probability of different events.
Concept of Joint Probability
• Let A and B be two events with probabilities P(A) and P(B). Suppose we are
interested in finding out the probability of the event AᴖB (read A intersection B)
• The event AᴖB denotes the joint occurrence of A and B
• Examples: Suppose in the context of a retail store, A denotes the event that a
customer buys bread and B denotes the event that a customer buys butter. Then
AᴖB denotes the event that the customer buys both bread and butter.
• Another example: Suppose a travel company places online ads for hotel booking,
air ticketing and car hire. Let A, B and C be the events that a prospective
customer visiting the site books hotel room, buys air ticket through the travel
company and hires car respectively. Then AᴖB indicates that the customer books
hotel room as well as air ticket through the travel company. What will AᴖC, BᴖC
and AᴖB ᴖC indicate?
• In the previous case suppose N people have visited the site. Let NA, NB and NC
denote the number of customers who booked hotel room, air ticket and hired
cars. Let NAB, NBC and NAC be the number of cases when the customers have
booked two services and NABC be the number of cases when the customer booked
all three services. Find P(AᴖB), P(AᴖC), P(BᴖC) and P(AᴖB ᴖC ).
Example
Consider the previous case of studying the effort required to service tickets.
Note that tickets may be of three different complexities – simple, tickets of
medium complexity or very complex tickets. There are 4026, 2306 and 128
tickets of these different complexities. Let A be the event that a ticket is
simple. The probability of this event may be estimated as P(A) = 4026 / 6460
= 0.6232.
Let B be the event that the effort is between 11 to 20 units and the probability
of this event was estimated to be 736 / 6460 = 0.114.
The event AᴖB consists of all the simple tickets with effort between 11 and 20
units. Note that there are 583 such observations and hence the estimated
probability is 583 / 6460 = 0.09

Note
When a large amount of data are available, the probabilities of
different events may be estimated empirically from data
Conditional Probability
 Conditional probability of event A given event B
– written as P(A│B) is the relative frequency of
A given B has happened.
 Conditional probability P(A│B) = P(AᴖB) / P(B).
Actually P(A│B) = NAB / NB

• P(A│B) is defined only if B ≠ φ, i.e. only if P(B) > 0


• Note that P(A│B) and P(B│A) are not the same.
Example
Consider the example of servicing tickets. Let A be the
event that the ticket to be resolved is simple and let B
be the event that the effort required to service the
ticket is between 11 and 20 units. We have seen that
P(A) = 0.6232, P(B) = 0.114, P(AᴖB) = 0.09
Now P(B|A) = 0.09 / 0.6232 = 0.1444
Notice that when the complexity is known to be
simple, the probability of event B increases by about
12.6%. This is often referred to as lift and in business
analytics we often aim at discovering appropriate
conditioning events that increases the probability of an
event of interest, e.g. the quantum of effort required
to service tickets.
Exercise
Examine the data file containing the ticket resolution effort.
Let A be the event that the complexity is simple
Let B be the event that the service domain type is
application support
Let C be the event that the effort required is between 20
and 30 units – both 20 and 30 included.
1. Find the conditional probability of C given A and B
2. Is the conditional probability higher than the direct probability?
What is your conclusion?
3. Do you think complexity is more important to find the
probability of B compared to service domain type? Explain.
Exercise- Work out
Total Trials: 6460
A: Complexity
Simple- 4026
Medium- 2306
Complex- 128
B: Service Type:
Database support- 4080
Application- 2380

C:Efforts
<= 10- 181
11-20- 736
21-30- 2536
Rest:
P{C(3)I A(1) ∩ B(2)} = …
An Observation
 It is important to note that the conditional
events A|B and B|A are very different
 Let A be the event that the ticket being serviced is simple
 Let B be the event that the effort required is between 11 and
20 units
 A|B is the event that the ticket is simple given that the effort
required was between 11 and 20 units. On the other hand
B|A is the event that the effort required is between 11 and 20
units given that the ticket is simple
 Notice that P(A | B) = 0.789 whereas P(B | A) = 0.144 only.
Another Example
 An epidemiologist wants to assess the impact of smoking on the
incidence of lung cancer. From hospital records she collected data
on 100 patients of lung cancer and she also collected data on 300
persons not suffering from lung cancer. She has classified the 400
samples into smokers and non smokers and the observations are
summarized below
Smoker Lung Cancer Total
Yes No
Yes 69 137 206
No 31 163 194
Total 100 300 400

 Let A be the event that a person has lung cancer and let B be the
event that the person is a smoker. Can you estimate P(A│B) from
the table given above?
Comment

 Notice that it may not be possible for


us to estimate certain conditional
probabilities from a given data.
 You must be careful about how the
experiments were carried out and
which probabilities are at all
estimable.
Conditional Probability (Continued…)
 We have noted that P(A│B) may be very different from P(A)
and we can often use this to our advantage
 Note that A│B is a subset of all occurrences of A. For example
A may denote the event that a machine fails. On a random
day the chance of failure may be 0.0001 or 1 in 10000.
 However, given certain conditions described through the
event B, P(A│B) may increase to 0.01.
 Thus presence of condition B leads to a 100 fold increase of
the probability of failure on any given day. If the condition
persists for 10 days, the chance increases tremendously. (Can
you calculate assuming failure across days are independent?)
 Another example: We know that probability of a heart
attack during a period of say one year for a randomly selected
Indian male may be fairly low. However, this risk may
increase significantly for a given combination of age, genetic
disposition, smoking habit, BMI, level of blood sugar and LDL.
Usage of Conditional Probability in BA

 In many analytics problems our job is to find the


event B that increases / decreases the chance of
occurrence of an event of interest significantly.
Recall that in this context we have already
touched upon the concept of lift.
 Defining the event of importance (say A) and
discovering the conditioning event B such that
P(A | B) is much greater than A is of crucial
importance in BA
Concept of Independence

 We say that events A and B are independent in case P(A│B)


= P(A), i.e. the probability of A is not impacted by the
presence of event B.
 This definition implies that when A and B are independent,
P(AB) = P(A).P(B) [ Note that P(A│B) is defined only when
P(B) > 0. However, the definition of independence –
sometimes referred to as stochastically independent – is
accepted even when P(B) = 0]
 Notes:
1. If A and B are independent, then Ac and Bc are independent. In
fact it can be easily shown that Ac and B , and A and Bc are
also independent.
2. If A, B, C are mutually independent then P(ABC) =
P(A).P(B).P(C). In fact this multiplication rule is applicable for
any number of events
Example of Independence

 Consider the example of effort required to service tickets. Let us


consider two events A and B where
 A = Complexity is simple
 B = Service domain is database support
 The estimates of the probabilities are
 P(A) = 4026 / 6460 = 0.623; P(B) = 4080 / 6460 = 0.632; P(AB) =
2894 / 6460 = 0.448
 Note that P(A).P(B) = 0.623 * 0.632 = 0.394 is not very different
from P(AB)
 Thus we may, prima facie, entertain a hypothesis that service
domain and complexity are independent of each other.
 Notice that this will not be applicable at all for ticket
complexity and effort
Usage of Assumption of Independence

 Note that when events are independent, their


joint probability may be computed as a product
of the probabilities of individual events
 This multiplicative rule may be applied to
compute the probabilities of many different
combinations of events as explained in the next
section
Usage of Independence - Examples
 Suppose you are tossing a fair coin. Thus the
probability that a toss results in a head is 0.5.
Assuming that tosses are independent of each
other, what is the chance that 3 tosses will result
in 3 heads?
 Suppose a machine has 20 different parts.
Suppose the parts fail independently of each
other and on any given day a part fails with 1%
chance only. Suppose the machine continues to
operate if all parts are operational and fails if one
or more parts fail. What is the chance that the
machine will fail on a randomly selected day?
Usage of Independence and Conditional
Probability
Consider the case of manpower supply in a large IT service
company. The company has many customers (called accounts)
and human resources are required to service the customers. The
account managers demand resources and following policy of the
company, each demand is for one resource. However, some of
these demands are abandoned. Abandonment is an indication of
inability to plan adequately.
In order to judge the effectiveness of planning, the company
may think of computing the conditional probabilities of
abandonment for different accounts.
Look at the manpower demand data and compute the
conditional probability of abandonment for the top three
accounts. Do you think the managers of these three accounts
are equally efficient?
Usage (Continued…)

Note that the demand depends on factors like technology category and
role. Thus we may define events as follows

A = Demand is for a particular role


B = Demand for a particular technology category
Suppose A and B are independent.
We can then find the probability of demand for any combination, if we
know the probabilities of the individual events.

Note that it is usually easy to estimate the probabilities of


individual events. However, estimating probabilities of many
events occurring jointly requires enormous amount of data
and are often not practicable. The independence assumption
comes handy in such situations.
Concept of Total Probability

 Let B1, B2, ……, Bp be a set of mutually exclusive


and collectively exhaustive events such that Bj ≠
φ for j = 1, 2, …p [i.e. P(Bj) ≠ 0 for j = 1, 2, …p]
 Let A be any other event.
 Then P(A) = ΣP(A│Bj) P(Bj) for j = 1, 2, …p
(Why?)
Exercise

In a certain county 60% of registered


voters support party A, 30% support party
B and 10% are independents. When those
voters were asked about increasing
military spending 40% of supporters of A
opposed it, 65% of supporters of B
opposed it and 55% of the independents
opposed it. What is the probability that a
voter selected randomly in this county
opposes increased military spending?
Exercise- Workout
Examples
1. Suppose a retail company has divided the clientele
into 5 different groups. The company carries out
different campaigns to target the different groups
and estimates the probability that a person
belonging to a particular group will buy some
product once he comes to the shop. The relative
frequency of the groups are known. How will you
estimate the probability that a customer walking
into a shop will end up buying a product?
Formulate the problem in mathematical terms.
Explain how you will use this knowledge to improve
your campaigning strategy?
Bayes’ Theorem
 Bayes’ theorem allows us to look at probability from a inverse
perspective
 Bayes’ theorem states that
P(B│A) = P(A│B) P(B) / P(A)
 Let B1, B2, ……, Bp be a set of mutually exclusive and
collectively exhaustive events such that Bj ≠ φ for j = 1, 2, …p.
In this set up Bayes’ theorem may be stated as
P(Bj│A) = P(A│Bj) P(Bj) / (ΣP(A│Bj) P(Bj)), j = 1, 2, …p
 This simple yet intelligent way of looking at probability is often
very effective. We may not be able to find P(Bj│A) directly but
it may be far easier to estimate P(A│Bj).
 Construct examples of the previous statement. Recall the
example of smoking and lung cancer. Can you use Bayes’
theorem to estimate probability of lung cancer given smoking
habit?
Application of Bayes’ Theorem

 Suppose I divide my email into three categories: A1 =


spam; A2 = administrative and A3 = technical. From
previous experience I find that P(A1) = 0.3; P(A2) = 0.5;
and P(A3) = 0.2. Let B be the event that the email
contains the word free and has at least one occurrence
of the character !. From previous experience I have
noted that P(B/A1) = 0.95; P(B/A2) = 0.005 and P(B/A3)
= 0.001. I receive an email with the word free and the !.
What is the probability that it is a spam?

Notice that we have used Bayes’ theorem to


construct a spam filter. In the subsequent slides
we will see how this simple concept may be
extended to construct powerful classification
mechanisms
Sensitivity and Specificity
 While Bayes’ theorem may be used to construct classification
mechanisms, it may also be used to evaluate their performance
 Let B denote the event of interest – say the failure of a machine or
the event that a person has a particular disease
 Let A be the event that the classifier gives a positive response
 AC is the event that classifier gives a negative response
 Note the difference between actual occurrence and a positive
response from the classification technique
Questions:
a. What is the difference between the two conditional events
– A / B and B / A?
b. Which probability we are interested in?
c. Is it possible to estimate the probability of interest
directly? If yes, how? If not, why not?
Sensitivity and Specificity (Continued…)
 P(A│B) is the conditional probability of a positive
response given that the event has actually occurred. This
probability is called the sensitivity. Higher the probability
of a positive response from the classifier when the
underlying condition is truly positive, higher is the
sensitivity.
 P(A│Bc) is the probability of a positive response when
the underlying condition is actually negative. 1 −
P(A│Bc) is called the specificity. Lower the value of
P(A│Bc) higher is the specificity. Thus specificity is also
given by P(Ac│Bc)
False Positive and False Negative

 P(A│Bc) is the probability of getting a false positive.


This gives the probability of a positive response when
the event of interest actually did not happen.
 P(Ac│B) is the probability of getting a false negative.
This gives the probability of a negative response when
the event of interest actually happened.

Note
A sensitive instrument does not give false negative
results and a specific instrument does not give
false positive results
Events of Interest
 Note that sensitivity and specificity do not give the
probabilities of the events of interest
 We are actually interested in positive and negative predictive
values (abbreviated as PPV and NPV respectively) defined as
 PPV = P(B│A) = P(A│B) P(B) / P(A) – by Bayes’ theorem
 NPV = P(Bc│Ac) = P(Ac│Bc) P(Bc) / P(Ac) – by Bayes’ theorem
 Notice that PPV and NPV cannot be found directly whereas
sensitivity and specificity can be.
 Also P(A) = P(A ∩ B) + P(A ∩ Bc)
= P(A│B) P(B) + P(A│Bc) P(Bc)
= Sensitivity. P(B) + (1 – Specificity)(1 – P(B))
 Thus we can find PPV and NPV provided we know sensitivity,
specificity and prevalence of the particular event of interest in
the population (i.e. if we know P(B))
Why is this Important?

Suppose we are trying to develop a classification model to understand


what leads to failure of vehicles. It is not possible to conduct
experiments where we observe impact of different conditions on the
event of failure of vehicles in a given period of time. However,
whenever vehicles fail, the failures will be reported. Suppose the
conditions are captured by sensors. Thus we will have data on
conditions given that failure has happened. We can, therefore,
estimate the probability of different conditions given that failure has
happened. From warranty report data we can also estimate the
unconditional probability of failure. We can, therefore, use the
methodology given above to classify whether vehicles will fail under
given conditions
Further Insights
 Note that the previous discussions show how we can estimate
P(B│A) where B is the failure event, when P(A│B) and P(B)
are known
 Generally we would like to estimate the conditional probability
of failure given many rather than only one event.
 Thus we may like to estimate P(B│A1 A2 …. Ak). Note that
using Bayes’ theorem, we know
P(B│A1 A2 …. Ak) = P(A1 A2 …. Ak│B) P(B) / P(A1 A2 …. Ak)

 In the next section we will see how the previous concepts,


including the Bayes’ optimal classification rules and the
concept of conditional independence to be introduced next
may be used to solve classification problems
Concept of Conditional Independence
 Let A, B and C be three events
 A and B are said to be conditionally independent given C, in case
P(A│B ∩ C) = P(A│C)
 Conditional independence is often a reasonable assumption as we
show in the subsequent examples
Consider the following events
A = Event that lecture is delivered by Amitava (there are two teachers – Amitava
and Boby)
B = Event that lecturer arrives late
C = Event that lecture concerns stat theory (theory and practical are taught)
Suppose Amitava has a higher chance of delivering lecture on stat theory
Suppose Amitava is likelier to be late
Notice that the conditional probability of lecture being on stat theory given that
lecturer is Amitava is independent of the event that the lecturer arrives late.
Thus P(C / AB) = P(C / A)
Implication of Conditional Independence
 Let A and B be conditionally independent given C. Note
that
 P(AB│C) = P(ABC) / P(C)
= P(A│BC) P(BC) / P(C)
= P(A│C) P(B│C) P(C) / P(C) (Why?)
 Thus we get P(AB│C) = P(A│C) P(B│C)
 In general we may say that when A1, A2, …. Ak are
conditionally independent given B
P(A1 A2 …. Ak│B) = P(A1│B) P(A2│B)…. P(Ak│B)
Naïve Bayes’ Classification
 The concepts of Bayes’ theorem, Bayes’ optimality criterion for classification and
conditional independence may be combined to develop a classification methodology
 Suppose a response variable R takes k different values. Let us assume that these
values are 1, 2, …k without loss of generality.
 Let A1 A2 …. An be n different events defined in terms of explanatory variables.
 We want to estimate the probability P(R = j / A1 A2 …. An) for different combinations
of A1 A2 …. An.
 Once these probabilities are estimated for all j for a given combination of A1 A2 …. An,
we try to find j that maximizes this probability. From Bayes’ optimality criterion, for a
given combination of A1 A2 …. An, the response is allocated to class j that maximizes
the probability.
 We have already shown that P(B│A1 A2 …. Ak) = P(A1 A2 …. Ak│B) P(B) / P(A1 A2 ….
Ak)
 Since the denominator is constant, we allocate to that class for which the numerator
is maximum
 Under the assumption of conditional independence of A1, A2, …. Ak given B, we get
P(A1 A2 …. Ak│B) = P(A1│B) P( A2│B) …P(Ak│B). Generally these probabilities can be
estimated and consequently a classification mechanism may be developed
Example
Consider the problem where data were collected for customers of computers.
We need to develop a classification mechanism so that customers may be
classified as buyers or non-buyers given the profile. We will use Naïve Bayes’
classification methodology to accomplish this objective.

Data Table
Age Income Student Credit Rating Buys Computer
≤ 30 High No Fair No
≤ 30 High No Excellent No
31 – 40 High No Fair Yes
> 40 Medium No Fair Yes
> 40 Low Yes Fair Yes
> 40 Low Yes Excellent No
31 – 40 Low Yes Excellent Yes
≤ 30 Medium No Fair No
≤ 30 Low Yes Fair Yes
> 40 Medium Yes Fair Yes
≤ 30 Medium Yes Excellent Yes
31 – 40 Medium No Excellent Yes
31 – 40 High Yes Fair Yes
> 40 Medium No Excellent No
Classification Mechanism
 The classifier aims at developing a method such that optimal
allocation to one of the classes (buys computer / does not buy
computer) is made for any customer with a given combination of
age, income, status (student or not) and credit rating
 Let B be the response variable that takes two values. B = 0 means
the customer does not buy computer and 1 means s/he buys
computer
 Now P(B = 0 / Age, Income, Status, Credit Rating) and P(B = 1 /
Age, Income, Status, Credit Rating) needs to be found using the
Naïve Bayes’ theory
 We know that rather than estimating these probabilities, some

values proportional to the same shall be found


Realistic Example

 Consider the problem of estimating time required


to service a ticket given certain classifications like
process type, complexity of ticket, service
domain, and job level of the server. Suppose the
effort required to service tickets are grouped into
5 categories, say – very low, low, medium, high
and very high. The problem is to classify a ticket
into one of the categories once it arrives on the
basis of the ticket categories.
Examples of Usage of the Concepts
 A machine has many different sensors that capture data on a number of variables –
say temperature, speed, vibration, and so on. Suppose these data are continuous –
captured in a ratio scale. The machine may fail in many different modes – including
degradation of function or development of fault codes that may not lead to stoppage
of function. The failure is a categorical variable. Our problem is to allocate the
machine to one of these classes given the sensor data.
 Inspection of products may be expensive. Thus we may need to develop a filtering
mechanism on the basis of automatically collected data to classify products as good
or bad. This application is very similar to spam filtering we have discussed earlier
 Suppose a manufacturer has installed an automatic sorting device at a great cost.
The concepts of sensitivity and specificity – in particular positive predictive value and
negative predictive predictive value may be used to assess the justification of the
investment
 A manufacturer may like to estimate the probability of failure of certain mission given
certain conditions – say for example an R&D project under certain condition. It is
usually easier to look at failed and successful projects – called case-control studies,
and assess the probabilities. The concept of Bayes’ theorem may then be used to find
the probability of failure. Accordingly the company may be guided about making
prudent investments
 In real life confusion between A / B and B / A is fairly common. An understanding of
this topic should act as a guard.
Review Questions
 What is an event? What is a random variable? What is sample
space? Give example of sample space? Can you connect
sample space with random variable?
 What is probability? What are the axioms of probability?
 Define conditional probability? Define independence.
 What is the meaning of ‘lift’? Can you think of supervised
learning model as an exercise of conditional probability?
 What do you mean by mutual independence? What is pairwise
independence? What is conditional independence?
 What is Bayes’ theorem?
 What do you understand by sensitivity and specificity? What
do you mean by false positive and false negative? Can you
relate the concepts of sensitivity and specificity with false
positive and false negative?
Exercise

 Consider the manpower supply data. Note that


the resources demanded may eventually have 4
different status – abandoned; closed and fulfilled,
or simply fulfilled; expired; and closed and not
fulfilled. The eventual status will depend on a
number of factors like role, technology, month of
the year, lead time allowed, and the account.
Develop a classification model to estimate the
eventual status of a demand given the
explanatory variables as described.
Type of Data
 Qualitative or Categorical: The objects being studied are grouped into categories
based on some qualitative trait. The resulting data are merely labels or categories.

 Nominal: A type of categorical data in which objects fall into unordered


categories; e.g. City, Ethnicity, Smoking status
 Ordinal: A type of categorical data in which order is important; e.g. Degree of
illness, Satisfaction Level.

 Quantitative or Numeric: The objects being studied are ‘measured’ based on some
quantitative trait. The resulting data are set of numbers.

 Attribute or Discrete: The characteristics which assume only isolated values in


its range of variation; e.g. No. of complaints, No. of Students.
 Variable or Continuous: The characteristics which may assume any value
within its range of variation; E.g. Pulse, Height, Diameter.
Introduction to Random Variables and Distributions
Content

 We will cover concepts of


 Random variables and data generation
process
 Measurement scales

 Sample and population

 Random sample

 Statistics

 Descriptive and inferential problems


Concepts of Random Variables
Content

 We will cover concepts of


 Random variables and data generation
process
 Measurement scales

 Sample and population

 Random sample

 Statistics

 Descriptive and inferential problems


Random Variables

 Random variable is the link between the sample


space and numerical values that we get from the
random experiments
 Random variable is a real valued function defined
on the sample space
 Examples
 A buyer in a shopping mall may or may not

buy. The sample space has two points and the


random variable may be allocated values like 0
and 1
Examples of Random Variables
1. A buyer in a shopping mall may or may not buy. The sample
space has two points and the random variable may be
allocated values like 0 and 1
2. The level of recovery of a patient after a hip replacement
surgery may be excellent, very good, OK, bad, and very bad.
The random variable for this sample space may be allocated
values like 1, 2, 3, 4, and 5.
3. The weight of new born babies can be any real number
greater than 0. The actual weight is directly mapped to the
random variable
4. The number of cars sold from a retail outlet may be any
nonnegative integer. The actual number sold represents the
random variable and is directly mapped to the same.
Note
 The random experiment may result in numeric
or nonnumeric values. Accordingly the
measurement scale of the random variable will
differ
 When the sample space consists of nonnumeric
values that do not have any implicit ordering,
the corresponding measurement scale is said to
be nominal. Example: colour of the car chosen
by the customer; time interval (morning,
afternoon, evening, night) when telephone call
was made
Random Variables

 Random variable is the link between the sample


space and numerical values that we get from the
random experiments
 Random variable is a real valued function defined
on the sample space
Examples of Random Variables
1. A buyer in a shopping mall may or may not buy. The sample
space has two points and the random variable may be
allocated values like 0 and 1
2. The level of recovery of a patient after a hip replacement
surgery may be excellent, very good, OK, bad, and very bad.
The random variable for this sample space may be allocated
values like 1, 2, 3, 4, and 5.
3. The weight of new born babies can be any real number
greater than 0. The sample space consists of all positive real
numbers. The actual weight is directly mapped to the
random variable
4. The number of cars sold from a retail outlet may be any
nonnegative integer. The sample space consists of all
nonnegative integers. The actual number sold represents the
random variable and is directly mapped to the same.
Note
 The random experiment may result in numeric
or nonnumeric values. Accordingly the
measurement scale of the random variable will
differ
 When the sample space consists of nonnumeric
values that do not have any implicit ordering,
the corresponding measurement scale is said to
be nominal. Example: colour of the car chosen
by the customer; time interval (morning,
afternoon, evening, night) when telephone call
was made
Note (Continued…)

 When the sample space consists of nonnumeric


values with an implicit ordering like the level of
post surgery recovery, the random variable has
an ordinal measurement scale.
 Other scales of measurement are numeric.
Although numeric measures may be divided into
two classes – interval and ratio, we will not make
the distinction at this point of time
 When values of a random variable are in nominal
or ordinal scale, it is referred to as categorical
data
Random Variables
 Random variables take values from a predefined set – it
is a mapping from the sample space. Thus all possible
values of the random variable are known in advance.
However, the exact value that will occur in the next
occurrence is unknown.
 When a random experiment is conducted, some specific
event (point in the sample space) occurs. However,
which event will occur is unknown till the experiment is
concluded. We have noted this when we discussed
probability
 We do not know when a machine will fail, what will be
power consumption of a device in a given period, what
will be the efficiency of a machine, how many defective
products will be produced during the next production
run…
Note

 Random variables should not be confused with


data
 Random variables are theoretical constructs. On
the other hand, data are realized values of the
random variables
 Random variables are governed by the laws of
probability. Data collected over a large number of
entities should agree with the theorized
probability model
Sample and Population
 Random experiments are carried out on
sampling units (we referred to this as entities as
well).
 Potentially there are infinitely many entities. And
the random experiment may be conducted on
the same entity multiple times.
 Suppose you are computing the fuel efficiency

of cars of a particular make. There are many


cars and you may drive the same car several
times
Sample and Population (Continued…)
 As we observe a the behaviour of a subset of all
possible entities, we are observing a sample
 The hypothetical (theoretical) case of all possible
observations of all possible entities constitute
the population
 A random sample is one where the entities are
chosen with equal probability, i.e. all entities
have the same chance of being represented.
Note that this is different from taking a
haphazard sample
Sample and Population (Continued…)
 Identifying the entities and observing their
behaviour (conducting the random experiment)
is called drawing samples
 In statistical computations we assume that the
samples are drawn randomly. Random drawing
enforces independence (one drawing does not
influence another)
 The laws of probability are applicable only if the
samples are drawn randomly. As a practitioner,
you must be conscious of this very important
(probably the most important) assumption
Concept of Distributions
 It is often assumed that the process that generates the random
variable can be modeled using a mathematical function. This function
enables us to calculate the probability that the values of a random
variable will fall within any given range
 The mathematical function is defined in terms of the random variable
and some unknown but fixed constants called parameters
 The distribution (mathematical function) characterizes the pattern of
variation of the random variable. If we know the distribution, we
know the probabilities of all events
 We may construct the frequency distribution as well to visualize the
pattern of variation. When we have very large number of
observations drawn randomly from a population, the frequency
distribution may be a good approximation of the distribution.
 The frequency distribution is a non-parametric way of looking at a
distribution and it has its own advantages and disadvantages.
Concept of Density and Mass Functions

• Random variables may be discrete or continuous. A random variable is said


to be discrete if it takes values from a set X = {x1, x2,…} , where X is
countable (a set is said to be countable if either it is finite or can be put in a
one-to-one correspondence with the integers)
• Discrete random variables have probability mass functions (pmf) that has
point masses
• Continuous random variables have density functions (pdf) as given below.
At any point x the pdf gives the height of the curve and the areas under the
curve from x0 to x1 gives the probability that the random variable assumes a
value in that range. In this sense pdf behaves like density where the mass
can be computed from density and volume
P(x)

x0 x1
Properties of Mass and Density Functions

 Let f(x) be a probability mass function (pmf). Then fX(x)


≥ 0 for all x ε R and ΣfX(xi) = 1
 In case the random variable X is continuous, we have
fX(x) ≥ 0 for all x ε R and we integrate f over the entire
real line to get the area under the curve.
 For a continuous distribution we integrate over the range
x0 to x1 to get the probability. For a pmf we take the
sum.
 Note that for a continuous random variable the
probability of any single point is zero
Cumulative Distribution Function
 The Cumulative Distribution Function (often referred to
as Distribution Function) is the function FX : R to [0, 1]
defined by FX(x) = P(X ≤ x)
 Recall that we use the CDF (often referred to as DF)
through ogive
 When either f (pdf or pmf) or F (CDF) is available, we
can compute probability of all events concerning the
random experiment or equivalently probability of all
subsets of values of the random variable
 Knowledge of the density, mass or distribution function
is, therefore a convenient way to understand the pattern
of variation of a random variable from a quantitative
perspective
Three Usages of Distributions

 Distributions of random variables may be used


from three perspectives, namely to understand
 To find the probability of any event

 To compare two or more distributions; or to

compare effectiveness of some actions


 To simulate real life scenarios using the
distributional models to develop an
understanding of the phenomenon
Note on Simulation / System Modeling
 Most real life scenarios involve random variables. Complex
systems may involve many random variables.
 Investment for retirement involves random variables like market
trends, interest rates, rate of inflation, length of life, and medical
costs
 Time to complete a project depends on productivity, rates of
attrition, political stability, technical surprises, market conditions
and so on
 The values of the random variables are not known in
advance, but the distributions are known. This enables us to
look at the expected outcomes by generating values of the
random variables many times
 Modeling a system as a set of mathematical equations
involving random variables is called system modeling.
 Generating values of the random variables to study the
outcomes of the system many times is called simulation
Monte Carlo Simulation

 Monte Carlo simulation in its simplest form is a


random number generator that is useful for
forecasting, estimation and risk analysis
 A simulation creates numerous scenarios of a
phenomenon by repeatedly picking values for
different random variables defining the
phenomenon from a user defined probability
distribution
Example – I
 Your company is a retailer in perishable goods and you
are tasked with finding the optimal level of inventory to
be maintained. It is known that if your inventory exceeds
actual demand, you incur a cost of $100 as the goods
are perishable. On the other hand you incur a cost of
$175 for bringing the item on an emergency basis if the
demand exceeds the inventory. These costs are on a per
unit basis and suppose that the inventory and demand
are computed on a monthly basis. Suppose you have
collected data on actual demand for 25 months and the
data are as follows: 12, 11, 7, 0, 0, 2, 7, 0, 11, 12, 0, 9,
3, 5, 0, 2, 1, 10, 3, 2, 8, 5, 1, 0, 9. Can you use the
previous concepts to find the optimal level of inventory?
Example – II
 You are responsible for maintaining the inventory
level for a raw material in a manufacturing plant.
You have noted that the quantity demanded on
any given day varies over a range. You have also
noted that once you place an order, the number
of days taken by the supplier to supply the
material varies. You need to compute the
different reordering levels (the inventory level
that triggers an order) indicating the
corresponding service levels where service level
gives the probability that a demand placed on
your store will be met. How will you formulate
this problem?
General Approach to Simulation

 Identify the random variables involved in the


problem being considered
 Develop a model to describe the problem /
phenomenon being studied. The model connects
the different random variables. Every
combination of values of the random variables
give rise to a specific occurrence of the
phenomenon
 Simulation therefore involves generating values
of the identified random variables and combining
these values as specified by the simulation model
Two Types of Simulation

 Parametric: In this type of simulation specific


forms of distribution – say binomial, Poisson,
normal etc. is assumed. In these cases the
parameters of the assumed distributions need to
be known.
 Nonparametric: In this type of simulation we use
empirical distributions obtained from data. This
method has the advantage that no assumptions
need to be made. However, a large amount of
data are required for effective implementation of
this method.
Summary
 Random variables are real valued functions of the sample space. We
allocate real values to the outcomes of random experiments thru
random variables
 Random variable is not data – it is a theoretical construct. As the
random experiment is conducted on the entities, data are generated
as values of random variable
 All possible values of a random variable are known in advance (the
sample space is known). However, the exact outcome in a specific
case is unknown.
 As the random experiment is conducted on different entities, sample
is generated. When the entities are chosen randomly, the sample is
a random sample. Randomness is the most important assumption
for the laws of probability to be applicable. The potentially infinite
set of experiments that may be conducted on all possible entities is
called the population
Summary (Continued…)
 Random variables have distributions. These are mathematical
functions that specify the probability of different events (i.e.
probabilities of different sets of values that may be realized by the
random variable)
 When the distribution is known the random variable is completely
known in terms of probability
 The distributions may be thought of as the data generating
mechanisms. These are theoretical constructs and are useful when
the observed data matches the theoretical prediction in the long run
 Many complex real life problems may be described as mathematical
functions of random variables. These mathematical descriptions are
called system models
 Knowledge of the distributions enable us to simulate the systems
and examine the possible outcomes in different situations
 Knowledge of distributions also enable us to compare random
variables.
SOME IMPORTANT DISCRETE
DISTRIBUTIONS
Binomial Distribution
 In many real life situations it is important to determine the
number or proportion of successes/failures when trials are
conducted independently of each other. For example one may
count the number of successful bids, the proportion of
automobiles that pass a test in the first go, the number of
program units found to be defect free when tested for the
first time and so on. These random variables can be modeled
using the Binomial distribution.
 The Binomial distribution works as follows:
 Let us assume that a trial has been conducted n times. Note that
the trials must be independent of each other.
 Let p be the proportion of success. Note that this proportion must
remain same over the period of experimentation.
 Let the random variable, number of successes, be X.
 Then the probability density function is given by
Binomial Distribution (Continued…)
 The Binomial distribution works as follows:
 Let us assume that a trial has been conducted n times.
Note that the trials must be independent of each other.
 Let p be the proportion of success. Note that this
proportion must remain reasonably same over the period
of experimentation.
 Let the random variable, number of successes, be X.
 Then the probability density function is given by
p(X=x) = nCx px (1-p)n-x; x = 0, 1, 2, ....., n
 In case we are interested in finding the probability density
function of the proportion of successes x/n rather than the
number of successes, the same is also given by
p(x/n) = nCx px (1-p)n-x; x=0, 1, 2, …, n
Usages of Binomial Distribution

 The Binomial distribution is a one parameter


distribution. The proportion of successes p is the
unknown parameter. The independent trials of
the binomial distribution are called Bernoulli trials
 We estimate p from data (how?)
 Given n and p, we can compute probability of any
event
Examples
1. Suppose in a retail shop, 30% of the customers
who come in buys something. In a particular hour,
10 people have come to the shop. What is the
chance that
a. Exactly 5 people will buy
b. Less than 3 people will buy
2. Suppose the manager of a computer institute
claims that 40% of the students who come for
enquiry enroll in the course. To verify this claim
you have collected data for three days. You
observed that 20 people came for enquiry and 5 of
them joined. Does your data support or refute the
claim of the manager? Explain.
Exercises and Examples

1. Suppose the chance that a particular part of an


aircraft will fail during a 3 hour or longer flight is
0.05. At a point of time only one part operates
and in order to improve reliability the aircraft
designer has provided 4 redundant parts (i.e. a
total of 5 parts). In case less than three parts
are found to be operative at the end of a flight,
a severe fault code is generated and the aircraft
is expected to ensure that the probability of this
event is less than one in 10000. Do you think
the current design is adequate?
Work-out- Exercises and Examples

1. Suppose the chance that a particular part of an


aircraft will fail during a 3 hour or longer flight is
0.05. At a point of time only one part operates
and in order to improve reliability the aircraft
designer has provided 4 redundant parts (i.e. a
total of 5 parts). In case less than three parts
are found to be operative at the end of a flight,
a severe fault code is generated and the aircraft
is expected to ensure that the probability of this
event is less than one in 10000. Do you think
the current design is adequate?
Geometric Distribution

 Let the random variable X be the number of


Bernoulli trials required to get the first success,
where the trials are conducted independently of
each other and the probability of success remains
constant
 X takes values 0, 1, 2 ….∞.
 P(X = k) = p (1 – p)k – 1, k ≥ 1
Negative Binomial Distribution
 This is an extension of the geometric distribution
 In this distribution we try to model the number of trials
required to get r successes. The random variable is
denoted as x where x is the excess number of trials
required over r to get r successes
 P(X = x) = ((x + r – 1)! / (r – 1)! x!)pr(1 – p)x, x = 0,1,…;
0<p<1
Examples

1. Suppose in a manufacturing environment


defects are rather rare. We may use the
geometric distribution to estimate the number of
attempts required to get the first defective item.
2. We may use the geometric distribution to
estimate the number of good products produced
before getting the first defective product
3. The negative binomial distribution may be used
to find the number of sales calls required to
close a fixed number of orders
Poisson Distribution
 In a real life situation Poisson distribution is applicable
whenever a particular trial is conducted many times,
independently of each other and the probability of a
particular event in any one trial is very small.
 Take, for instance, the number of defects found in a
specification document. It may be assumed that the
probability of making a mistake in any one line (or a
small segment) is very small and the errors occur
independently of each other.
 Going by the same logic, number of accidents, number of
breakdowns, number of unreadable messages and similar
random variables will follow Poisson Distribution. (Why?)
 When p is small but n is large so that np remains finite,
the Binomial distribution approaches the Poisson
distribution. Note that large n typically means n > 30.
Poisson Distribution (Continued…)
 Let  be the average number of occurrences of
an event.
 Let X be the random variable.
 Then p(X=x) = (e- x) / x ; x = 0, 1, 2, …..., ∞
 Note that the only parameter of a Poisson
distribution is , the average number of
occurrences of the event.
Note
Poisson distribution is typically used for count data –
number of breakdowns, number of accidents, number of
defects on a product, number of faults generated during
a production run of a machine, number of customers
who come to a store
Examples
 Suppose in the tool crib of a machine shop, on an average 2
requests for issuing tools are made in an hour. What is the chance
that in a particular hour
 5 or more requests will come

 No request will come

 Suppose you are in charge of manpower planning for a team


responsible for maintaining software. Consider a simplistic
environment where only one type of request come. From your past
observation you have seen that on an average 3 requests come in a
day and it usually takes a full day to service the request.
Considering this, you have allocated three persons.
 What is the chance that at least one person will remain
unutilized on any given day?
 What is the chance that you will not be able to service
customers, assuming that customer service will be adversely
impacted in case 4 or more requests arrive on any given day
Discrete Uniform Distribution

 This is often referred to as the equally likely


outcome distribution and is equivalent to
assuming minimum knowledge about the
distribution
 A random variable X is said to follow a discrete
uniform distribution in case it takes values 1, 2,
….k such that P(X = j) = 1 / k for all j = 1, 2,…k
CONTINUOUS
DISTRIBUTIONS
Normal Distribution
 One of the most common distributions for
continuous random variables is the Normal
distribution.
 When a random variable is expected to be roughly
symmetrical around its mean and tend to cluster
around the mean value, the variable follows a
Normal distribution.
 Many continuous variables like length, weight or
other dimensions / characteristics of a manufactured
product; the difference between planned and actual
effort or time; amount of material consumed during
a production run and so on are expected to follow
normal distribution
PDF of Normal Distribution

 Let X be a random variable, which follows


Normal distribution.
 Let  be the mean value of X.
 Let  be the standard deviation of X.
 Then the probability density function is given by
p(x) = {1 / (2)} exp[-(x-)2 / (22)]
 The two parameters of Normal distribution are 
and  respectively, where  and 2 represent
the mean and variance of the random variable
Importance of Normal Distribution

 Two main reasons for the importance of Normal


distribution are:
 The tendency of sum or average values of

independently drawn random observations


from markedly non-normal distributions to
closely approximate Normal distributions (see
example in the next page), and
 The robustness or insensitivity of many
commonly used statistical procedures from
theoretical normality.
Convergence to Normal Distribution – An Example
Convergence to Normal Distribution (Continued…)

 Discuss about the different laws of large numbers and


convergence of binomial and poisson distributions to
normal in addition to convergence of means and sums
Uniform Distribution

 Let X be a random variable that takes values


between two real numbers a and b (b > a) with
the pdf f(x) = 1 / (b – a) whenever a≤ x ≤b and
0 otherwise
 It is important to note that in a uniform
distribution the maximum and minimum values
are fixed and all values between the minimum
and maximum occur with equal likelihood
Usage of Uniform Distribution
 This distribution is very important from the
perspective of random number generation. The
computer generated random numbers are drawn
from uniform distribution. In general these
random numbers follow the U(0,1) distribution
 The distribution functions of all distributions
follow uniform distribution, i.e. if F is the
distribution function for a random variable and x
is any random observation then the distribution
of y = F(x) ~ U(0,1). This is a very important
property and we will see its usage later
Exponential Distribution
 This distribution is widely used to describe events
recurring at random points in time. Typical examples
include time between arrival of customers at a
service booth; time between failures of a machine
and so on.
 The exponential and Poisson distributions are related.
In case the time between successive occurrences of
an event follows an exponential distribution then the
number of occurrences in a given period follows a
Poisson distribution.
 The probability density function of the exponential
distribution is given by f(x) = λe-λx, x ≥ 0, λ > 0
 The mean of exponential distribution is 1/λ
Weibull Distribution

 This distribution describes data resulting from life


and fatigue tests. It is often used to describe
failure times in reliability studies as well as
breaking strength of materials. Weibull
distributions are also used to represent various
physical quantities like wind speed
 Weibull pdf f(x) = (α/ βα)xα – 1 exp[ – (x / β)α],
where x ≥ 0, α > 0, β > 0
Summary of the Distributions
Name Paramet Mean Variance
ers
Binomial p np np(1 – p)

Poisson λ λ λ

Geometric p 1/p (1 – p) / p2

Negative p [r(1 – p)] / r(1 – p) / p2


Binomial p
Discrete K (K + 1) / 2 (K – 1)(K + 1) / 12
Uniform
Uniform a, b (a + b) / 2 (b – a)2 / 12

Normal μ, σ2 μ σ2

Exponential λ 1/λ 1 / λ2

Weibull α, β βΓ(1 + β2 {Γ(1 + 2/α) –


1/α) (Γ(1 + 1/α))2 }
Introduction to Descriptive
Statistics
Prologue
 The most basic statistical process is enumeration – the collection and
listing of individual cases. The fundamental idea of statistics is that
useful information can be accrued by compiling data. While individual
observations like time to pay an invoice, closing value of a stock on any
given day, time to travel from one place to another in a city, number of
resources demanded by a account manager in a given month and so on
do not tell us much by itself, together the data speaks clearly.
 In order to look at a body of data, it needs to be summarized.
Summary is the amalgamation of data – mostly noisy data – to reveal
interesting common features.
 We also use the process of comparison. Comparison is the process
opposite to summary, the pulling apart of a data set to reveal
interesting differences
 Descriptive Statistics give us a set of tools to carry out
summarizations and comparisons of one or more variables. It also
includes some techniques to visualize the data in a manner such that
interesting characteristics of the underlying data generation process
and the random variables involved may become apparent. The tools
consist of exploring the relationship between two or more variables as
well.
Understanding One Variable
Example 1

Look at the data on the number of days required to


obtain payment of invoices. Suppose we have a
large amount of data and we may be willing to
have a tool that allows us to estimate the
probability of different events. (What are the events
in this case? What is the data generation process
and what are the random variables?)
Example I (Continued..)
 We need a tool that summarizes the pattern of
variation
 We use tools like frequency distribution and
histogram to understand how many observations
lie in different intervals of the random variable
 Different events may be thought of as the time
required to pay invoices. Frequency distribution
and histogram are two tools to estimate these
probabilities.
 Usage of these tools require a large amount of
data
Example 2
Look at the same data. Suppose you are interested
in assessing the risk involved in managing your
cash flow. Thus you may have problems in case
payments get unduly delayed. You need a tool /
measure to develop a quantitative understanding
regarding that risk. (Can you identify the random
variables involved? Can you state the risk in
quantitative terms?)
Example 2 (Continued…)
 While the average time to pay may not be very
high, there may be a few cases when the time
becomes very large
 This kind of a pattern of variation is said to be
skewed. When the distribution of a random
variable is skewed, large values become more
likely.
Example 3
You have an important function at home and need to
reach on time. Unfortunately, you got delayed in the
office. There are two alternative routes. Both these
routes take about the same time but one possibly has
a larger variation than the other. You would like to
have a quantitative measure of variation so that you
could compare the two routes. We often have similar
situations in investment as well.
Example 4

You are the manager of a retail store and you want


to arrive at a ball park estimate of the quantity of
different products that you will stock. You need a
tool to arrive at this figure.
Comments
1. All the problems stated above requires you to develop quantitative
understanding about the pattern of variation, typical values,
quantum of variation and other important characteristics of a
random variable on the basis of collected data (these
characteristics are empirically computed from the data without
making any assumptions)
2. The pattern of variation (that also enables the analyst to make a
variety of probabilistic statements) is captured using various tools
like frequency distribution, histogram, and percentiles
3. The typical values and variations are obtained from the average /
median (the 50th percentile) and some measure of variation like
standard deviation.
4. Other important characteristics include skewness and kurtosis of a
random variable. These measures help forming ideas about
extreme values and risks
Measures of Location
 The simplest summary measures are the measures
of location. Three most important measures are
mean, median, and mode
 We will discuss about mean and median. Mode is
not used much in the context of business analytics
 These measures give an idea about the typical value
assumed by a random variable
 Suppose you will carry out an activity many times.
In these cases choosing the alternative with a more
attractive mean is likely to make sense. Examples
 There are two routes to go back home from office. The
time taken by route 1 is lower on an average
 There are three alternative methods to carry out a
chemical process. Method 2 has a higher average yield
compared to the other methods
Median
 In certain cases we may use median instead of mean
 Median is defined to be a value such that 50% of the values
of the random variable is less than or equal to this value.
Technically we may write xm to be the median of a random
variable X when F(xm) = 0.5. We may consider the smallest
xm or we may take an average of x1 and x2 (x1 < x2) where
both x1 and x2 satisfy the condition F(xj) = 0.5, j = 1,2
 We often prefer median when the underlying distribution is
not symmetric and the sample size is not very large. This is
particularly true for artificial variables that can have very large
values. Examples are sale of books, revenue earned by
specialists, market capitalization of companies, number of hits
for certain websites and so on. For such variables, mean may
represent a skewed picture even for large sample – average
national income is a good example
Percentiles
 The value xp is called the pth percentile in case F(xp) = p,
0 ≤ p ≤ 1. Median is a special case of a percentile (it is
the 50th percentile). As already mentioned, xp may not
be unique.
 When we know the mean, median and other percentiles
of a distribution, we develop a feel about the random
variable and the data generation process
 When the mean and the median are about same and the
percentiles are about equidistant from the median, the
distribution is symmetric. Otherwise it is skewed – a
concept we will study in greater detail
 The percentiles – when computed from a large sample –
gives us reasonable idea about the probability of many
events
Measures of Dispersion
 These are summary measures of the spread of a
random variable
 Three typical measures are standard deviation, inter
quartile range and range
 In a large sample case, standard deviation (mean
squared deviation from the mean) is the most
popular and stable measure.
 Standard deviation is a measure of risk. Higher the
standard deviation, higher is the spread and hence
higher is the chance that the random variable can
take widely different values. Thus increased
variability reduces predictability and hence increases
risk.
An Important Point
 The mean, median, standard deviation and
percentiles are properties of some random variables.
 As we are looking at properties of random variables,
we must be in a position to identify the random
variables explicitly. We should try to ensure that the
data generation process is also identified.
 The summary measures that we obtain are sample
observations and may change if the sample changes
 In our study of random variables using descriptive
statistics, the issues of sampling fluctuations are not
considered.
Usages

 The mean, median, standard deviation and


percentiles are used for both summarization and
comparison. Examples
 We may compare the time taken to service

tickets of different complexities or other types


 We may compare two or more customers with

respect to their time to pay invoices


 We may summarize the price of certain
commodities in different markets in a city in a
given month
Frequency Distribution

Notes
1. The frequency distribution is a technique to count the number of observations
between different intervals of the random variable being studied.
2. Usually the intervals are of equal length and are referred to as class intervals. The
number of class intervals is taken as sqrt(N) where N is the number of
observations. However, if N is very large, the number of classes is restricted to
about 25.
3. Frequency distribution and histograms require a large number of observations. You
should look at a sample size of 100 or more at the very least
4. Let fj be the frequency of the jth class and let N = Σ fj. The relative frequencies are
defined to be fj / N. Cumulative frequencies are defined to be Fj = Σ fi, i = 1,2,..j,
and cumulative relative frequencies are defined as Fj / N
Histogram

 A histogram is a visual depiction of the frequency


distribution
 In order to construct a histogram, we construct
adjacent bars for the different class intervals with
heights proportional to the frequency or relative
frequency of the classes.
Example

Notes
1. The above figure is an example of a histogram
2. In this case the heights of the bars are proportional to
the frequencies. Note that this is same as constructing
the bars with their heights proportional to the relative
frequency
Example (Continued…)
 The histogram and frequency distribution makes
several points apparent
 Shape of the distribution: The distribution in the
previous slide is roughly symmetric
 The location is between 40 and 50 (the location
of maximum concentration)
 We get an idea about the variation

 We can estimate the probabilities of different


events from the frequency distribution. The
relative frequencies and the cumulative relative
frequencies provide estimates of the density
functions and the cumulative density functions
respectively
Inferences
 The pdf and cdf estimated from the frequency
distributions are non-parametric estimates of
these functions
 The estimates may be written as
Fest(x) = ΣI(Xi ≤ x) / N, where I(Xi ≤ x) = 1 if Xi
≤ x and 0 otherwise
 We can use Dvoretzky – Kiefer – Wolfonitz
(DKW) inequality to construct a confidence band
DKW Confidence Band

 This is a non-parametric confidence band


 Let L(x) = max {Fest(x) – Ɛ, 0}
 Let U(x) = min {Fest(x) + Ɛ, 1}
 Ɛ = sqrt(ln(2/α)/2N), where we are looking at a
(1 – α)% confidence band
Histogram Example – Positive Skew

• The histogram given above has a positive or a right skew. Although


many values (time to pay an invoice) are low, some invoices take
very long for payment. Such a behaviour increases the risk of not
getting payment for a long time that may lead to problems of cash
flow. This situation is also often called positive skew.
Histogram Example – Negative Skew

The histogram given above has a negative skew. In this


case the median is larger than the mean while for a
positively skewed distribution the mean is higher than
the median.
When a distribution is negatively skewed, we can get
very low values and that may be a cause of risk in
specific situations.
Exercise

The above histogram gives the distribution of weight of


adults taken randomly from three different countries.
Examine the histogram and give your comments.
Histogram Example

• Consider the payment cycle time of two different


customers. The mean and standard deviation of the cycle
times are 78.6 and 67.5 for customer 1 and 79.2 and
65.7 for customer 2
• However, the two histograms are very different showing
that the payment patterns are not same at all – a fact not
detected by the mean and standard deviation
Histogram Example (Continued…)

Customer 1
• The histogram given above for invoice payment time is skewed to the right
• Although the average payment time is 78 days (well below the agreed 90
days limit) many invoices take much longer.
• It appears that there is a systemic issue (may be invoices contain errors,
may be they are sent without verification of completion of work, may be
they are not sent electronically) and focusing on the delinquent invoices may
not be of much help.

Customer 2
• The histogram given above shows a different pattern. In this case
most invoices are paid within 90 days. However, a few takes much
longer, thereby increasing the average time.
• The average time in this case is 79 days – slightly greater than the
previous case. However, control is much easier as we probably have
to focus on a few specific cases of delinquency.
Distribution of Dimensions
Look at the distribution of dimensions of some component produced by
three machines. What are your comments?
Concept of Ogives

1. Ogives provide visual description of the cumulative distribution


function
2. The curve varies between 0 and 1. It asymptotically approaches 1
and may approach similarly.
3. We can construct two types of ogives – less than type and greater
than type (what are they?)
4. Ogive of a random variable X may be higher (greater than or equal
to) ogive of Y everywhere. In case it happens, what will it mean?
Slightly Distorted Example of an Ogive
Suppose a telecom service provider has carried out a survey to understand the importance the
customers attach to various aspects of service. The opinion of the customer has been taken in a 7 point
(1 to 7) scale where 1 implies almost no importance and 7 indicates the highest level of importance.
The data collected for 357 customers for two service characteristics, namely store experience and
consistency of service delivery are tabulated below:

Variable
Rating
Store Experience Consistency of Service Delivery
Freq Prop Cum Prop Freq Prop Cum Prop
1 4 0.011 0.011 1 0.003 0.003
2 6 0.017 0.028 1 0.003 0.006
3 7 0.020 0.048 5 0.014 0.020
4 35 0.100 0.148 34 0.095 0.115

5 122 0.341 0.489 97 0.272 0.387


6 134 0.375 0.864 155 0.434 0.821
7 49 0.136 1.000 64 0.179 1.000
Total 357 1.000 357 1.000

Do you see any pattern? Do you understand the profile?


Skewness
 Skewness indicates how the distribution is pulled to one
side or the other. This helps taking many decisions
 In an investment scenario a left skew is preferred to a right
skew. A left skew indicates that lower values are less likely. Thus
when the random variable is the return of a stock, a left skewed
distribution is preferable
 Suppose X and Y represent the productivity in two different
circumstances – say when two different tools are being used.
Suppose X and Y has same average, median and variance.
However, suppose X is negatively skewed and Y is positively
skewed. How will you compare X and Y (or the tools)? Can you
think about the ogives in these two cases.
Kurtosis

 Kurtosis measures the peakedness


 Kurtosis is expected to be 3.0 and such
distributions are called mesokurtic
 A distribution with kurtosis < 3.0 is called
platykurtic and when it is > 3.0 the distribution
is called leptokurtic
 Measure of kurtosis is important as leptokurtic
distributions are more prone to catastrophic
events
Box Plot – Another Visual Summary
 Box Plot provides another (other than frequency
distribution / histogram) visual summary of a
distribution
 The plot is called a Box and Whisker plot.
 The plot consists of a box with two ends as the
3rd and 1st quartile respectively. The median is
shown as a line inside the box. Two whiskers are
drawn from the 3rd and 1st quartiles to the
maximum and minimum values respectively.
 Thus in one figure the box plot displays the
central tendency, the dispersion as well as the
skew.
Example

• An example box plot is shown in


this slide. The upper and lower
whiskers represent the maximum
and the minimum values
respectively. The upper and lower
lines of the box represent the 3rd
and the 1st quartiles. The median is
shown inside the box.
• Box plot is not only a great way to
present summary description of a
random variable. It is an excellent
tool to compare a number of
distributions as well
Summary
 Measures like mean, median, percentiles, standard deviation, skewness and
kurtosis are properties as well as summary measures of random variables. It is
important to understand the random variables and the underlying data
generation process.
 The mean and median gives the location or the typical values. In the long run
we expect these values to happen. The percentile helps understand the pattern
and gives idea about probability of different events. Standard deviation gives
the spread and helps assess the risk.
 Skewed distributions will alter the probability of occurrence. Distributions with
a positive skew will produce lower values more frequently. Distribution with a
negative skew will have preponderance of higher values. The ogives help us
identify this from a visual perspective.
 Kurtosis helps us assess the chance of catastrophic events. Leptokurtic
distributions are more likely to have such events
 The frequency distribution / histogram coupled with DKW confidence band is
very useful for estimation of probability of different events
 The box plot is used to summarize the central tendency, dispersion and skew
of the distribution of a random variable. This may be used for comparison as
well – we will cover this aspect later
RELATIONSHIP BETWEEN
TWO OR MORE VARIABLES
Relationship Between Two Variables

 In real life business analytics we often come


across situations where we need to estimate or
classify a variable on the basis of other variables
(What is this analysis called? What are these
variables called?)
 Understanding the relationship between these
variables is, therefore, a basic requirement
 We examine this relationship graphically using
scatter diagram (an example was given in the
previous chapter)
Relationship between odometer reading and a used car’s
selling price.

 A car dealer wants to find Car Odometer Price


the relationship between 1 37388 5318
2 44758 5061
the odometer reading and
3 45833 5008
the selling price of used cars. 4 30862 5795
 A random sample of 100 cars 5 31705 5784
is selected, and the data 6 34010 5359
recorded. . . .
. . .
. . .
Independent variable x
Dependent variable y
294
Example – Scatter Diagram

6000
5500
Price

5000
4500
19000 29000 39000 49000
Odometer

• The diagram given above is called a scatter diagram


• In this diagram the X axis represents the explanatory (independent)
variable and the Y axis represents the response variable
• The line shown in the graph is the line of best fit and we will discuss
about this in a later section
Quantifying the Relationship
 The strength of the relationship is measured using
correlation coefficient
 The correlation coefficient measures the strength of the
linear relationship and has the following properties
 Ranges between –1 and 1
 Unit-less
 The closer to –1, the stronger the negative linear relationship
 The closer to 1, the stronger the positive linear relationship
 Value of ± 1 indicates perfect relationship and a value of 0
indicates no relationship
Computation of Correlation Coefficient
• For data collected in ratio scale the product moment correlation (also called Pearson’s
correlation coefficient) is computed
• The computation is carried out as follows:

 (x i  X )( yi  Y )
cov ( x , y )  i 1
n 1
Where cov stands for covariance and a positive (negative) value indicates a positive
(negative) relationship. Zero indicates absence of linear relationship.

cov ariance ( x, y )
Correlation coefficient r 
var x var y
Scatter Plots of Data with Various Correlation Coefficients

Y Y Y

X X X
r = -1 r = -.6 r=0
Y
Y Y

X X X
r = +1 r = +.3 r=0
Exercise

Look at the above scatter plots and comment about the


relationship between the variables.
An Example

Portugal is a top wine exporting country holding about 3.2% of the world
market in 2005. The wine industry is investing in technology and is trying to
find the externally controllable parameters that could impact the wine taste. In
order to understand the relationship between the different parameters and
wine taste, a large experiment was conducted. The description of the data
collected as part of the exercise is given below:

1. Independent variables: There are 11 independent variables, namely – fixed


acidity, volatile acidity, citric acid, residual sugar, chlorides, free SO2, Total
SO2, density, pH, sulphates and alcohol.
2. Response: Quality of wine measured in ordinal scale
Alcohol vs. Wine Quality
9

4 Series1

0
0 2 4 6 8 10 12 14 16

The scatter diagram of alcohol vs. wine quality is given


above. However, the relationship between these
variables is not clear from this diagram. Incidentally, the
plot used here is called the dot plot as well.
Problems of Scatter Diagram
 In scatter diagram we plot data of each individual
pair of observations.
 While a scatter diagram is usually effective in
detecting patterns, it suffers from a number of
deficiencies
 When the number of observations are large, the
plot may get cluttered. As a result the actual
pattern may become hard to detect
 The range of the data may be large and only few
(but not negligible) points may lie in a large part
of the overall range. Depicting the pattern in such
a case turns out to be difficult
Concept of Mean Function

 The mean function finds the mean of the


response variable for different ranges of the
explanatory (input / predictor) variable
 We usually divide the explanatory variable into a
number of sub-ranges depending on the number
of observations (in order to get stable estimate of
the average each range needs to have at least 30
observations)
A Variant of a Mean Function
Consider the following table showing the number of cases of different
quality of wine as the alcohol content changes (reconstruction of the
earlier scatter diagram in terms of mean function)

Wine Quality Alcohol Content Total

Low Moderate High Very High


(8.4 – 9.5) (9.5 – 10.1) (10.1 – 11.1) (11.1 – 14.1)
Poor (3,4) 12 16 15 10 53

Moderate (5) 185 254 142 47 628

Good (6) 73 122 186 180 561

Excellent (7,8) 2 17 48 130 197

Total 272 409 391 367 1439

Can you see the pattern now? Can you compare this with the scatter
diagram?
Extension of Mean Functions
 When a large volume of data are available, the mean function
may be extended to incorporate multiple input variables to
understand the behaviour of the output (target) variable
 The essential idea is that when a number of explanatory
variables have similar values, the behaviour of the outcome is
expected to be similar
 This method is a preliminary version of the nearest neighbour
algorithms we will study in greater detail later.
 The method requires construction of tables and classifying
new data points into a unique cell of the table. Consequently,
the methods will be referred to as table lookup methods
 The table lookup methods can often be implemented through
SQL if the right data are available
Structure

In order to carry out a table lookup, the table may be


constructed
X1 X2 as follows:
X3 … Xp Output

• Each column of the table except the one for output


represent an input variable
• Each input variable is divided into a number of groups.
For categorical variables, the individual values or some
combination is taken. Ratio scale variables may be
grouped using percentiles
• The output may be the average or median for value
An Example

 Websites providing estimates of price of used


cars is an example of a simple table lookup
method
 The interface allows the user to specify the make,
model, year, mileage range, fuel type etc. The
estimated price is an average for the class
 The estimates are generally pretty good
 The home price (the example we just now tried)
is another example.
Issues with Table Lookup

 If the different dimensions (X variables) of the


table are correlated, some cells may have very
few observations. Consequently, the average /
proportion arrived at from the cell may not be
stable
 The number of cells increase quickly. Even if we
have just 5 dimensions each broken up into 4
parts, we land up with many cells! (How many?)
For a realistic BA problem this is a formidable
barrier.
Construction of Lookup Tables (Already Discussed)
 Choose dimensions that are likely to impact the target
 Try to ensure that the dimensions are not strongly
correlated (this is often hard to ensure and may not
have much impact – we will also look at methods to take
care of this situation called multicollinearity)
 Use percentiles / combination of categories for cell
entries
 Populate the table. In case some cells are empty / near
empty, try reducing dimensions or reduce the fineness of
partition
 Occasionally some cells may be empty by design, e.g. a
product may not be sold in some country.
 Define an appropriate function of the target variable for
each cell
RFM – A Widely Used Lookup Model
 In direct marketing RFM is a familiar acronym for
Recency, Frequency and Monetary
 Assumptions: customers who purchased
 recently are more likely to purchase in near future

 frequently are more likely to continue

 more in value terms are likely to bring more value

 A three dimensional (R / F / M) lookup table is


constructed

Questions

1. RFM cells are correlated and certain combinations are


unlikely to exist. Identify the combinations
2. How can you possibly use the RFM lookup table
Notes on RFM

 The cells of the RFM lookup table may be


considered as experimental blocks. The entities /
experimental units within a block are expected to
be similar
 Customers belonging to certain cells are likely to
be more lucrative compared to customers
belonging to other cells
 Direct marketing often makes use of the above
two points
Usages of RFM
 Targeting: Customers in certain cells may be targeted
more vigorously compared to customers in other cells
 Test and measure: The effectiveness of campaigns
may be tested by targeting randomly selected
customers within cells and then comparing them with
a control sample of the same cell who have not been
made any offer
 Assessment of incremental gains: The test and
measure may be carried out in different cells to
estimate where the campaign effect is maximum.
(Note: Cell with the maximum gain may not have the
maximum response)
Concepts of Stratification
 In many applications we are interested in some variable
and we are aware that this variable is impacted by many
other variables.
 These other variables are called the explanatory
variables (we often refer to the variable of interest as Y
and the variables impacting Y as X variables)
 We intend to examine how Y changes as X changes and
in order to study this we look at the different subsets of
Y obtained for different values (conditions) of X.
 This process of examining the changes of Y as X
changes is known as stratification
 In the stratification exercise we are essentially
estimating the conditional averages / medians of Y for
different values of X. Note that in computing the mean
function, we are essentially doing the same
Usage of Box Plots for Stratification

 In some applications we may like to look at the


variable of interest (called Y variable or response
variable) for different values of X
 In order to compare Y, we may use the box plot
A Hypothetical Example
Another Example
 Examine the data on effort required for servicing
tickets of different types.
 Note that tickets were classified on complexity
as well as a couple of other classifications
 The random variable effort to service tickets
were stratified on the basis of the classification
criteria and the box plots were drawn for each
class
 The box plot is given in the subsequent slide.
Give your comments
Box plot of effort based on different combination

IM-C-DB IM-M-App IM-S-App IM-S-DB RM-M-App RM-S-App RM-S-DB SR-C-DB


SR-M-DB SR-S-DB

317
Exercise
 Examine the manpower supply data. We want to
estimate the demand. Answer the following
 What random variable are you studying?

 Can you characterize the random variable? What


does characterization mean? How will you
proceed?
 Examine the data carefully. How could you
possibly stratify the response? How will you know
whether the stratification is effective? What
techniques will you use?
 Can you help the management in predicting the
demand?
Review Questions
1. What is a scatter diagram?
2. What is correlation coefficient? What is its range?
3. What does a correlation coefficient ± 1 indicate?
4. Why do you think covariance measures linear
relationship?
5. What is mean function?
6. What is a table lookup method? What are some of
its advantages and disadvantages?
7. What is RFM? How is it used in the context of
direct marketing?
Usages
 Scatter diagrams and mean functions enable you
understand the form of relationship between two
variables
 You should be able to use this knowledge to choose
the right model later – at a stage when you fit
models
 The particular form of contingency table is a good
starting point for exploratory analyses. These
analyses typically enhance understanding regarding
relationship between variables substantially
 Lookup tables and RFM as a special case of lookup
tables is the starting point of big data analytics.
These are simple non-parametric techniques that can
be effectively used when large data are present.
Exercise - Estimation of Home Price
Suppose we are interested in understanding how the
different characteristics of locations (towns) impact the
median value of owner occupied homes. Analyze the data
given
CRIM
to you (Boston Housing Data) and give your
per capita crime rate by town
comments
ZN on ofhow
proportion theland
residential different input
zoned for lots variables
over 25,000 sq.ft. impact the
INDUS proportion of non-retail business acres per town.
output
CHAS variable.
Charles The list
River dummy of (1
variable output variables
if tract bounds river; 0is given below.
otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000
a
a
a
Data Distribution & Summarisation
 Data do vary and this variation can be predictable.

 Three aspects are important to summarise the data and they can be
linked to understand the distribution of data.
 Centering- Average

 Spread- Standard Deviation

 Shape:

 Skewness: Skewness
 Peakedness: Kurtosis

 Predictable distributions are characterized using some or all of these


aspects- called parametric distributions.

 However, it is not always guaranteed that such parametric distribution


will exist. In such non-parametric context, median and quartile range are
of great use.
Data Distribution & Summarisation
 Average or Mean - Sum of all Data divided by number of data points

 Median - Middle Data Value when data is ranked from min. to max.

 Mode - Data which has maximum frequency.

 Maximum - Largest of Data Points.

 Minimum - Smallest of Data Points

 Range - Difference between Maximum & Minimum

2
S( xi i)
 Standard Deviation - s 
n-1
Tabular Output of Summary Statistical
Calculations

1. Double click
on “C9”

2.Select “By
variable” and
Double click on
“C10”

2. Click ”OK”

Does this output agree with the scatter plot points of interest?
Graphical Output of Summary Statistical
1. Double
click on
Calculations
“C9.”

4. Click
“OK.”

3. Click on
Graphs
Button to
2. Click on “By bring up the
Variables” Graph
and select Dialogue Box
Ang to (See above)
create final
graph (See
above)

Compare summary histogram to


scatter plot values for angle 174-
-more on histograms in next
slide
Probability Distributions

Need quantify/verify our conclusions from the descriptive


statistics investigation:

•Introduce probability distributions for use in analytical statistical methods

•Remove the subjectivity from the use of descriptive statistics


investigations
Probability Distributions: Terminology

Given a Random Variable:

• Probability Density Function (PDF)


– The Frequency of occurrence for the variable.

• Cumulative Distribution Function (CDF)


– The Probability that the variable is less than or equal to some
value
– It is the area under the PDF and ranges from 0 to 1.

• Reliability Function
– The Probability that the variable is greater than some value
– It is 1-CDF
Continuous Distribution: Normal

Description
• The normal distribution (also called the Gaussian distribution) is
the most commonly used distribution in statistics. Two
parameters ( (mu) and  (sigma)) are required to specify the
distribution.
( x   )2

The distribution (x ;  ) 1 
p , 2
e 2 2

2  2

Notes
• The normal distribution closely matches the distribution of many
random processes, especially measuring processes
Normal CDF & Reliability Function
1.0

P(x < X)
0.5

0.0
-4 -3 -2 -1 0 1 2 3 4
x
Cumulative
Distribution
Function (CDF)

1.0
P(x > X)

0.5

0.0
-4 -3 -2 -1 0 1 2 3 4
x
Reliability
Function
Parameters of the Normal Distribution

 (mu), a measure of central tendency, is the mean or average of all


values in the population. When only a sample of the population is being
described, mean is more properly denoted by x (x bar).

 (sigma) is a measure of dispersion or variability. With smaller values


of , all values in the population lie closer to the mean. When only a
sample of the population is being described, the standard deviation is
more properly denoted by s.

Both  and  are specific values for any given population, and they
change as the members of the population (the distribution) vary.
A Plot of the Normal Distribution

mean or average (  or x )

standard deviation ()

probability of USL defect

LSL USL
lower specification limit upper specification limit
Formal Definitions of Moments: Statistical
Expectation

The mean
• also called Expected Value or First Moment
• the Mean is a measure of central tendency, i.e., “Where is the
center of the distribution?”   E ( X )

 
i1
x i f ( xi) for discrete variables

  
x f ( x ) d x fo r c o n tin u o u s v a ria b le s

Variance
• also called Second Moment
• Variance is a measure of spread in the distribution
 2   2
 Var ( X )  E (( X   )2 )

  ( xi   )2 f ( xi ) fo r d is c re te v a ria b le s
i  1



 
(  x )2 f ( x ) d x fo r c o n tin u o u s v a ria b le s
Higher Order Moments

Third order moment


• the third order moment is a measure of assymetry in the distribution
• when normalized, a skewness value is computed
• Skewness is zero for symmetric distributions

 3  E  ( X   )3 


skew ness    3
 3

Fourth order moment


• the fourth order moment is a measure of the tail-heaviness of the
distribution
• when normalized, a kurtosis value is computed

 4  E  ( X   )4 


kurtosis    4
 4
Moments for Distributions

d is tr ib u tio n m ean v a r ia n c e skew n ess k u r to s is


b in o m ia l n p n p (1  p ) 1  2 p 1  6 p (1  p )
3 
n p (1  p ) n p (1  p )

P o is s o n  t  t 1 1  3 t
t t
  2
n o rm a l 0 3

u n ifo r m
xU SL  x L SL (x U SL  x L SL ) 2
0 1 .8
2
12
2 
3(   4 )
c h i- s q u a r e d  8

We typically only worry


about mean and variance.

Only if we are experiencing unusually high defect rates


for a given mean and variance do we worry about
skewness and kurtosis. More on the other tabled
distributions later.
Estimators

For a Population Recall


• definition: all possible observations of the random variable (often an infinite
number)
• characterized by the probability density function (PDF) of the random variable
• associated moments can be computed from PDF
And for a Sample
• definition: a discrete subset of the population
• a PDF cannot be generated from sample data
• however, estimates of the population PDF and moments can be obtained from
sample data
Moment estimators
n

 (x  x)
2
n


i
xi s 2
 ^ 2
 i1
x  ^  i 1
n 1
n
n

 (x i  x )
n

 (x  x)
3 4
i i1
^ 3  i1 ^ 4 
n1 n 1
Degrees of Freedom Reading Material

n
 ( Xi  X )2
^  s  i 1
n 1
Why n-1?
The use of n–1 is a mathematical device used for the purpose of deriving an unbiased estimator of the
population variance. In the given context, n–1 is referred to as “degrees of freedom.” When the total
sums-of-squared deviations is given and the pair-wise deviation contrasts are made for n observations,
the last contrast is fixed; hence, there are n–1 degrees of freedom from which to accumulate the total.
More specifically, degrees of freedom can be defined as (n–1) independent contrasts out of n
observations. For example, in a sample with n = 5, measurements X1, X2, X3, X4, and X5 are made. The
additional contrast, X1–X5, is not independent because its value is known from
(X1–X2) + (X2–X3) + (X3–X4) + (X4–X5) = (X1–X5)
Therefore, for a sample of n = 5, there are four (n–1) independent contrasts of “degrees of freedom.” In
this instance, all but one of the contrasts are free to vary in magnitude, given that the total is fixed.
Thus, when n is large, the degree of bias is small; therefore, there is little need for such a corrective
device.
Review Exercise: Another Look at Means and
Standard Deviations
Minitab File:
Catapultnew2.mtw

1. Click on
“Mean”

2. Click on “Input
5. Repeat 1-4 by variables” and select
having step 1 Rep1-Rep3
select 3. Enter c12 in
“Standard “Store result in:”
deviation.”
Store result in
4. Click on
“c13.”
“OK”
Review Exercise Results
Sample standard
Sample means deviations

Operator 1

Operator 2

Operator 3

These estimates confirm our earlier conclusions. They agree


with the graphical output from the summary of the descriptive
statistics.
Distributions

Objectives of Module

 Continuation of probability topics: Binomial and Poisson


distributions
 More on the Normal distribution: standardization, z-values and
Normal probability tables, the central limit theorem
 “Theory” behind 6-sigma
 relationship of binomial to Poisson
 relationship of binomial/Poisson to Normal
 introduction to capability: the z statistic
Overview of the six-sigma approach

Defect: Non-conformance to customer CTQs

Satisfies
the
Customer

Dissatisfies Dissatisfies
the the
Customer Customer

Maximum Target Maximum


Allowable Value Allowable
Lower Limit Upper Limit

Maximum Range of Variation


Discrete Distribution: Binomial

Description
• assume n independent trials of a test are run, with each trial
having a p chance of failure
• what chance is there of x failures occurring over the n trials?
The binomial distribution describes this:
 n
b ( x ; n , p )   x ÷
( ) ( ) for x  0,1 , 2, n
x n x
p 1  p
 n n!
where   
 x x !( n  x ) !
Example
• We are installing 10 bolts in a system, each of which has a 20%
chance of being installed incorrectly. What is the chance of 2
bolts being incorrectly installed?
 10 
b ( 2 ;10 ,0 .2 )   2 ÷ ( 0 .2 ) (1  0 .2 )
2 10-2
 0 .3020

Binomial PDF

Probability distribution plot

0.35
0.3020
0.30
0.2684
0.25
0.2013
0.20

0.15
0.1074
0.10 0.0881

0.05 0.0264
0.0055 0.0008 0.0001 0.0000 0.0000
0.00

The sum of all probabilities adds up to 1.0


Moments for the Binomial Distribution
Recall from Module 1 the following moments:
d is tr ib u tio n m ean v a r ia n c e skew n ess k u r to s is
b in o m ia l n p n p (1  p ) 1  2 p 1  6 p (1  p )
3 
n p (1  p ) n p (1  p )

P o is s o n  t  t 1 1  3 t
t t
  2
n o rm a l 0 3

u n ifo r m
xU SL  x L SL (x U SL  x L SL ) 2
0 1 .8
2
12
2 
3(   4 )
c h i- s q u a r e d  8

The moments for the binomial are obtained directly from the
summation formulas for discrete distributions. The variance is
usually written as npq , where q  1  p .
Estimation for the Binomial Distribution
Recall from Module 1 the following moment estimators:
 (x i  x )
2

x  ^ 

i 1
xi s 2
 ^ 2  i1
n 1
n
n

 (x  x)
n

 (x  x)
3 4
i i
i1
^ 3  i1 ^ 4 
n1 n 1

It can be shown that the estimate for the mean reduces to np ˆ and the
variance to np ˆqˆ . Thus, for the the 9 observations of the catapult data
for angle 150 the estimate of the binomial mean is 3.0 and the estimate of
the variance is 2.0.
Overview of the six-sigma approach
Based on the Catapult data at angle 150 the customer CTQ is
not met--the defect probability is too large. What could be
done to meet the given CTQ:
• Somehow reduce the variability of operator 2’s performance

• Somehow shift the mean of operator 3

• In the last Module on probability and statistics we will learn how


to make tests to determine if operator 2’s variability exceeds
that of operators 1 & 3, and if the mean of operator 3 is biased
upward with respect to operators 1 & 2

For now treat the data as coming from a single process


Overview of the six-sigma approach

The binomial distribution is unwieldy in processes that involve many distinct


steps each of which has numerous operations:
• Six-sigma treats all defects from all steps and operations
mathematically identically--a defect is a defect, is a defect, …

• Six-sigma assumes that the number of possibilities for


occurrence of a defect in any step is very largep and the
associated is very small
n p np
• With large, small, and constant, the binomial
distribution can be approximated with the Poisson distribution

As a first step Six-sigma replaces the binomial with the Poisson distribution
Discrete Distribution: Poisson

Description
• assume that there are so many opportunities for a defect that
practically they cannot be counted, but that the number of defects over
a given period of time can be readily counted
• what chance is there of x failures occurring over that given period of
time?
As indicated the Poisson distribution describes this:

e  t ( t )
x
p(x;  t )  for x  0,1,2 , K…
x!
Example
• I-90 experiences 3.6 traffic accidents per day between Albany and
Buffalo. What are the chances of zero accidents occurring on any given
day?
  3.6 accidents / day t  1 day

e 3.6 ( 3.6)
0

p( 0;3.6)   0.0273
0!
Discrete Distribution: Poisson

The Poisson increment does not have to be time--area,


volume, etc. also work:
• Count the number of yarn defects in a square yard of cloth

• Count the number of chromosome interchanges in cells due to


irradiation by X-rays

• Count the number of bacteria in Petri plates, etc. ...


Poisson PDF

Probability distribution plot


0.25

0.2125

0.20 0.1912
0.1771

0.15 0.1377

0.0984
0.10
0.0826

0.05 0.0425
0.0273
0.0191
0.0028
0.00
0 1 2 3 4 5 6 7 8 10

The sum of all probabilities adds up to 1.0


Moments for the Poisson Distribution
Recall from Module 1 the following moments:
d is tr ib u tio n m ean v a r ia n c e skew n ess k u r to s is
b in o m ia l n p n p (1  p ) 1  2 p 1  6 p (1  p )
3 
n p (1  p ) n p (1  p )

P o is s o n  t  t 1 1  3 t
t t
  2
n o rm a l 0 3

u n ifo r m
xU SL  x L SL (x U SL  x L SL ) 2
0 1 .8
2
12
2 
3(   4 )
c h i- s q u a r e d  8

The moments for the Poisson are obtained directly from the
t 1
summation formulas for discrete distributions. Note that when
the mean and variance are equal to .
Estimation for the Poisson Distribution
Recall from Module 1 the following moment estimators:
 (x i  x )
2

x  ^ 

i 1
xi s 2
 ^ 2  i1
n 1
n
n

 (x  x)
n

 (x  x)
3 4
i i
i1
^ 3  i1 ^ 4 
n1 n 1

̂t to
It can be shown that the estimate for the mean reduces
̂t
and the variance to ̂t , where x is given by from the
observed Poisson distribution.
The Poisson Approximation to the Binomial

Consider the binomial random variable describing the defect distribution for the
Catapult data for the angle of 150 and a unit increment Poisson random variable.
Define   np in the Poisson distribution. Then

 
n

1    e ,
 n

from the limit formula defining the exponential function. Applying this
formula to the example data yields :
(ˆ  3)
(2 / 3)  0.02601  0.04978
9

The approximation is not that poor even with a p of 1/3 and an n of 9.


With a defect probability of 1/100 and an n of 100 the left-hand side
becomes 0.36603 and the right-hand side becomes 0.36788 (=1/e)!

The Poisson can be a very good approximation to the binomial


More on the Normal Distribution:
Standardization

Why standardize the Normal distribution:

• Tabled values of the normal distribution are given for a normal


with mean 0 and standard deviation 1

• Standard normal quantiles represent z-values in Six-sigma

• Standardizing facilitates comparisons whatever the original


response units.
Normal PDF

Example
• if =2 and =0.5, what are the chances of a number less than
2.3 occurring? 2 .3

P( x  2.3)  2
 p( x; ,  ) dx  0.7257


Corresponding PDF plot


1.00

0.90

0.80

2.3
0.70

0.60

0.50

0.40

0.30

0.20

0.10

0.00
0.50 1.00 1.50 2.00 2.50 3.00 3.50

The total area under the curve is 1.0


More on the Normal PDF

Plus or minus sigma?


• often, we refer to confidence intervals in terms of a ± value
• exactly what does ±2 or±3 mean?
The plot 1.00

0.90

0.80

0.70

0.60 ±1 covers 68.3%

0.50

0.40

0.30 ±2 covers 95.4%

0.20
±3 covers 99.7%
0.10

0.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

Notes

• A range of ±1.960 covers exactly 95%

• A range of ±2.576 covers exactly 99%
The Standard Normal Curve

Normalization
• when the following normalization is applied, a new random
variable is generated that has mean zero and variance one.
x
z

• this allows us to use standard probability tables to integrate the


area under the curve: in Six-sigma is called the z-transform
Standard PDF plot
0.50

0.45

0.40
2.3  2
Beware some tables 0.35z   0.6
0.5
contain the area of 0.30

the curve to the left of 0.25


z others to the right! 0.20

0.15

0.10
Area to the left of 0.6:
0.7257 0.05

0.00
-4.00 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00
Single-Tail z Table
(Values of z from 0.00 to 3.99)

0.00 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
0.10 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
0.20 .4607 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
0.30 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
0.40 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
0.50 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
0.60 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
0.70 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
0.80 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
0.90 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
1.00 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379
1.10 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170
1.20 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985
1.30 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823
1.40 .0808 .0793 .0778 .0764 .0749 ..0735 .0721 .0708 .0694 .0681
1.50 .0668 .0655 .0643 .0630 .0618 .0606 0594 .0582 .0571 .0559
1.60 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455
1.70 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367
1.80 0.359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294
1.90 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233
2.00 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183
2.10 .0179 0.0174 .0170 .0166 0.162 .0158 .0154 .0150 .0146 .0143
2.20 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .00116 .0113 .0110
2.30 .01072 .01044 .01017 .00990 .00964 .00939 .00914 .00889 .00866 .00842
2.40 .00820 .00798 .00776 .00755 .00734 .00714 .00695 .00676 .00657 .00639
2.50 .00621 .00604 .00587 .00570 .00554 .00539 .00523 .00509 .00494 .00480
2.60 .00466 .00453 .00440 .00427 .00415 .00402 .00391 .00379 .00368 .00357
2.70 .00347 .00336 .00326 .00317 .00307 .00298 .00289 .00280 .00272 .00264
2.80 .00256 .00248 .00240 .00233 .00226 .00219 .00212 .00205 .00199 .00193
2.90 .00187 .00181 .00175 .00169 .00164 .00159 .00154 .00149 .00104 .00139
3.00 .00135 .00131 .00126 .00122 .00118 .00114 .00111 .00107 .00144 .00100
3.10 .000968 .000936 .000904 .000874 .000845 .000816 .000789 .000762 .000736 .000711
3.20 .000687 .000664 .000641 .000619 .000598 .000577 .000538 .000538 .000519 .000501
3.30 .000483 .000467 .000450 .000434 .000419 .000404 .000376 .000376 .000362 .000350
3.40 .000337 .000325 .000313 .000302 .000291 .000280 .000260 .000260 .000251 .000242
3.50 .000233 .000224 .000216 .000208 .000200 .000193 .000179 .000179 .000172 .000165
3.60 .000159 .000153 .000147 .000142 .000136 .000131 .000121 .000121 .000112 .000112
3.70 1.08E-4 1.04E-4 9.96E-5 9.58E-5 9.20E-5 8.84E-5 8.16E-5 8.18E-5 7.8E-5 7.53E-5
3.80 7.24E-5 6.95E-5 6.67E-5 6.41E-5 6.15E-5 5.91E-5 5.44E-5 5.46E-5 5.22E-5 5.01E-5
3.90 4.81E-5 4.62E-5 4.43E-5 4.25E-5 4.08E-5 3.91E-5 3.60E-05 3.61E-5 3.45E-5 3.31E-5
Single-Tail z Table
(Values of z from 4.00 to 7.99)

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

4.00 3.17E-5 3.04E-5 2.91E-5 2.79E-5 2.67E-5 2.56E-5 2.45E-5 2.35E-5 2.25E-5 2.16E-5
4.10 2.07E-5 1.98E-5 1.90E-5 1.81E-5 1.74E-5 1.66E-5 1.59E-5 1.52E-5 1.46E-5 1.40E-5
4.20 1.34E-5 1.28E-5 1.22E-5 1.17E-5 1.12E-5 1.07E-5 1.02E-5 9.78E-6 9.35E-6 8.94E-6
4.30 8.55E-6 8.17E-6 7.81E-6 7.46E-6 7.13E-6 6.81E-6 6.51E-6 6.22E-6 5.94E-6 5.67E-6
4.40 5.42E-6 5.17E-6 4.94E-6 4.72E-6 4.50E-6 4.30E-6 4.10E-6 3.91E-6 3.74E-6 3.56E-6
4.50 3.40E-6 3.24E-6 3.09E-6 2.95E-6 2.82E-6 2.68E-6 2.56E-6 2.44E-6 2.33E-6 2.22E-6
4.60 2.11E-6 2.02E-6 1.92E-6 1.83E-6 1.74E-6 1.66E-6 1.58E-6 1.51E-6 1.44E-6 1.37E-6
4.70 1.30E-6 1.24E-6 1.18E-6 1.12E-6 1.07E-6 1.02E-6 9.69E-7 9.22E-7 8.78E-7 8.35E-7
4.80 7.94E-7 7.56E-7 7.19E-7 6.84E-7 6.50E-7 6.18E-7 5.88E-7 5.59E-7 5.31E-7 5.05E-7
4.90 4.80E-7 4.56E-7 4.33E-7 4.12E-7 3.91E-7 3.72E-7 3.53E-7 3.35E-7 3.18E-7 3.02E-7
5.00 2.87E-7 2.73E-7 2.59E-7 2.46E-7 2.33E-7 2.21E-7 2.10E-7 1.99E-7 1.89E-7 1.79E-7
5.10 1.70E-7 1.61E-7 1.53E-7 1.45E-7 1.38E-7 1.30E-7 1.24E-7 1.17E-7 1.11E-7 1.05E-7
5.20 9.98E-8 9.46E-8 8.96E-8 8.49E-8 8.04E-8 7.62E-8 7.22E-8 6.84E-8 6.47E-8 6.13E-8
5.30 5.80E-8 5.49E-8 5.20E-8 4.92E-8 4.66E-8 4.41E-8 4.17E-8 3.95E-8 3.73E-8 3.53E-8
5.40 3.34E-8 3.16E-8 2.99E-8 2.82E-8 2.67E-8 2.52E-8 2.39E-8 2.26E-8 2.13E-8 2.01E-8
5.50 1.90E-8 1.80E-8 1.70E-8 1.61E-8 1.52E-8 1.43E-8 1.35E-8 1.28E-8 1.21E-8 1.14E-8
5.60 1.07E-8 1.01E-8 9.57E-9 9.04E-9 8.53E-9 8.04E-9 7.59E-9 7.16E-9 6.75E-9 6.37E-9
5.70 6.01E-9 5.67E-9 5.34E-9 5.04E-9 4.75E-9 4.48E-9 4.22E-9 3.98E-9 3.75E-9 3.53E-9
5.80 3.33E-9 3.13E-9 2.95E-9 2.78E-9 2.62E-9 2.47E-9 2.32E-9 2.19E-9 2.06E-9 1.94E-9
5.90 1.82E-9 1.72E-9 1.62E-9 1.52E-9 1.43E-9 1.35E-9 1.27E-9 1.19E-9 1.12E-9 1.05E-9
6.00 9.90E-10 9.31E-10 8.75E-10 8.23E-10 7.73E-10 7.27E-10 6.83E-10 6.42E-10 6.03E-10 5.67E-10
6.10 5.32E-10 5.00E-10 4.70E-10 4.41E-10 4.14E-10 3.89E-10 3.65E-10 3.43E-10 3.22E-10 3.02E-10
6.20 2.83E-10 2.66E-10 2.50E-10 2.34E-10 2.20E-10 2.06E-10 1.93E-10 1.81E-10 1.70E-10 1.59E-10
6.30 1.49E-10 1.40E-10 1.31E-10 1.23E-10 1.15E-10 1.08E-10 1.01E-10 9.49E-11 8.89E-11 8.33E-11
6.40 7.80E-11 7.31E-11 6.85E-11 6.41E-11 6.00E-11 5.62E-11 5.26E-11 4.92E-11 4.61E-11 4.31E-11
6.50 4.04E-11 3.78E-11 3.53E-11 3.30E-11 3.09E-11 2.89E-11 2.70E-11 2.53E-11 2.36E-11 2.21E-11
6.60 2.07E-11 1.93E-11 1.81E-11 1.69E-11 1.58E-11 1.47E-11 1.38E-11 1.29E-11 1.20E-11 1.12E-11
6.70 1.05E-11 9.79E-12 9.14E-12 8.53E-12 7.96E-12 7.43E-12 6.94E-12 6.48E-12 6.04E-12 5.64E-12
6.80 5.26E-12 4.91E-12 4.58E-12 4.27E-12 3.98E-12 3.71E-12 3.46E-12 3.23E-12 3.01E-12 2.81E-12
6.90 2.62E-12 2.44E-12 2.27E-12 2.12E-12 1.97E-12 1.84E-12 1.71E-12 1.59E-12 1.49E-12 1.38E-12
7.00 1.29E-12 1.20E-12 1.12E-12 1.04E-12 9.68E-13 9.01E-13 8.38E-13 7.80E-13 7.62E-13 6.75E-13
7.10 6.28E-13 5.84E-13 5.43E-13 5.05E-13 4.70E-13 4.37E-13 4.06E-13 3.78E-13 3.51E-13 3.26E-13
7.20 3.03E-13 2.82E-13 2.62E-13 2.43E-13 2.26E-13 2.10E-13 1.95E-13 1.81E-13 1.68E-13 1.56E-13
7.30 1.45E-13 1.35E-13 1.25E-13 1.16E-13 1.08E-13 9.99E-14 9.27E-14 8.60E-14 7.98E-14 7.40E-14
7.40 6.86E-14 6.37E-14 5.90E-14 5.47E-14 5.07E-14 4.70E-14 4.36E-14 4.04E-14 3.75E-14 3.47E-14
7.50 3.22E-14 2.98E-14 2.76E-14 2.56E-14 2.37E-14 2.19E-14 2.03E-14 1.88E-14 1.74E-14 1.61E-14
7.60 1.49E-14 1.38E-14 1.28E-14 1.18E-14 1.10E-14 1.01E-14 9.38E-15 8.68E-15 8.03E-15 7.42E-15
7.70 6.86E-15 6.35E-15 5.87E-15 5.43E-15 5.02E-15 4.64E-15 4.29E-15 3.96E-15 3.66E-15 3.38E-15
7.80 3.12E-15 2.89E-15 2.67E-15 2.46E-15 2.27E-15 2.10E-15 1.94E-15 1.79E-15 1.65E-15 1.53E-15
7.90 1.41E-15 1.30E15 1.20E-15 1.11E-15 1.02E-15 9.42E-16 8.69E-16 8.01E-16 7.39E-16 6.82E-16
z Transformation

Instructions Example
Locate the z-value in the Locate 0.60 in the Single-Tail
“Single-Tail z-Table” using this z-Table using this three-step
three-step process. process.

1. Find the whole number and the 1. Find the value 0.6 in the first
first decimal place in the first column. We will call this row
column (titled z). 0.6.

2. Locate the appropriate column 2. Locate 0.00 which is the first


title of the second decimal place. column. We will call this
(Columns are titled 0.00 to column 0.00.
0.09).
3. The intersection of row 0.6
3. The correct probability is located and column 0.00 is 0.2743.
at the intersection of row and The area to the left of 0.60 is
column you have chosen. 1-0.2743=0.7257.
Exercise 1: z Transformation

1. For a measured characteristic, the mean is 18.61 and the standard deviation of the process
is 1.00. Use the formula listed below to convert the measurement, 20.00, into a z-value.

(x - )
z= 

z=

2. Locate the probability in the Single-Tail z-table on the previous pages.

3. If z = 1.39, what is the probability that a randomly selected member of the population will be
greater than or equal to z?

Answer: _____________

z
3 2 1  1 2 3
Exercise 1: z Transformation

4. For the same z (z = 1.39), what is the probability that a randomly


selected member of the population will be less than or equal to z?

Answer: _____________

z
3 2 1  1 2 3
Answers to Exercise 1:
z Transformation

(x - )
1. z = 
20.00 - 18.61
z= 1
z= 1.39
2. Locate the probability using the instructions on
page 3.23. (.0823)

3. .0823 or 8.23%

4. 1.00 - 0.0823 = 0.9177 or 91.77%


z-Values and Their Application

A z-value is the distance from a specification limit of the mean or target


of a distribution expressed in multiples of the standard deviation of the
distribution. If x is the upper specification limit, you can use the z-value
to determine the probability of producing product above the upper
specification.

USL

Defect Probability

ZUSL= USL -  

z-Values and Their Application

If x is the lower specification limit, you can use the -z-value to determine
the probability of producing product below the lower specification.

LSL

Defect
Probability


ZLSL=  - LSL

z-Values and Their Application

For a two-sided distribution, the sum of the probabilities for upper and
lower specification limits tells you the total probability of producing out-
of-spec product.

LSL USL

Defect Defect
Probability Probability

Target
Process Capability Catapult Example

From the the descriptive statistics graphs option in Minitab grouping by Angle
X = 50.53
s = 2.79
for the nine observations at angle 150.

ZUSL= USL - x 53  50.53


= P(defectUSL)=
s 2.79
_______

ZLSL= x - LSL 50.53  47


= P(defectLSL)=
s 2.79
_______
P(defectTOTAL)= _______
Z = ______ (from Z table)

P(defect TOTAL ) is analogous to the defect probability for the


Catapult example studied in the first module on probability and statistics
The Central Limit Theorem
Definition
• under certain conditions, if n random numbers are added or averaged
together, the distribution of their sum/average approaches a normal
distribution as n approaches 
Comments
• the value of n required for effective normality varies with the
distribution of the random numbers included in the summation
• for example, n=5 works with random numbers that come from a
uniform distribution, while n=100+ is required for numbers that
come from an exponential distribution
• for the Cauchy distribution no n, no matter how large, works
Practical meaning
• many of the statistical quantities used in DFSS are effectively
summations of random numbers
• therefore, these quantities can be treated as normally distributed
even though the original numbers on which they are based are not
normally distributed!!
The Central Limit Theorem Applied to
the Catapult Example

The central limit theorem is applied in the standardized form.


Let’s consider evaluating the probability of zero defects for the
Catapult example:
1/ 2
• Binomial: P{0}=0.02601, Normal approximation:  p( x;3,2)dx  0.03855

1/ 2
• Poisson: P{0}=0.04979, Normal approximation:
 p( x;3,3)dx  0.07446


Finally, Six-sigma replaces the Poisson with the Normal


Shows the Normal approximation to the binomial
pˆ  1 / 3
for the Catapult data at angle 150, .

Shows the Normal approximation to the Poisson


ˆ  3.0
for the Catapult data at angle 150, .
Conclusions associated with the
Central Limit Theorem
Recall that the Normal (assumed actual distribution) fit from the z-value
(1  0.2910) 9  0.04527
calculations gives: P{0}=

• Thus the binomial, Poisson, and normal distributions


applied directly to the Catapult example data for angle
150 give fairly comparable results.

• Also generally in the DFSS setting the Poisson


approximation to the binomial is good, while the normal
approximation, especially, to the Poisson is not good
unless is large (>> 1).

Treat z-values only as capability measures!



Using Z as a Measure of Capability

As variation
decreases,
Z=6 capability
T USL
increases and,
as a
6 Capability consequence,
the standard
SL -  deviation ()
Z= gets smaller

Z=3 which, in turn,
T USL decreases the
probability of a

z
3 Capability defect.

More on z-values in the Process Capability module


Use of the Poisson Distribution

The direct application of the Poisson approximation to the binomial used in


the Catapult example is not generally used in Six-sigma, instead:
• In the example assume there is no measurement error. Then all
variability is due to differences in setting the cocking angle to 150
(ignoring the peculiarities associated with operators 2 & 3). Now
assume that setting the cocking angle consists of many small steps (m)
each of which can be done correctly or incorrectly (a failure) with
probability p . Apply the Poisson approximation to this binomial
distribution to yield   mp .

• This second binomial distribution may or may not be observed. The


DFSS approach when this binomial is observed will be studied in the
Process Capability module. In this case  is called defects per unit or
DPU. DPU is NOT a probability.

• When the second process is not observed set .

np    mp 
Use of the Poisson Distribution in Six-sigma (con’t)

np    mp 

p not observed p observed


•Observations of the characteristic •Follow the methodology
defining the CTQ available, as in developed in the Process
Catapult example: Proceed as in Capability module--ends with
example using either the Normal the Normal approximation
directly (z-value calculations) or
the Poisson or Normal
approximations
•Only have success/failure
observations: Can only use Poisson
or Normal approximations (no
direct observations on the CTQ)

Both paths end in z-values and the Normal approximation


Recap of Session

Review of the use of the Poisson distribution in Six-


sigma

• Based on many “small” opportunities for a single possible


defect

• Defects at this second level may or may not be observed

Look for “defects” in Process Capability module!!!


Histogram
Purpose:To display variation in a data set. Converts an unorganized set of data or group
of measurements into a coherent picture.

When: To determine if process is on target meeting customer requirements. To


determine if variation in process is normal or if something has caused it to vary in
an unusual way.
How: Count the number of data points
Determine the range (R) for entire set
Divide range value into classes (K)
Determine the class width (H) (H=R/K)
Determine the class boundary, end point: (first point=lowest data value + H)
Construct a frequency table based on values computed in previous step
Construct a Histogram
14 based on frequency table
13
# of Students

12
11
10
9
8
7
6
5
4
3
2
1

55 60 65 70 75 80 85 90 95 100
Test Grades
Histogram Example
Minitab File: Catapultnew2.mtw
Histogram Output

1. Double
click
anywher 2. For each: Group
e on the
“C9” line
to select
Dist as a
variable. 3. Group variables: Ang

4. Ang
Click ANG
:
“OK”

Angle makes a difference?


Histogram Exercise

Make histograms of the nine groupings of OpXang using


the descriptive statistics graphics option.

What are your observations?

How do these histograms compare with the scatter


plot?
Dot Plot
Purpose: To display variation in a process.
Quick graphical comparison of two or more processes.

When: First stages of data analysis.

How: Create an X axis


Scale the axis per the range in the data
Place a dot for each value along the X axis
Stack repeat dots
.
.:
.: .:: :: :
:: .:: ..:::.:: : .
.: . :.:: :::::::::::::::..:.:.: ..
+---------+---------+---------+---------+---------+-------
patrn24
.
:
:
:. : .
. .: : : : :: : : : : ..
: .: :: :.::. :::.:::: : :. .: : ::
.:::::::.::::: ::::::::::: ::.::: :: .::: .
. ::::::::::::::::::::::::::::::::::::::.:::: :
+---------+---------+---------+---------+---------+-------
patrn60
0.0000 0.0050 0.0100 0.0150 0.0200 0.0250
Dot Plot Example
Minitab File:
Catapultnew2.mtw
Dot Plot Output

1. Double click
on “C9”

2. Check “By
variable” and
select Ang

3. Check
“Same scale
for all
variables”
Contrast the Dot plot
4. Click: “OK”
results with those
from your histogram
analyses
Box and Whisker Plot

Purpose: To begin an understanding of the distribution of


the data.
To get a quick, graphical comparison of two processes.
When: First stages of data analysis
How: Minitab!

The Box and Whisker plot allows us to visually compare the


averages and variability of two or more data sets. It will also
identify unusual data points (outliers).
Box and Whisker Plot
Max and Min observations computed from a formula by Minitab

* Outlier

Maximum Observation

75th Percentile

Median (50th Percentile)

25th Percentile

Minimum Observation
Box and Whisker Plot Example
Minitab File:
Catapultnew2.mtw

1. Double clicking
1. Double a variable under
clicking a “X” chooses that
variable under grouping
“Y” chooses variable.
that variable
to graph.
3. Click on
“OK”
Box and Whisker Plot Output

What are your observations? Comments about Operator 2? How do these


results compare to our earlier conclusions?
Box and Whisker Plot Output: Summary
Points of Interest from Scatter Plot:
•Overall all operators probably have similar within cell variability (important for
ANOVA and regression analyses)
•However, operator 1 probably has somewhat less within cell variability than the
other two
•Operator 3 appears to produce biased results--possibly due to a different stop
position than operators 1 & 2 used (Beginning RCA)
•Apparently three different quadratic curves (regression analysis) would explain
all but the within cell variability

The box plot shows much the same results as the scatter plot, but with more
detail with respect to the operator within cell variability. Thus I would amend the
above comments:
•Overall cell variability seems to increase in going from operator 1 to 3 to 2--a
test of this can be made. For the time being assume all within cell variability is
the same.
•Inspection of both plots indicates a change in the “relationship” to angle for
each operator as the angle increases. This is called interaction in statistical
jargon.
When all else fails, look at the data--always!
Basic Statistics and Visualization in R
Contents

Exploring individual variable


Exploring multiple variables
Visualization
Exploring individual variable
First extract individual variable as,
>datafilename$variablename

>Bank$Amount

Function
Mean, median and range: mean(), median(), range()
Quartiles and percentiles: quantile()
Exploring individual variable
Attach any dataset and see the structure of data

All are numeric variable.


Exploring individual variable
Consider variable ‘mpg’.

Mean:
> length(mpg) ##number of observation
[1] 32
> sum(mpg)/32
[1] 20.09062
> mean(mpg)
[1] 20.09062
> median(mpg)
[1] 19.2
> range(mpg) ##max(mpg) - min(mpg)
[1] 10.4 33.9
Exploring individual variable
> var(mpg)
[1] 36.3241
> sqrt(var(mpg))
[1] 6.026948
>quantile(mpg)
0% 25% 50% 75% 100%
10.400 15.425 19.200 22.800 33.900

> quantile(mpg, c(0.05, 0.95)) Median


5% 95%
11.995 31.300
Exploring individual variable
In R mode can not be directly obtained, need to write a function
for mode.

>Mode<-function(x){ux<-unique(x)
ux[which.max(tabulate(match(x, ux)))] }
>Mode(mpg)
[1] 21
Exploring multiple variable
The data contains more than one variables.

Function summary()
numeric variables: minimum, maximum, mean, median, and
the first (25%) and third (75%) quartiles
categorical variables (factors): frequency of every level
Exploring multiple variable
Aggregating Data
It is relatively easy to collapse data in R using one or more BY variables and
a defined function.
# aggregate data frame iris by Species returning means
# for numeric variables

> library(MASS)
> z<-aggregate(iris[,-5], by=list(Species), FUN=mean, na.rm=TRUE)
>z
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Exploring multiple variable
Exploring multiple variable

Numeric
variable

Categorica
l variable
Exploring multiple variable
Correlation:
It is a measure of linear relationship between two or more
numeric variables.

Correlation
between 2
variables
Correlation
matrix (>2
variables)
Basic Statistics

Consider any data set and try these basic statistics


Visualization

Bar-plot:
Bar plots need not be based on counts or frequencies. You can create
bar plots that represent means, medians, standard deviations, etc.

1)Simple Bar Plot


2)Stacked Bar Plot
3)Grouped Bar Plot
Visualization
Simple bar plot:

>counts<-table(mtcars$gear)
>barplot(counts, main="Car Distribution“, xlab="Number of Gears")
Visualization
By default, the categorical axis line is suppressed. Include the
option axis.lty=1 to draw it.
Visualization
# Simple Horizontal Bar Plot with Added Labels
>counts <- table(mtcars$gear)
>barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"), axis.lty=1)
Visualization
Stacked Bar Plot
# Stacked Bar Plot with Colors and Legend
> counts <- table(mtcars$vs, mtcars$gear)
>barplot(counts, main="Car Distribution by Gears and VS“, xlab="Number
of Gears", col=c("darkblue","red"), legend = rownames(counts), axis.lty=1)
Visualization
Grouped Bar Plot

>counts <- table(mtcars$vs, mtcars$gear)


>barplot(counts, main="Car Distribution by Gears and VS“, xlab="Number
of Gears", col=c("darkblue","red"), legend = rownames(counts),
beside=TRUE, axis.lty=1)
Visualization
Scatter plot and Line chart

x<-seq(1,10); # create some data


y<-2*x;
plot(x,y) #plot scatter plot
lines(x,y, type="l") #Other types-("p","l","o","b","c","s","S","h")
Visualization

Plot(x,y) Plot(x,y, type=“l”)

Use par(mfrow=c(m, n)) for different plots together.


Visualization

Get help on plot and see argument “type” which can be added in plot. Try
same for different plots.
Visualization

Simple Pie Chart

# Simple Pie Chart


>slices <- c(10, 12,4, 16, 8)
>lbls <- c("US", "UK", "Australia", "Germany", "France")
>pie(slices, labels = lbls, main="Pie Chart of Countries")
Visualization
Pie Chart with Annotated Percentages
> slices <- c(10, 12, 4, 16, 8)
> lbls <- c("US", "UK", "Australia", "Germany", "France")
> pct <- round(slices/sum(slices)*100)
> lbls <- paste(lbls, pct) # add percents to labels
> lbls <- paste(lbls,"%",sep="") # ad % to labels
> pie(slices,labels = lbls, col=rainbow(length(lbls)), main="Pie Chart of
Countries")
Visualization
3D Exploded Pie Chart

>library(plotrix)
>slices <- c(10, 12, 4, 16, 8)
>lbls <- c("US", "UK", "Australia", "Germany", "France")
>pie3D(slices,labels=lbls, explode=0.1, main="Pie Chart of Countries ")
Visualization
Creating Annotated Pies from a data frame
# Pie Chart from data frame with Appended Sample Sizes
>mytable <- table(iris$Species)
>lbls <- paste(names(mytable), "\n", mytable, sep="")
>pie(mytable, labels = lbls, main="Pie Chart of Species\n (with sample
sizes)")
Visualization
Histogram

# Simple Histogram # Colored Histogram with Different Bins


>hist(mtcars$mpg) >hist(mtcars$mpg, breaks=12, col="red")
Visualization
# Add a Normal curve
>x <- mtcars$mpg
>h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram with Normal Curve")
>xfit<-seq(min(x), max(x), length=40)
>yfit<-dnorm(xfit, mean=mean(x), sd=sd(x))
>yfit <- yfit*diff(h$mids[1:2])*length(x)
>lines(xfit, yfit, col="blue", lwd=2)
Visualization
Boxplot
# Boxplot of MPG by Car Cylinders
>boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
>xlab="Number of Cylinders", ylab="Miles Per Gallon")
Visualization
# Notched Boxplot of Tooth Growth Against 2 Crossed Factors
# boxes colored for ease of interpretation
>boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE,
col=(c("gold","darkgreen")),
>main="Tooth Growth", xlab="Suppliment and Dose")
Visualization
Saving plot:
Formats in which plots can be saved are: pdf, jpeg, bmp, png

Plots will be saved on given


location in specified format
Visualization

Take any data set from R and try all visualization techniques
Test of Hypothesis
Handout
EXPLANATIONS OF TOH

 Because of variation, no two things will be exactly alike.


 The question is whether differences you see between
samples, groups, processes, etc., are due to random,
common-cause variation, or if there is a real difference.
 To help us make this decision, various hypothesis tests
provide ways of estimating common-cause variation for
different situations.
 They test whether a difference is significantly bigger than the
common-cause variation we would expect for the situation.
 If the answer is no, there is no statistical evidence of a difference.
 If the answer is yes, conclude the groups are significantly different.
 Hypothesis tests take advantage of larger samples because
the variation among averages decreases as the sample
size increases.
EXPLANATIONS OF TOH

 Tests the null hypothesis—


H0: no difference between groups

 Against the alternative hypothesis—


Ha: groups are different

 Obtain a P-value for the null hypothesis—


 Use the data and the appropriate hypothesis test
statistic to obtain a P-value (We’ll use Minitab to do
this.)
 If P < .05, reject the H0 and conclude the Ha
 If P  .05, cannot reject the H0
TWO TYPES OF ERRORS

 There are four possible outcomes to any


decision we make based on a hypothesis test:
We can decide the groups are the same or
different, and we can be right or wrong.
Actual (Truth)

Groups are Groups are


Same Different

Accept H0: No Error Type II Error


Conclusion Groups are Same
or
Decision Reject H0: Type I Error No Error
Groups are Different
TEST OF NORMALITY

Definition:
The Normal Curve is a probability distribution where
the most frequently occurring value is in the
middle and other probabilities tail off
symmetrically in both directions. This shape is
sometimes called a bell-shaped curve.
Z – STANDARD NORMAL VARIATE

A “Standard” Normal Distribution

mean = 0
st. dev. = 1

-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3

Z-value anywhere
on this scale
 Z-value
 How many standard
deviations the value- (value - of - interest)  X
Z
of-interest is away S
from the mean
AREA CALCULATION

Area under the


standard Normal Area = 1

curve = probability
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3

Area = .5
What is the probability
a Z-value will be  zero? Probability = .5 or 50%

-3
-3
-2
-2
-1
-1
0
0
1
1
2
2 3
3

What is the probability Area = ?


a Z-value will be  2.84?
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3

Value-of-interest
P - VALUE

P-value =
 Tail area
 Area under curve beyond value-of-interest

 Probability of being at the value-of-interest or

beyond
P-value = area P-value = areas A + B

A B

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Value-of- Value-of- Value-of- Value-of-


interest interest interest interest

 A small P-value (0 to .05) means:


 The probability is small that the value-of-interest did,
indeed, come from that distribution. . . . It is likely to
have come from some other distribution.
Steps of Hypothesis test

1. Set Hypothesis: H0: with equality against H1 < or > or <>


2. Decide the type of hypothesis.
3. Collect data
4. Test hypothesis (Run minitab)
5. Examine p-value.
6. Conclude- If p<0.05, reject Null Hypothesis (H0).
NORMAL PROBABILITY PLOT

Here is a sample Normal probability plot generated in Minitab (n = 25).


Graph > Probability Plot
 If the data are Normal, the points will fall on a “straight” line.
 “Straight” means within the 95% confidence bands.
 You can say the data are Normal if approximately 95% of the data

points fall within the confidence bands.

99 ML Estimates

Mean: 40.1271
95
StDev: 4.86721
90

80
70
Percent

60
50
40
30
20

10
5

25 35 45 55

Data

95% confidence bands


NORMAL PROBABILITY PLOT

Normal Probability Plot Ten equally spaced percentiles


from the Normal distribution
99 ML Estimates

Mean: 40.1271
95
StDev: 4.86721
90

80
70
Percent

60 10%
50 10
40
10% 10%
20
30 30
20
50
10 70
5 80
10% 10%
90
10%
1

25 35 45 55

Data

 Data values are on X- – Equally spaced percentiles


axis divide the Normal curve
 Percentiles of the into equal areas
Normal distribution – The percentiles match the
are on the Y-axis
(unequal spacing of percents on the vertical
lines is deliberate) axis of the Normal
probability plot
NORMAL PROBABILITY PLOT

99 ML Estimates 99 ML Estimates

Mean: 40.1271 Mean: 1.13627


95 95
StDev: 4.86721 StDev: 1.07363
90 90

80 80
70 70

Percent
Percent

60 60
50 50
40 40
30 30
20 20

10 10
5 5

1 1

25 35 45 55 -2 -1 0 1 2 3 4

Data Data

Conclusion Conclusion
– Not a serious – There is a serious
departure from departure from
Normality Normality
EXAMPLE OF NORMALITY TEST

Data of weekly expenditures are as under:


Weeks I- Expenditures (in L of
Rs. )
1 4.25
2 4.78
3 3.95
4 3.86
5 3.72
6 5.17
7 5.07
8 4.65
9 4.7
10 4.35

Want to test whether the data


follows normal distributions or not.

Conclusion
– Not a serious departure from Normality
EXAMPLE OF NORMALITY TEST
EXAMPLE OF NORMALITY TEST

Conclusion
– P>0.05 indicates the data comes from Normal
population.
WHAT IS TEST OF HYPOTHESIS?

 A hypothesis test is a procedure that


summarizes data so you can detect
differences among groups.
 A hypothesis test is used to make
comparisonsWhat
between
You
two or more groups.
Type of Data Can Compare Example

Is the % on-time deliveries for


Discrete Proportions Supplier A the same as for
Supplier B

Averages Is the average production volume


the same for all three shifts?

Do results from the group using


Continuous
Variation the New Method vary less than
the results from the group using
the Old method?

Shapes or How does the distribution of


Distributions cycle time compare for various
methods?
1 Sample t-Test

 Situation: Average of a sample to be compared with a


specific value when standard deviation is known and would
be calculated from the available data. e.g. Want to know
whether the run length of asample of 20 plate crosses 1.2
lakh or not.

We judge the difference between average of 20 run lengths


data and 1.2 by using a statistical test called a t-test (we
will use Minitab to do it).

 H0: Average = 1.2 against Average <> 1.2


Abs((X 1.2 ))
t n -1  s .d .

1 Sample t-Test

 The t-distribution:
 Has more variation than a Z distribution (and thus

different areas in the tails, meaning different P-values).


 Its spread of variation depends on the “degrees of

freedom” (df).
 We won’t go into details here about df, but think of

it as the amount of information left in the sample


after estimating the means and standard deviations
of the two groups.
 Is appropriate to use since we estimated the means

and variances (proven through statistical theory)


DATA OF 1 SAMPLE t-TEST

Sample Plates Run Length


1 1.25
2 1.2
3 1.3
4 1.35
5 1.18
6 1.28
7 1.24
8 1.2
9 1.32
10 1.28
11 1.16
12 1.32
13 1.37
14 1.35
15 1.27
16 1.26
17 1.21
18 1.34
19 1.3
20 1.2
DATA OF 1 SAMPLE t-TEST

One-Sample T: Run Length

Variable N Mean StDev SE Mean


95.0% CI
Run Length 20 1.2690 0.0627 0.0140 (
1.2397, 1.2983)

One-Sample T: Run Length

Test of mu = 1.2 vs mu not = 1.2

Variable N Mean StDev SE Mean


Run Length 20 1.2690 0.0627 0.0140

Variable 95.0% CI T p-value


Run Length ( 1.2397, 1.2983) 4.93 0.000

Conclusion
– P=0.000 (to be interpreted as p < 0.001) indicates the
average run length is confidently above 1.2 lakh.
2 SAMPLE t - TEST

 The t-test
 Is a test of hypothesis for comparing two
averages.
 The hypothesis is that the two group averages
are the same.
 Their difference = 0
 If P-value is low, reject the hypothesis.
 By convention, a P-value is considered low if it is <
.05
Alternative
 Common notation hypothesis
Null hypothesis Ha: meanA  meanB
H0: meanA = meanB
T - TEST

We judge the difference between two group averages


by using
a statistical test called(X a At-test (we will use Minitab to
XB ) - 0
do it). t ( A -B ) 
S(A  B)

 The t-distribution:
 Has more variation than a Z distribution (and thus
different areas in the tails, meaning different P-values).
 Its spread of variation depends on the “degrees of
freedom” (df).
 We won’t go into details here about df, but think of it as
the amount of information left in the sample after
estimating the means and standard deviations of the two
groups.
T - TEST

Stat > Basic Statistics > 2 Sample t… > Graphs > (Select both plots)

Use this button


if all the data
are in one
column and
group labels are
in another
column (our
Leave this box data are not set
unchecked up this way)
T - TEST

Boxplots of Std and New Dotplots of Std and New


(means are indicated by solid circles) (means are indicated by lines)

20 20

15 15

10
10

Std New Std New


T - TEST

Session Window Output


Two Sample T-Test and Confidence Interval

Two sample T for Std vs New Standard error of the


N Mean StDev SE Mean
mean = st. dev. of the
Std 100 15.03 1.88 0.19 average
New 50 14.11 1.54  n  0.22
95% CI for mu Std - mu New: ( 0.35, 1.49)
T-Test mu Std = mu New (vs not =): T = 3.19 P = 0.0018 DF = 117

Confidence interval for the avg. diff.


Std - New The value of t Draw conclusions by
(discussed on next page) looking at the
P-value. Is it < .05?

Conclusions
Since the P-value is small (< .05), conclude there is a statistically
significant difference in the average research time between the two
methods. (Or, “the average research time is not the same for the
two methods.”)
Note: The P-value is different than .0007 reported before because
we more appropriately used the t-distribution instead of the
EXAMPLE OF 2-SAMPLE t - TEST

• Data of 10 samples plates from the same batch are tested in two
printers for the average run length whether differs or not
significantly.

• Hypothesis: H0: Average for Printer-1 = Average for Printer-2


against H1: Average for Printer-1 <> Average for Printer-2

• Data:
Run Length_Pr-A Run Length_Pr-B
1.25 1.32
1.2 1.37
1.3 1.35
1.18 1.27
1.28 1.26
1.24 1.21
1.2 1.34
1.32 1.3
1.28 1.2
1.16 1.35
EXAMPLE OF 2-SAMPLE t - TEST
EXAMPLE OF 2-SAMPLE t - TEST

Two-Sample T-Test and CI: Run Length_Pr-A, Run Length_Pr-B


Two-sample T for Run Length_Pr-A vs Run Length_Pr-B
N Mean StDev SE Mean
Run Leng 10 1.2410 0.0543 0.017
Run Leng 10 1.2970 0.0600 0.019

Difference = mu Run Length_Pr-A - mu Run Length_Pr-B


Estimate for difference: -0.0560
95% CI for difference: (-0.1100, -0.0020)
T-Test of difference = 0 (vs not =): T-Value = -2.19 P-Value = 0.043 DF = 17

• Conclusion: Printer-B’s run length is significantly higher than that of


Printer-A.
ONE WAY ANOVA

 Situation: Averages of more than two samples to be


compared. e.g. Want to know whether the run length of
Three Presses of 10 plates each differs significantly with
respect to the run length or not.
 Data:

Run Length_Pr-A Run Length_Pr-B Run Length_Pr-C


1.25 1.32 1.3
1.2 1.37 1.18
1.3 1.35 1.28
1.18 1.27 1.26
1.28 1.26 1.21
1.24 1.21 1.34
1.2 1.34 1.3
1.32 1.3 1.35
1.28 1.2 1.27
1.16 1.35 1.26
ONE WAY ANOVA
ONE WAY ANOVA

One-way ANOVA: Run Length_Pr-A, Run Length_Pr-B, Run Length_Pr-C

Analysis of Variance
Source DF SS MS F P
Factor 2 0.01592 0.00796 2.57 0.095
Error 27 0.08375 0.00310
Total 29 0.09967
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ------+---------+---------+---------+
Run Leng 10 1.2410 0.0543 (----------*---------)
Run Leng 10 1.2970 0.0600 (----------*---------)
Run Leng 10 1.2750 0.0525 (---------*----------)
------+---------+---------+---------+
Pooled StDev = 0.0557 1.225 1.260 1.295 1.330

 Conclusion: The presses do not differ significantly.


CHI SQUARE TEST

• Situation: Frequencies of more than two attributes to be


observed whether or not they are associated.
• Example: Like to know whether or not the market
segment and regions are associated in complaints.
• Data:
No. of Complaints Regions Segments
6 East Newspaper
12 East Printing
11 West Newspaper
18 West Printing
3 North Newspaper
9 North Printing
7 South Newspaper
19 South Printing
CHI SQUARE TEST

Tabulated Statistics: Regions, Segments

Rows: Regions Columns: Segments

Newspaper Printing All

East 6 12 18
5.72 12.28 18.00

North 3 9 12
3.81 8.19 12.00

South 7 19 26
8.26 17.74 26.00

West 11 18 29
9.21 19.79 29.00

All 27 58 85
27.00 58.00 85.00

Chi-Square = 1.064, DF = 3, P-Value = 0.786


1 cells with expected counts less than 5.0

Cell Contents --
Count
Exp Freq

 Conclusion: Region & Market segments are not


associated for the complaints.
EXERCISE-13: CHI SQUARE TEST

In a Cancer hospital, records identifies the following data.

Area Smoking Status No of cancer patients


Rural Smoker 8
Rural Non-smoker 4
Rural Ex-smoker 8
Urban Smoker 19
Urban Non-smoker 7
Urban Ex-smoker 3

Conclude whether Area and Smoking status has got any


association on cancer.
REGRESSION ANALYSIS
OVERVIEW

 Situation: One response variable (y) and one or more


independent predictor variables (x’s) when exact relationship
of the predictors to response is exactly not known. e.g.
promotional cost of sales, higher experience level of sales
executives lead to increase the sales, whether or not can be
explored using the regression analysis.
The linear regression equation deals with only one
predictor x on response y, is termed as Simple Linear
Regression (SLR).
 The linear regression equation deals with more than one
predictors, x’s on response y, is termed as Multiple Linear
Regression (MLR).
OVERVIEW
 Regression analysis deals with the following issues:
• Nature of relationship: examined by scatter plot for
SLR. The scatter plot apparently shows a pattern whether
or not x and y are dependent with linear, non-linear or
none equation.
• Relationship: Best linear Regression Equation is
established as:
o y = a + bx + error for SLR
o y = a + b1 x1 + b2 x2 + …+bp xp error
• Strength of Relationship: This is determined from the
square of correlation co-efficient, R2, termed as co-
efficient of determination. Higher the R2, better the linear
relationship to exist. It varies from 0 to 100%: A rule of
thumb for R2 is provided as under:
OVERVIEW

o More than 80% and with the conditions that R2 and


Adj- R2 are closer and S, the standard error of the
residuals are less; the regression equation can be
used for prediction of y through x.
o 40 to 80%, the predictor alone is not good enough;
but it provides a strong indication to influence y.
o 10-40%: weak relationship and
o <10% may treated as no relationship.
• Purity/ Sanctity of the Relationship: R2 does not
provide the true strength of the relationship. For MLR,
increasing more variable will lead increase in R2,but
Adjusted- R2 will fall. So Adjusted- R2 provides the true
strength of the relationship without the influence of
nuisance data or nuisance variables.
OVERVIEW

• Precision of the Relationship: Even with high R2, if the


residuals (difference between actual response and
predicted response) are high, the predicted value won’t
suffice the purpose. Standard deviation of these residuals
has to be adequately low to enable the investigator to
make sense of prediction. It is denoted by S.
SIMPLE LINEAR REGRESSION

 The fitted equation is:


Y = a + b x + e, Y is called the fitted value or
predicted value.
 Example:

• Height and Weight of an individual are thought to be related but given the height of an
individual, it is difficult to predict what exactly weight, the individual to have. This is a fit
case of Simple linear regression with y as weight and x as height of the individual.

• Data of 17 individuals with corresponding height and weight are available as under:

Name SM SV AK RP AD SB RK SR BM DK AV RM SP PB SM ST DV
Weight (Kg) 72 92 78 109 73 74 75 75 60 65 46 64 58 62 61 65 89
Height (cm.) 167 187 167 183 179 176 178 175 165 170 160 165 163 163 160 165 179
SIMPLE LINEAR REGRESSION
 Example: (contd…)
SIMPLE LINEAR REGRESSION

 Example: (contd…)

• The regression seems to fit the line moderately OK; but


high S does make the equation not adequate to predict.
Error-Diagnostic and Autocorrelation of Error

August 22, 2019 Data Mining: Concepts and Techniques 468


Error-Diagnostic and Autocorrelation of Error
Coefficient Table:
Confidence Interval :
Prediction Interval :
Prediction Interval :
MULTIPLE LINEAR REGRESSION

 The fitted equation is:


Y = a + b1 x1 + b2x2 +…+bp xp + e, Y is
called the fitted value or predicted value.
 Example:
• Height and Pulse rate are expected to predict Weight of an
individual are thought to be related but given the height and
pulse rate of an individual, it is difficult to predict what
exactly weight, the individual to have. This is a fit case of
Multiple linear regression with y as weight and x1 and x2 as
height and pulse rate respectively of the individual.

• Data of 17 individuals with corresponding height and


weight are available as under:
Name SM SV AK RP AD SB RK SR BM DK AV RM SP PB SM ST DV
Weight 72 92 78 109 73 74 75 75 60 65 46 64 58 62 61 65 89
Height 167 187 167 183 179 176 178 175 165 170 160 165 163 163 160 165 179
Pulse Rate 72 74 70 75 72 73 74 72 74 73 72 74 73 71 72 73 72
MULTIPLE LINEAR REGRESSION

 Example: (contd…)
MULTIPLE LINEAR REGRESSION

 Example: (contd…)
Regression Analysis: Weight versus Height, Pulse Rate

The regression equation is


Weight = - 161 + 1.52 Height - 0.38 Pulse Rate + error

Predictor Coef SE Coef T P


Constant -160.9 119.8 -1.34 0.201
Height 1.5245 0.2699 5.65 0.000
Pulse Rate -0.381 1.794 -0.21 0.835

S = 8.315 R-Sq = 72.5% R-Sq(adj) = 68.5%

• Inclusion of Pulse rate in the SLR model has increased the


72.4 to 72.5; but there has been a drastic fall of Adjusted R-
sq. indicating the futility of pulse rate to predict weight.
Moreover, p-value of Pulse rate, 0.835 (>0.05) confidently
concludes that Pulse rate is not significantly adding value to
the relationship. Hence SLR with height is better than MLR
in this respect.
EXERCISE REGRESSION

 Exercise
• Data of square meter Plate sales and the cost of sales for
15 months are as under:
Month Sq. M. Sales (‘00000) Cost of Sales (Rs. In Lakh)
Apr-11 0.9 15
May-11 0.9 10
Jun-11 0.8 11
Jul-11 1.0 18
Aug-11 0.8 8
Sep-11 0.9 12
Oct-11 1.1 15
Nov-11 1.1 16
Dec-11 1.0 12
Jan-12 0.9 9
Feb-12 1.0 13
Mar-12 0.9 10 • Do the
Apr-12 1.2 14 Regression Analysis
May-12 1.1 17
with the
phenomena.
Jun-12 1.0 14
Prediction Interval :
Unusual Observations & Cook’s distance:
Multicolinearity:
How good the Regression Model is and Mellow’s Cp:
How good the Regression Model is and Mellow’s Cp:
Unusual Observations & Cook’s distance:

the predicted residual error sum of squares (PRESS) statistic is a form of cross
validation used in regression analysis to provide a summary measure of the fit of a model to a
sample of observations that were not themselves used to estimate the model. It is calculated as
the sums of squares of the prediction residuals for those observations.
A fitted model having been produced, each observation in turn is removed and the model is refitted
using the remaining observations. The out-of-sample predicted value is calculated for the omitted
observation in each case, and the PRESS statistic is calculated as the sum of the squares of all
the resulting prediction errors:

Given this procedure, the PRESS statistic can be calculated for a number of candidate model
structures for the same dataset, with the lowest values of PRESS indicating the best structures.
Models that are over-parameterised (over-fitted) would tend to give small residuals for observations
included in the model-fitting but large residuals for observations that are excluded.
Unusual Observations & Cook’s distance:
POWER OF THE TEST
Using the Power of the Test for Good Hypothesis Testing

Table 1: Possible Outcomes of a Hypothesis Test

Reality Decisions

Ho is true Accepting Ho is true; good decision Accepting Ho when it is false; Type


(p = 1 – a or confidence level) II error (p = b)

Ha is true Rejecting Ho when in fact it is true; Rejecting Ho that is not true; good
Type 1 error (p = a or significance decision (p = 1 – b or power of the
level) test)

What should every good hypothesis test ensure? Ideally, it should make the
probabilities of both a Type I error and Type II error very small. The probability of a
Type I error is denoted as a and the probability of a Type II error is denoted as b.

Understanding a
Recall that in every test, a significance level is set, normally a= 0.05. In other words, that
means one is willing to accept a probability of 0.05 of being wrong when rejecting the null
hypothesis. This is the a risk that one is willing to take, and setting a at 0.05, or 5 percent,
means one is willing to be wrong 5 out of 100 times when one rejects Ho. Hence, once the
significance level is set, there is really nothing more that can be done about a.
Understanding b and 1 -b
Suppose the null hypothesis is false. One would want the hypothesis test to reject it
all the time. Unfortunately, no test is foolproof, and there will be cases where the null
hypothesis is in fact false but the test fails to reject it. In this case, a Type II error
would be made. b is the probability of making a Type II error and b should be as
small as possible. Consequently, 1 -b is the probability of rejecting a null hypothesis
correctly (because in fact it is false), and this number should be as large as possible.

The Power of the Test


Rejecting a null hypothesis when it is false is what every good hypothesis test
should do. Having a high value for 1 -b (near 1.0) means it is a good test, and
having a low value (near 0.0) means it is a bad test. Hence, 1 –b is a measure of
how good a test is, and it is known as the “power of the test.”
The power of the test is the probability that the test will reject Ho when in
fact it is false. Conventionally, a test with a power of 0.8 is considered
good.
Statistical Power Analysis
Consider the following when doing a power analysis:
1.What hypothesis test is being used
2.Standardized effect size
3.Sample size
4.Significance level ora
5.Power of the test or 1 – b
The computation of power depends on the test used. One of the simplest examples for power
computation is the t-test. Assume that there is a population mean of m= 20 and a sample is
collected of n = 44 and that a sample mean of and sample standard deviation of s = 4
are found. Did this sample come from a population of mean = 20 if it is set that a= 0.05?

Ho: m does equal 20 Ha: m does not equal 20


a = 0.05, two-tailed test
The critical value of t at 0.05 (two-tailed) for DF = 43 is 2.0167 (using spreadsheet
software [e.g., Excel], TINV [0.05,43] = 2.0167). Since the t is greater than the critical
value, the null hypothesis is rejected. But how powerful was this test?

Computing the Value of 1 -b


The critical value of t at 0.05 (two tailed) for DF = 43 is 2.0167. The following
figure illustrates this graphically.

This t = +/-2.0167 equals in the hypothesized distribution = 20 +/- (2.0167) = 20 + 0.603(2.0167) = 21.216 and 20 – 0.603(2.0167) = 18.784.
The next figure shows an alternative distribution of m= 22 and s= 4.
This is the original distribution shift by two units to the right.
What is the probability of being less than 21.216 in this alternative
distribution? That probability is b, accepting Ho when in fact it is false. This is
because with any value within that region, in the original probability
distribution, one would have accepted Ho. How does one find this b? What is
the t value of 21.216 in the alternative distribution?
What is the corresponding probability of being less than t = -1.3? From the t-tables, using one-tailed,
DF = 43, t = 1.3, one finds 0.10026 (using spreadsheet software TDIST [1.3,43,1], it is 0.10026).
Hence b = 0.10026 and 1 -b = 0.9, which was the power of the test in this example.

What Influences the Power of the Test?


Three key factors affect the power of the test.

Factor 1
The difference or effect size affects power. If the difference that one was trying to detect was
not 2 but 1, the overlap between the original distribution and the alternative distribution would have
been greater. Hence, b would increase and 1 -b or power would decrease.

Hence, as effect size increases, power will also increase.


Factor 2
Significance level or a affects power. Imagine in the example using the
significance level of 0.1 instead. What would happen?

Table 2: Using a Different Significance Level


Significance DF Critical t Value in
Level Original
Distribution
0.05 43 2.016692 21.21606538
0.l0 43 1.681071 21.01368563

The critical t would shift from 2.01669 to 1.68. This makes b smaller and 1
– b larger. Hence, as the significance level of the test increases, the power
of the test also increases. However, this comes at a high price because a
risk also increases.
Factor 3
Sample size affects power. Why? Consider the following equations:

How can t be increased? As t increases, it becomes easier to reject Ho. One


way is to increase the numerator or the effect size. As the effect size
increases, power also increases. Also, as the denominator or the standard
error of the mean (SE mean) decreases, t also will increase, and
consequently the power of the test also will increase. How can the
denominator be decreased? As the sample size increases, the SE mean
decreases. Hence, as sample size increases, t also will increase and the
power of the test also will increase.
In general, to improve power, really only the sample size can be increased
because the significance level is usually fixed by industry (0.05 for Six Sigma)
and there is not much that can be done to change the difference trying to be
detected.
Since that power of 0.8 is good enough, one can use statistical software to find
out what the corresponding sample size is that will be need to be
collected prior to hypothesis testing to obtain a good power of test.
Power
In a hypothesis test, the likelihood that you will find a significant effect or difference
when one truly exists. Power is the probability that you will correctly reject the null
hypothesis when it is false.
A number of factors affect power:
· Sample size: Increasing sample size provides more information about the
population and therefore increases power.
· a (the probability of Type I error): A large a value increases power because you
are more likely to reject the null hypothesis with larger a values.
· s (variability in the population): When s is small, it is easier to detect a difference,
which increases power.
· Magnitude of population effect: The more similar populations are, the more difficult
it is to detect a difference. Therefore, power decreases.
You can calculate power before you collect data (a prospective study) to ensure that your
hypothesis test will detect significant differences or effects. For example, a pharmaceutical
company wants to see how much power their hypothesis test has to detect differences among
three different diabetes treatments. To increase power, they can increase the sample size to get
more information about the population of diabetes patients using these medications. Also, they
can try to decrease error variance by following good sampling practices.
You can also calculate power to understand the power of tests that you have already conducted
(a retrospective study). For example, an automobile parts manufacturer performs an experiment
comparing the weight of two steel formulations, and the results are not statistically significant.
Using Minitab, the manufacturer can calculate power based on the minimum difference that they
would like to see. If the power to detect this difference is low, they may want modify the
experimental design to increase the power and continue to evaluate the same problem. However,
if the power is high, they may conclude that the two steel formulations are not different and
discontinue further experimentation.
Power equals 1- b, where b is the probability of making a Type II error (failing to reject the null
hypothesis when it is false). As a (the level of significance) increases, b decreases. Therefore, as
a increases, power also increases. Keep in mind that increasing a also increases the probability of
Type I error (rejecting the null hypothesis when it is true).
Difference (power and sample size)
The difference between an actual population parameter and the hypothesized value;
for example, the difference between the mean width of all dowels produced on a
machine and the target width. Difference is also known as population effect, or simply,
effect.
Usually, the true population parameter is not known; therefore, samples are taken and
a statistical test, such as a t-test or a one-way ANOVA, is used to evaluate whether a
difference exists. You can use Minitab's power and sample size analysis to design a
test to detect the desired difference with the desired power. The power is the
likelihood that you will detect the desired difference when one truly exists.
For example, you would like to know if the mean fill weight of cereal boxes is within
0.5 oz of the target (20 oz). Historically, the standard deviation of fill weights for this
machine is 0.9 oz. With a sample size of 45, you can detect a difference of 0.5 or
greater with 95% power (95% likelihood that you will detect the difference if it
exists). If you take only 20 samples, the difference has to be at least 0.76 oz in order
to maintain 95% power.
Difference (power and sample size)
The difference between an actual population parameter and the hypothesized value;
for example, the difference between the mean width of all dowels produced on a
machine and the target width. Difference is also known as population effect, or simply,
effect.
Usually, the true population parameter is not known; therefore, samples are taken and
a statistical test, such as a t-test or a one-way ANOVA, is used to evaluate whether a
difference exists. You can use Minitab's power and sample size analysis to design a
test to detect the desired difference with the desired power. The power is the
likelihood that you will detect the desired difference when one truly exists.
For example, you would like to know if the mean fill weight of cereal boxes is within
0.5 oz of the target (20 oz). Historically, the standard deviation of fill weights for this
machine is 0.9 oz. With a sample size of 45, you can detect a difference of 0.5 or
greater with 95% power (95% likelihood that you will detect the difference if it
exists). If you take only 20 samples, the difference has to be at least 0.76 oz in order
to maintain 95% power.
Chapter-5
Classification
Chapter-5 Classification
Applications of DA

 DA is especially useful to understand the


differences and factors leading consumers to
make different choices allowing them to develop
marketing strategies which take into proper
account the role of the predictors.
 Examples
• Determinants of customer loyalty
• Shopper profiling and segmentation
• Determinants of purchase and non-purchase
Fisher’s linear discriminant analysis

 The discriminant function is the starting point


 Two key assumptions behind linear DA
(a) the predictors are normally distributed;
(b) the covariance matrices for the predictors within each of the groups
are equal.
 Departure from condition (a) should suggest use of
alternative methods (logistic regression)
 Departure from condition (b) requires the use of different
discriminant techniques (usually quadratic discriminant
functions).
 In most empirical cases, the use of linear DA is appropriate

505
Example (DA using MINITAB)

 IRIS Data
 MINITAB EXERCISE
Review and Discussions

 Questions ?
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees,


or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the

classified result from the model


 Accuracy rate is the percentage of test set samples that are

correctly classified by the model


 Test set is independent of training set, otherwise over-fitting

will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
ZeroR Classifier

 The idea behind the ZeroR classifier is to identify


the most common class value in the training set

 It always returns that value when evaluating an


instance

 It is frequently used as a baseline for evaluating


other algorithms.
ZeroR Algorithm

 Count frequency of each class

 For an unknown touple assign the class label of a


class having majority voting.

 This algorithm ignores dependency of predictor


variables on a class variable
One R

 It is an improved version of ZeroR classifier

 Assume that the class lable depends only on one


predictor.

 Considering one predictor each time, compute


classification accuracy.

 Base the rule on the predictor, which the best


accuracy
Example OneR

 Weather data

 The OneR rule is based on predictor OUTLOOK

 Limitations?
TOPIC

Predictive Analytics:
◦ K Nearest Neighbour
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
A Typical (training) Data

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

yes
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes

no yes
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

excellent fair

no yes yes
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized in
advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
 There are no samples left
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
in D: m
Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v


partitions) to classify D: v |D |
InfoA ( D)    I (D j )
j

j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  InfoA(D)
Attribute Selection: Information Gain
 Class P: buys_computer = Infoage ( D) 
5
I (2,3) 
4
I (4,0)
“yes” 14 14
 Class N: buys_computer = 5 5
“no” 9 9 5 5 I (2,3)  I (3,2)  0.694
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940 14 14
14 14 14 14

age pi ni I(pi, ni) means “age <=30” has 5


<=30 2 3 0.971 out of 14 samples, with 2 yes’es
31…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high no fair no
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40
low
low
yes excellent
yes excellent
no
yes Gain(income)  0.029
Gain( student )  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no
Computing Information-Gain for Continuous-Value
Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values is considered
as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information requirement for A is
selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)

 Information gain measure is biased towards attributes with a large number of


values
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
4 4 6 6 4 4
SplitInfo A ( D)    log 2 ( )   log 2 ( )   log 2 ( )  0.926
14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031

 The attribute with the maximum gain ratio is selected as the splitting
attribute
Gini index (CART, IBM IntelligentMiner)

 If a data set D contains examples from n classes, gini index, gini(D) is defined
as n
gini( D) 1  p 2j
where pj
j 1
is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini index gini(D)
is defined as
|D1| |D |
gini A (D)  gini(D1)  2 gini(D2)
 Reduction in Impurity: |D| |D|
gini( A)  gini(D)  giniA(D)
 The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
Gini index (CART, IBM IntelligentMiner)

 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”


2 2
9  5
gini ( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4
in D2
 10  4
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D1 )
 14   14 

but gini{medium,high} is 0.30 and thus the best since it is the lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much
smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in
both partitions
Decision Tree Based Classification

 Advantages:
 Inexpensive to construct

 Extremely fast at classifying unknown records

 Easy to interpret for small-sized trees

 Accuracy is comparable to other classification

techniques for many simple data sets


Practical Issues of Classification

 Underfitting and Overfitting

 Missing Values

 Costs of Classification
Example

 DATA Iris
Multiple Linear Regression
CORRELATION
 If two variables X and Y, are related such that as
Y increases / decreases with another variable X, a
correlation is said to exist between them.

 A scatter diagram is a chart that pictorially depicts


the relationship between two such data types.
Some Examples of Relationship
• Cutting speed and tool life
• Moisture content and thread elongation
• Breakdown and equipment age
• Temperature and lipstick hardness
• Striking pressure and electrical current
• Temperature and percent foam in soft drinks
Scatter Diagram of Automotive Speed vs. Mi
40

35
Mileage (km/Lit)

30

25

20

15
25 35 45 55 65 75
Speed (km/h)
SCATTER DIAGRAM
• A scatter diagram depicts the relationship
as a pattern that can be directly read.
• If Y increases with X, then X and Y are
positively correlated.
• If Y decreases as X increases, then the two
types of data are negatively correlated.
• If no significant relationship is apparent
between X and Y, then the two data types
are not correlated.
DIFFERENT SCATTER DIAGRAM PATTERNS
DATA ON CONVEYOR SPEED AND SEVERED LENGTH
Sl. No. Conveyor Severed Sl. No. Conveyor Severed
Speed Length Speed Length
(cm/sec) (mm) (cm/sec) (mm)
1 8.1 1046 16 6.7 1024
2 7.7 1030 17 8.2 1034
3 7.4 1039 18 8.1 1036
4 5.8 1027 19 6.6 1023
5 7.6 1028 20 6.5 1011
6 6.8 1025 21 8.5 1030
7 7.9 1035 22 7.4 1014
8 6.3 1015 23 7.2 1030
9 7.0 1038 24 5.6 1016
10 8.0 1036 25 6.3 1020
11 8.0 1026 26 8.0 1040
12 8.0 1041 27 5.5 1013
13 7.2 1029 28 6.9 1025
14 6.0 1010 29 7.0 1020
15 6.3 1020 30 7.5 1022
Scatter Diagram for Conveyor Speed and Severed

1050
1045
1040
Severed Length (mm)

1035
1030
1025
1020
1015
1010
1005
1000
5 5.5 6 6.5 7 7.5 8 8.5 9
Conveyor Speed (cm/sec)
USES OF SCATTER DIAGRAM

 If an increase in Y depends on increase in


X, then, if X is controlled, Y will be
naturally controlled.

 If X is increased, Y will increase


somewhat. Then Y seems to have causes
other than X.
COVARIANCE
 A Statistic representing the degree to
which two variables vary together. It is
basically a number that reflects the
degree to which two variables vary
together.
The Covariance is

 (X i - X )(Yi - Y )
COVXY  i 1
( N 1)
CORRELATION COEFFICIENT
 A measure of the relationship between variables.
 The most commonly used coefficient is Pearson Product-
Moment Correlation Coefficient (measure of linear
relationship denoted by ‘r’).
 ‘r’ lies between -1 and +1. r = 0 means no correlation.
 A positive value of ‘r’ implies positive correlation and
negative value implies negative correlation.

The Pearson Correlation Coefficient is


COVXY
r 
( S .D .X ) ( S .D .Y )
CORRELATION FROM MINITAB

 Open MINITAB Worksheet.

 Enter the data in separate columns of equal length

 Choose STAT > BASIC STATISTICS > CORRELATION

 In Variables, enter the columns containing data

 Interpret the data, R-sq and p value


REGRESSION
 Regression is the prediction of dependent
variable from knowledge of one or more
other independent variables.
 Regression Analysis is a statistical
technique for estimating the parameters
of an equation relating a particular value
of dependent variable to a set of
independent variables. The resulting
equation is called Regression Equation.
 Linear regression is the regression in
which the relationship is linear.
 Curvilinear regression is the regression in
which the best fitting line is a curve.
SIMPLE LINEAR REGRESSION
 Only a single predictor variable or independent
variable ‘X’ (e.g.: cutting speed) and a response
variable or dependent variable ‘Y’ (e.g: tool life).
The regression equation is

Y  ab X

where, Y  Predicted value of Y
a  Intercept (the predicted value of Y when X  0)
b  Slope of the line (the amount of difference in Y
associated with a 1 - unit difference in X)
SIMPLE LINEAR REGRESSION FROM MINITAB

 Open MINITAB Worksheet and collect 30 to 50 pairs of


data.
 Choose STAT > REGRESSION > REGRESSION
 In Response, enter the column containing it and in
Predictors, enter the corresponding column.
 Click OK.
 R2 or R-sq or Coefficient of Determination
= SS Regression/SS Total = 1 - (SS Error / SS Total)
MULTIPLE LINEAR REGRESSION

 Many predictor variables or independent variables ‘X1,


X2, ….Xk’ (e.g.: gender, height) and a response variable
or dependent variable ‘Y’ (e.g.: weight).
The regression equation is

Y  a  b1 X 1  b2 X 2  ...  bk X k

where, Y  Predicted value of Y
a  Intercept (the predicted value of Y when all X i  0)
b j  Slope of the line (the amount of difference in Y associated
with a 1 - unit difference in X j ) : j  1, 2,..., k
MULTIPLE LINEAR REGRESSION FROM MINITAB

 Open MINITAB Worksheet and enter the data in


separate columns of equal length.
 Choose STAT > REGRESSION > REGRESSION
 In Response, enter the column containing it and in
Predictors, enter the corresponding columns.
 Click OK.
 R = Multiple correlation coefficient
STEPWISE REGRESSION
 Many predictor variables or independent variables
‘X1, X2, ….Xk’ (e.g.: gender, height) and a response
variable or dependent variable ‘Y’ (e.g.: weight).

 It begins by selecting the single independent


variable (entire set of predictors) that is the ‘best’
predictor which maximizes R2. Then it adds
(eliminates) variables in sequential manner, in order
of importance and at each step it increases R2.
STEPWISE REGRESSION FROM MINITAB

 Open MINITAB Worksheet and enter the data in


separate columns of equal length.
 Choose STAT > REGRESSION > STEPWISE
REGRESSION
 In Response, enter the column containing it and in
Predictors, enter the corresponding columns.
 Click OK.
 When you choose the stepwise method, you can enter a
starting set of predictor variables in Predictors in initial
model. These variables are removed if their p-values are
greater than the Alpha to enter value. If you want keep
variables in the model regardless of their p-values, enter
them in Predictors to include in every model in the main
dialog box.
BEST SUBSETS REGRESSION
 Many predictor variables or independent
variables ‘X1, X2, ….Xk’ (e.g.: gender, height)
and a response variable or dependent
variable ‘Y’ (e.g.: weight).

 It generates regression models using the


maximum R2 criterion by first examining all
one-predictor regression models and then
selecting the two-predictor models giving the
largest R2. It examines all two-predictor
models, selects the two models with the
largest R2, and displays information on these
two models. This process continues until the
model contains all predictors.
BEST SUBSETS REGRESSION
FROM MINITAB

 Open MINITAB Worksheet and enter the data in separate


columns of equal length.
 Choose STAT > REGRESSION > BEST SUBSETS
REGRESSION
 In Response, enter the column containing it and in Free
predictors, enter the corresponding columns.
 Click OK.
 Cp = (SSEp / MSEm) - (n-2p) : where SSEp is SSE for
the best model with ‘p’ parameter and MSEm is the mean
square error for the model with all ‘m’ predictors.
 We look for models where Cp is small and is also close to
p, the number of parameters in the model.
NON-LINEAR REGRESSION
 Many predictor variables or independent
variables ‘X1, X2, ….Xk’ (e.g.: gender, height)
and a response variable or dependent
variable ‘Y’ (e.g.: weight).
The regression equation in two predictor variables is

Y  a  b1 X 1  b2 X 2  b3 X 12  b4 X 22  b5 X 1 X 2
which is called as full quadratic model of Y on X 1 and X 2 .
NON-LINEAR REGRESSION FROM
MINITAB (only for One Predictor)

 Open MINITAB Worksheet and collect 30 to 50 pairs of


data.

 Choose STAT > REGRESSION > FITTED LINE PLOT

 In Response (Y), enter the column containing it and in


Predictor (X), enter the corresponding column.

 Choose the type of regression model.

 Click OK.
 Logistic Regression Model

 When it is used?
When the dependent (response) variable is a dichotomous
variable (i. e. it takes only two values, which usually represent
the occurrence or non-occurrence of some outcome event,
usually coded as 0 or 1) and the independent (input) variables
are continuous, categorical, or both.
 For example, in a medical study, the patient survives or dies as
a response and age, suffering from disease or not as predictors.
LR as classifier
 Logistic regression can be used for classifying a new observation, where the
class is unknown, into one of the classes, based on the values of its predictor
variables (called classification).

 The first step in logistic regression is to estimates the probabilities of


belonging to each class. That means, we get an estimate of P(Y=1), the
probability of belonging to class 1 (which also tells us the probability of
belonging to class 0). In the next step we use a cutoff value on these
probabilities in order to classify each case in one of the classes.
Continued………..
Model:
The form of the model is

𝑝
𝑙𝑜𝑔 = 𝛽0 + 𝛽1 𝑋1 + − − − + 𝛽𝑘 𝑋𝑘
1−𝑝

where 𝑝 is the probability that Y=1 i.e. P[Y=1] and X 1, X2,.. .,Xk are the
independent variables (predictors). 𝛽0 , 𝛽1 , .... 𝛽𝑘 are known as the regression
coefficients, which have to be estimated from the data. Once the coefficients are
estimated from the training set, we can estimate the class membership
probabilities:

1
P[Y=1] = 𝑝 = −𝑋 ′ 𝛽 and P[Y=0] = 1- 𝑝 We use a cutoff
1+𝑒

value on these probabilities in order to classify each case in one of the classes.
Odds:

Logistic regression also produces Odds Ratios (O.R.) associated with each
predictor value. The "odds” of an event is defined as the probability of the
outcome event occurring divided by the probability of the event not occurring.

𝑝 𝑋′𝛽
i.e. odds = =𝑒
1−𝑝

The odds ratio for a predictor is defined as the relative amount by which the
odds of the outcome increase (O.R. greater than 1.0) or decrease (O.R. less than
1.0) when the value of the predictor variable is increased by 1.0 units keeping
others are constant (fixed). Odds ratio for regressor xj assuming all other predictor
variables are constant.

𝑂𝑑𝑑𝑠 (𝑥 𝑖 +1)
OR = = 𝑒 𝛽𝑖
𝑂𝑑𝑑𝑠 (𝑥 𝑖)
Example: - Acceptance of Personal Loan:

Here, we consider the example of acceptance of a personal loan by

Universal Bank. The banks dataset includes data on 5000 customers. The data

include customer demographic information (Age, Income etc.), customer response

to the last personal loan campaign (Personal Loan), and the customer’s

relationship with the bank (mortgage, securities account, etc.). Among these 5000

customers, only 480 (9.6%) accepted the personal loan that was offered to them in

a previous campaign. The goal is to find characteristics of customers who are most

likely to accept the loan offer in future mailings.

Data
Result:

We select 60% data randomly from the Universal Bank data for the training purpose and

remaining 40% data for validation.

Using Minitab we fit a logistic regression model on the randomly selected 60% data. After

fitting a model we obtain a confusion matrix based on remaining 40% data.

 We take only one regressor variable as Income

Confusion Matrix

Predicted %
0 1 Error
0 1754 57 3.15
Actual
1 127 62 67.2

Total misclassification error = 9.2%.

The odds ratio is 1.04 that means a single unit increase in income, is associated with an increase in

the odds that a customer accepts the offer by a factor of 1.04.


Continued…………..

 We take only one regressor variable as Age

Confusion Matrix

Predicted %
0 1 Error
0 1811 0 0
Actual
1 189 0 100

Total misclassification error = 9.45%.

The odds ratio is 1 that means a single unit increase in income, is associated with

an increase in the odds that a customer accepts the offer by a factor of 1.

 Similarly if we take only one regressor variable as Education then the total

misclassification error is 9.45%.


Continued…………….

 If we take regressor variable as Income, Age and Education.

Confusion Matrix

Predicted %
0 1 Error
0 1786 25 1.38
Actual
1 74 115 39.15

Total misclassification error = 4.95%.

The odds ratio corresponding to Age is 1.01that means a single unit increase in

age, holding income and education constant, is associated with an increase in the

odds that a customer accepts the offer by a factor of 1.01.


Cutoff Value:-

Given the value for a set of predictors, we can predict the probability that

each observation belongs to class 1. The next step is to set a cutoff on these

probabilities so that each observation is classified into one of the two classes. This

is done by setting a cutoff value, c, such that observation with probabilities above c

are classified as belonging to class 0. For example, in binary case, a cutoff of 0.5

means that cases with an estimated probability of P[Y=1] > 0.5 are classified as

belonging to class ‘1’ whereas cases with P[Y=1] < 0.5 are classified as belonging

to class ‘0’. This cutoff need not set as 0.5.

Different cutoff values lead to different classification and consequently,

different confusion matrices. A popular cutoff value for a two class case is 0.5.
Chapter-6
Clustering
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
Clustering: Some Applications

 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature
spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Clothing Industry
Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in


terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 (two modes)  
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 Dissimilarity matrix  0 
 (one mode)
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )


.

 Calculate the standardized measurement (z-score)


xif  m f
zif  sf
 Using mean absolute deviation is more robust than using
standard deviation
Similarity and Dissimilarity Between
Objects

 Distances are normally used to measure the similarity or


dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
 If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an


element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e.,


dis(Ki, Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,


Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster SiN 1(t
Cm  ip )
N
 Radius: square root of average distance from any point of the
cluster to its centroid
S N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster

S N S N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms: Basic Concept

 Partitioning method: Construct a partition of a database D of n objects


into a set of k clusters, s.t., min sum of squared distance

S S
k
m1 tmiKm (Cm  tmi ) 2

 Given a k, find a partition of k clusters that optimizes the chosen


partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in


four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment
The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering

 Use distance matrix as clustering criteria. This method


does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
Dendrogram: Shows How the Clusters are Merged
Chapter-7
Affinity Analysis
Market Basket Analysis

Example

TID Red White Blue Orange Green Yellow


1 1 1 0 0 1 0
2 0 1 0 1 0 0
3 0 1 1 0 0 0
4 1 1 0 1 0 0
5 1 0 1 0 0 0
6 0 1 1 0 0 0
7 1 0 1 0 0 0
8 1 1 1 0 1 0
9 1 1 1 0 0 0
10 0 1 0 0 0 1
TID Red White Blue Orange Green Yellow
1 1 1 0 0 1 0
2 0 1 0 1 0 0
3 0 1 1 0 0 0
4 1 1 0 1 0 0
Item Set Support(count) 5 1 0 1 0 0 0
Red 6 6 0 1 1 0 0 0
White 7 7 1 0 1 0 0 0
Blue 6 8 1 1 1 0 1 0
Orange 2 9 1 1 1 0 0 0
Green 2 10 0 1 0 0 0 1
Yellow 1
Red White 4
Red Blue 4 If Then Confidence
Red Orange 1
Red Green 2 Red white Green 50
Red Yellow 0 Red Green White 100
White Blue 4
White Orange 2 White Green Red 100
White Green 2
Red White Green 33
White Yellow 1
Red white Blue 2 White Red green 29
Red white green 2
Green Red, white 100

582
Association Rule Mining
 Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset

 Itemset
 A collection of one or more items TID Items
 Example: {Milk, Bread, Diaper} 1 Bread, Milk
 k-itemset 2 Bread, Diaper, Beer, Eggs
 An itemset that contains k items
3 Milk, Diaper, Beer, Coke
 Support count ()
4 Bread, Milk, Diaper, Beer
 Frequency of occurrence of an itemset
5 Bread, Milk, Diaper, Coke
 E.g. ({Milk, Bread,Diaper}) = 2
 Support
 Fraction of transactions that contain an
itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
 An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
 An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X  Y, where X and Y are 3 Milk, Diaper, Beer, Coke
itemsets 4 Bread, Milk, Diaper, Beer
 Example: 5 Bread, Milk, Diaper, Coke
{Milk, Diaper}  {Beer}
Example:
 Rule Evaluation Metrics {Milk , Diaper }  Beer
 Support (s)

Fraction of transactions that  (Milk, Diaper, Beer ) 2


s   0.4

contain both X and Y |T| 5


 Confidence (c)
 (Milk, Diaper, Beer ) 2
 Measures how often items in Y c   0.67
appear in transactions that  (Milk, Diaper ) 3
contain X
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
 Method:

 Let k=1
 Generate frequent itemsets of length 1
 Repeat until no new frequent itemsets are identified
 Generate length (k+1) candidate itemsets from

length k frequent itemsets


 Prune candidate itemsets containing subsets of

length k that are infrequent


 Count the support of each candidate by scanning

the DB
 Eliminate candidates that are infrequent, leaving

only those that are frequent


Eample

 Association Mining
References :

Data Mining for Business Intelligence- Shmueli G.,


Patel N. R. and Bruce P. C., Wiley India

Data Mining : Concepts and Techniques- Han J and


Kamber M. , Academic Press.
Chapter-8
Predictive Analytics-
Time Series Analysis
Introduction

A sequence of observations over regularly


spaced intervals of time. For example:
· Monthly unemployment rates for the past
five years
· Daily production at a manufacturing plant
for a month
· Decade-by-decade population of a state
over the past century
Time Series Models:

Decomposition model:

By default, Minitab uses a multiplicative model. Use the multiplicative model when the size of the
seasonal pattern in the data depends on the level of the data. This model assumes that as the data
increase, so does the seasonal pattern. Most time series exhibit such a pattern.

The multiplicative model is:


Yt = Trend * Seasonal * Error, where Yt is the observation at time t.

Method:
1. smoothen the data using a centered moving average with a length equal to the length of the
seasonal cycle.
2. Divide the moving average from the data to obtain what are often referred to as raw
seasonal values.
3. For corresponding time periods in the seasonal cycles, determine the median of the raw
seasonal values. For example, if you have 60 consecutive months of data (5 years),
determine the median of the 5 raw seasonal values corresponding to January, to February,
and so on.
4. Adjust the medians of the raw seasonal values so that their average is one. These adjusted
medians constitute the seasonal indices.
5. Use the seasonal indices to seasonally adjust the data.
6. Fit a trend line to the seasonally adjusted data using least squares regression.
7. The data can be detrended by either dividing the data by the trend component
Trend Analysis:

You collect employment data in a trade business over 60 months and wish to
predict employment for the next 12 months.
Sales Performance Sales Performance Sales Performance
Month Trade Food Metals Month Trade Food Metals Month Trade Food Metals
Apr-02 322 53.5 44.2 Apr-04 330 52.3 42.5 Apr-06 361 54.8 49.6
May-02 317 53 44.3 May-04 326 51.5 42.6 May-06 354 54.2 49.9
Jun-02 319 53.2 44.4 Jun-04 329 51.7 42.3 Jun-06 357 54.6 49.6
Jul-02 323 52.5 43.4 Jul-04 337 51.5 42.9 Jul-06 367 54.3 50.7
Aug-02 327 53.4 42.8 Aug-04 345 52.2 43.6 Aug-06 376 54.8 50.7
Sep-02 328 56.5 44.3 Sep-04 350 57.1 44.7 Sep-06 381 58.1 50.9
Oct-02 325 65.3 44.4 Oct-04 351 63.6 44.5 Oct-06 381 68.1 50.5
Nov-02 326 70.7 44.8 Nov-04 354 68.8 45 Nov-06 383 73.3 51.2
Dec-02 330 66.9 44.4 Dec-04 355 68.9 44.8 Dec-06 384 75.5 50.7
Jan-03 334 58.2 43.1 Jan-05 357 60.1 44.9 Jan-07 387 66.4 50.3
Feb-03 337 55.3 42.6 Feb-05 362 55.6 45.2 Feb-07 392 60.5 49.2
Mar-03 341 53.4 42.4 Mar-05 368 53.9 45.2 Mar-07 396 57.7 48.1
Apr-03 322 52.1 42.2 Apr-05 348 53.3 45
May-03 318 51.5 41.8 May-05 345 53.1 45.5
Jun-03 320 51.5 40.1 Jun-05 349 53.5 46.2
Jul-03 326 52.4 42 Jul-05 355 53.5 46.8
Aug-03 332 53.3 42.4 Aug-05 362 53.9 47.5
Sep-03 334 55.5 43.1 Sep-05 367 57.1 48.3
Oct-03 335 64.2 42.4 Oct-05 366 64.7 48.3
Nov-03 336 69.6 43.1 Nov-05 370 69.4 49.1
Dec-03 335 69.3 43.2 Dec-05 371 70.3 48.9
Jan-04 338 58.5 42.8 Jan-06 375 62.6 49.4
Feb-04 342 55.3 43 Feb-06 380 57.9 50
Mar-04 348 53.6 42.8 Mar-06 385 55.8 50

What would you do?


Trend Analysis:

First try Time Series Plot:

Time Series Plot of Trade


400

390

380

370

360
Trade

350

340

330

320

310
1 6 12 18 24 30 36 42 48 54 60
Index

Because there is an overall curvilinear pattern to the data, you use trend
analysis and fit a quadratic trend model.
Trend Analysis:

Trend Analysis Plot for Trade


Quadratic Trend Model
Yt = 320.76 + 0.509*t + 0.01075*t**2
400 Variable
Actual
390
Fits
380
Accuracy Measures
370 MAPE 1.7076
MAD 5.9566
360
Trade

MSD 59.1305

350
340
330
320
310
2 2 3 3 4 4 5 5 6 6 7
r -0 p- 0 r -0 p-0 r -0 p-0 r -0 p-0 r -0 p-0 r -0
Ap Se M
a
Se M
a
Se M
a
Se M
a
Se M
a
Month

Now, What are MAPE, MAD, & MSD?


Trend Analysis:

What are MAPE, MAD, & MSD?

MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values. It expresses
accuracy as a percentage.

where yt equals the actual value, t equals the fitted value, and n equals the number of observations.
MAD, which stands for Mean Absolute Deviation, measures the accuracy of fitted time series values. It
expresses accuracy in the same units as the data, which helps conceptualize the amount of error.

where yt equals the actual value, t equals the fitted value, and n equals the number of observations.
MSD stands for Mean Squared Deviation. MSD is always computed using the same denominator, n,
regardless of the model, so you can compare MSD values across models. MSD is a more sensitive measure
of an unusually large forecast error than MAD.

where yt equals the actual value, t equals the forecast value, and n equals the number of forecasts.

How do we Forecast?
Errors
Trend Analysis:

How do we Forecast?

Trend Analysis Plot for Trade


Quadratic Trend Model
Yt = 320.76 + 0.509*t + 0.01075*t**2

Variable
410 Actual
400 Fits
Forecasts
390
Accuracy Measures
380 MAPE 1.7076
370 MAD 5.9566
Trade

MSD 59.1305
360
350
340
330
320

2 2 3 3 4 5 5 6 6
r -0 t- 0 y -0 c-0 l-0 b-0 p-0 r -0 v-0
c u
Ap O M
a De J Fe Se Ap No
Month

How are the Fits and Residuals?


Trend Analysis:
Fits & Residuals: Sales Performance Sales Performance
Month Trade FITS1 RESI1 Month Trade FITS1 RESI1
Apr-02 322 321.2821 0.717901 Nov-04 354 348.0654 5.934649
May-02 317 321.8237 -4.823709 Dec-04 355 349.2732 5.726816
Jun-02 319 322.3868 -3.386809 Jan-05 357 350.5025 6.497491
Jul-02 323 322.9714 0.0286 Feb-05 362 351.7533 10.24668
Aug-02 327 323.5775 3.422517 Mar-05 368 353.0256 14.97437
Sep-02 328 324.2051 3.794943 Apr-05 348 354.3194 -6.319429
Oct-02 325 324.8541 0.145879 May-05 345 355.6347 -10.63472
Nov-02 326 325.5247 0.475323 Jun-05 349 356.9715 -7.971499
Dec-02 330 326.2167 3.783276 Jul-05 355 358.3298 -3.32977
Jan-03 334 326.9303 7.069738 Aug-05 362 359.7095 2.290468
Feb-03 337 327.6653 9.334708 Sep-05 367 361.1108 5.889214
Mar-03 341 328.4218 12.57819 Oct-05 366 362.5335 3.466469
Apr-03 322 329.1998 -7.199823 Nov-05 370 363.9778 6.022234
May-03 318 329.9993 -11.99933 Dec-05 371 365.4435 5.556507
Jun-03 320 330.8203 -10.82032 Jan-06 375 366.9307 8.069289
Jul-03 326 331.6628 -5.662804 Feb-06 380 368.4394 11.56058
Aug-03 332 332.5268 -0.52678 Mar-06 385 369.9696 15.03038
Sep-03 334 333.4122 0.587753 Apr-06 361 371.5213 -10.52131
Oct-03 335 334.3192 0.680795 May-06 354 373.0945 -19.09449
Nov-03 336 335.2477 0.752346 Jun-06 357 374.6892 -17.68917
Dec-03 335 336.1976 -1.197594 Jul-06 367 376.3053 -9.305332
Jan-04 338 337.169 0.830974 Aug-06 376 377.943 -1.942988
Feb-04 342 338.1619 3.838052 Sep-06 381 379.6021 1.397865
Mar-04 348 339.1764 8.823638 Oct-06 381 381.2828 -0.282773
Apr-04 330 340.2123 -10.21227 Nov-06 383 382.9849 0.015098
May-04 326 341.2697 -15.26966 Dec-06 384 384.7085 -0.708522
Jun-04 329 342.3485 -13.34855 Jan-07 387 386.4536 0.546367
Jul-04 337 343.4489 -6.448927 Feb-07 392 388.2202 3.779765
Aug-04 345 344.5708 0.429204 Mar-07 396 390.0083 5.991671
Sep-04 350 345.7142 4.285843
Oct-04 351 346.879 4.120992

Now how do we decompose?


Decomposition
Time Series Decomposition for Trade

Multiplicative Model

Data Trade
Length 60
NMissing 0

Seasonal Indices
Fitted Trend Equation
Period Index
Yt = 316.58 + 1.08*t 1 0.97552
2 0.96163
3 0.96591
4 0.98339
5 1.00159
Accuracy Measures 6 1.00999
7 1.00511
MAPE 0.8908 8 1.00981
MAD 3.0351 9 1.00949
MSD 16.5285 10 1.01591
11 1.02494
12 1.03671
Decomposition- Seasonal Indices:

Time Series Decomposition Plot for Trade


Multiplicative Model
400 Variable
Actual
Fits
380 Trend

Accuracy Measures
MAPE 0.8908
360 MAD 3.0351
Trade

MSD 16.5285

340

320

300
02 02 03 03 04 04 05 05 06 06 07
pr - ep- a r- e p- a r- e p- a r- ep- a r- e p- a r-
A S M S M S M S M S M
Month
Decomposition- Detrending and
Deseasonalising:
Component Analysis for Trade
Multiplicative Model
Original Data Detrended Data
400 1.05

360 1.00

320 0.95
Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07
Month Month

Seasonally Adjusted Data Seas. Adj. and Detr. Data


400
10

5
360
0

-5
320

Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07
Month Month
Decomposition- Seasonal Indices:

Seasonal Analysis for Trade


Multiplicative Model
Seasonal Indices Detrended Data by Season
1.04 1.05

1.00 1.00

0.96 0.95
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

Percent Variation by Season Residuals by Season

12 10

8 5

0
4
-5

0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Diagnostic Checking of Error:
Versus Fits
Normal Probability Plot (response is Trade)
(response is Trade)
15
99.9

99
10
95
90
5

Residual
80
70
Percent

60
50
40 0
30
20
10
-5
5

1
-10
0.1 300 320 340 360 380 400
-15 -10 -5 0 5 10 Fitted Value
Residual

Histogram Versus Order


(response is Trade) (response is Trade)
20 15

10
15

5
Frequency

Residual
10
0

5
-5

0 -10
-5 0 5 10 1 5 10 15 20 25 30 35 40 45 50 55 60
Residual Observation Order
Decomposition- Interpretation:
Decomposition generates three sets of plots:

· A time series plot that shows the original series with the fitted trend line, predicted values,
and forecasts.

· A component analysis - in separate plots are the series, the detrended data, the seasonally
adjusted data, the seasonally adjusted and detrended data (the residuals).

· A seasonal analysis - charts of seasonal indices and percent variation within each season
relative to the sum of variation by season and boxplots of the data and of the residuals by
seasonal period.

In addition, the fitted trend line, the seasonal indices, the three accuracy measures- MAPE,
MAD, and MSD - and forecasts in the Session window.

In the example, the first graph shows that the detrended residuals from trend analysis are fit
fairly well by decomposition, except that part of the first annual cycle is underpredicted and the
last annual cycle is overpredicted. This is also evident in the lower right plot of the second
graph; the residuals are highest in the beginning of the series and lowest at the end.
Forecasting from Decomposed Model:

Period
Forecast Time Series Decomposition Plot for Trade
Multiplicative Model
61 372.964
420
62 368.687 Variable
Actual
63 371.370 400 Fits

64 379.150 Trend
Forecasts
65 387.248 380 Accuracy Measures
66 391.582 MAPE 0.8908
Trade

360 MAD 3.0351


67 390.775 MSD 16.5285
68 393.691 340
69 394.655
70 398.256 320
71 402.901
72 408.646 300
02 - 02 03 03 l- 04 05 05 06 06
pr- ct a y- e c- u b- p- pr- ov-
A O M D J F e Se A N
Month
Moving Average:
Averages calculated from artificial subgroups of consecutive observations. In
control charting, you can create a moving average chart for time weighted data. In
time series analysis, use moving average to smooth data and reduce random
fluctuations in a time series.

For example, an office products supply company monitors inventory levels every
day. They want to use moving averages of length 2 to track inventory levels to
smooth the data. Here are the data collected over 8 days for one of their products.

Day 1 2 3 4 5 6 7 8
Inventory Level 4310 4400 4000 3952 4011 4000 4110 4220
Moving average 4310 4355 4200 3976 3981.5 4005.5 4055 4165

The first moving average is 4310, which is the value of the first observation. (In
time series analysis, the first number in the moving average series is not
calculated; it is a missing value.) The next moving average is the average of the
first two observations, (4310 + 4400) / 2 = 4355. The third moving average is
the average of observation 2 and 3, (4400 + 4000) / 2 = 4200, and so on. If you
want to use a moving average of length 3, three values are averaged instead of
two.
Moving Average:
Moving Average smoothes your data by averaging
consecutive observations in a series and provides short-
term forecasts. This procedure can be a likely choice
when your data do not have a trend or seasonal
component. There are ways, however, to use moving
averages when your data possess trend and/or
seasonality.
Use for:
· Data with no trend, and
· Data with no seasonal pattern
· Short term forecasting
Forecast profile:
· Flat line
ARIMA equivalent: none
Single Exponential Smoothing:
Single exponential smoothing smoothes your data by computing exponentially weighted
averages and provides short-term forecasts.

Single Exponential Smoothing for


Trade Smoothing Plot for Trade
Single Exponential Method

Data Trade 410 Variable

Length 60 400
Actual
Fits
390 Forecasts

Smoothing Constant 380


95.0% PI

Smoothing Constant
370 Alpha 1.26370
Alpha 1.26370

Trade
360 Accuracy Measures
MAPE 1.2303
350 MAD 4.2754
MSD 42.9460
Accuracy Measures 340
330
MAPE 1.2303 320
MAD 4.2754
MSD 42.9460 02 02 - 03 03 4 04 5 05 6 6 7
r - p- p- -0 p- -0 p- - 0 p- 0 -0
Ap Se ar e ar e ar e ar e ar
M S M S M S M S M
Month
Forecasts

Period Forecast Lower Upper


61 396.760 386.285 407.235
62 396.760 386.285 407.235 Smoothing constant (weight) used and three measures to help you to
63 396.760 386.285 407.235 determine the accuracy of the fitted values: MAPE, MAD, and MSD.
64 396.760 386.285 407.235 The three accuracy measures, MAPE, MAD, and MSD, were 1.23, 4.27,
65 396.760 386.285 407.235 and 42.94, respectively for the single exponential smoothing model.
Double Exponential Smoothing:
Double exponential smoothing smoothes your data by Holt (and
Brown as a special case) double exponential smoothing and Smoothing Plot for Trade
provides short-term forecasts. This procedure can work well when Double Exponential Method

a trend is present but it can also serve as a general smoothing 460 Variable
Actual
method. Dynamic estimates are calculated for two components: 440 Fits
Forecasts
level and trend. 420 95.0% PI

Smoothing Constants
400 Alpha (lev el) 1.25883
Gamma (trend) 0.01218
Double Exponential Smoothing for Trade

Trade
380 Accuracy Measures
MAPE 1.0968
360 MAD 3.7958
Data Trade MSD 43.9140

340
Length 60
320

Smoothing Constants 300


1 6 12 18 24 30 36 42 48 54 60
Index
Alpha (level) 1.25883
Gamma (trend) 0.01218

Accuracy Measures

MAPE 1.0968
MAD 3.7958
MSD 43.9140

Forecasts

Period Forecast Lower Upper


61 398.084 388.785 407.384
62 399.760 382.835 416.685
63 401.436 376.691 426.180
64 403.111 370.492 435.730
65 404.787 364.270 445.303
Double Exponential Smoothing:
Winter’s Method:
Winters' Method smoothes your data by Holt-Winters exponential smoothing
and provides short to medium-range forecasting. You can use this procedure
Winters' Method for Trade when both trend and seasonality are present, with these two components
being either additive or multiplicative. Winters' Method calculates dynamic
Multiplicative Method estimates for three components: level, trend, and seasonal.
Data Trade
Length 60
Smoothing Constants
Winters' Method Plot for Trade
Alpha (level) 0.2 Multiplicative Method
Gamma (trend) 0.2
400 Variable
Delta (seasonal) 0.2 Actual

Accuracy Measures Fits


Forecasts
380
MAPE 1.0046 95.0% PI

MAD 3.4131 Smoothing Constants


360 Alpha (lev el) 0.2
MSD 22.1571
Trade

Gamma (trend) 0.2


Delta (seasonal) 0.2
340
Forecasts Accuracy Measures
MAPE 1.0046
MAD 3.4131
320 MSD 22.1571
Period Forecast Lower Upper
61 374.621 366.259 382.983
300
62 369.368 360.875 377.861
63 372.484 363.845 381.123 02 02 -03 - 03 - 04 -04 - 05 - 05 -06 - 06 - 07
pr- ep- ar p ar p ar p ar p ar
A S M Se M Se M Se M Se M
64 380.183 371.384 388.983 Month
65 387.819 378.846 396.793
Autocorrelation Method:

• Differences computes the differences between data values of a time series. If you
wish to fit an ARIMA model but there is trend or seasonality present in your data,
differencing data is a common step in assessing likely ARIMA models. Differencing is
used to simplify the correlation structure and to help reveal any underlying pattern.

• Lag computes lags of a column and stores them in a new column. To lag a time
series, Minitab moves the data down the column and inserts missing value symbols, *,
at the top of the column. The number of missing values inserted depends upon the
length of the lag.

• Autocorrelation computes and plots the autocorrelations of a time series.


Autocorrelation is the correlation between observations of a time series separated by k
time units. The plot of autocorrelations is called the autocorrelation function or ACF.
View the ACF to guide your choice of terms to include in an ARIMA model.

o Choose to use the default number of lags, which is n / 4 for a series with less
than or equal to 240 observations or sqrt(n) + 45 for a series with more than
240 observations, where n is the number of observations in the series.
Introduction to Dependent Observations

In autoregressive models, we consider observations taken over time.

To denote this, we will index the observations with the letter t rather
than the letter i.

Our data will be observations on Y1, Y2, ...Yt, ...where t indexes the day,
month, year, or any time interval.

Key new idea:

Exploit the dependence in the series

Time series analysis is about uncovering, modeling, and


exploiting dependence
Introduction to Dependent Observations

 We will NOT assume that Yt-1 is independent of Yt

 Example: Is tomorrow’s temperature independent of today’s?

Suppose y1 ...yT are the temperatures measured daily for several


years. Which of the following two predictors would work better:

i. the average of the temperatures from the previous year


ii. the temperature on the previous day?

 If the readings are iid N(,2), what would be your prediction for YT+1?

 This example demonstrates that we should handle dependent time


series quite differently from independent series.
The Lake Michigan Time Series

Storm off Promontory Point

The mean June level of lake Michigan in number of meters above sea
level (lmich_yr), 1918-2006

Use Minitab Time series Plot Command (under graph menu) to produce
this graph
Introduction to Dependent Observations
Monthly US Beer Production (millions of barrels)

Time Series Plot of b_prod


20

19

18

17
b_prod

16

15

14

13
Strong Seasonality
12
1 7 14 21 28 35 42 49 56 63 70
Index
a. Introduction to Dependent Observations
What Does IID Data Look Like?

Time Series Plot of IID


3

0
IID

-1

-2

-3

-4
1 10 20 30 40 50 60 70 80 90 100
Index

many (but not too many) crossings of the mean


Checking for Independence

Knowing Yt does not help you in predicting Yt+1


It is not always easy just to look at the data and
decide whether a time series is independent.

So how can we tell?

Plot Yt vs. Yt-1 to check for a relationship


or
Plot Yt vs. Yt-s for s = 1, 2, …

Chapter VII. Slide 620


Checking for Independence
How do we do this in Minitab? – Use the “lag”
MTB > lag c2 c3
MTB > lag c3 c4

C1 C2 C3 C4
t Y(t) Y(t-1) Y(t-2)
1 5 * *
2 8 5 *
3 1 8 5
Now each row has Y at
4 3 1 8 time t, Y one period
5 9 3 1 ago, and Y two periods
ago
6 4 9 3

Y Y lagged once Y lagged twice


Checking for Independence
Now let’s return to the lake data…
Each point is a pair of adjacent years.
e.g. (Level1929, Level1930)

First, let’s plot Levelt vs. Levelt-1

Corr = .794

Now, let’s plot Levelt vs. Levelt-2

Corr = .531
Autocorrelation

Time series is about dependence. We use correlation as a measure of


dependence.

Although we have only one variable, we can compute the correlation


between Yt and Yt-1 or between Yt and Yt-2.

The correlations between Y’s at different times are called


autocorrelations.

However, we must assume that all the Y’s have:


 same mean (no upward or downward trends)

 same variances
Autocorrelation

We will assume what is known as stationarity.

Roughly speaking this means:


 The time series varies about a fixed mean and has constant

variance
 The dependence between successive observations does not

change over time

Let’s define the autocorrelations for a stationary time series.


cov ( Yt ,Yt s ) cov ( Yt ,Yt s )
s  
Var ( Yt )  Var ( Yt s ) Var ( Yt )

Note that the autocorrelation does not depend on t because we


have assumed stationary
Autocorrelation

We estimate the theoretical quantities by using sample averages (as


always).

The estimated or sample autocorrelations are:

(Y  Y)(Y
t s
t t s  Y)
rs = T

 t )
 2
(Y Y
t 1
Autocorrelation

The ACF command in Minitab computes the autocorrelations

There is a strong
dependence
between
observations spaced
close together in
time (e.g only one or
two years apart). As
time passes, the
dependence
diminishes in
strength.
c. Autocorrelation
Let’s look at the autocorrelations for the IID series.

Autocorrelation Function for ran


(with 5% significance limits for the autocorrelations)

1.0
0.8 In contrast to the ACF
0.6
0.4
for the ‘level’ series, the
Autocorrelation

0.2 sample autocorrelations


0.0
are much smaller.
-0.2
-0.4
-0.6
-0.8
-1.0

2 4 6 8 10 12 14 16 18 20 22 24
Lag
Autocorrelation

How do we know if the sample autocorrelations are good


estimates of the underlying theoretical autocorrelations?
and
How do we know if we have enough sample information to reach
definitive conclusions?

If all the true autocorrelations are 0, then the standard


deviation of the sample autocorrelations is about 1/sqrt(T).

Std Err (rs ) 


1
T

T = Total Number of observations or time periods

Chapter VII. Slide 628


Autocorrelation

For the IID series

All of the sample autocorrelations are within 2 standard deviations


of 0 -- no evidence of positive autocorrelation in the data.

For the level series

T=89 so the standard deviation is again about 0.1. The first


autocorrelation is many standard deviations away from 0,
suggesting strongly that the data are not iid.
Autocorrelation
Another Example: Stock Returns

Monthly returns on IBM…

Time Series Plot of IBM-ret

0.2

0.1
IBM-ret

0.0

-0.1

-0.2
1 36 72 108 144 180 216 252 288 324 360
Index
Autocorrelation
Let’s look at the ACF for the series.
Autocorrelation Function for IBM-ret
(with 5% significance limits for the autocorrelations)

1.0
0.8
0.6
0.4
Autocorrelation

0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

1 5 10 15 20 25 30 35 40 45 50 55 60
Lag

The series is independent


The AR(1) Model
A simple way to model dependence over time is with the
“autoregressive model of order 1.”

This is a SLR model of Yt regressed on lagged Yt-1.

AR(1) : Yt  0  1Yt 1  t

What does the model say for the T+1 st observation?

YT 1  β0  β1YT  ε T 1
The AR(1) model expresses what we don’t know in terms of
what we do know at time T.
The AR(1) Model

How should we predict YT+1?


E YT1 |YT   0  1YT  ET1 |YT   0  1YT
How do we use the AR(1) model? We simply regress Y on lagged
Y.

If our model successfully captures the dependence structure in


the data then the residuals should look iid. There should be no
dependence in the residuals!

So to check the AR(1) model, we can check the residuals from the
regression for any “left-over” dependence.
d. The AR(1) Model
Let’s try it out on the lake water level data...
Regression Analysis: level versus level_t-1

The regression equation is


level = 36.8 + 0.792 level_t-1

88 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P


Constant 36.79 11.55 3.18 0.002
level_t-1 0.79161 0.06543 12.10 0.000

S = 0.236208 R-Sq = 63.0% R-Sq(adj) = 62.6%

Analysis of Variance

Source DF SS MS F P
Regression 1 8.1675 8.1675 146.39 0.000
Residual Error 86 4.7983 0.0558
Total 87 12.9657
The AR(1) Model
Now let’s look at the ACF of the residuals…

Not much
autocorrelation
left!
The AR(1) Model
Now let’s try the beer data…

MTB > lag c1 c2


MTB > name c2 ‘bprod-1’

Regression Analysis

The regression equation is


b_prod = 4.78 + 0.704 bprod-1

71 cases used 1 cases contain missing values


Predictor Coef SE Coef T P
Constant 4.778 1.425 3.35 0.001
bprod-1 0.70429 0.08724 8.07 0.000

s = 1.386 R-sq = 48.6% R-sq(adj) = 47.8%


The AR(1) Model
Now let’s look at the ACF of the residuals…
Autocorrelation Function for RESI1
(with 5% significance limits for the autocorrelations)

1.0
0.8
0.6
0.4
Autocorrelation

0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

2 4 6 8 10 12 14 16 18
Lag

There’s a lot of auto-correlation left in.


Why at lag 6 and 12?
The AR(1) Model

To gain a better feel for this model, let’s simulate data


series from the model with various parameter settings…
Time Series Plot of AR(1)

0.3

0.2
0  0
0.1

1  .8
AR(1)

0.0

-0.1

-0.2

-0.3
1 10 20 30 40 50 60 70 80 90 100
Index

The series fluctuates around a mean level with fairly long “runs”.
The
Now the ACF… AR(1) Model
Autocorrelation Function for AR(1)
(with 5% significance limits for the autocorrelations)

1.0
0.8
0.6
0.4
Autocorrelation

0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

2 4 6 8 10 12 14 16 18 20 22 24
Lag

The ACF reveals the strong dependence in the series!


Note the smooth decline from about .8
The AR(1) Model

Now let’s look at a series generated with a negative


slope value…
Time Series Plot of AR(1)-.8
0.5

0.4 0  0
0.3

0.2 1  .8
AR(1)-.8

0.1

0.0

-0.1

-0.2

-0.3

-0.4
1 10 20 30 40 50 60 70 80 90 100
Index
Because β1 is negative, an above average Y tends to be followed
by a below average Y (and vice versa) - hence the jagged
appearance of the plot.
The AR(1) Model
and the ACF…
Autocorrelation Function for AR(1)-.8
(with 5% significance limits for the autocorrelations)

1.0
0.8
0.6
0.4
Autocorrelation

0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

2 4 6 8 10 12 14 16 18 20 22 24
Lag

This choppy behavior is reflected in the ACF


ARIMA Method:

Use ARIMA to model time series behavior and to generate forecasts. ARIMA fits
a Box-Jenkins ARIMA model to a time series. ARIMA stands for Autoregressive
Integrated Moving Average with each term representing steps taken in the
model construction until only random noise remains. ARIMA modeling differs
from the other time series methods discussed in this chapter in the fact that
ARIMA modeling uses correlation techniques. ARIMA can be used to model
patterns that may not be visible in plotted data. The concepts used in this
procedure follow Box and Jenkins. For an elementary introduction to time series

The ACF and PACF of the food employment data suggest an autoregressive model of order 1, or AR(1),
after taking a difference of order 12. You fit that model here, examine diagnostic plots, and examine the
goodness of fit. To take a seasonal difference of order 12, you specify the seasonal period to be 12, and the
order of the difference to be 1. In the subsequent example, you perform forecasting.
1 Open the worksheet EMPLOY.MTW.
2 Choose Stat > Time Series > ARIMA.
3 In Series, enter Food.
4 Check Fit seasonal model. In Period, enter 12. Under Nonseasonal, enter 1 in Autoregressive.
Under Seasonal, enter 1 in Difference.
5 Click Graphs. Check ACF of residuals and PACF of residuals.
6 Click OK in each dialog box.
ARIMA Method:
ARIMA Model: Food

Estimates at each iteration

Iteration SSE Parameters


0 95.2343 0.100 0.847
1 77.5568 0.250 0.702
2 64.5317 0.400 0.556
3 56.1578 0.550 0.410
4 52.4345 0.700 0.261
5 52.2226 0.733 0.216 Differencing: 0 regular, 1 seasonal of order 12
6 52.2100 0.741 0.203 Number of observations: Original series 60, after differe
7 52.2092 0.743 0.201 ncing 48
8 52.2092 0.743 0.200
Residuals: SS = 51.0364 (backforecasts excluded)
9 52.2092 0.743 0.200
MS = 1.1095 DF = 46
Relative change in each estimate less than 0.0010

Modified Box-Pierce (Ljung-Box) Chi-Square statistic


Final Estimates of Parameters
Lag 12 24 36 48
Type Coef SE Coef T P
AR 1 0.7434 0.1001 7.42 0.000 Chi-Square 11.3 19.1 27.7 *
Constant 0.1996 0.1520 1.31 0.196 DF 10 22 34 *
P-Value 0.338 0.641 0.768 *
ARIMA Method:
PACF of Residuals for Food
(with 5% significance limits for the partial autocorrelations)

1.0
0.8
0.6
Partial Autocorrelation

0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

1 2 3 4 5 6 7 8 9 10 11 12
Lag

ACF of Residuals for Food


(with 5% significance limits for the autocorrelations)

1.0
0.8
0.6
0.4
Autocorrelation

0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

1 2 3 4 5 6 7 8 9 10 11 12
Lag
ARIMA Method:

After you have identified one or more likely models, you need to specify the model in the main ARIMA dialog box.
· If you want to fit a seasonal model, check Fit seasonal model and enter a number to specify the period. The period
is the span of the seasonality or the interval at which the pattern is repeated. The default period is 12.
You must check Fit seasonal model before you can enter the seasonal autoregressive and moving average parameters
or the number of seasonal differences to take.

· To specify autoregressive and moving average parameters to include in nonseasonal or seasonal ARIMA models, enter
a value from 0 to 5. The maximum is 5. At least one of these parameters must be nonzero. The total for all parameters
must not exceed 10. For most data, no more than two autoregressive parameters or two moving average parameters are
required in ARIMA models.

Suppose you enter 2 in the box for Moving Average under Seasonal, the model will include first and second order
moving average terms.
· To specify the number of nonseasonal and/or seasonal differences to take, enter a number in the appropriate box. If
you request one seasonal difference with k as the seasonal period, the kth difference will be taken.

· To include the constant in the model, check Include constant term in model.

· You may want to specify starting values for the parameter estimates. You must first enter the starting values in a
worksheet column in the following order: AR's (autoregressive parameters), seasonal AR's, MA's (moving average
parameters), seasonal MA's, and if you checked Include constant term in model enter the starting value for the
constant in the last row of the column. This is the same order in which the parameters appear on the output. Check
Starting values for coefficients, and enter the column containing the starting values for each parameter included in
the model. Default starting values are 0.1 except for the constant.
ARIMA Method:
Box and Jenkins present an interactive approach for fitting ARIMA models to time series.
This iterative approach involves identifying the model, estimating the parameters, checking
model adequacy, and forecasting, if desired. The model identification step generally requires
judgment from the analyst.

1 First, decide if the data are stationary. That is, do the data possess constant
mean and variance .

 Examine a time series plot to see if a transformation is required to give constant


variance.
 Examine the ACF to see if large autocorrelations do not die out, indicating that
differencing may be required to give a constant mean.
A seasonal pattern that repeats every kth time interval suggests taking the kth difference to
remove a portion of the pattern. Most series should not require more than two difference
operations or orders. Be careful not to overdifference. If spikes in the ACF die out rapidly,
there is no need for further differencing. A sign of an overdifferenced series is the first
autocorrelation close to -0.5 and small values elsewhere
Use Stat > Time Series > Differences to take and store differences. Then, to examine
the ACF and PACF of the differenced series, use Stat > Time Series > Autocorrelation
and Stat > Time Series > Partial Autocorrelation.
ARIMA Method:
2 Next, examine the ACF and PACF of your stationary data in order to identify
what autoregressive or moving average models terms are suggested.

· An ACF with large spikes at initial lags that decay to zero or a PACF with a large spike at
the first and possibly at the second lag indicates an autoregressive process.

· An ACF with a large spike at the first and possibly at the second lag and a PACF with
large spikes at initial lags that decay to zero indicates a moving average process.

· The ACF and the PACF both exhibiting large spikes that gradually die out indicates that
both autoregressive and moving averages processes are present.

For most data, no more than two autoregressive parameters or two moving average
parameters are required in ARIMA models.
ARIMA Method:
3 Once you have identified one or more likely models, you are ready to use the
ARIMA procedure.

· Fit the likely models and examine the significance of parameters and select
one model that gives the best fit.

· Check that the ACF and PACF of residuals indicate a random process,
signified when there are no large spikes. You can easily obtain an ACF and a
PACF of residual using ARIMA's Graphs subdialog box. If large spikes remain,
consider changing the model.

· You may perform several iterations in finding the best model. When you are
satisfied with the fit, go ahead and make forecasts.

The ARIMA algorithm will perform up to 25 iterations to fit a given model. If the
solution does not converge, store the estimated parameters and use them as
starting values for a second fit. You can store the estimated parameters and use
them as starting values for a subsequent fit as often as necessary.
ARIMA Method:
In the example of fitting an ARIMA model, you found that an AR(1) model with a
twelfth seasonal difference gave a good fit to the food sector employment data.
You now use this fit to predict employment for the next 12 months.
Step 1: Refit the ARIMA model without displaying the acf and pacf of the
residuals

1 Perform steps 1- 4 of Example of ARIMA.


Step 2: Display a time series plot
1 Click Graphs. Check Time series plot. Click OK.
Step 3: Generate the forecasts
1 Click Forecast. In Lead, enter 12. Click OK in each dialog box.
ARIMA Method:
In the example of fitting an ARIMA model, you found that an AR(1) model with a
twelfth seasonal difference gave a good fit to the food sector employment data.
You now use this fit to predict employment for the next 12 months.
Step 1: Refit the ARIMA model without displaying the acf and pacf of the
residuals

1 Perform steps 1- 4 of Example of ARIMA.


Step 2: Display a time series plot
1 Click Graphs. Check Time series plot. Click OK.
Step 3: Generate the forecasts
1 Click Forecast. In Lead, enter 12. Click OK in each dialog box.
Forecasts from period 60
95% Limits
Period Forecast Lower Upper Actual
61 56.4121 54.3472 58.4770
62 55.5981 53.0251 58.1711
63 55.8390 53.0243 58.6537
64 55.4207 52.4809 58.3605
65 55.8328 52.8261 58.8394
66 59.0674 56.0244 62.1104
67 69.0188 65.9559 72.0817
68 74.1827 71.1089 77.2565
69 76.3558 73.2760 79.4357
70 67.2359 64.1527 70.3191
71 61.3210 58.2360 64.4060
72 58.5100 55.4240 61.5960

Potrebbero piacerti anche