Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction:
o Introduction to data mining and business
analytics– current status and examples
o Types of problems encountered with examples;
concepts of data generation process
About the Business:
Stakeholders are:
Owners/ Investors/ Shareholders/ Management.
Customer
Government
Design Collaborator
Employee
Supplier
Supply Chain
Society
Intellectual community
More output from More output from More output from more
Less resources same resources resources/ investments
Better sensitivity
from Less investment
Customer looks for New market
more & more enjoying
Reducing defects
Reducing customers come in-
opportunities for Expand capacity
defects (Waste) New Product
Any organisation will have core and core processes. Core is the kind
to products and services it offers to the customer to pursue a purpose
for the benefit of the society.
Core does not gel well with the concept of `diversification’; but it
promotes the concept of `expansion’.
P1 P2 P3 P4 P5 P6
Develop New Collect
Confirm Order Product/ Fulfill Order Deliver Order Collect Money Customer
Process Feedback
Entry Strategy
Salesmanship – The way to win the “ It costs 6 – 7 times more to acquire a new
customer than retain an existing one. ”
customer
– Bain & Company
First Serving Strategy “ The probability of selling to an existing
Requires more toil customer is 60 – 70%. The probability of
selling to a new prospect is 5-20%. “
Requires strong commitment
– Marketing Metrics
Demonstration of honesty to customer “ A 2% increase in customer retention has
13
Data and Extraction of Information - Current Scenario
• The growth of data availability is mind-boggling. According to Intel the quantity of information
generated from dawn of human history till 2003 – some 5 exabytes – is now created every two
days
• Data processing and storage costs have decreased by a factor of 1000 over the past decade
• Technologies like Hadoop and MapReduce eliminates the need to structure the data in rigidly
defined formats – a costly, labour-intensive proposition
• Powerful techniques for analyzing data to extract various insights have been developed and
software are available to enable easy implementation
• Advanced statistical, optimization, machine-learning and data-mining techniques enable
extraction of hitherto unavailable insights
retail stores
18
Assumptions (Continued…)
In case we are interested in a specific outcome,
characteristics of the same should be measurable
Status of a sales offer – a consumer may or may not buy
the product that she enquired about. Accordingly the
outcome variable may be binary – 0 or 1.
Time to failure for a television – the number of hours the
TV set has operated before failing. The outcome variable
will be a real number starting from 0.
The number of near misses or minor accidents a driver
had during a period (or for driving certain distance). The
outcome will be an integer count starting from 0
The perception score like outstanding/ very good/ good/
acceptable/ poor given by a customer regarding the quality
of service of a restaurant. Here the outcome is an ordered
categorical variable with 5 possible values.
19
Assumptions (Continued…)
It is assumed that the behaviour / usage / habits
as well as outcome are measured on the same
entity. The following must necessarily be true
The entity being studied should be well
defined and clearly identifiable
The outcome and the characteristics of the
entity to be measured must be known in
advance
The characteristics of the entity as well as the
outcome must be observable as well as
measurable
20
Assumptions (Continued…)
In some cases we may not have a clearly defined
outcome
We may like to assess the sentiments as positive,
21
Two Types of Problems of BA
Supervised Analytics
When the response is predefined and we are
23
Examples of Supervised Analytics
24
Examples of Supervised Analytics (Continued…)
An engineer in a chemical plant wants to
understand the relationship between the
characteristics of a batch, e.g. the proportion of
certain ingredients and the maximum
temperature and pressure, with the batch
quality. Here batch quality is the response and
the individual batches are the units of analysis
(sampling units).
Note that defining the unit of analysis may not
be easy in case of continuous production
25
Examples of Supervised Analytics (Continued…)
An investment consultant is engaged in
estimating the possible closing values of a
particular stock on a daily basis. She uses the
data on the past values to make the prediction
for future days. Here the particular stock being
studied is the entity. The past closing values (on
daily basis) are the input (X) variables. The
value for the next day (if you have values for
day 1, 2, ….n then the value for day n + 1) is
the response (Y).
26
Examples of Unsupervised Analytics
A large retailer wants to open a new outlet. Noting that
a new outlet is expensive, the company is planning to
survey the proposed locations to assess potential.
However, a large number of locations have been
identified and the company finds that surveying all
locations is also a costly proposition. In order to reduce
costs, the company collects secondary data relevant to
the potential of the locations and develops a small
number of similar clusters such that only one location
from each cluster may be surveyed. Once the ‘best’
cluster is chosen, survey may be conducted for a few of
the locations within the cluster so that the company can
arrive at a ‘good’ location at a low cost.
27
Unsupervised Analytics (Continued…)
Software development is a skill intensive activity and it is
important to develop a holistic methodology to measure
the skills of individual software developers. Skill is
unobservable but it has many constitutive components
that may be observed and measured at least as expert
rating. It is important to group these constituent
components into a few broad dimensions so that scales
for measuring these dimensions may be constructed
using a subset of the proposed constitutive components.
It is also important to assess whether the proposed
constitutive components cover all the important
dimensions. It may be noted that the individual software
developers form the units of analysis and the
constitutive skill components are the input (X) variables.
It may further be noted that there are no responses (Y
variables).
28
Three Pillars of Business Analytics
Acquisition, storage and preliminary compilation of data.
Includes acquisition of unstructured data using
technology and preliminary compilation includes
visualization and descriptive analyses
In depth analyses of data using statistical and machine
learning techniques. At this stage we test hypotheses,
build models to uncover relationships and discover
patterns that are not easily visible
Understanding the business perspective such that the
problems may be appropriately formulated, interesting
hypotheses may be proposed, right variables may be
identified, and the results may be communicated to the
business users in their language
29
Components of Business Analytics
Operational data bases, data
warehouses, online processing and
mining, enterprise information
management systems, data
acquisition and cleaning, big data
Data acquisition, engineering technologies like Hadoop &
and processing (mostly MapReduce
compilation)
30
Business Analytics Process
Problem Data
Problem
Statement Understanding
Formulation
Operational Data
Databases Repository
31
Data Preparation
Most organizations maintain data to support regular operations.
These are referred to as operational data.
For instance, procurement department maintains data on
vendors, prices, and time to supply; manufacturing department
maintains data on defects, production, and manufacturability;
and engineering / design department maintains data on changes
made to drawings of parts procured from vendors. However,
improvement of manufacturability requires data from all three
departments. Often, getting data for the same entity is difficult.
For example which parts supplied by which vendors according to
which drawing numbers were used in particular assembly and
what were the results may not be easy to gather.
Data preparation requires compiling different operational data to
obtain an overall holistic view. This activity is often same as
developing a data warehouse.
32
Comparison of BA and BI
Business Intelligence (BI) involves
Developing warehouses from operational data
Providing elementary capabilities for data visualization and
descriptive analyses like
Reporting quantum of sales, showing trends, allowing putting up
business specific alerts, allowing users to get different views
Business Analytics (BA) involves
Identifying the possible cause rather than only answering ‘what’
and ‘where’ addressed by BI
Predicting possible outcomes and even automating decisions like
making suggestions to buyers for improving sales
Uncovering interesting patterns that can lead to valuable insights
33
What Have We Learnt?
34
What Have We Learnt? (Continued…)
The supervised analytics problem is often looked at as a
problem of fitting a function like y = f(x) on the basis of
the quantitative data
The measurements of the X and Y variables must be
carried out on the same entity. Thus successful
application of BA techniques requires identifying the
entities to be studied and ensuring that the identified
variables are measurable. Ensuring that all
observations are taken on the same entity is of
utmost importance
35
What Have We Learnt? (Continued…)
Identification of the variables and defining the
problem require substantive knowledge. This
does not come under the purview of the
statistical / machine learning techniques that we
will be covering in this course
In case of supervised learning, care must be
taken to ensure that the X and Y variables are
expected to be related from substantive
perspective
In case of unsupervised learning, the identified
patterns must make business sense
36
What Have We Learnt? (Continued…)
Business analytics have three important
constituents – data acquisition, storage,
compilation and preliminary analyses; in depth
analyses using statistical / machine learning
techniques; and managerial and business
understanding for problem formulation as well
as effective communication with business users
37
What Have We Learnt? (Continued…)
Usually businesses maintain data to support their
regular operations. These are called operational
data
Successful business analytics requires connecting
different operational data and developing a data
warehouse. This activity is referred to as data
preparation
Business analytics differ from BI as it offers in depth
insights not available from preliminary compilations
and visualization. BI is restricted to development of
data warehouses and providing preliminary
descriptive and visualization tools
38
Review Questions
What is meant by learning from data? How is it related to
business problems?
What are supervised and unsupervised learning (analytic)
problems? Give examples.
What are response and explanatory variables? Give examples.
What is a sampling unit (unit of analysis)? How and why is it
important in the context of business analytics?
What is the role of substantive knowledge in the context of
business analytics?
What are the constituents of business analytics?
What is operational data? What is a data warehouse?
What is meant by data preparation? How is it relevant to BA?
What are the key differences between BA and BI?
39
Coverage
In this course we will primarily look at formulation of
problems; understanding and preparing data; and
developing models to solve the identified problems
The data preparation will be covered partially as we
will look into the issues of treatment of missing data
but will not cover technical issues of data capture or
the maintenance of data warehouses
Concepts of modeling (we will refer to it as
statistical / machine learning techniques that forms
the core of data science) will be covered in detail
Some deployment and managerial issues will be
discussed from theoretical standpoint as well as
through case examples
40
About the Business:
Organisation is in the business to ensure sustainable growth in
profit.
Financial results does not mean the profit alone. Organisation
looks for:
Quantum of money- Revenue Top-line
Core does not gel well with the concept of `diversification’; but
it promotes the concept of `expansion’.
DATA
Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Data transactions at Mall
.........
Types of Attributes
Examples
Missing values-
Outlier treating..
Meaning
Influence
Remedy
Data integration and Transformation
Summary
Boxplots
Multivariate Data Analysis
Conventional Analysis :
Summary
Dispersion
Association (between attributes)
Correlation (between numeric variables)
Group comparison
Data Reduction
Principal component Analysis
Variables under study are transformed to a
Basic assumptions
Use of regression
Subset selection
Data Reduction
Data Exploration
Data Visualization
Classification
Prediction
Association Rules or Affinity Analysis
Data Mining-Whose domain
Mathematician Statistician
Supervised Learning
Prediction Methods
Use some variables to predict unknown or
Description Methods
Find human-interpretable patterns that
DATA
Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Data transactions at Mall
.........
Data
Qualitative Quantitative
Examples
Data Reduction
Data Exploration
Data Visualization
Classification
Prediction
Association Rules or Affinity Analysis
Predictive Analytics
What is Data Mining?
Origins of Data Mining
distributed nature
of data Database
systems
Problem
Direct Marketing
Goal: Reduce cost of mailing by
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Classification: Example
Approach:
Use credit card transactions and the information on
its account-holder as attributes.
When does a customer buy, what does he buy, how often
he pays on time, etc
Label past transactions as fraud or fair transactions.
This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
Classification: Example
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to
be lost to a competitor.
Classification: Example
Approach:
Use detailed record of transactions with each of the
past and present customers, to find attributes.
How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Classification Techniques
Discriminant Analysis
Nearest Neighbour Rule
Logistic Regression
Bayesian Classifier
Decision Tree and CART
And many more like ANN, SVM etc..
Chapter-2
Big Data
CHAPTER- 2: BIG DATA- BIG TICKET
PROJECT IN THE WORLD
•THESE DATA ARE INPUT AND ARE AVAILABLE IN BITS AND PIECES USUALLY NOT AVAILABLE
IN REGULAR PERIOD OF TIME IN THE CONTEXT OF TACTICAL SURVEYS, LITERATURE
SEARCH, SPECIAL PURPOSE INVESTIGATIONS . DATA ARE AVAILABLE IN DIFFERENT
POCKETS AND APPARENTLY IN NON-SYNCHRONISABLE MANNER.
SUPPLY CHAIN
TECHNOLOGY COLLABORATOR
GOVERNMENT
ALL SPEAK OR CARRY OUT
SUPPLIERS TRANSACTIONS DIFFERENTLY AND
IN DIFFERENT TIME INTERVAL IN
FINANCIAL INSTITUTE DIFFERENT FORMS AND RECORDS
ARE HARDLY OF ON-ONE
MEDIA/ NEWS PAPER
CORRESPONDENCE, MAKING IT
SOCIETY DIFFICULT TO ANALYSE.
EXPECTED QUESTIONS TO BIG DATA
-
Large-Scale Data Management
Big Data Analytics
Data Science and Analytics
• Data Volume
– 44x increase from 2009 to 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
Characteristics of Big Data:
2-Complexity (Varity)
97
Defining Metadata
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
1) Numeric-
Delete observation(s)
Replace by global constant, mean, mode etc.
2) Attribute-
Delete observation(s)
Experts suggestion(s)
Data cleaning...
NA’s removed
Data cleaning...
NA’s removed
Data cleaning...
An outlier in numeric data can be a very large or very small value (Far
away from natural flow of the data).
Outlier
Data cleaning...
To find these outlier, one can use in-build functions or write a simple
R code for the same. Following is a simple R-code which will remove outliers
from your data. Need to write code for detecting and removing outliers from a
file with several numeric features.
Tasks in data preparation
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
Ordered by mpg
Tasks in data preparation
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
First data
set
Second data
set
Merged data
set
Tasks in data preparation
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
attach(mydata)
mydata$sum <- x1 + x2
mydata$mean <- (x1 + x2)/2
detach(mydata)
Renaming variables
You can rename variables programmatically or interactively.
# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))
# you can re-enter all the variable names in order changing the ones you #need to change. the
limitation is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
#OR
names(mydata)[4]<- “New_name”
Dataset Name
Variable assigned
number
Creating, recoding, renaming a variable...
Tasks in data preparation
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
- Sorting
- Merging
- Aggregating
- Creating, recoding, renaming a variable
Yearly
Quarter
ly
Monthl
y
Daily
Preparing data for time series
Preparing data for time series
Preparing data for time series
Preparing data for time series
Chapter-5
Visual Analytics and Exploratory Data Analysis
Chapter-4: Visual Analytics and Exploratory
Data Analysis
◦ Data Types
◦ Basic Statistics
◦ Summary of Statistics
◦ Data Visualisation
◦ Distributions
◦ Test of Hypothesis
Linkages within Summary of Statistics, Data
Visualisation & Exploratory Data Analysis- Univariate
Linkages within Summary of Statistics, Data
Visualisation & Exploratory Data Analysis- Pairwise
Basic Concepts of Probability
Learning Objectives
Understand the concepts of events and
probability of events
Understand the notion of conditional
probabilities and independence of different kinds
Understand the concept of inverse probabilities
and Bayes’ theorem
Understand specific concepts of lift, support,
sensitivity and specificity
Develop ability to use these concepts for
formulation of business problems and providing
solutions to the same
Experiments
An experiment is a process – real or hypothetical
– that can be repeated many times and whose
possible outcomes are known in advance
Notes:
1. This is an intuitive definition but conveys the
meaning
2. We are discussing about experiments where
we agree about the possible outcomes at
the outset
Examples of Experiments
Many activities in business, economics,
manufacturing and other areas may be considered
as experiments
A customer walks into a retail outlet. The total
value of goods bought in a single trip may be
considered to be an experiment. The possible
value could be any real number ≥ 0
The prepaid balance of the customer of a telecom
service provider might have become close to
zero. The customer may or may not buy further
talk time in a given period. The act of buying or
not buying may be looked at as an experiment
with two possible outcomes – 0 or 1.
More Examples
The fuel consumption of a car as it travels may
be considered to be an experiment. The fuel
consumed per kilometer travelled in any given
journey may be the outcome of the experiment
and it may assume any positive value.
A restaurant may approach its patrons and
request them to rate their service in a one to
five scale. The experiment has 5 possible
outcomes assuming that no customer declines to
provide a feedback
Examples (Continued…)
A software development company tests the use cases as they
are developed. The testing may be considered to be the
experiment and the number of defects observed may be the
outcome. Observe that the outcome is an integer ≥ 0
Notes:
1. Notice that thinking in terms of experiments forces the
analyst to concentrate on the entity defined in the
previous section.
2. The experiments and its outcomes is necessarily
idealized. Surely an use case cannot have 1010 defects or
the fuel efficiency cannot be 500000 km / lt. Also
defining an use case or a journey rigourously may not be
possible. However, we need to keep in mind that any
theory necessarily involves idealization.
Sample Space and Events
2. The sample space consists of all possible values of effort. In this example the
sample points are values like 1, 4, 5, 6…and so on.
4. Event that effort required > 50 hours; or 10 < effort < 100 hours are
examples of compound events
Suppose a telecom service provider has carried out a survey to find the level
of importance customers attach to various aspects of their experience of
using the service. Suppose the importance is given in a seven point scale (1
to 7) where 1 means least importance and 7 stands for the highest
importance. One of the aspects of customer experience is accuracy of bills
and suppose that the survey has yielded the following result
Value Frequency
1 1
2 3
3 6
4 13
5 72
6 135
7 130
Let A be the event that a randomly selected customer will consider the
importance of accurate billing to be 6 or more on a 7 point scale. What is
P(A)? How did you arrive at the value? Can you identify the experiment, the
entity involved and the sample space?
Example-cum-Exercise
Note
When a large amount of data are available, the probabilities of
different events may be estimated empirically from data
Conditional Probability
Conditional probability of event A given event B
– written as P(A│B) is the relative frequency of
A given B has happened.
Conditional probability P(A│B) = P(AᴖB) / P(B).
Actually P(A│B) = NAB / NB
C:Efforts
<= 10- 181
11-20- 736
21-30- 2536
Rest:
P{C(3)I A(1) ∩ B(2)} = …
An Observation
It is important to note that the conditional
events A|B and B|A are very different
Let A be the event that the ticket being serviced is simple
Let B be the event that the effort required is between 11 and
20 units
A|B is the event that the ticket is simple given that the effort
required was between 11 and 20 units. On the other hand
B|A is the event that the effort required is between 11 and 20
units given that the ticket is simple
Notice that P(A | B) = 0.789 whereas P(B | A) = 0.144 only.
Another Example
An epidemiologist wants to assess the impact of smoking on the
incidence of lung cancer. From hospital records she collected data
on 100 patients of lung cancer and she also collected data on 300
persons not suffering from lung cancer. She has classified the 400
samples into smokers and non smokers and the observations are
summarized below
Smoker Lung Cancer Total
Yes No
Yes 69 137 206
No 31 163 194
Total 100 300 400
Let A be the event that a person has lung cancer and let B be the
event that the person is a smoker. Can you estimate P(A│B) from
the table given above?
Comment
Note that the demand depends on factors like technology category and
role. Thus we may define events as follows
Note
A sensitive instrument does not give false negative
results and a specific instrument does not give
false positive results
Events of Interest
Note that sensitivity and specificity do not give the
probabilities of the events of interest
We are actually interested in positive and negative predictive
values (abbreviated as PPV and NPV respectively) defined as
PPV = P(B│A) = P(A│B) P(B) / P(A) – by Bayes’ theorem
NPV = P(Bc│Ac) = P(Ac│Bc) P(Bc) / P(Ac) – by Bayes’ theorem
Notice that PPV and NPV cannot be found directly whereas
sensitivity and specificity can be.
Also P(A) = P(A ∩ B) + P(A ∩ Bc)
= P(A│B) P(B) + P(A│Bc) P(Bc)
= Sensitivity. P(B) + (1 – Specificity)(1 – P(B))
Thus we can find PPV and NPV provided we know sensitivity,
specificity and prevalence of the particular event of interest in
the population (i.e. if we know P(B))
Why is this Important?
Data Table
Age Income Student Credit Rating Buys Computer
≤ 30 High No Fair No
≤ 30 High No Excellent No
31 – 40 High No Fair Yes
> 40 Medium No Fair Yes
> 40 Low Yes Fair Yes
> 40 Low Yes Excellent No
31 – 40 Low Yes Excellent Yes
≤ 30 Medium No Fair No
≤ 30 Low Yes Fair Yes
> 40 Medium Yes Fair Yes
≤ 30 Medium Yes Excellent Yes
31 – 40 Medium No Excellent Yes
31 – 40 High Yes Fair Yes
> 40 Medium No Excellent No
Classification Mechanism
The classifier aims at developing a method such that optimal
allocation to one of the classes (buys computer / does not buy
computer) is made for any customer with a given combination of
age, income, status (student or not) and credit rating
Let B be the response variable that takes two values. B = 0 means
the customer does not buy computer and 1 means s/he buys
computer
Now P(B = 0 / Age, Income, Status, Credit Rating) and P(B = 1 /
Age, Income, Status, Credit Rating) needs to be found using the
Naïve Bayes’ theory
We know that rather than estimating these probabilities, some
Quantitative or Numeric: The objects being studied are ‘measured’ based on some
quantitative trait. The resulting data are set of numbers.
Random sample
Statistics
Random sample
Statistics
x0 x1
Properties of Mass and Density Functions
Poisson λ λ λ
Geometric p 1/p (1 – p) / p2
Normal μ, σ2 μ σ2
Exponential λ 1/λ 1 / λ2
Notes
1. The frequency distribution is a technique to count the number of observations
between different intervals of the random variable being studied.
2. Usually the intervals are of equal length and are referred to as class intervals. The
number of class intervals is taken as sqrt(N) where N is the number of
observations. However, if N is very large, the number of classes is restricted to
about 25.
3. Frequency distribution and histograms require a large number of observations. You
should look at a sample size of 100 or more at the very least
4. Let fj be the frequency of the jth class and let N = Σ fj. The relative frequencies are
defined to be fj / N. Cumulative frequencies are defined to be Fj = Σ fi, i = 1,2,..j,
and cumulative relative frequencies are defined as Fj / N
Histogram
Notes
1. The above figure is an example of a histogram
2. In this case the heights of the bars are proportional to
the frequencies. Note that this is same as constructing
the bars with their heights proportional to the relative
frequency
Example (Continued…)
The histogram and frequency distribution makes
several points apparent
Shape of the distribution: The distribution in the
previous slide is roughly symmetric
The location is between 40 and 50 (the location
of maximum concentration)
We get an idea about the variation
Customer 1
• The histogram given above for invoice payment time is skewed to the right
• Although the average payment time is 78 days (well below the agreed 90
days limit) many invoices take much longer.
• It appears that there is a systemic issue (may be invoices contain errors,
may be they are sent without verification of completion of work, may be
they are not sent electronically) and focusing on the delinquent invoices may
not be of much help.
Customer 2
• The histogram given above shows a different pattern. In this case
most invoices are paid within 90 days. However, a few takes much
longer, thereby increasing the average time.
• The average time in this case is 79 days – slightly greater than the
previous case. However, control is much easier as we probably have
to focus on a few specific cases of delinquency.
Distribution of Dimensions
Look at the distribution of dimensions of some component produced by
three machines. What are your comments?
Concept of Ogives
Variable
Rating
Store Experience Consistency of Service Delivery
Freq Prop Cum Prop Freq Prop Cum Prop
1 4 0.011 0.011 1 0.003 0.003
2 6 0.017 0.028 1 0.003 0.006
3 7 0.020 0.048 5 0.014 0.020
4 35 0.100 0.148 34 0.095 0.115
6000
5500
Price
5000
4500
19000 29000 39000 49000
Odometer
(x i X )( yi Y )
cov ( x , y ) i 1
n 1
Where cov stands for covariance and a positive (negative) value indicates a positive
(negative) relationship. Zero indicates absence of linear relationship.
cov ariance ( x, y )
Correlation coefficient r
var x var y
Scatter Plots of Data with Various Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Exercise
Portugal is a top wine exporting country holding about 3.2% of the world
market in 2005. The wine industry is investing in technology and is trying to
find the externally controllable parameters that could impact the wine taste. In
order to understand the relationship between the different parameters and
wine taste, a large experiment was conducted. The description of the data
collected as part of the exercise is given below:
4 Series1
0
0 2 4 6 8 10 12 14 16
Can you see the pattern now? Can you compare this with the scatter
diagram?
Extension of Mean Functions
When a large volume of data are available, the mean function
may be extended to incorporate multiple input variables to
understand the behaviour of the output (target) variable
The essential idea is that when a number of explanatory
variables have similar values, the behaviour of the outcome is
expected to be similar
This method is a preliminary version of the nearest neighbour
algorithms we will study in greater detail later.
The method requires construction of tables and classifying
new data points into a unique cell of the table. Consequently,
the methods will be referred to as table lookup methods
The table lookup methods can often be implemented through
SQL if the right data are available
Structure
Questions
317
Exercise
Examine the manpower supply data. We want to
estimate the demand. Answer the following
What random variable are you studying?
Three aspects are important to summarise the data and they can be
linked to understand the distribution of data.
Centering- Average
Shape:
Skewness: Skewness
Peakedness: Kurtosis
Median - Middle Data Value when data is ranked from min. to max.
2
S( xi i)
Standard Deviation - s
n-1
Tabular Output of Summary Statistical
Calculations
1. Double click
on “C9”
2.Select “By
variable” and
Double click on
“C10”
2. Click ”OK”
Does this output agree with the scatter plot points of interest?
Graphical Output of Summary Statistical
1. Double
click on
Calculations
“C9.”
4. Click
“OK.”
3. Click on
Graphs
Button to
2. Click on “By bring up the
Variables” Graph
and select Dialogue Box
Ang to (See above)
create final
graph (See
above)
• Reliability Function
– The Probability that the variable is greater than some value
– It is 1-CDF
Continuous Distribution: Normal
Description
• The normal distribution (also called the Gaussian distribution) is
the most commonly used distribution in statistics. Two
parameters ( (mu) and (sigma)) are required to specify the
distribution.
( x )2
The distribution (x ; ) 1
p , 2
e 2 2
2 2
Notes
• The normal distribution closely matches the distribution of many
random processes, especially measuring processes
Normal CDF & Reliability Function
1.0
P(x < X)
0.5
0.0
-4 -3 -2 -1 0 1 2 3 4
x
Cumulative
Distribution
Function (CDF)
1.0
P(x > X)
0.5
0.0
-4 -3 -2 -1 0 1 2 3 4
x
Reliability
Function
Parameters of the Normal Distribution
Both and are specific values for any given population, and they
change as the members of the population (the distribution) vary.
A Plot of the Normal Distribution
mean or average ( or x )
LSL USL
lower specification limit upper specification limit
Formal Definitions of Moments: Statistical
Expectation
The mean
• also called Expected Value or First Moment
• the Mean is a measure of central tendency, i.e., “Where is the
center of the distribution?” E ( X )
i1
x i f ( xi) for discrete variables
x f ( x ) d x fo r c o n tin u o u s v a ria b le s
Variance
• also called Second Moment
• Variance is a measure of spread in the distribution
2 2
Var ( X ) E (( X )2 )
( xi )2 f ( xi ) fo r d is c re te v a ria b le s
i 1
( x )2 f ( x ) d x fo r c o n tin u o u s v a ria b le s
Higher Order Moments
3 E ( X )3
skew ness 3
3
4 E ( X )4
kurtosis 4
4
Moments for Distributions
P o is s o n t t 1 1 3 t
t t
2
n o rm a l 0 3
u n ifo r m
xU SL x L SL (x U SL x L SL ) 2
0 1 .8
2
12
2
3( 4 )
c h i- s q u a r e d 8
(x x)
2
n
i
xi s 2
^ 2
i1
x ^ i 1
n 1
n
n
(x i x )
n
(x x)
3 4
i i1
^ 3 i1 ^ 4
n1 n 1
Degrees of Freedom Reading Material
n
( Xi X )2
^ s i 1
n 1
Why n-1?
The use of n–1 is a mathematical device used for the purpose of deriving an unbiased estimator of the
population variance. In the given context, n–1 is referred to as “degrees of freedom.” When the total
sums-of-squared deviations is given and the pair-wise deviation contrasts are made for n observations,
the last contrast is fixed; hence, there are n–1 degrees of freedom from which to accumulate the total.
More specifically, degrees of freedom can be defined as (n–1) independent contrasts out of n
observations. For example, in a sample with n = 5, measurements X1, X2, X3, X4, and X5 are made. The
additional contrast, X1–X5, is not independent because its value is known from
(X1–X2) + (X2–X3) + (X3–X4) + (X4–X5) = (X1–X5)
Therefore, for a sample of n = 5, there are four (n–1) independent contrasts of “degrees of freedom.” In
this instance, all but one of the contrasts are free to vary in magnitude, given that the total is fixed.
Thus, when n is large, the degree of bias is small; therefore, there is little need for such a corrective
device.
Review Exercise: Another Look at Means and
Standard Deviations
Minitab File:
Catapultnew2.mtw
1. Click on
“Mean”
2. Click on “Input
5. Repeat 1-4 by variables” and select
having step 1 Rep1-Rep3
select 3. Enter c12 in
“Standard “Store result in:”
deviation.”
Store result in
4. Click on
“c13.”
“OK”
Review Exercise Results
Sample standard
Sample means deviations
Operator 1
Operator 2
Operator 3
Objectives of Module
Satisfies
the
Customer
Dissatisfies Dissatisfies
the the
Customer Customer
Description
• assume n independent trials of a test are run, with each trial
having a p chance of failure
• what chance is there of x failures occurring over the n trials?
The binomial distribution describes this:
n
b ( x ; n , p ) x ÷
( ) ( ) for x 0,1 , 2, n
x n x
p 1 p
n n!
where
x x !( n x ) !
Example
• We are installing 10 bolts in a system, each of which has a 20%
chance of being installed incorrectly. What is the chance of 2
bolts being incorrectly installed?
10
b ( 2 ;10 ,0 .2 ) 2 ÷ ( 0 .2 ) (1 0 .2 )
2 10-2
0 .3020
Binomial PDF
0.35
0.3020
0.30
0.2684
0.25
0.2013
0.20
0.15
0.1074
0.10 0.0881
0.05 0.0264
0.0055 0.0008 0.0001 0.0000 0.0000
0.00
P o is s o n t t 1 1 3 t
t t
2
n o rm a l 0 3
u n ifo r m
xU SL x L SL (x U SL x L SL ) 2
0 1 .8
2
12
2
3( 4 )
c h i- s q u a r e d 8
The moments for the binomial are obtained directly from the
summation formulas for discrete distributions. The variance is
usually written as npq , where q 1 p .
Estimation for the Binomial Distribution
Recall from Module 1 the following moment estimators:
(x i x )
2
x ^
i 1
xi s 2
^ 2 i1
n 1
n
n
(x x)
n
(x x)
3 4
i i
i1
^ 3 i1 ^ 4
n1 n 1
It can be shown that the estimate for the mean reduces to np ˆ and the
variance to np ˆqˆ . Thus, for the the 9 observations of the catapult data
for angle 150 the estimate of the binomial mean is 3.0 and the estimate of
the variance is 2.0.
Overview of the six-sigma approach
Based on the Catapult data at angle 150 the customer CTQ is
not met--the defect probability is too large. What could be
done to meet the given CTQ:
• Somehow reduce the variability of operator 2’s performance
As a first step Six-sigma replaces the binomial with the Poisson distribution
Discrete Distribution: Poisson
Description
• assume that there are so many opportunities for a defect that
practically they cannot be counted, but that the number of defects over
a given period of time can be readily counted
• what chance is there of x failures occurring over that given period of
time?
As indicated the Poisson distribution describes this:
e t ( t )
x
p(x; t ) for x 0,1,2 , K…
x!
Example
• I-90 experiences 3.6 traffic accidents per day between Albany and
Buffalo. What are the chances of zero accidents occurring on any given
day?
3.6 accidents / day t 1 day
e 3.6 ( 3.6)
0
p( 0;3.6) 0.0273
0!
Discrete Distribution: Poisson
0.2125
0.20 0.1912
0.1771
0.15 0.1377
0.0984
0.10
0.0826
0.05 0.0425
0.0273
0.0191
0.0028
0.00
0 1 2 3 4 5 6 7 8 10
P o is s o n t t 1 1 3 t
t t
2
n o rm a l 0 3
u n ifo r m
xU SL x L SL (x U SL x L SL ) 2
0 1 .8
2
12
2
3( 4 )
c h i- s q u a r e d 8
The moments for the Poisson are obtained directly from the
t 1
summation formulas for discrete distributions. Note that when
the mean and variance are equal to .
Estimation for the Poisson Distribution
Recall from Module 1 the following moment estimators:
(x i x )
2
x ^
i 1
xi s 2
^ 2 i1
n 1
n
n
(x x)
n
(x x)
3 4
i i
i1
^ 3 i1 ^ 4
n1 n 1
̂t to
It can be shown that the estimate for the mean reduces
̂t
and the variance to ̂t , where x is given by from the
observed Poisson distribution.
The Poisson Approximation to the Binomial
Consider the binomial random variable describing the defect distribution for the
Catapult data for the angle of 150 and a unit increment Poisson random variable.
Define np in the Poisson distribution. Then
n
1 e ,
n
from the limit formula defining the exponential function. Applying this
formula to the example data yields :
(ˆ 3)
(2 / 3) 0.02601 0.04978
9
Example
• if =2 and =0.5, what are the chances of a number less than
2.3 occurring? 2 .3
P( x 2.3) 2
p( x; , ) dx 0.7257
0.90
0.80
2.3
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.50 1.00 1.50 2.00 2.50 3.00 3.50
0.90
0.80
0.70
0.50
0.40
0.20
±3 covers 99.7%
0.10
0.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Notes
• A range of ±1.960 covers exactly 95%
• A range of ±2.576 covers exactly 99%
The Standard Normal Curve
Normalization
• when the following normalization is applied, a new random
variable is generated that has mean zero and variance one.
x
z
0.45
0.40
2.3 2
Beware some tables 0.35z 0.6
0.5
contain the area of 0.30
0.15
0.10
Area to the left of 0.6:
0.7257 0.05
0.00
-4.00 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00
Single-Tail z Table
(Values of z from 0.00 to 3.99)
0.00 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
0.10 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
0.20 .4607 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
0.30 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
0.40 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
0.50 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
0.60 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
0.70 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
0.80 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
0.90 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
1.00 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379
1.10 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170
1.20 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985
1.30 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823
1.40 .0808 .0793 .0778 .0764 .0749 ..0735 .0721 .0708 .0694 .0681
1.50 .0668 .0655 .0643 .0630 .0618 .0606 0594 .0582 .0571 .0559
1.60 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455
1.70 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367
1.80 0.359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294
1.90 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233
2.00 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183
2.10 .0179 0.0174 .0170 .0166 0.162 .0158 .0154 .0150 .0146 .0143
2.20 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .00116 .0113 .0110
2.30 .01072 .01044 .01017 .00990 .00964 .00939 .00914 .00889 .00866 .00842
2.40 .00820 .00798 .00776 .00755 .00734 .00714 .00695 .00676 .00657 .00639
2.50 .00621 .00604 .00587 .00570 .00554 .00539 .00523 .00509 .00494 .00480
2.60 .00466 .00453 .00440 .00427 .00415 .00402 .00391 .00379 .00368 .00357
2.70 .00347 .00336 .00326 .00317 .00307 .00298 .00289 .00280 .00272 .00264
2.80 .00256 .00248 .00240 .00233 .00226 .00219 .00212 .00205 .00199 .00193
2.90 .00187 .00181 .00175 .00169 .00164 .00159 .00154 .00149 .00104 .00139
3.00 .00135 .00131 .00126 .00122 .00118 .00114 .00111 .00107 .00144 .00100
3.10 .000968 .000936 .000904 .000874 .000845 .000816 .000789 .000762 .000736 .000711
3.20 .000687 .000664 .000641 .000619 .000598 .000577 .000538 .000538 .000519 .000501
3.30 .000483 .000467 .000450 .000434 .000419 .000404 .000376 .000376 .000362 .000350
3.40 .000337 .000325 .000313 .000302 .000291 .000280 .000260 .000260 .000251 .000242
3.50 .000233 .000224 .000216 .000208 .000200 .000193 .000179 .000179 .000172 .000165
3.60 .000159 .000153 .000147 .000142 .000136 .000131 .000121 .000121 .000112 .000112
3.70 1.08E-4 1.04E-4 9.96E-5 9.58E-5 9.20E-5 8.84E-5 8.16E-5 8.18E-5 7.8E-5 7.53E-5
3.80 7.24E-5 6.95E-5 6.67E-5 6.41E-5 6.15E-5 5.91E-5 5.44E-5 5.46E-5 5.22E-5 5.01E-5
3.90 4.81E-5 4.62E-5 4.43E-5 4.25E-5 4.08E-5 3.91E-5 3.60E-05 3.61E-5 3.45E-5 3.31E-5
Single-Tail z Table
(Values of z from 4.00 to 7.99)
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
4.00 3.17E-5 3.04E-5 2.91E-5 2.79E-5 2.67E-5 2.56E-5 2.45E-5 2.35E-5 2.25E-5 2.16E-5
4.10 2.07E-5 1.98E-5 1.90E-5 1.81E-5 1.74E-5 1.66E-5 1.59E-5 1.52E-5 1.46E-5 1.40E-5
4.20 1.34E-5 1.28E-5 1.22E-5 1.17E-5 1.12E-5 1.07E-5 1.02E-5 9.78E-6 9.35E-6 8.94E-6
4.30 8.55E-6 8.17E-6 7.81E-6 7.46E-6 7.13E-6 6.81E-6 6.51E-6 6.22E-6 5.94E-6 5.67E-6
4.40 5.42E-6 5.17E-6 4.94E-6 4.72E-6 4.50E-6 4.30E-6 4.10E-6 3.91E-6 3.74E-6 3.56E-6
4.50 3.40E-6 3.24E-6 3.09E-6 2.95E-6 2.82E-6 2.68E-6 2.56E-6 2.44E-6 2.33E-6 2.22E-6
4.60 2.11E-6 2.02E-6 1.92E-6 1.83E-6 1.74E-6 1.66E-6 1.58E-6 1.51E-6 1.44E-6 1.37E-6
4.70 1.30E-6 1.24E-6 1.18E-6 1.12E-6 1.07E-6 1.02E-6 9.69E-7 9.22E-7 8.78E-7 8.35E-7
4.80 7.94E-7 7.56E-7 7.19E-7 6.84E-7 6.50E-7 6.18E-7 5.88E-7 5.59E-7 5.31E-7 5.05E-7
4.90 4.80E-7 4.56E-7 4.33E-7 4.12E-7 3.91E-7 3.72E-7 3.53E-7 3.35E-7 3.18E-7 3.02E-7
5.00 2.87E-7 2.73E-7 2.59E-7 2.46E-7 2.33E-7 2.21E-7 2.10E-7 1.99E-7 1.89E-7 1.79E-7
5.10 1.70E-7 1.61E-7 1.53E-7 1.45E-7 1.38E-7 1.30E-7 1.24E-7 1.17E-7 1.11E-7 1.05E-7
5.20 9.98E-8 9.46E-8 8.96E-8 8.49E-8 8.04E-8 7.62E-8 7.22E-8 6.84E-8 6.47E-8 6.13E-8
5.30 5.80E-8 5.49E-8 5.20E-8 4.92E-8 4.66E-8 4.41E-8 4.17E-8 3.95E-8 3.73E-8 3.53E-8
5.40 3.34E-8 3.16E-8 2.99E-8 2.82E-8 2.67E-8 2.52E-8 2.39E-8 2.26E-8 2.13E-8 2.01E-8
5.50 1.90E-8 1.80E-8 1.70E-8 1.61E-8 1.52E-8 1.43E-8 1.35E-8 1.28E-8 1.21E-8 1.14E-8
5.60 1.07E-8 1.01E-8 9.57E-9 9.04E-9 8.53E-9 8.04E-9 7.59E-9 7.16E-9 6.75E-9 6.37E-9
5.70 6.01E-9 5.67E-9 5.34E-9 5.04E-9 4.75E-9 4.48E-9 4.22E-9 3.98E-9 3.75E-9 3.53E-9
5.80 3.33E-9 3.13E-9 2.95E-9 2.78E-9 2.62E-9 2.47E-9 2.32E-9 2.19E-9 2.06E-9 1.94E-9
5.90 1.82E-9 1.72E-9 1.62E-9 1.52E-9 1.43E-9 1.35E-9 1.27E-9 1.19E-9 1.12E-9 1.05E-9
6.00 9.90E-10 9.31E-10 8.75E-10 8.23E-10 7.73E-10 7.27E-10 6.83E-10 6.42E-10 6.03E-10 5.67E-10
6.10 5.32E-10 5.00E-10 4.70E-10 4.41E-10 4.14E-10 3.89E-10 3.65E-10 3.43E-10 3.22E-10 3.02E-10
6.20 2.83E-10 2.66E-10 2.50E-10 2.34E-10 2.20E-10 2.06E-10 1.93E-10 1.81E-10 1.70E-10 1.59E-10
6.30 1.49E-10 1.40E-10 1.31E-10 1.23E-10 1.15E-10 1.08E-10 1.01E-10 9.49E-11 8.89E-11 8.33E-11
6.40 7.80E-11 7.31E-11 6.85E-11 6.41E-11 6.00E-11 5.62E-11 5.26E-11 4.92E-11 4.61E-11 4.31E-11
6.50 4.04E-11 3.78E-11 3.53E-11 3.30E-11 3.09E-11 2.89E-11 2.70E-11 2.53E-11 2.36E-11 2.21E-11
6.60 2.07E-11 1.93E-11 1.81E-11 1.69E-11 1.58E-11 1.47E-11 1.38E-11 1.29E-11 1.20E-11 1.12E-11
6.70 1.05E-11 9.79E-12 9.14E-12 8.53E-12 7.96E-12 7.43E-12 6.94E-12 6.48E-12 6.04E-12 5.64E-12
6.80 5.26E-12 4.91E-12 4.58E-12 4.27E-12 3.98E-12 3.71E-12 3.46E-12 3.23E-12 3.01E-12 2.81E-12
6.90 2.62E-12 2.44E-12 2.27E-12 2.12E-12 1.97E-12 1.84E-12 1.71E-12 1.59E-12 1.49E-12 1.38E-12
7.00 1.29E-12 1.20E-12 1.12E-12 1.04E-12 9.68E-13 9.01E-13 8.38E-13 7.80E-13 7.62E-13 6.75E-13
7.10 6.28E-13 5.84E-13 5.43E-13 5.05E-13 4.70E-13 4.37E-13 4.06E-13 3.78E-13 3.51E-13 3.26E-13
7.20 3.03E-13 2.82E-13 2.62E-13 2.43E-13 2.26E-13 2.10E-13 1.95E-13 1.81E-13 1.68E-13 1.56E-13
7.30 1.45E-13 1.35E-13 1.25E-13 1.16E-13 1.08E-13 9.99E-14 9.27E-14 8.60E-14 7.98E-14 7.40E-14
7.40 6.86E-14 6.37E-14 5.90E-14 5.47E-14 5.07E-14 4.70E-14 4.36E-14 4.04E-14 3.75E-14 3.47E-14
7.50 3.22E-14 2.98E-14 2.76E-14 2.56E-14 2.37E-14 2.19E-14 2.03E-14 1.88E-14 1.74E-14 1.61E-14
7.60 1.49E-14 1.38E-14 1.28E-14 1.18E-14 1.10E-14 1.01E-14 9.38E-15 8.68E-15 8.03E-15 7.42E-15
7.70 6.86E-15 6.35E-15 5.87E-15 5.43E-15 5.02E-15 4.64E-15 4.29E-15 3.96E-15 3.66E-15 3.38E-15
7.80 3.12E-15 2.89E-15 2.67E-15 2.46E-15 2.27E-15 2.10E-15 1.94E-15 1.79E-15 1.65E-15 1.53E-15
7.90 1.41E-15 1.30E15 1.20E-15 1.11E-15 1.02E-15 9.42E-16 8.69E-16 8.01E-16 7.39E-16 6.82E-16
z Transformation
Instructions Example
Locate the z-value in the Locate 0.60 in the Single-Tail
“Single-Tail z-Table” using this z-Table using this three-step
three-step process. process.
1. Find the whole number and the 1. Find the value 0.6 in the first
first decimal place in the first column. We will call this row
column (titled z). 0.6.
1. For a measured characteristic, the mean is 18.61 and the standard deviation of the process
is 1.00. Use the formula listed below to convert the measurement, 20.00, into a z-value.
(x - )
z=
z=
3. If z = 1.39, what is the probability that a randomly selected member of the population will be
greater than or equal to z?
Answer: _____________
z
3 2 1 1 2 3
Exercise 1: z Transformation
Answer: _____________
z
3 2 1 1 2 3
Answers to Exercise 1:
z Transformation
(x - )
1. z =
20.00 - 18.61
z= 1
z= 1.39
2. Locate the probability using the instructions on
page 3.23. (.0823)
3. .0823 or 8.23%
USL
Defect Probability
ZUSL= USL -
z-Values and Their Application
If x is the lower specification limit, you can use the -z-value to determine
the probability of producing product below the lower specification.
LSL
Defect
Probability
ZLSL= - LSL
z-Values and Their Application
For a two-sided distribution, the sum of the probabilities for upper and
lower specification limits tells you the total probability of producing out-
of-spec product.
LSL USL
Defect Defect
Probability Probability
Target
Process Capability Catapult Example
From the the descriptive statistics graphs option in Minitab grouping by Angle
X = 50.53
s = 2.79
for the nine observations at angle 150.
As variation
decreases,
Z=6 capability
T USL
increases and,
as a
6 Capability consequence,
the standard
SL - deviation ()
Z= gets smaller
Z=3 which, in turn,
T USL decreases the
probability of a
z
3 Capability defect.
np mp
Use of the Poisson Distribution in Six-sigma (con’t)
np mp
12
11
10
9
8
7
6
5
4
3
2
1
55 60 65 70 75 80 85 90 95 100
Test Grades
Histogram Example
Minitab File: Catapultnew2.mtw
Histogram Output
1. Double
click
anywher 2. For each: Group
e on the
“C9” line
to select
Dist as a
variable. 3. Group variables: Ang
4. Ang
Click ANG
:
“OK”
1. Double click
on “C9”
2. Check “By
variable” and
select Ang
3. Check
“Same scale
for all
variables”
Contrast the Dot plot
4. Click: “OK”
results with those
from your histogram
analyses
Box and Whisker Plot
* Outlier
Maximum Observation
75th Percentile
25th Percentile
Minimum Observation
Box and Whisker Plot Example
Minitab File:
Catapultnew2.mtw
1. Double clicking
1. Double a variable under
clicking a “X” chooses that
variable under grouping
“Y” chooses variable.
that variable
to graph.
3. Click on
“OK”
Box and Whisker Plot Output
The box plot shows much the same results as the scatter plot, but with more
detail with respect to the operator within cell variability. Thus I would amend the
above comments:
•Overall cell variability seems to increase in going from operator 1 to 3 to 2--a
test of this can be made. For the time being assume all within cell variability is
the same.
•Inspection of both plots indicates a change in the “relationship” to angle for
each operator as the angle increases. This is called interaction in statistical
jargon.
When all else fails, look at the data--always!
Basic Statistics and Visualization in R
Contents
>Bank$Amount
Function
Mean, median and range: mean(), median(), range()
Quartiles and percentiles: quantile()
Exploring individual variable
Attach any dataset and see the structure of data
Mean:
> length(mpg) ##number of observation
[1] 32
> sum(mpg)/32
[1] 20.09062
> mean(mpg)
[1] 20.09062
> median(mpg)
[1] 19.2
> range(mpg) ##max(mpg) - min(mpg)
[1] 10.4 33.9
Exploring individual variable
> var(mpg)
[1] 36.3241
> sqrt(var(mpg))
[1] 6.026948
>quantile(mpg)
0% 25% 50% 75% 100%
10.400 15.425 19.200 22.800 33.900
>Mode<-function(x){ux<-unique(x)
ux[which.max(tabulate(match(x, ux)))] }
>Mode(mpg)
[1] 21
Exploring multiple variable
The data contains more than one variables.
Function summary()
numeric variables: minimum, maximum, mean, median, and
the first (25%) and third (75%) quartiles
categorical variables (factors): frequency of every level
Exploring multiple variable
Aggregating Data
It is relatively easy to collapse data in R using one or more BY variables and
a defined function.
# aggregate data frame iris by Species returning means
# for numeric variables
> library(MASS)
> z<-aggregate(iris[,-5], by=list(Species), FUN=mean, na.rm=TRUE)
>z
Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Exploring multiple variable
Exploring multiple variable
Numeric
variable
Categorica
l variable
Exploring multiple variable
Correlation:
It is a measure of linear relationship between two or more
numeric variables.
Correlation
between 2
variables
Correlation
matrix (>2
variables)
Basic Statistics
Bar-plot:
Bar plots need not be based on counts or frequencies. You can create
bar plots that represent means, medians, standard deviations, etc.
>counts<-table(mtcars$gear)
>barplot(counts, main="Car Distribution“, xlab="Number of Gears")
Visualization
By default, the categorical axis line is suppressed. Include the
option axis.lty=1 to draw it.
Visualization
# Simple Horizontal Bar Plot with Added Labels
>counts <- table(mtcars$gear)
>barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"), axis.lty=1)
Visualization
Stacked Bar Plot
# Stacked Bar Plot with Colors and Legend
> counts <- table(mtcars$vs, mtcars$gear)
>barplot(counts, main="Car Distribution by Gears and VS“, xlab="Number
of Gears", col=c("darkblue","red"), legend = rownames(counts), axis.lty=1)
Visualization
Grouped Bar Plot
Get help on plot and see argument “type” which can be added in plot. Try
same for different plots.
Visualization
>library(plotrix)
>slices <- c(10, 12, 4, 16, 8)
>lbls <- c("US", "UK", "Australia", "Germany", "France")
>pie3D(slices,labels=lbls, explode=0.1, main="Pie Chart of Countries ")
Visualization
Creating Annotated Pies from a data frame
# Pie Chart from data frame with Appended Sample Sizes
>mytable <- table(iris$Species)
>lbls <- paste(names(mytable), "\n", mytable, sep="")
>pie(mytable, labels = lbls, main="Pie Chart of Species\n (with sample
sizes)")
Visualization
Histogram
Take any data set from R and try all visualization techniques
Test of Hypothesis
Handout
EXPLANATIONS OF TOH
Definition:
The Normal Curve is a probability distribution where
the most frequently occurring value is in the
middle and other probabilities tail off
symmetrically in both directions. This shape is
sometimes called a bell-shaped curve.
Z – STANDARD NORMAL VARIATE
mean = 0
st. dev. = 1
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
Z-value anywhere
on this scale
Z-value
How many standard
deviations the value- (value - of - interest) X
Z
of-interest is away S
from the mean
AREA CALCULATION
curve = probability
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
Area = .5
What is the probability
a Z-value will be zero? Probability = .5 or 50%
-3
-3
-2
-2
-1
-1
0
0
1
1
2
2 3
3
Value-of-interest
P - VALUE
P-value =
Tail area
Area under curve beyond value-of-interest
beyond
P-value = area P-value = areas A + B
A B
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
99 ML Estimates
Mean: 40.1271
95
StDev: 4.86721
90
80
70
Percent
60
50
40
30
20
10
5
25 35 45 55
Data
Mean: 40.1271
95
StDev: 4.86721
90
80
70
Percent
60 10%
50 10
40
10% 10%
20
30 30
20
50
10 70
5 80
10% 10%
90
10%
1
25 35 45 55
Data
99 ML Estimates 99 ML Estimates
80 80
70 70
Percent
Percent
60 60
50 50
40 40
30 30
20 20
10 10
5 5
1 1
25 35 45 55 -2 -1 0 1 2 3 4
Data Data
Conclusion Conclusion
– Not a serious – There is a serious
departure from departure from
Normality Normality
EXAMPLE OF NORMALITY TEST
Conclusion
– Not a serious departure from Normality
EXAMPLE OF NORMALITY TEST
EXAMPLE OF NORMALITY TEST
Conclusion
– P>0.05 indicates the data comes from Normal
population.
WHAT IS TEST OF HYPOTHESIS?
The t-distribution:
Has more variation than a Z distribution (and thus
freedom” (df).
We won’t go into details here about df, but think of
Conclusion
– P=0.000 (to be interpreted as p < 0.001) indicates the
average run length is confidently above 1.2 lakh.
2 SAMPLE t - TEST
The t-test
Is a test of hypothesis for comparing two
averages.
The hypothesis is that the two group averages
are the same.
Their difference = 0
If P-value is low, reject the hypothesis.
By convention, a P-value is considered low if it is <
.05
Alternative
Common notation hypothesis
Null hypothesis Ha: meanA meanB
H0: meanA = meanB
T - TEST
The t-distribution:
Has more variation than a Z distribution (and thus
different areas in the tails, meaning different P-values).
Its spread of variation depends on the “degrees of
freedom” (df).
We won’t go into details here about df, but think of it as
the amount of information left in the sample after
estimating the means and standard deviations of the two
groups.
T - TEST
Stat > Basic Statistics > 2 Sample t… > Graphs > (Select both plots)
20 20
15 15
10
10
Conclusions
Since the P-value is small (< .05), conclude there is a statistically
significant difference in the average research time between the two
methods. (Or, “the average research time is not the same for the
two methods.”)
Note: The P-value is different than .0007 reported before because
we more appropriately used the t-distribution instead of the
EXAMPLE OF 2-SAMPLE t - TEST
• Data of 10 samples plates from the same batch are tested in two
printers for the average run length whether differs or not
significantly.
• Data:
Run Length_Pr-A Run Length_Pr-B
1.25 1.32
1.2 1.37
1.3 1.35
1.18 1.27
1.28 1.26
1.24 1.21
1.2 1.34
1.32 1.3
1.28 1.2
1.16 1.35
EXAMPLE OF 2-SAMPLE t - TEST
EXAMPLE OF 2-SAMPLE t - TEST
Analysis of Variance
Source DF SS MS F P
Factor 2 0.01592 0.00796 2.57 0.095
Error 27 0.08375 0.00310
Total 29 0.09967
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ------+---------+---------+---------+
Run Leng 10 1.2410 0.0543 (----------*---------)
Run Leng 10 1.2970 0.0600 (----------*---------)
Run Leng 10 1.2750 0.0525 (---------*----------)
------+---------+---------+---------+
Pooled StDev = 0.0557 1.225 1.260 1.295 1.330
East 6 12 18
5.72 12.28 18.00
North 3 9 12
3.81 8.19 12.00
South 7 19 26
8.26 17.74 26.00
West 11 18 29
9.21 19.79 29.00
All 27 58 85
27.00 58.00 85.00
Cell Contents --
Count
Exp Freq
• Height and Weight of an individual are thought to be related but given the height of an
individual, it is difficult to predict what exactly weight, the individual to have. This is a fit
case of Simple linear regression with y as weight and x as height of the individual.
• Data of 17 individuals with corresponding height and weight are available as under:
Name SM SV AK RP AD SB RK SR BM DK AV RM SP PB SM ST DV
Weight (Kg) 72 92 78 109 73 74 75 75 60 65 46 64 58 62 61 65 89
Height (cm.) 167 187 167 183 179 176 178 175 165 170 160 165 163 163 160 165 179
SIMPLE LINEAR REGRESSION
Example: (contd…)
SIMPLE LINEAR REGRESSION
Example: (contd…)
Example: (contd…)
MULTIPLE LINEAR REGRESSION
Example: (contd…)
Regression Analysis: Weight versus Height, Pulse Rate
Exercise
• Data of square meter Plate sales and the cost of sales for
15 months are as under:
Month Sq. M. Sales (‘00000) Cost of Sales (Rs. In Lakh)
Apr-11 0.9 15
May-11 0.9 10
Jun-11 0.8 11
Jul-11 1.0 18
Aug-11 0.8 8
Sep-11 0.9 12
Oct-11 1.1 15
Nov-11 1.1 16
Dec-11 1.0 12
Jan-12 0.9 9
Feb-12 1.0 13
Mar-12 0.9 10 • Do the
Apr-12 1.2 14 Regression Analysis
May-12 1.1 17
with the
phenomena.
Jun-12 1.0 14
Prediction Interval :
Unusual Observations & Cook’s distance:
Multicolinearity:
How good the Regression Model is and Mellow’s Cp:
How good the Regression Model is and Mellow’s Cp:
Unusual Observations & Cook’s distance:
the predicted residual error sum of squares (PRESS) statistic is a form of cross
validation used in regression analysis to provide a summary measure of the fit of a model to a
sample of observations that were not themselves used to estimate the model. It is calculated as
the sums of squares of the prediction residuals for those observations.
A fitted model having been produced, each observation in turn is removed and the model is refitted
using the remaining observations. The out-of-sample predicted value is calculated for the omitted
observation in each case, and the PRESS statistic is calculated as the sum of the squares of all
the resulting prediction errors:
Given this procedure, the PRESS statistic can be calculated for a number of candidate model
structures for the same dataset, with the lowest values of PRESS indicating the best structures.
Models that are over-parameterised (over-fitted) would tend to give small residuals for observations
included in the model-fitting but large residuals for observations that are excluded.
Unusual Observations & Cook’s distance:
POWER OF THE TEST
Using the Power of the Test for Good Hypothesis Testing
Reality Decisions
Ha is true Rejecting Ho when in fact it is true; Rejecting Ho that is not true; good
Type 1 error (p = a or significance decision (p = 1 – b or power of the
level) test)
What should every good hypothesis test ensure? Ideally, it should make the
probabilities of both a Type I error and Type II error very small. The probability of a
Type I error is denoted as a and the probability of a Type II error is denoted as b.
Understanding a
Recall that in every test, a significance level is set, normally a= 0.05. In other words, that
means one is willing to accept a probability of 0.05 of being wrong when rejecting the null
hypothesis. This is the a risk that one is willing to take, and setting a at 0.05, or 5 percent,
means one is willing to be wrong 5 out of 100 times when one rejects Ho. Hence, once the
significance level is set, there is really nothing more that can be done about a.
Understanding b and 1 -b
Suppose the null hypothesis is false. One would want the hypothesis test to reject it
all the time. Unfortunately, no test is foolproof, and there will be cases where the null
hypothesis is in fact false but the test fails to reject it. In this case, a Type II error
would be made. b is the probability of making a Type II error and b should be as
small as possible. Consequently, 1 -b is the probability of rejecting a null hypothesis
correctly (because in fact it is false), and this number should be as large as possible.
This t = +/-2.0167 equals in the hypothesized distribution = 20 +/- (2.0167) = 20 + 0.603(2.0167) = 21.216 and 20 – 0.603(2.0167) = 18.784.
The next figure shows an alternative distribution of m= 22 and s= 4.
This is the original distribution shift by two units to the right.
What is the probability of being less than 21.216 in this alternative
distribution? That probability is b, accepting Ho when in fact it is false. This is
because with any value within that region, in the original probability
distribution, one would have accepted Ho. How does one find this b? What is
the t value of 21.216 in the alternative distribution?
What is the corresponding probability of being less than t = -1.3? From the t-tables, using one-tailed,
DF = 43, t = 1.3, one finds 0.10026 (using spreadsheet software TDIST [1.3,43,1], it is 0.10026).
Hence b = 0.10026 and 1 -b = 0.9, which was the power of the test in this example.
Factor 1
The difference or effect size affects power. If the difference that one was trying to detect was
not 2 but 1, the overlap between the original distribution and the alternative distribution would have
been greater. Hence, b would increase and 1 -b or power would decrease.
The critical t would shift from 2.01669 to 1.68. This makes b smaller and 1
– b larger. Hence, as the significance level of the test increases, the power
of the test also increases. However, this comes at a high price because a
risk also increases.
Factor 3
Sample size affects power. Why? Consider the following equations:
505
Example (DA using MINITAB)
IRIS Data
MINITAB EXERCISE
Review and Discussions
Questions ?
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction is training set
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
ZeroR Classifier
Weather data
Limitations?
TOPIC
Predictive Analytics:
◦ K Nearest Neighbour
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
A Typical (training) Data
age?
<=30 overcast
31..40 >40
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
yes
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes
no yes
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
excellent fair
no yes yes
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in
advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no samples left
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
in D: m
Info( D) pi log 2 ( pi )
i 1
j 1 | D |
The attribute with the maximum gain ratio is selected as the splitting
attribute
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is defined
as n
gini( D) 1 p 2j
where pj
j 1
is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini index gini(D)
is defined as
|D1| |D |
gini A (D) gini(D1) 2 gini(D2)
Reduction in Impurity: |D| |D|
gini( A) gini(D) giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
Gini index (CART, IBM IntelligentMiner)
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
Comparing Attribute Selection Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is much
smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in
both partitions
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Missing Values
Costs of Classification
Example
DATA Iris
Multiple Linear Regression
CORRELATION
If two variables X and Y, are related such that as
Y increases / decreases with another variable X, a
correlation is said to exist between them.
35
Mileage (km/Lit)
30
25
20
15
25 35 45 55 65 75
Speed (km/h)
SCATTER DIAGRAM
• A scatter diagram depicts the relationship
as a pattern that can be directly read.
• If Y increases with X, then X and Y are
positively correlated.
• If Y decreases as X increases, then the two
types of data are negatively correlated.
• If no significant relationship is apparent
between X and Y, then the two data types
are not correlated.
DIFFERENT SCATTER DIAGRAM PATTERNS
DATA ON CONVEYOR SPEED AND SEVERED LENGTH
Sl. No. Conveyor Severed Sl. No. Conveyor Severed
Speed Length Speed Length
(cm/sec) (mm) (cm/sec) (mm)
1 8.1 1046 16 6.7 1024
2 7.7 1030 17 8.2 1034
3 7.4 1039 18 8.1 1036
4 5.8 1027 19 6.6 1023
5 7.6 1028 20 6.5 1011
6 6.8 1025 21 8.5 1030
7 7.9 1035 22 7.4 1014
8 6.3 1015 23 7.2 1030
9 7.0 1038 24 5.6 1016
10 8.0 1036 25 6.3 1020
11 8.0 1026 26 8.0 1040
12 8.0 1041 27 5.5 1013
13 7.2 1029 28 6.9 1025
14 6.0 1010 29 7.0 1020
15 6.3 1020 30 7.5 1022
Scatter Diagram for Conveyor Speed and Severed
1050
1045
1040
Severed Length (mm)
1035
1030
1025
1020
1015
1010
1005
1000
5 5.5 6 6.5 7 7.5 8 8.5 9
Conveyor Speed (cm/sec)
USES OF SCATTER DIAGRAM
(X i - X )(Yi - Y )
COVXY i 1
( N 1)
CORRELATION COEFFICIENT
A measure of the relationship between variables.
The most commonly used coefficient is Pearson Product-
Moment Correlation Coefficient (measure of linear
relationship denoted by ‘r’).
‘r’ lies between -1 and +1. r = 0 means no correlation.
A positive value of ‘r’ implies positive correlation and
negative value implies negative correlation.
Click OK.
Logistic Regression Model
When it is used?
When the dependent (response) variable is a dichotomous
variable (i. e. it takes only two values, which usually represent
the occurrence or non-occurrence of some outcome event,
usually coded as 0 or 1) and the independent (input) variables
are continuous, categorical, or both.
For example, in a medical study, the patient survives or dies as
a response and age, suffering from disease or not as predictors.
LR as classifier
Logistic regression can be used for classifying a new observation, where the
class is unknown, into one of the classes, based on the values of its predictor
variables (called classification).
𝑝
𝑙𝑜𝑔 = 𝛽0 + 𝛽1 𝑋1 + − − − + 𝛽𝑘 𝑋𝑘
1−𝑝
where 𝑝 is the probability that Y=1 i.e. P[Y=1] and X 1, X2,.. .,Xk are the
independent variables (predictors). 𝛽0 , 𝛽1 , .... 𝛽𝑘 are known as the regression
coefficients, which have to be estimated from the data. Once the coefficients are
estimated from the training set, we can estimate the class membership
probabilities:
1
P[Y=1] = 𝑝 = −𝑋 ′ 𝛽 and P[Y=0] = 1- 𝑝 We use a cutoff
1+𝑒
value on these probabilities in order to classify each case in one of the classes.
Odds:
Logistic regression also produces Odds Ratios (O.R.) associated with each
predictor value. The "odds” of an event is defined as the probability of the
outcome event occurring divided by the probability of the event not occurring.
𝑝 𝑋′𝛽
i.e. odds = =𝑒
1−𝑝
The odds ratio for a predictor is defined as the relative amount by which the
odds of the outcome increase (O.R. greater than 1.0) or decrease (O.R. less than
1.0) when the value of the predictor variable is increased by 1.0 units keeping
others are constant (fixed). Odds ratio for regressor xj assuming all other predictor
variables are constant.
𝑂𝑑𝑑𝑠 (𝑥 𝑖 +1)
OR = = 𝑒 𝛽𝑖
𝑂𝑑𝑑𝑠 (𝑥 𝑖)
Example: - Acceptance of Personal Loan:
Universal Bank. The banks dataset includes data on 5000 customers. The data
to the last personal loan campaign (Personal Loan), and the customer’s
relationship with the bank (mortgage, securities account, etc.). Among these 5000
customers, only 480 (9.6%) accepted the personal loan that was offered to them in
a previous campaign. The goal is to find characteristics of customers who are most
Data
Result:
We select 60% data randomly from the Universal Bank data for the training purpose and
Using Minitab we fit a logistic regression model on the randomly selected 60% data. After
Confusion Matrix
Predicted %
0 1 Error
0 1754 57 3.15
Actual
1 127 62 67.2
The odds ratio is 1.04 that means a single unit increase in income, is associated with an increase in
Confusion Matrix
Predicted %
0 1 Error
0 1811 0 0
Actual
1 189 0 100
The odds ratio is 1 that means a single unit increase in income, is associated with
Similarly if we take only one regressor variable as Education then the total
Confusion Matrix
Predicted %
0 1 Error
0 1786 25 1.38
Actual
1 74 115 39.15
The odds ratio corresponding to Age is 1.01that means a single unit increase in
age, holding income and education constant, is associated with an increase in the
Given the value for a set of predictors, we can predict the probability that
each observation belongs to class 1. The next step is to set a cutoff on these
probabilities so that each observation is classified into one of the two classes. This
is done by setting a cutoff value, c, such that observation with probabilities above c
are classified as belonging to class 0. For example, in binary case, a cutoff of 0.5
means that cases with an estimated probability of P[Y=1] > 0.5 are classified as
belonging to class ‘1’ whereas cases with P[Y=1] < 0.5 are classified as belonging
different confusion matrices. A popular cutoff value for a two class case is 0.5.
Chapter-6
Clustering
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Clustering: Some Applications
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Clothing Industry
Measure the Quality of Clustering
Dissimilarity matrix 0
(one mode)
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Typical Alternatives to Calculate the Distance
between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
S N S N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms: Basic Concept
S S
k
m1 tmiKm (Cm tmi ) 2
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
Example
582
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
Itemset
A collection of one or more items TID Items
Example: {Milk, Bread, Diaper} 1 Bread, Milk
k-itemset 2 Bread, Diaper, Beer, Eggs
An itemset that contains k items
3 Milk, Diaper, Beer, Coke
Support count ()
4 Bread, Milk, Diaper, Beer
Frequency of occurrence of an itemset
5 Bread, Milk, Diaper, Coke
E.g. ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule
TID Items
Association Rule 1 Bread, Milk
An implication expression of the 2 Bread, Diaper, Beer, Eggs
form X Y, where X and Y are 3 Milk, Diaper, Beer, Coke
itemsets 4 Bread, Milk, Diaper, Beer
Example: 5 Bread, Milk, Diaper, Coke
{Milk, Diaper} {Beer}
Example:
Rule Evaluation Metrics {Milk , Diaper } Beer
Support (s)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from
the DB
Eliminate candidates that are infrequent, leaving
Association Mining
References :
Decomposition model:
By default, Minitab uses a multiplicative model. Use the multiplicative model when the size of the
seasonal pattern in the data depends on the level of the data. This model assumes that as the data
increase, so does the seasonal pattern. Most time series exhibit such a pattern.
Method:
1. smoothen the data using a centered moving average with a length equal to the length of the
seasonal cycle.
2. Divide the moving average from the data to obtain what are often referred to as raw
seasonal values.
3. For corresponding time periods in the seasonal cycles, determine the median of the raw
seasonal values. For example, if you have 60 consecutive months of data (5 years),
determine the median of the 5 raw seasonal values corresponding to January, to February,
and so on.
4. Adjust the medians of the raw seasonal values so that their average is one. These adjusted
medians constitute the seasonal indices.
5. Use the seasonal indices to seasonally adjust the data.
6. Fit a trend line to the seasonally adjusted data using least squares regression.
7. The data can be detrended by either dividing the data by the trend component
Trend Analysis:
You collect employment data in a trade business over 60 months and wish to
predict employment for the next 12 months.
Sales Performance Sales Performance Sales Performance
Month Trade Food Metals Month Trade Food Metals Month Trade Food Metals
Apr-02 322 53.5 44.2 Apr-04 330 52.3 42.5 Apr-06 361 54.8 49.6
May-02 317 53 44.3 May-04 326 51.5 42.6 May-06 354 54.2 49.9
Jun-02 319 53.2 44.4 Jun-04 329 51.7 42.3 Jun-06 357 54.6 49.6
Jul-02 323 52.5 43.4 Jul-04 337 51.5 42.9 Jul-06 367 54.3 50.7
Aug-02 327 53.4 42.8 Aug-04 345 52.2 43.6 Aug-06 376 54.8 50.7
Sep-02 328 56.5 44.3 Sep-04 350 57.1 44.7 Sep-06 381 58.1 50.9
Oct-02 325 65.3 44.4 Oct-04 351 63.6 44.5 Oct-06 381 68.1 50.5
Nov-02 326 70.7 44.8 Nov-04 354 68.8 45 Nov-06 383 73.3 51.2
Dec-02 330 66.9 44.4 Dec-04 355 68.9 44.8 Dec-06 384 75.5 50.7
Jan-03 334 58.2 43.1 Jan-05 357 60.1 44.9 Jan-07 387 66.4 50.3
Feb-03 337 55.3 42.6 Feb-05 362 55.6 45.2 Feb-07 392 60.5 49.2
Mar-03 341 53.4 42.4 Mar-05 368 53.9 45.2 Mar-07 396 57.7 48.1
Apr-03 322 52.1 42.2 Apr-05 348 53.3 45
May-03 318 51.5 41.8 May-05 345 53.1 45.5
Jun-03 320 51.5 40.1 Jun-05 349 53.5 46.2
Jul-03 326 52.4 42 Jul-05 355 53.5 46.8
Aug-03 332 53.3 42.4 Aug-05 362 53.9 47.5
Sep-03 334 55.5 43.1 Sep-05 367 57.1 48.3
Oct-03 335 64.2 42.4 Oct-05 366 64.7 48.3
Nov-03 336 69.6 43.1 Nov-05 370 69.4 49.1
Dec-03 335 69.3 43.2 Dec-05 371 70.3 48.9
Jan-04 338 58.5 42.8 Jan-06 375 62.6 49.4
Feb-04 342 55.3 43 Feb-06 380 57.9 50
Mar-04 348 53.6 42.8 Mar-06 385 55.8 50
390
380
370
360
Trade
350
340
330
320
310
1 6 12 18 24 30 36 42 48 54 60
Index
Because there is an overall curvilinear pattern to the data, you use trend
analysis and fit a quadratic trend model.
Trend Analysis:
MSD 59.1305
350
340
330
320
310
2 2 3 3 4 4 5 5 6 6 7
r -0 p- 0 r -0 p-0 r -0 p-0 r -0 p-0 r -0 p-0 r -0
Ap Se M
a
Se M
a
Se M
a
Se M
a
Se M
a
Month
MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values. It expresses
accuracy as a percentage.
where yt equals the actual value, t equals the fitted value, and n equals the number of observations.
MAD, which stands for Mean Absolute Deviation, measures the accuracy of fitted time series values. It
expresses accuracy in the same units as the data, which helps conceptualize the amount of error.
where yt equals the actual value, t equals the fitted value, and n equals the number of observations.
MSD stands for Mean Squared Deviation. MSD is always computed using the same denominator, n,
regardless of the model, so you can compare MSD values across models. MSD is a more sensitive measure
of an unusually large forecast error than MAD.
where yt equals the actual value, t equals the forecast value, and n equals the number of forecasts.
How do we Forecast?
Errors
Trend Analysis:
How do we Forecast?
Variable
410 Actual
400 Fits
Forecasts
390
Accuracy Measures
380 MAPE 1.7076
370 MAD 5.9566
Trade
MSD 59.1305
360
350
340
330
320
2 2 3 3 4 5 5 6 6
r -0 t- 0 y -0 c-0 l-0 b-0 p-0 r -0 v-0
c u
Ap O M
a De J Fe Se Ap No
Month
Multiplicative Model
Data Trade
Length 60
NMissing 0
Seasonal Indices
Fitted Trend Equation
Period Index
Yt = 316.58 + 1.08*t 1 0.97552
2 0.96163
3 0.96591
4 0.98339
5 1.00159
Accuracy Measures 6 1.00999
7 1.00511
MAPE 0.8908 8 1.00981
MAD 3.0351 9 1.00949
MSD 16.5285 10 1.01591
11 1.02494
12 1.03671
Decomposition- Seasonal Indices:
Accuracy Measures
MAPE 0.8908
360 MAD 3.0351
Trade
MSD 16.5285
340
320
300
02 02 03 03 04 04 05 05 06 06 07
pr - ep- a r- e p- a r- e p- a r- ep- a r- e p- a r-
A S M S M S M S M S M
Month
Decomposition- Detrending and
Deseasonalising:
Component Analysis for Trade
Multiplicative Model
Original Data Detrended Data
400 1.05
360 1.00
320 0.95
Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07
Month Month
5
360
0
-5
320
Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Apr-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07
Month Month
Decomposition- Seasonal Indices:
1.00 1.00
0.96 0.95
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
12 10
8 5
0
4
-5
0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Diagnostic Checking of Error:
Versus Fits
Normal Probability Plot (response is Trade)
(response is Trade)
15
99.9
99
10
95
90
5
Residual
80
70
Percent
60
50
40 0
30
20
10
-5
5
1
-10
0.1 300 320 340 360 380 400
-15 -10 -5 0 5 10 Fitted Value
Residual
10
15
5
Frequency
Residual
10
0
5
-5
0 -10
-5 0 5 10 1 5 10 15 20 25 30 35 40 45 50 55 60
Residual Observation Order
Decomposition- Interpretation:
Decomposition generates three sets of plots:
· A time series plot that shows the original series with the fitted trend line, predicted values,
and forecasts.
· A component analysis - in separate plots are the series, the detrended data, the seasonally
adjusted data, the seasonally adjusted and detrended data (the residuals).
· A seasonal analysis - charts of seasonal indices and percent variation within each season
relative to the sum of variation by season and boxplots of the data and of the residuals by
seasonal period.
In addition, the fitted trend line, the seasonal indices, the three accuracy measures- MAPE,
MAD, and MSD - and forecasts in the Session window.
In the example, the first graph shows that the detrended residuals from trend analysis are fit
fairly well by decomposition, except that part of the first annual cycle is underpredicted and the
last annual cycle is overpredicted. This is also evident in the lower right plot of the second
graph; the residuals are highest in the beginning of the series and lowest at the end.
Forecasting from Decomposed Model:
Period
Forecast Time Series Decomposition Plot for Trade
Multiplicative Model
61 372.964
420
62 368.687 Variable
Actual
63 371.370 400 Fits
64 379.150 Trend
Forecasts
65 387.248 380 Accuracy Measures
66 391.582 MAPE 0.8908
Trade
For example, an office products supply company monitors inventory levels every
day. They want to use moving averages of length 2 to track inventory levels to
smooth the data. Here are the data collected over 8 days for one of their products.
Day 1 2 3 4 5 6 7 8
Inventory Level 4310 4400 4000 3952 4011 4000 4110 4220
Moving average 4310 4355 4200 3976 3981.5 4005.5 4055 4165
The first moving average is 4310, which is the value of the first observation. (In
time series analysis, the first number in the moving average series is not
calculated; it is a missing value.) The next moving average is the average of the
first two observations, (4310 + 4400) / 2 = 4355. The third moving average is
the average of observation 2 and 3, (4400 + 4000) / 2 = 4200, and so on. If you
want to use a moving average of length 3, three values are averaged instead of
two.
Moving Average:
Moving Average smoothes your data by averaging
consecutive observations in a series and provides short-
term forecasts. This procedure can be a likely choice
when your data do not have a trend or seasonal
component. There are ways, however, to use moving
averages when your data possess trend and/or
seasonality.
Use for:
· Data with no trend, and
· Data with no seasonal pattern
· Short term forecasting
Forecast profile:
· Flat line
ARIMA equivalent: none
Single Exponential Smoothing:
Single exponential smoothing smoothes your data by computing exponentially weighted
averages and provides short-term forecasts.
Length 60 400
Actual
Fits
390 Forecasts
Smoothing Constant
370 Alpha 1.26370
Alpha 1.26370
Trade
360 Accuracy Measures
MAPE 1.2303
350 MAD 4.2754
MSD 42.9460
Accuracy Measures 340
330
MAPE 1.2303 320
MAD 4.2754
MSD 42.9460 02 02 - 03 03 4 04 5 05 6 6 7
r - p- p- -0 p- -0 p- - 0 p- 0 -0
Ap Se ar e ar e ar e ar e ar
M S M S M S M S M
Month
Forecasts
a trend is present but it can also serve as a general smoothing 460 Variable
Actual
method. Dynamic estimates are calculated for two components: 440 Fits
Forecasts
level and trend. 420 95.0% PI
Smoothing Constants
400 Alpha (lev el) 1.25883
Gamma (trend) 0.01218
Double Exponential Smoothing for Trade
Trade
380 Accuracy Measures
MAPE 1.0968
360 MAD 3.7958
Data Trade MSD 43.9140
340
Length 60
320
Accuracy Measures
MAPE 1.0968
MAD 3.7958
MSD 43.9140
Forecasts
• Differences computes the differences between data values of a time series. If you
wish to fit an ARIMA model but there is trend or seasonality present in your data,
differencing data is a common step in assessing likely ARIMA models. Differencing is
used to simplify the correlation structure and to help reveal any underlying pattern.
• Lag computes lags of a column and stores them in a new column. To lag a time
series, Minitab moves the data down the column and inserts missing value symbols, *,
at the top of the column. The number of missing values inserted depends upon the
length of the lag.
o Choose to use the default number of lags, which is n / 4 for a series with less
than or equal to 240 observations or sqrt(n) + 45 for a series with more than
240 observations, where n is the number of observations in the series.
Introduction to Dependent Observations
To denote this, we will index the observations with the letter t rather
than the letter i.
Our data will be observations on Y1, Y2, ...Yt, ...where t indexes the day,
month, year, or any time interval.
If the readings are iid N(,2), what would be your prediction for YT+1?
The mean June level of lake Michigan in number of meters above sea
level (lmich_yr), 1918-2006
Use Minitab Time series Plot Command (under graph menu) to produce
this graph
Introduction to Dependent Observations
Monthly US Beer Production (millions of barrels)
19
18
17
b_prod
16
15
14
13
Strong Seasonality
12
1 7 14 21 28 35 42 49 56 63 70
Index
a. Introduction to Dependent Observations
What Does IID Data Look Like?
0
IID
-1
-2
-3
-4
1 10 20 30 40 50 60 70 80 90 100
Index
C1 C2 C3 C4
t Y(t) Y(t-1) Y(t-2)
1 5 * *
2 8 5 *
3 1 8 5
Now each row has Y at
4 3 1 8 time t, Y one period
5 9 3 1 ago, and Y two periods
ago
6 4 9 3
Corr = .794
Corr = .531
Autocorrelation
same variances
Autocorrelation
variance
The dependence between successive observations does not
(Y Y)(Y
t s
t t s Y)
rs = T
t )
2
(Y Y
t 1
Autocorrelation
There is a strong
dependence
between
observations spaced
close together in
time (e.g only one or
two years apart). As
time passes, the
dependence
diminishes in
strength.
c. Autocorrelation
Let’s look at the autocorrelations for the IID series.
1.0
0.8 In contrast to the ACF
0.6
0.4
for the ‘level’ series, the
Autocorrelation
2 4 6 8 10 12 14 16 18 20 22 24
Lag
Autocorrelation
0.2
0.1
IBM-ret
0.0
-0.1
-0.2
1 36 72 108 144 180 216 252 288 324 360
Index
Autocorrelation
Let’s look at the ACF for the series.
Autocorrelation Function for IBM-ret
(with 5% significance limits for the autocorrelations)
1.0
0.8
0.6
0.4
Autocorrelation
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
1 5 10 15 20 25 30 35 40 45 50 55 60
Lag
AR(1) : Yt 0 1Yt 1 t
YT 1 β0 β1YT ε T 1
The AR(1) model expresses what we don’t know in terms of
what we do know at time T.
The AR(1) Model
So to check the AR(1) model, we can check the residuals from the
regression for any “left-over” dependence.
d. The AR(1) Model
Let’s try it out on the lake water level data...
Regression Analysis: level versus level_t-1
Analysis of Variance
Source DF SS MS F P
Regression 1 8.1675 8.1675 146.39 0.000
Residual Error 86 4.7983 0.0558
Total 87 12.9657
The AR(1) Model
Now let’s look at the ACF of the residuals…
Not much
autocorrelation
left!
The AR(1) Model
Now let’s try the beer data…
Regression Analysis
1.0
0.8
0.6
0.4
Autocorrelation
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2 4 6 8 10 12 14 16 18
Lag
0.3
0.2
0 0
0.1
1 .8
AR(1)
0.0
-0.1
-0.2
-0.3
1 10 20 30 40 50 60 70 80 90 100
Index
The series fluctuates around a mean level with fairly long “runs”.
The
Now the ACF… AR(1) Model
Autocorrelation Function for AR(1)
(with 5% significance limits for the autocorrelations)
1.0
0.8
0.6
0.4
Autocorrelation
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2 4 6 8 10 12 14 16 18 20 22 24
Lag
0.4 0 0
0.3
0.2 1 .8
AR(1)-.8
0.1
0.0
-0.1
-0.2
-0.3
-0.4
1 10 20 30 40 50 60 70 80 90 100
Index
Because β1 is negative, an above average Y tends to be followed
by a below average Y (and vice versa) - hence the jagged
appearance of the plot.
The AR(1) Model
and the ACF…
Autocorrelation Function for AR(1)-.8
(with 5% significance limits for the autocorrelations)
1.0
0.8
0.6
0.4
Autocorrelation
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2 4 6 8 10 12 14 16 18 20 22 24
Lag
Use ARIMA to model time series behavior and to generate forecasts. ARIMA fits
a Box-Jenkins ARIMA model to a time series. ARIMA stands for Autoregressive
Integrated Moving Average with each term representing steps taken in the
model construction until only random noise remains. ARIMA modeling differs
from the other time series methods discussed in this chapter in the fact that
ARIMA modeling uses correlation techniques. ARIMA can be used to model
patterns that may not be visible in plotted data. The concepts used in this
procedure follow Box and Jenkins. For an elementary introduction to time series
The ACF and PACF of the food employment data suggest an autoregressive model of order 1, or AR(1),
after taking a difference of order 12. You fit that model here, examine diagnostic plots, and examine the
goodness of fit. To take a seasonal difference of order 12, you specify the seasonal period to be 12, and the
order of the difference to be 1. In the subsequent example, you perform forecasting.
1 Open the worksheet EMPLOY.MTW.
2 Choose Stat > Time Series > ARIMA.
3 In Series, enter Food.
4 Check Fit seasonal model. In Period, enter 12. Under Nonseasonal, enter 1 in Autoregressive.
Under Seasonal, enter 1 in Difference.
5 Click Graphs. Check ACF of residuals and PACF of residuals.
6 Click OK in each dialog box.
ARIMA Method:
ARIMA Model: Food
1.0
0.8
0.6
Partial Autocorrelation
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
1 2 3 4 5 6 7 8 9 10 11 12
Lag
1.0
0.8
0.6
0.4
Autocorrelation
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
1 2 3 4 5 6 7 8 9 10 11 12
Lag
ARIMA Method:
After you have identified one or more likely models, you need to specify the model in the main ARIMA dialog box.
· If you want to fit a seasonal model, check Fit seasonal model and enter a number to specify the period. The period
is the span of the seasonality or the interval at which the pattern is repeated. The default period is 12.
You must check Fit seasonal model before you can enter the seasonal autoregressive and moving average parameters
or the number of seasonal differences to take.
· To specify autoregressive and moving average parameters to include in nonseasonal or seasonal ARIMA models, enter
a value from 0 to 5. The maximum is 5. At least one of these parameters must be nonzero. The total for all parameters
must not exceed 10. For most data, no more than two autoregressive parameters or two moving average parameters are
required in ARIMA models.
Suppose you enter 2 in the box for Moving Average under Seasonal, the model will include first and second order
moving average terms.
· To specify the number of nonseasonal and/or seasonal differences to take, enter a number in the appropriate box. If
you request one seasonal difference with k as the seasonal period, the kth difference will be taken.
· To include the constant in the model, check Include constant term in model.
· You may want to specify starting values for the parameter estimates. You must first enter the starting values in a
worksheet column in the following order: AR's (autoregressive parameters), seasonal AR's, MA's (moving average
parameters), seasonal MA's, and if you checked Include constant term in model enter the starting value for the
constant in the last row of the column. This is the same order in which the parameters appear on the output. Check
Starting values for coefficients, and enter the column containing the starting values for each parameter included in
the model. Default starting values are 0.1 except for the constant.
ARIMA Method:
Box and Jenkins present an interactive approach for fitting ARIMA models to time series.
This iterative approach involves identifying the model, estimating the parameters, checking
model adequacy, and forecasting, if desired. The model identification step generally requires
judgment from the analyst.
1 First, decide if the data are stationary. That is, do the data possess constant
mean and variance .
· An ACF with large spikes at initial lags that decay to zero or a PACF with a large spike at
the first and possibly at the second lag indicates an autoregressive process.
· An ACF with a large spike at the first and possibly at the second lag and a PACF with
large spikes at initial lags that decay to zero indicates a moving average process.
· The ACF and the PACF both exhibiting large spikes that gradually die out indicates that
both autoregressive and moving averages processes are present.
For most data, no more than two autoregressive parameters or two moving average
parameters are required in ARIMA models.
ARIMA Method:
3 Once you have identified one or more likely models, you are ready to use the
ARIMA procedure.
· Fit the likely models and examine the significance of parameters and select
one model that gives the best fit.
· Check that the ACF and PACF of residuals indicate a random process,
signified when there are no large spikes. You can easily obtain an ACF and a
PACF of residual using ARIMA's Graphs subdialog box. If large spikes remain,
consider changing the model.
· You may perform several iterations in finding the best model. When you are
satisfied with the fit, go ahead and make forecasts.
The ARIMA algorithm will perform up to 25 iterations to fit a given model. If the
solution does not converge, store the estimated parameters and use them as
starting values for a second fit. You can store the estimated parameters and use
them as starting values for a subsequent fit as often as necessary.
ARIMA Method:
In the example of fitting an ARIMA model, you found that an AR(1) model with a
twelfth seasonal difference gave a good fit to the food sector employment data.
You now use this fit to predict employment for the next 12 months.
Step 1: Refit the ARIMA model without displaying the acf and pacf of the
residuals