Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Analytics
Descriptive Analytics
Objectives
• Need to understand variations in Data
• Find the range, variance, standard deviation, and coefficient of variation and
know what these values mean
• Apply the empirical rule to describe the variation of population values
around the mean
Coverage
• Predictive analytics techniques such as regression attempt to explain variation
in the outcome variable (Y) using predictor variables (X)
• Variability in the data is measured using the following measures:
• Range
• Inter-Quartile Distance (IQD)
• Variance
• Standard Deviation
Measures of Variability
Variation
Measures of variation
give information on the
spread or variability of
the data values.
Same center,
different variation
Range
• Simplest measure of variation
• Range is the difference between maximum and minimum value of the data.
It captures the data spread:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Disadvantages of the Range
• Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Quartile
• Quartile divides the data into 4 equal parts. The first quartile (Q1) contains
first 25% of the data, Q2 contains 50% of the data and is also the median.
Quartile 3 (Q3) accounts for 75% of the data
Quartiles
• Quartiles split the ranked data into 4 segments with an equal
number of values per segment
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the observations
are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third quartile
Quartiles
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Quartile Formulas
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
Percentile
• Percentile, decile and quartile are frequently used to identify the position of the
observation in the dataset.
• Percentile, denoted as Px, is the value of the data at which x percentage of the data
lie below that value
Position corresponding to Px x (n+1)/100
• The variance is essentially the average of the squared deviations from the mean.
• If Xi is a typical observation, its squared deviation from the mean is (Xi – mean)2
• Variance for population, 2, is calculated using
( Xn
) 2
Variance 2 i
i 1 n
Note
• If all observations are close to the mean, their squared deviations from the
mean—and the variance—will be relatively small
• If at least a few of the observations are far from the mean, their squared
deviations from the mean—and the variance—will be large
• In Excel, use the VAR function to obtain the sample variance and the VARP
function to obtain the population variance
Population Variance
• Average of squared deviations of values from the mean
N
• Population variance: (x μ)
i
2
σ 2 i1
N
μ = population mean
Where
N = population size
xi = ith value of the variable x
Sample Variance
• Average (approximately) of squared deviations of values from the mean
n
• Sample variance: (x x) i
2
s
2 i1
n -1
X = arithmetic mean
Where
n = sample size
Xi = ith value of the variable X
Degrees of Freedom
• Degrees of freedom is equal to the number of independent variables in the
model (Trochim, 2005).
• For example, we can create any sample of size n with mean value of by
randomly selecting (n 1) values. We need to fix just one out of n values.
Thus the number of independent variables in this case is (n 1)
Generalized Degrees of Freedom
• Degrees of freedom is defined as the difference between the number of
observations in the sample and number of parameters estimated (Walker
1940, Toothaker and Miller, 1996).
σ i1
N
Sample Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
n
i
(x x) 2
• If the values of a variable are approximately normally distributed (symmetric and bell-
shaped), then the following rules hold:
• Approximately 68% of the observations are within one standard deviation of the mean.
• Approximately 95% of the observations are within two standard deviations of the mean.
• Approximately 99.7% of the observations are within three standard deviations of the mean
Chebychev’s Inequality
• For any population with mean μ and standard deviation σ , and k > 1 ,
the percentage of observations that fall within the interval
Is at least
100[1 (1/k )]% 2
(continued)
Chebychev’s Theorem
• Regardless of how the data are distributed, at least (1 - 1/k2) of the values will
fall within k standard deviations of the mean (for k > 1)
• Examples:
At least within
(1 - 1/1.52) = 55.6% ……... k = 1.5 (μ ± 1.5σ)
(1 - 1/22) = 75% …........... k = 2 (μ ± 2σ)
(1 - 1/32) = 89% …….…... k = 3 (μ ± 3σ)
Problem
• You are the owner of a retail store that sells imported fashion goods for the
premium segment. Average footfall in the store is 500 per month.
• The mean value of the customer purchase is INR 60,000 (or USD 1000) and the
standard deviation is INR 6000 (or USD 100).
• You would like to device a promotion campaign to target customers in the spending
range between INR 48,000 and INR 72,000.
• You have hired an agency from NY. Given it takes 2$ per customer per month for
the campaign, what should be your estimated budget for 3 months
The Empirical Rule
68%
μ
μ 1σ
The Empirical Rule
95% 99.7%
μ 2σ μ 3σ
Practical Application
• For most datasets, three standard deviation around the mean are commonly used to
describe the variability of the dataset.
• Eg. You have ordered an item from Flipkart through the cheapest mode of
transport and the average time to reach the item to the customer is 10 days. Given
that the standard deviation is 1 days, what should be a reasonable estimate of the
delivery to the customer that you will show while the order was being placed
• Given that during rainy seasons, the standard deviation is 2 days. How will your
estimate change
Process Capability Index
• To measure how well a process can achieve the specification given to it
• Using a sample of output, measure the dimension of interest, and compute the
total variation using the third empirical rule
• Eg. A part dimension in a manufacturing process is specified by 5.00 +/- 0.2 cm.
• How do you measure if the given manufacturing process achieves the specification
Upper Specification - Lower Specification
Cp
total var iation
Example Data
Example Data
Example Data
Advantages of Variance and Standard Deviation
• Each value in the data set is used in the calculation
• Values far from the mean are given extra weight (because deviations from
the mean are squared)
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570
Coefficient of Variation
• Stock A:
• Average price last year = $50 s $5
CVA 100% 100% 10%
• Standard deviation = $5 x $50
Both stocks have the same standard
deviation, but stock B is less variable
relative to its price
• Stock B:
• Average price last year = $100 s $5
• Standard deviation = $5 CVB 100% 100% 5%
x $100
Coefficient of Variation
s
CV 100%
x
Practical Example
Practical Example
Return to Risk
• Return to risk = 1/CV, is often easier to interpret, especially in financial
risk analysis.
• The Sharpe ratio is a related measure in finance.
Standardized Values
• A standardized value, commonly called a z-score, provides a relative
measure of the distance an observation is from the mean, which is
independent of the units of measurement.
• If 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 is a sample of n observations
𝑥𝑖 − 𝑥ҧ
• 𝑧𝑖 = 𝑠
• 𝑧𝑖 = z-score for 𝑥𝑖
• 𝑥ҧ = sample mean
• s = sample standard deviation
Properties
• Helps to determine how far a particular value is from the mean relative to the data set’s
standard deviation.
• The numerator represents the distance that xi is from the sample mean; a negative value
indicates that xi lies to the left of the mean, and a positive value indicates that it lies to
the right of the mean. By dividing by the standard deviation, s, we scale the distance
from the mean to express it in units of standard deviations. Thus,
• a z-score of + 1.0 means that the observation is one standard deviation to the right of the mean;
• a z-score of - 1.5 means that the observation is 1.5 standard deviations to the left of the mean.
Example
Usages
• Standardized value
• Empirical rule:
• For data having a bell-shaped distribution:
• Within 1 standard deviation – approximately 68% of the data values.
• Within 2 standard deviations – approximately 95% of the data values.
• Within 3 standard deviations – almost all the data values.
• Identifying outliers:
• Outliers: Extreme values in a data set.
• It can be identified using standardized values (z-scores).
• Any data value with a z-score less than –3 or greater than +3 is an outlier.
In Excel
Shape of the
Distribution
Introduction
• A frequency distribution which is not symmetrical is said to be
asymmetrical or skewed
• Apart from central tendency and the dispersion, it is essential to know the
shape of a distribution to describe a data set
Definition
• Skewness is a measure of symmetry or lack of symmetry. A dataset is
symmetrical when the proportion of data at equal distance (measured in
terms of standard deviation) from mean (or median) is equal. That is, the
proportion of data between and - k is same as and + k, where k is
some positive constant
Example
i.e.
• if R is a range in Excel containing the
data elements in S then SKEW(R) = the
skewness of S.
Shape and Measures of Location
• Comparing measures of location can sometimes reveal information about the shape
of the distribution of observations.
• For example, if the distribution were perfectly symmetrical and unimodal, the mean, median, and
mode would all be the same.
• Positive Skew implies mean > median > mode
• Negative Skew implies mean < median < mode
Measures of Skewness
• Absolute measures of skewness
• Karl Pearson Coefficient of Skewness
• Bowley’s Coefficient of skewness
• Kelly’s Coefficient of skewness
Measures of Tail Shape: Kurtosis
• Kurtosis refers to the peakedness (i.e., high, narrow) or flatness (i.e., short, flat-topped) of a
histogram.
• Kurtosis is aimed at measuring the shape of the tail, i.e. if the tail of the data distribution is
heavy or light
• The coefficient of kurtosis (CK) measures the degree of kurtosis of a population
• CK < 3 indicates the data is somewhat flat with a wide degree of dispersion (Platykurtic).
• CK > 3 indicates the data is somewhat peaked with less dispersion (Leptokurtic).
Excess Kurtosis
• The excess kurtosis is a measure that captures deviation from kurtosis of a
normal distribution and is given by:
4 4
X i X / n
i 1
• Excess Kurtosis = 3
4
Note
• Kurtosis has to do with the “fatness” of the tails of the
distribution relative to the tails of a normal distribution.
• A distribution with high kurtosis has many more extreme
observations.
• In Excel, kurtosis can be calculated with the KURT
function.
The 5 number Summary
The five number summary is the summary of the data set using the five
measures, namely
• Smallest number
• First quartile
• Median
• Third quartile
• Largest number
Box & Whisker Plot
Weighted Mean
The weighted mean of a set of data is n
f x i i
f1x1 f 2 x 2 f n x n
x i 1
n n
Where fi is the weight of the ith observation
n fi
use when data is already grouped into n classes, with fi values in the ith class
Approximations for Grouped Data
Suppose data are grouped into K classes, with frequencies f1, f2, . . . fK, and the
midpoints of the classes are m1, m2, . . ., mK
K
f (m x)
i i
2
• Calculate the 80th & 95th percentile 1.80 3.28 1.98 3.75
1.56 1.43 1.48 2.20
• Calculate the scores which might be
1.83 1.95 1.28 2.23
considered as outliers
3.74 1.66 2.96 1.62
• Dean believes CGPA is right tailed – 2.86 1.14 2.77 1.89
can you support?
1.38 1.71 2.74 2.16
2.84 1.68 3.35 2.07
Thank You