Sei sulla pagina 1di 74

Introduction to Business

Analytics
Descriptive Analytics
Objectives
• Need to understand variations in Data
• Find the range, variance, standard deviation, and coefficient of variation and
know what these values mean
• Apply the empirical rule to describe the variation of population values
around the mean
Coverage
• Predictive analytics techniques such as regression attempt to explain variation
in the outcome variable (Y) using predictor variables (X)
• Variability in the data is measured using the following measures:
• Range
• Inter-Quartile Distance (IQD)
• Variance
• Standard Deviation
Measures of Variability
Variation

Range Interquartile Variance Standard Coefficient of


Range Deviation Variation

 Measures of variation
give information on the
spread or variability of
the data values.

Same center,
different variation
Range
• Simplest measure of variation
• Range is the difference between maximum and minimum value of the data.
It captures the data spread:
Range = Xlargest – Xsmallest
Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13
Disadvantages of the Range
• Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Quartile
• Quartile divides the data into 4 equal parts. The first quartile (Q1) contains
first 25% of the data, Q2 contains 50% of the data and is also the median.
Quartile 3 (Q3) accounts for 75% of the data
Quartiles
• Quartiles split the ranked data into 4 segments with an equal
number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3

 The first quartile, Q1, is the value for which 25% of the observations
are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50% are larger)
 Only 25% of the observations are greater than the third quartile
Quartiles

 Example: Find the first quartile


Sample Ranked Data: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5
Quartile Formulas

Find a quartile by determining the value in the appropriate


position in the ranked data, where

First quartile position: Q1 = 0.25(n+1)

Second quartile position: Q2 = 0.50(n+1)


(the median position)

Third quartile position: Q3 = 0.75(n+1)

where n is the number of observed values


Interquartile Range
• The outlier problem can be reduced by eliminating high- and low-valued
observations and calculate the range of the middle 50% of the data

• Inter-quartile distance (IQD), also called inter-quartile range (IQR) is a


measure of the distance between Quartile 1 (Q1) and Quartile 3 (Q3)

• Interquartile range = 3rd quartile – 1st quartile


IQR = Q3 – Q1
Interquartile Range
Median X
X Q1 Q3 maximum
Example: minimum (Q2)
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27
Percentile
• Percentile, decile and quartile are frequently used to identify the position of the
observation in the dataset.

• Percentile, denoted as Px, is the value of the data at which x percentage of the data
lie below that value
Position corresponding to Px  x (n+1)/100

• Px is the position in the data calculated , where n is the number of observations in


the data.
Decile
• Decile corresponds to special values of percentile that divide the data into 10
equal parts. First decile contains first 10% of the data and second decile
contains first 20% of the data and so on.
Variance
• Variance is a measure of variability in the data from the mean value

• The variance is essentially the average of the squared deviations from the mean.
• If Xi is a typical observation, its squared deviation from the mean is (Xi – mean)2
• Variance for population, 2, is calculated using
( Xn
  ) 2
Variance   2   i
i 1 n
Note
• If all observations are close to the mean, their squared deviations from the
mean—and the variance—will be relatively small

• If at least a few of the observations are far from the mean, their squared
deviations from the mean—and the variance—will be large

• In Excel, use the VAR function to obtain the sample variance and the VARP
function to obtain the population variance
Population Variance
• Average of squared deviations of values from the mean
N

• Population variance:  (x  μ)
i
2

σ 2 i1
N
μ = population mean
Where
N = population size
xi = ith value of the variable x
Sample Variance
• Average (approximately) of squared deviations of values from the mean
n

• Sample variance:  (x  x) i
2

s 
2 i1
n -1
X = arithmetic mean
Where
n = sample size
Xi = ith value of the variable X
Degrees of Freedom
• Degrees of freedom is equal to the number of independent variables in the
model (Trochim, 2005).

• For example, we can create any sample of size n with mean value of by
randomly selecting (n  1) values. We need to fix just one out of n values.
Thus the number of independent variables in this case is (n  1)
Generalized Degrees of Freedom
• Degrees of freedom is defined as the difference between the number of
observations in the sample and number of parameters estimated (Walker
1940, Toothaker and Miller, 1996).

• If there are n observations in the sample and k parameters are estimated


from the sample, then the degrees of freedom is (n  k).
Root mean of the Variance
• A fundamental problem with variance is that it is in squared units (e.g., $  $2).
• A more natural measure is the standard deviation, which is the square root of
variance.
• The sample standard deviation, denoted by s, is the square root of the sample
variance.
• The population standard deviation, denoted by σ, is the square root of the
population variance.
• In Excel, use the STDEV function to find the sample standard deviation or the
STDEVP function to find the population standard deviation.
Population Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
N

• Population standard deviation:  (x  μ)


i
2

σ i1
N
Sample Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
n

 i
(x  x) 2

• Sample standard deviation: S i1


n -1
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16
(10  X)2  (12  x)2  (14  x)2    (24  x)2
s 
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2



8 1

126 A measure of the “average”


  4.2426
7 scatter around the mean
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16
(10  X)2  (12  x)2  (14  x)2    (24  x)2
s 
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2



8 1

126 A measure of the “average”


  4.2426
7 scatter around the mean
Measuring variation

Small standard deviation

Large standard deviation


Rules for Interpreting Standard Deviation
• The interpretation of the standard deviation can be stated as three empirical rules.

• If the values of a variable are approximately normally distributed (symmetric and bell-
shaped), then the following rules hold:
• Approximately 68% of the observations are within one standard deviation of the mean.
• Approximately 95% of the observations are within two standard deviations of the mean.
• Approximately 99.7% of the observations are within three standard deviations of the mean
Chebychev’s Inequality

• For any population with mean μ and standard deviation σ , and k > 1 ,
the percentage of observations that fall within the interval

[μ - kσ] and [μ + kσ]

Is at least
100[1  (1/k )]% 2
(continued)
Chebychev’s Theorem
• Regardless of how the data are distributed, at least (1 - 1/k2) of the values will
fall within k standard deviations of the mean (for k > 1)

• Examples:

At least within
(1 - 1/1.52) = 55.6% ……... k = 1.5 (μ ± 1.5σ)
(1 - 1/22) = 75% …........... k = 2 (μ ± 2σ)
(1 - 1/32) = 89% …….…... k = 3 (μ ± 3σ)
Problem
• You are the owner of a retail store that sells imported fashion goods for the
premium segment. Average footfall in the store is 500 per month.
• The mean value of the customer purchase is INR 60,000 (or USD 1000) and the
standard deviation is INR 6000 (or USD 100).
• You would like to device a promotion campaign to target customers in the spending
range between INR 48,000 and INR 72,000.
• You have hired an agency from NY. Given it takes 2$ per customer per month for
the campaign, what should be your estimated budget for 3 months
The Empirical Rule

If the data distribution is bell-shaped, then the interval: μ  1σ


contains about 68% of the values in the population or the sample

68%

μ
μ  1σ
The Empirical Rule

95% of the dataset within μ  2σ 99.7% of the dataset within μ  3σ

95% 99.7%

μ  2σ μ  3σ
Practical Application
• For most datasets, three standard deviation around the mean are commonly used to
describe the variability of the dataset.
• Eg. You have ordered an item from Flipkart through the cheapest mode of
transport and the average time to reach the item to the customer is 10 days. Given
that the standard deviation is 1 days, what should be a reasonable estimate of the
delivery to the customer that you will show while the order was being placed
• Given that during rainy seasons, the standard deviation is 2 days. How will your
estimate change
Process Capability Index
• To measure how well a process can achieve the specification given to it
• Using a sample of output, measure the dimension of interest, and compute the
total variation using the third empirical rule
• Eg. A part dimension in a manufacturing process is specified by 5.00 +/- 0.2 cm.
• How do you measure if the given manufacturing process achieves the specification
 Upper Specification - Lower Specification 
Cp   
 total var iation 
Example Data
Example Data
Example Data
Advantages of Variance and Standard Deviation
• Each value in the data set is used in the calculation

• Values far from the mean are given extra weight (because deviations from
the mean are squared)
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570
Coefficient of Variation
• Stock A:
• Average price last year = $50 s $5
CVA    100%  100%  10%
• Standard deviation = $5 x $50
Both stocks have the same standard
deviation, but stock B is less variable
relative to its price
• Stock B:
• Average price last year = $100 s $5
• Standard deviation = $5 CVB    100%  100%  5%
x $100
Coefficient of Variation

• Measures relative variation with respect to mean


• Generally expressed in percentage (%)
• Can be used to compare two or more sets of data measured in different units
• Smaller the CV, smaller is the risk

 s
CV     100%
x 
Practical Example
Practical Example
Return to Risk
• Return to risk = 1/CV, is often easier to interpret, especially in financial
risk analysis.
• The Sharpe ratio is a related measure in finance.
Standardized Values
• A standardized value, commonly called a z-score, provides a relative
measure of the distance an observation is from the mean, which is
independent of the units of measurement.
• If 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 is a sample of n observations
𝑥𝑖 − 𝑥ҧ
• 𝑧𝑖 = 𝑠
• 𝑧𝑖 = z-score for 𝑥𝑖
• 𝑥ҧ = sample mean
• s = sample standard deviation
Properties
• Helps to determine how far a particular value is from the mean relative to the data set’s
standard deviation.
• The numerator represents the distance that xi is from the sample mean; a negative value
indicates that xi lies to the left of the mean, and a positive value indicates that it lies to
the right of the mean. By dividing by the standard deviation, s, we scale the distance
from the mean to express it in units of standard deviations. Thus,
• a z-score of + 1.0 means that the observation is one standard deviation to the right of the mean;
• a z-score of - 1.5 means that the observation is 1.5 standard deviations to the left of the mean.
Example
Usages
• Standardized value
• Empirical rule:
• For data having a bell-shaped distribution:
• Within 1 standard deviation – approximately 68% of the data values.
• Within 2 standard deviations – approximately 95% of the data values.
• Within 3 standard deviations – almost all the data values.
• Identifying outliers:
• Outliers: Extreme values in a data set.
• It can be identified using standardized values (z-scores).
• Any data value with a z-score less than –3 or greater than +3 is an outlier.
In Excel
Shape of the
Distribution
Introduction
• A frequency distribution which is not symmetrical is said to be
asymmetrical or skewed

• Apart from central tendency and the dispersion, it is essential to know the
shape of a distribution to describe a data set
Definition
• Skewness is a measure of symmetry or lack of symmetry. A dataset is
symmetrical when the proportion of data at equal distance (measured in
terms of standard deviation) from mean (or median) is equal. That is, the
proportion of data between  and  - k is same as  and + k, where k is
some positive constant
Example

• A variable can be skewed to the right (or positively skewed)


because of some really large values (e.g., really large baseball
salaries).
• Or it can be skewed to the left (or negatively skewed) because
of some really small values (e.g., temperature lows in Antarctica).
Example

Left Skewed Right Skewed


Measures of Shape: Skewness

In Excel, a measure of skewness can be calculated with the


SKEW function.

Positively skewed Symmetrical


Coefficient of Skewness

• Coefficient of Skewness (CS):


 CS is negative for left-skewed data.
 CS is positive for right-skewed data. n 
3
 |CS| > 1 suggests high degree of
skewness.
 ( Xi  X ) / n

CSg1 
 0.5 ≤ |CS| ≤ 1 suggests moderate i 1
skewness.
 |CS| close to 0 in the range of -0.5 to +
0.5 suggests relative symmetry.
 3
Formulae in case of Sample
• Excel provides the SKEW function as a
way to calculate the skewness of S,

i.e.
• if R is a range in Excel containing the
data elements in S then SKEW(R) = the
skewness of S.
Shape and Measures of Location
• Comparing measures of location can sometimes reveal information about the shape
of the distribution of observations.
• For example, if the distribution were perfectly symmetrical and unimodal, the mean, median, and
mode would all be the same.
• Positive Skew implies mean > median > mode
• Negative Skew implies mean < median < mode
Measures of Skewness
• Absolute measures of skewness
• Karl Pearson Coefficient of Skewness
• Bowley’s Coefficient of skewness
• Kelly’s Coefficient of skewness
Measures of Tail Shape: Kurtosis
• Kurtosis refers to the peakedness (i.e., high, narrow) or flatness (i.e., short, flat-topped) of a
histogram.
• Kurtosis is aimed at measuring the shape of the tail, i.e. if the tail of the data distribution is
heavy or light
• The coefficient of kurtosis (CK) measures the degree of kurtosis of a population

• CK < 3 indicates the data is somewhat flat with a wide degree of dispersion (Platykurtic).
• CK > 3 indicates the data is somewhat peaked with less dispersion (Leptokurtic).
Excess Kurtosis
• The excess kurtosis is a measure that captures deviation from kurtosis of a
normal distribution and is given by:
4   4
  X i  X  / n
i 1 
• Excess Kurtosis = 3
 4
Note
• Kurtosis has to do with the “fatness” of the tails of the
distribution relative to the tails of a normal distribution.
• A distribution with high kurtosis has many more extreme
observations.
• In Excel, kurtosis can be calculated with the KURT
function.
The 5 number Summary
The five number summary is the summary of the data set using the five
measures, namely
• Smallest number
• First quartile
• Median
• Third quartile
• Largest number
Box & Whisker Plot
Weighted Mean
The weighted mean of a set of data is n

f x i i
f1x1  f 2 x 2    f n x n
x i 1

n n
Where fi is the weight of the ith observation

n   fi
use when data is already grouped into n classes, with fi values in the ith class
Approximations for Grouped Data
Suppose data are grouped into K classes, with frequencies f1, f2, . . . fK, and the
midpoints of the classes are m1, m2, . . ., mK
K

• For a sample of n observations, the mean is fm i i


x i1
n
K
where n   fi
i1
Approximations for Grouped Data
Suppose data are grouped into K classes, with frequencies f1, f2, . . . fK, and the
midpoints of the classes are m1, m2, . . ., mK
K

 f (m  x)
i i
2

• For a sample of n observations, the variance is s 


2 i1
n 1
Example
Solutions
Problem - 1
• Average monthly wages and SD of two garment manufacturers are given
below:
• Factory A: Average wage is $4600. SD wage is 500 and number of workers 100
• Factory B: Average monthly wage is $4900. SD wage is 400 and number of workers is
80
• Which factory pays larger sum of monthly wages
• Which factory has the greater variability of wage distribution
Problem 2
• Given the CGPA of the students; 3.36 1.48 2.64 2.96
calculate: 1.88 1.87 3.43 3.67
• Mean, Median & Mode 2.57 1.98 1.66 1.77

• Calculate the 80th & 95th percentile 1.80 3.28 1.98 3.75
1.56 1.43 1.48 2.20
• Calculate the scores which might be
1.83 1.95 1.28 2.23
considered as outliers
3.74 1.66 2.96 1.62
• Dean believes CGPA is right tailed – 2.86 1.14 2.77 1.89
can you support?
1.38 1.71 2.74 2.16
2.84 1.68 3.35 2.07
Thank You

Potrebbero piacerti anche