Sei sulla pagina 1di 69

Descriptive Statistics (Part 1)

Numerical Description
Central Tendency
Dispersion
C
h
a
p
t
e
r

4
Statistics are descriptive measures derived from a
sample (n items).
Parameters are descriptive measures derived from
a population (N items).
Numerical Description
Three key characteristics of numerical data:
Characteristic Interpretation
Central Tendency Where are the data values concentrated?
What seem to be typical or middle data
values?
Numerical Description
Dispersion How much variation is there in the data?
How spread out are the data values?
Are there unusual values?
Shape Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?
Numerical statistics can be used to summarize this
random sample of brands.
Defect rate = total no. defects
no. inspected
x 100
Must allow for sampling error since the analysis is
based on sampling.
Numerical Description
Example: Vehicle Quality
Consider the data set of vehicle defect rates from
J. D. Power and Associates.
Numerical Description
Number of defects per 100 vehicles, 1004 models.
To begin, sort the
data in Excel.
Sorted data provides insight into central tendency
and dispersion.
Numerical Description
The dot plot offers a visual impression of the data.
Visual Displays
Numerical Description
Histograms with 5 bins (suggested by Sturges
Rule) and 10 bins are shown below.
Both are symmetric with no extreme values and
show a modal class toward the low end.
Visual Displays
Numerical Description
Descriptive
Statistics in Excel
Go to Tools | Data Analysis
and select
Descriptive Statistics
Highlight the data
range, specify a cell
for the upper-left
corner of the output
range, check
Summary Statistics
and click OK.
Here is the resulting analysis.
Descriptive Statistics in MegaStat
Here is the
resulting
MegaStat
analysis:
The central tendency is the middle or typical
values of a distribution.
Central tendency can be assessed using a dot
plot, histogram or more precisely with numerical
statistics.
Central Tendency
Statistic Formula Excel Formula Pro Con
Mean =AVERAGE(Data)
Familiar and
uses all the
sample
information.
Influenced
by extreme
values.
1
1
n
i
i
x
n
=

Central Tendency
Six Measures of Central Tendency
Median
Middle
value in
sorted
array
=MEDIAN(Data)
Robust when
extreme data
values exist.
Ignores
extremes
and can be
affected by
gaps in data
values.
Statistic Formula Excel Formula Pro Con
Mode
Most
frequently
occurring
data value
=MODE(Data)
Useful for
attribute
data or
discrete data
with a small
range.
May not be
unique,
and is not
helpful for
continuous
data.
Central Tendency
Six Measures of Central Tendency
Midrange
=0.5*(MIN(Data)
+MAX(Data))
Easy to
understand
and
calculate.
Influenced
by extreme
values and
ignores
most data
values.
min max
2
x x +
Statistic Formula Excel Formula Pro Con
Geometric
mean (G)
=GEOMEAN(Data)
Useful for
growth
rates and
mitigates
high
extremes.
Less
familiar
and
requires
positive
data.
Trimmed
mean
Same as the
mean except
omit highest
and lowest
k% of data
values (e.g.,
5%)
=TRMEAN(Data, %)
Mitigates
effects of
extreme
values.
Excludes
some data
values
that could
be
relevant.
Central Tendency
Six Measures of Central Tendency
1 2
...
n
n
x x x
A familiar measure of central tendency.
In Excel, use function =AVERAGE(Data) where
Data is an array of data values.
Population Formula Sample Formula
1
N
i
i
x
N
=
=

1
n
i
i
x
x
n
=
=

Central Tendency
Mean
For the sample of n = 37 car brands:
1
87 93 98 ... 159 164 173 4639
125.38
37 37
n
i
i
x
x
n
=
+ + + + + +
= = = =

Central Tendency
Mean
Arithmetic mean is the most familiar average.
Affected by every sample item.
The balancing point or fulcrum for the data.
Central Tendency
Characteristics of the Mean
Regardless of the shape of the distribution,
absolute distances from the mean to the data
points always sum to zero.
1
( ) 0
n
i
i
x x
=
=

Central Tendency
Characteristics of the Mean
Consider the following
asymmetric distribution of quiz
scores whose mean = 65.
1
( )
n
i
i
x x
=

= (42 65) + (60 65) + (70 65) + (75 65) + (78 65)
= (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0
The median (M) is the 50
th
percentile or midpoint
of the sorted sample data.
M separates the upper and lower half of the sorted
observations.
If n is odd, the median is the middle observation in
the data array.
If n is even, the median is the average of the
middle two observations in the data array.
Central Tendency
Median
Central Tendency
Median
For n = 8, the median is between the fourth and
fifth observations in the data array.
Central Tendency
Median
For n = 9, the median is the fifth observation in the
data array.
Consider the following n = 6 data values:
11 12 15 17 21 32
What is the median?
M = (x
3
+x
4
)/2 = (15+17)/2 = 16
11 12 15 16 17 21 32
For even n, Median =
/ 2 ( / 2 1)
2
n n
x x
+
+
n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4
Central Tendency
Median
Consider the following n = 7 data values:
12 23 23 25 27 34 41
What is the median?
M = x
4
= 25
12 23 23 25 27 34 41
For odd n, Median =
( 1) / 2 n
x
+
(n+1)/2 = (7+1)/2 = 8/2 = 4
Central Tendency
Median
Use Excels function =MEDIAN(Data) where Data
is an array of data values.
For the 37 vehicle quality ratings (odd n) the
position of the median is
(n+1)/2 = (37+1)/2 = 19.

So, the median is x
19
= 121.
When there are several duplicate data values, the
median does not provide a clean 50-50 split in
the data.
Central Tendency
Median
The median is insensitive to extreme data values.
For example, consider the following quiz scores for
3 students:
Toms scores:
20, 40, 70, 75, 80 Mean =57, Median = 70, Total = 285
Jakes scores:
60, 65, 70, 90, 95 Mean = 76, Median = 70, Total = 380
Marys scores:
50, 65, 70, 75, 90 Mean = 70, Median = 70, Total = 350
What does the median for each student tell you?
Central Tendency
Characteristics of the Median
The most frequently occurring data value.
Similar to mean and median if data values occur
often near the center of sorted data.
May have multiple modes or no mode.
Central Tendency
Mode
Lees scores:
60, 70, 70, 70, 80 Mean =70, Median = 70, Mode = 70
Pats scores:
45, 45, 70, 90, 100 Mean = 70, Median = 70, Mode = 45
Sams scores:
50, 60, 70, 80, 90 Mean = 70, Median = 70, Mode = none
Xiaos scores:
50, 50, 70, 90, 90 Mean = 70, Median = 70, Modes = 50,90
Central Tendency
Mode
For example, consider the following quiz scores for
3 students:
What does the mode for each student tell you?
Easy to define, not easy to calculate in large
samples.
Use Excels function =MODE(Array)
- will return #N/A if there is no mode.
- will return first mode found if multimodal.
May be far from the middle of the distribution and
not at all typical.
Central Tendency
Mode
Generally isnt useful for continuous data since
data values rarely repeat.
Best for attribute data or a discrete variable with a
small range (e.g., Likert scale).
Central Tendency
Mode
Consider the following P/E ratios for a random
sample of 68 Standard & Poors 500 stocks.
What is the mode?
Central Tendency
Example: Price/Earnings Ratios and Mode
7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
Excels descriptive
statistics results are:
The mode 13 occurs
7 times, but what
does the dot plot
show?
Mean 22.7206
Median 19
Mode 13
Range 84
Minimum 7
Maximum 91
Sum 1545
Count 68
Central Tendency
Example: Price/Earnings Ratios and Mode
The dot plot shows local modes (a peak with
valleys on either side) at 10, 13, 15, 19, 23, 26, 29.
These multiple modes suggest that the mode is
not a stable measure of central tendency.
Central Tendency
Example: Price/Earnings Ratios and Mode
Points scored by the winning NCAA football team
tends to have modes in multiples of 7 because
each touchdown yields 7 points.
Central Tendency
Example: Rose Bowl Winners Points
Consider the dot plot of the points scored by the
winning team in the first 87 Rose Bowl games.
What is the mode?
A bimodal distribution refers to the shape of the
histogram rather than the mode of the raw data.
Occurs when dissimilar populations are combined
in one sample. For example,
Central Tendency
Mode
Compare mean and median or look at histogram to
determine degree of skewness.
Central Tendency
Skewness
Distributions
Shape
Histogram Appearance Statistics
Skewed left
(negative
skewness)
Long tail of histogram points left
(a few low values but most data on
right)
Mean < Median
Central Tendency
Symptoms of Skewness
Symmetric
Tails of histogram are balanced
(low/high values offset)
Mean ~ Median
Skewed right
(positive
skewness)
Long tail of histogram points right
(most data on left but a few high
values)
Mean > Median
For the sample of J.D. Power quality ratings, the
mean (125.38) exceeds the median (121). What
does this suggest?
Central Tendency
Skewness
The geometric mean (G) is a
multiplicative average.
For the J. D. Power quality data (n=37):
1 2
...
n
n
G x x x =
37 77
37
(87)(93)(98)...(164)(173) 2.37667 10 123.38 G = = =
In Excel use =GEOMEAN(Array)
The geometric mean tends to mitigate the effects
of high outliers.
Central Tendency
Geometric Mean
A variation on the geometric mean used to find the
average growth rate for a time series.
For example, from
1998 to 2002, Spirit
Airlines revenues
are:
1
1
n
n
x
G
x
=
Year Revenue (mil)
1998 131
1999 227
2000 311
2001 354
2002 403
Central Tendency
Growth Rates
The average growth rate is given by taking the
geometric mean of the ratios of each years
revenue to the preceding year.
Due to cancellations, only the first and last years
are relevant:
227
G =
311
131
| |
|
\ .
227
354
| |
|
\ .
311
403
354
| |
|
\ .
5
5
403
1 1
131
| |
=
|
\ .
= 1.2421 = .242 or 24.2% per year
In Excel use =(403/131)^(1/5)-1
Central Tendency
Growth Rates
The midrange is the point halfway between the
lowest and highest values of X.
Easy to use but sensitive to extreme data values.
min max
2
x x +
Midrange =
For the J. D. Power quality data (n=37):
min max
2
x x +
Midrange =
1 37
87 173
130
2 2
x x + +
= =
=
Here, the midrange (130) is higher than the mean
(125.38) or median (121).
Central Tendency
Midrange
To calculate the trimmed mean, first remove the
highest and lowest k percent of the observations.
For example, for the n = 68 P/E ratios, we want a 5
percent trimmed mean (i.e., k = .05).
To determine how many observations to trim,
multiply k x n = 0.05 x 68 = 3.4 or 3 observations.
So, we would remove the three smallest and three
largest observations before averaging the
remaining values.
Central Tendency
Trimmed Mean
Here is a summary of all the measures of central
tendency for the n = 68 P/E values.
The trimmed mean mitigates the effects of very
high values, but still exceeds the median.
Mean: 22.72 =AVERAGE(PERatio)
Median: 19.00 =MEDIAN(PERatio)
Mode: 13.00 =MODE(PERatio)
Geometric Mean: 19.85 =GEOMEAN(PERatio)
Midrange: 49.00 =(MIN(PERatio)+MAX(PERatio))/2
5% Trim Mean: 21.10 =TRIMMEAN(PERatio,0.1)
Central Tendency
Trimmed Mean
Central Tendency
Trimmed Mean
The Federal
Reserve uses a
16% trimmed
mean to mitigate
the effects of
extremes in its
analysis of the
Consumer Price
Index.
Variation is the spread of data points about the
center of the distribution in a sample. Consider the
following measures of dispersion:
Statistic Formula Excel Pro Con
Range x
max
x
min

=MAX(Data)-
MIN(Data)
Easy to calculate
Sensitive to
extreme data
values.
Dispersion
Variance
(s
2
)
=VAR(Data)
Plays a key role
in mathematical
statistics.
Non-intuitive
meaning.
( )
2
1
1
n
i
i
x x
n
=

Measures of Variation
Statistic Formula Excel Pro Con
Standard
deviation
(s)
=STDEV(Data)
Most common
measure. Uses
same units as the
raw data ($ , , ,
etc.).
Non-intuitive
meaning.
( )
2
1
1
n
i
i
x x
n
=

Dispersion
Measures of Variation
Coef-
ficient. of
variation
(CV)
None
Measures relative
variation in
percent so can
compare data
sets.
Requires
non-
negative
data.
100
s
x

Statistic Formula Excel Pro Con


Mean
absolute
deviation
(MAD)
=AVEDEV(Data)
Easy to
understand.
Lacks nice
theoretical
properties.
Dispersion
Measures of Variation
1
n
i
i
x x
n
=

The difference between the largest and smallest


observation.
Range = x
max
x
min

For example, for the n = 68 P/E ratios,
Range = 91 7 = 84
Dispersion
Range
The population variance (o
2
) is
defined as the sum of squared
deviations around the mean
divided by the population size.
For the sample variance (s
2
), we
divide by n 1 instead of n,
otherwise s
2
would tend to
underestimate the unknown
population variance o
2
.
( )
2
2
1
N
i
i
x
N
=

o =

( )
2
2
1
1
n
i
i
x x
s
n
=

=

Dispersion
Variance
The square root of the variance.
Units of measure are the same as X.
Population
standard
deviation
( )
2
1
N
i
i
x
N
=

o =

Sample
standard
deviation
( )
2
1
1
n
i
i
x x
s
n
=

=

Explains how individual values in a data set vary


from the mean.
Dispersion
Standard Deviation
Excels built in functions are
Statistic Excel population
formula
Excel sample
formula
Variance =VARP(Array) =VAR(Array)
Standard deviation =STDEVP(Array) =STDEV(Array)
Dispersion
Standard Deviation
Consider the following five quiz scores for
Stephanie.
Dispersion
Calculating a Standard Deviation
Now, calculate the sample standard deviation:
( )
2
1
2380
595 24.39
1 5 1
n
i
i
x x
s
n
=

= = = =

Somewhat easier, the two-sum formula can also


be used:
2
2
1 2
2
1
(360)
28300
28300 25920
5
595 24.39
1 5 1 5 1
n
i
n
i
i
i
x
x
n
s
n
=
=
| |
|
\ .



= = = = =

Dispersion
Calculating a Standard Deviation
The standard deviation is nonnegative because
deviations around the mean are squared.
When every observation is exactly equal to the
mean, the standard deviation is zero.
Standard deviations can be large or small,
depending on the units of measure.
Compare standard deviations only for data sets
measured in the same units and only if the means
do not differ substantially.
Dispersion
Calculating a Standard Deviation
Useful for comparing variables measured in
different units or with different means.
A unit-free measure of dispersion
Expressed as a percent of the mean.
Only appropriate for nonnegative data. It is
undefined if the mean is zero or negative.
100
s
CV
x
=
Dispersion
Coefficient of Variation
For example:
Defect rates
(n = 37)
s = 22.89
= 125.38

gives

CV = 100 (22.89)/(125.38) = 18%
ATM deposits
(n = 100)
s = 280.80
= 233.89

gives

CV = 100 (280.80)/(233.89) = 120%
P/E ratios
(n = 68)
s = 14.28
= 22.72

gives

CV = 100 (14.08)/(22.72) = 62%
x
x
x
100
s
CV
x
=
Dispersion
Coefficient of Variation
The Mean Absolute Deviation (MAD) reveals the
average distance from an individual data point to
the mean (center of the distribution).
Uses absolute values of the deviations around the
mean.
Excels function is =AVEDEV(Array)
1
n
i
i
x x
MAD
n
=

=

Dispersion
Mean Absolute Deviation
Consider the histograms of hole diameters drilled in
a steel plate during manufacturing.
The desired distribution is outlined in red.
Dispersion
Machine A Machine B
Central Tendency vs. Dispersion:
Manufacturing
Desired mean (5mm)
but too much variation.
Acceptable variation but
mean is less than 5 mm.
Take frequent samples to monitor quality.
Machine A Machine B
Dispersion
Central Tendency vs. Dispersion:
Manufacturing
Consider student ratings of four professors on eight
teaching attributes (10-point scale).
Dispersion
Central Tendency vs. Dispersion:
J ob Performance
Jones and Wu have identical means but different
standard deviations.
Dispersion
Central Tendency vs. Dispersion:
J ob Performance
Smith and Gopal have different means but identical
standard deviations.
Dispersion
Central Tendency vs. Dispersion:
J ob Performance
A high mean (better rating) and low standard
deviation (more consistency) is preferred. Which
professor do you think is best?
Dispersion
Central Tendency vs. Dispersion:
J ob Performance
Applied Statistics in
Business and Economics
End of Part 1 of Chapter 4

Potrebbero piacerti anche