Sei sulla pagina 1di 71

Descriptive Statistics

The farthest most people ever get


What is data?
• Data is made up of variables
• A variable is something that can take different values between
individuals or in the same individual at different time points

– Gender can take the value “male” or “female”


– Age can take a minimum numeric value of zero, and a
maximum numeric value of many years
– Time to react to your name being called out is an example of
a variable that would vary if you measured it in the same
individual at several time points
• It is usual in Psychology to measure the value of a variable in
many separate individuals
What does statistics do to data?
– Different types of variables
• categorical, ordinal, continuous (interval and ratio)
– If you have measured the same variable in many
individuals you need a way of summarising the
data
– What’s the “average” value?
– How much variation is there in the data?
• Compare – ask if one group differs from
another on the value of a variable
• Relate – ask how one variable changes as a
function of another one
Variables are classified according to their level
of measurement

• Country of birth
– Example values are France, UK, Germany
– this is an unordered category because France is not
more or less than the UK
– We may assign numbers to category values for
convenience (e.g. 1 = UK, 2 = France), but you
cannot meaningfully add or subtract the numbers
– This severely restricts the type of statistics we can
use with categorical variables
Common Descriptive Statistics
• Count (frequencies)
• Percentage
• Mean
• Mode
• Median
• Range
• Standard deviation
• Variance
• Ranking
Descriptive Statistics
• Descriptive Statistics are Used by Researchers to Report
on Populations and Samples

• In Sociology:
Summary descriptions of measurements (variables) taken
about a group of people

• By Summarizing Information, Descriptive Statistics


Speed Up and Simplify Comprehension of a Group’s
Characteristics
Sample vs. Population

Population Sample
Descriptive Statistics
An Illustration:
Which Group is Smarter?
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Each individual may be different. If you try to understand a group by remembering the
qualities of each member, you become overwhelmed and fail to understand the group.
Descriptive Statistics
Which group is smarter now?

Class A--Average IQ Class B--Average IQ

110.54 110.23

They’re roughly the same!

With a summary descriptive statistic, it is much easier to


answer our question
Indicators of Central Tendency
• Mode
– Most Frequently Occurring Score

• Median
– Middle Score

• Mean
– Arithmetic Average, etc.
Indicators of Central Tendency
Mode = 15 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

Advantages
•Quick and easy to compute
•Unaffected by extreme scores
•Can be used at any level of measurement.
Indicators of Central Tendency
Mode = 15 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

Disadvantages
•Terminal Statistic
• A given sub-group could make
this measure unrepresentative.
Indicators of Central Tendency
Median

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

50th Percentile = n + 1
2
Indicators of Central Tendency
Median = 19.5 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

For an even number of scores take


the mean of the middle two:
(19 + 20)
2 = 19.5
Indicators of Central Tendency
Median = 19.5 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

Advantages
•Unaffected by extreme scores
•Can be used at all levels above nominal.
Indicators of Central Tendency
Median = 19.5 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

Disadvantages
•Only considers order- value ignored.
Indicators of Central Tendency
-Arithmetic Mean
-Harmonic Mean
-Geometric Mean
also.. -f mean
-Truncated mean
-Power mean
-Weighted arithmetic mean
-Chisini mean
-Identric mean, etc, etc…
Indicators of Central Tendency
Mean

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

∑X
X= n
Indicators of Central Tendency
Mean = 17.9 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

251
(10+11+11+15+15+15+19+20+21+21+22+22+24+25)
X= 14
Indicators of Central Tendency
Mean = 17.9 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

Advantages
•Very sensitive measure
•Takes into account all the available information
•Can be combined with means of other groups to give the overall mean.
Indicators of Central Tendency
Mean = 17.9 k.y-1

Annual
Salary: 10k 11k 11k 15k 15k 15k 19k 20k 21k 21k 22k 22k 24k 25k

Disadvantages
•Very sensitive measure
•Can only be used on interval or ratio data
•Can only be used when scores are symmetrical above and below X.
Distribution

• Often displayed graphically, where:

– X axis = measured variable


– Y axis = frequency.
160
Normal Distribution
140

120
Number of People

100

80

60

40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


Para-Normal Distribution?
160

140

120
Number of People

100

80

60

40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)

Paranormal events are phenomena described in any non-scientific


bodies of knowledge, whose existence within these contexts is
described to lie beyond normal experience or scientific explanation.
AKA
160
Normal Distribution -Bell Shaped
140
-Gaussian.

120
Number of People

100
…but first described
80
mathematically by
Abraham De Moivre Carl Friedrich Gauss
60
in 1733… Applied ND in 1809 to
…published 1924! establish the diameter
40 of lunar features

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


Normal Distribution
Characteristics of ND Curve:

•Naturally Occurring
•Symmetrical
160
Normal Distribution
140

Mode
120
Number of People

Median
100

80
Mean

60

40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


160
Normal Distribution
140

120 Point of
Number of People

100
Inflection: A
point of a curve at
80 which a change in
the direction of
60 68.26% curvature occurs.
40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


160 Normal Distribution
140

Z = standard score Therefore,


120
for comparison: Average = 3500
Number of People

100 Raw score SD = 1000


versus
80
Group
34.13% 34.13%
60

40

20 2.15% 2.15%
13.59% 13.59%
0

1500 2500 3500 4500 5500

Energy Intake (calories per day)


160
Normal Distribution
140
So, if: Therefore,
120
Raw score = 4500 Average = 3500
Number of People

100 SD = 1000
Z = +1
80 Study of SD size
34.13% 34.13% = ‘Kurtosis’
60

40

20 2.15% 2.15%
13.59% 13.59%
0

1500 2500 3500 4500 5500

Energy Intake (calories per day)


Normal Distribution
So, if: Therefore,
160
Raw score = 4500 Average = 3500
140 SD = 500
Z = +2
120
Number of People

100 68.26%
80

60

40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


160 Normal Distribution
So, if: Therefore,
140
Average = 3500
120
Raw score = 4500 SD = 2000
Number of People

100
Z = +0.5

80

60 68.26%
40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


160 Non-Normal Distribution
Mode
140
Negative Skew
120
Median
Number of People

100
Mean
80

60

40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


160 Non-Normal Distribution
Mode
140
Positive Skew
120
Median
Number of People

100
Mean
80

60

40

20

1500 2500 3500 4500 5500

Energy Intake (calories per day)


Coefficient of Variation (CV)
• Another Measure of Dispersion

• Histograms

• Skewness

• Kurtosis

• Other Descriptive Summary Measures


Measures of Dispersion – Coefficient of
Variation
• Coefficient of variation (CV) measures the spread of a set
of data as a proportion of its mean.
• It is the ratio of the sample standard deviation to the
sample mean
s
CV   100%
x
• It is sometimes expressed as a percentage
• There is an equivalent definition for the coefficient of
variation of a population
Chapel Hill Bend
(A) (B)
Mean 1198.10 298.07

Standard Deviation 191.80 82.08

Coefficient of Variation 0.16 0.28


(CV) (16%) (28%)
Coefficient of Variation (CV)

• It is a dimensionless number that can be used to


compare the amount of variance between populations
with different means

 (x x) 2 n
i
 (x  x)
i
2

s 
2 i 1
s i 1
n 1 n 1
s
CV   100%
x
Measures of Skewness and Kurtosis
• A fundamental task in many statistical analyses is to
characterize the location and variability of a data set
(Measures of central tendency vs. measures of
dispersion)
• Both measures tell us nothing about the shape of the
distribution
• A further characterization of the data includes
skewness and kurtosis
• The histogram is an effective graphical technique for
showing both the skewness and kurtosis of a data set
Histograms

Fig. 3. Histogram of crown width (m) measured for a random


sample (n = 63; mean = 9.3 m; SD = 4.64 m).
Frequency & Distribution

• A histogram is one way to depict a frequency


distribution
• Frequency is the number of times a variable takes on a
particular value
• Note that any variable has a frequency distribution
• e.g. roll a pair of dice several times and record the
resulting values (constrained to being between and 2 and
12), counting the number of times any given value
occurs (the frequency of that value occurring), and take
these all together to form a frequency distribution
Frequency & Distribution

• Frequencies can be absolute (when the frequency


provided is the actual count of the occurrences) or
relative (when they are normalized by dividing the
absolute frequency by the total number of observations
[0, 1])
• Relative frequencies are particularly useful if you want
to compare distributions drawn from two different
sources (i.e. while the numbers of observations of each
source may be different)
Histograms
• We may summarize our data by constructing
histograms, which are vertical bar graphs
• A histogram is used to graphically summarize the
distribution of a data set
• A histogram divides the range of values in a data set
into intervals
• Over each interval is placed a bar whose height
represents the frequency of data values in the interval.
Building a Histogram

• To construct a histogram, the data are first grouped


into categories
• The histogram contains one vertical bar for each
category
• The height of the bar represents the number of
observations in the category (i.e., frequency)
• It is common to note the midpoint of the category on
the horizontal axis
Building a Histogram – Example
• 1. Develop an ungrouped frequency table
– That is, we build a table that counts the number of
occurrences of each variable value from lowest to highest:
TMI Value Ungrouped Freq.
4.16 2
4.17 4
4.18 0
… …
13.71 1
• We could attempt to construct a bar chart from this table, but it
would have too many bars to really be useful
Building a Histogram – Example

• 2. Construct a grouped frequency table


– Select an appropriate number of classes

Class Frequency Percentage


4.00 - 4.99 120
5.00 - 5.99 807
6.00 - 6.99 1411
7.00 - 7.99 407
8.00 - 8.99 87
9.00 - 9.99 33
10.00 - 10.99 17
11.00 - 11.99 22
12.00 - 12.99 43
13.00 - 13.99 19
Building a Histogram – Example

• 3. Plot the frequencies of each class


– All that remains is to create the bar graph

Pond Branch TMI Histogram

48
Percent of cells in catchment

44
40
36
32
28
24
20
16 A proxy for
12
8 Soil Moisture
4
0
4 5 6 7 8 9 10 11 12 13 14 15 16

Topographic Moisture Index


Further Moments of the Distribution

• While measures of dispersion are useful for helping


us describe the width of the distribution, they tell us
nothing about the shape of the distribution

Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA:
Macmillan College Publishing Co., p. 91.
Further Moments of the Distribution

• There are further statistics that describe the shape of


the distribution, using formulae that are similar to
those of the mean and variance
• 1st moment - Mean (describes central value)
• 2nd moment - Variance (describes dispersion)
• 3rd moment - Skewness (describes asymmetry)
• 4th moment - Kurtosis (describes peakedness)
Further Moments – Skewness

• Skewness measures the degree of asymmetry exhibited


by the data
n

 (x  x)
i
3

skewness  i 1
3
ns
• If skewness equals zero, the histogram is symmetric
about the mean
• Positive skewness vs negative skewness
Further Moments – Skewness

Source: http://library.thinkquest.org/10030/3smodsas.htm
Further Moments – Skewness

• Positive skewness
– There are more observations below the mean than
above it
– When the mean is greater than the median

• Negative skewness
– There are a small number of low observations and
a large number of high ones
– When the median is greater than the mean
Further Moments – Kurtosis

• Kurtosis measures how peaked the histogram is


n

 (x  x)
i
4

kurtosis  i
4
3
ns
• The kurtosis of a normal distribution is 0
• Kurtosis characterizes the relative peakedness or
flatness of a distribution compared to the normal
distribution
Further Moments – Kurtosis
• Platykurtic– When the kurtosis < 0, the frequencies
throughout the curve are closer to be equal (i.e., the
curve is more flat and wide)
• Thus, negative kurtosis indicates a relatively flat
distribution
• Leptokurtic– When the kurtosis > 0, there are high
frequencies in only a small part of the curve (i.e, the
curve is more peaked)
• Thus, positive kurtosis indicates a relatively peaked
distribution
Further Moments – Kurtosis

platykurtic leptokurtic

Source: http://www.riskglossary.com/link/kurtosis.htm

• Kurtosis is based on the size of a distribution's


tails.
• Negative kurtosis (platykurtic) – distributions with
short tails
• Positive kurtosis (leptokurtic) – distributions with
relatively long tails
Why Do We Need Kurtosis?

• These two distributions have the same variance,


approximately the same skew, but differ markedly in
kurtosis.
Source: http://davidmlane.com/hyperstat/A53638.html
How to Graphically Summarize Data?

• Histograms

• Box plots
Functions of a Histogram

• The function of a histogram is to graphically


summarize the distribution of a data set
• The histogram graphically shows the following:
1. Center (i.e., the location) of the data
2. Spread (i.e., the scale) of the data
3. Skewness of the data
4. Kurtosis of the data
4. Presence of outliers
5. Presence of multiple modes in the data.
Functions of a Histogram

• The histogram can be used to answer the following


questions:
1. What kind of population distribution do the data
come from?
2. Where are the data located?
3. How spread out are the data?
4. Are the data symmetric or skewed?
5. Are there outliers in the data?
Source: http://www.robertluttman.com/vms/Week5/page9.htm (First three)
http://office.geog.uvic.ca/geog226/frLab1.html (Last)
Box Plots
• We can also use a box plot to graphically summarize
a data set
• A box plot represents a graphical summary of what is
sometimes called a “five-number summary” of the
distribution
– Minimum
– Maximum
– 25th percentile 75th
max.
– 75th percentile %-ile
median
– Median 25th
min. %-ile
• Interquartile Range (IQR)
Rogerson, p. 8.
Box Plots
• Example – Consider first 9 Commodore prices ( in
$,000)
6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0
• Arrange these in order of magnitude
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
• The median is Q2 = 6.7 (there are 4 values on either
side)
• Q1 = 5.9 (median of the 4 smallest values)
• Q3 = 10.2 (median of the 4 largest values)
• IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
• Example (ranked)
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
• The median is Q1 = 6.7
• Q1 = 5.9 Q3 = 10.2 IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
Box Plots

Example: Table 1.1 Commuting data (Rogerson, p5)

Ranked commuting times:

5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47

25th percentile is represented by observation (30+1)/4=7.75


75th percentile is represented by observation 3(30+1)/4=23.25
25th percentile: 11.75
75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Example (Ranked commuting times):

5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
25th percentile: 11.75 75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Other Descriptive Summary Measures

• Descriptive statistics provide an organization and


summary of a dataset
• A small number of summary measures replaces the
entirety of a dataset
• We’ll briefly talk about other simple descriptive
summary measures
Other Descriptive Summary Measures

• You're likely already familiar with some simple


descriptive summary measures
– Ratios
– Proportions
– Percentages
– Rates of Change
– Location Quotients
Other Descriptive Summary Measures
• Ratios –
# of observations in A
=
# of observations in B
e.g., A - 6 overcast, B - 24 mostly cloudy days
• Proportions – Relates one part or category of data to
the entire set of observations, e.g., a box of marbles
that contains 4 yellow, 6 red, 5 blue, and 2 green gives
a yellow proportion of 4/17 or
colorcount = {yellow, red, blue, green}
ai
acount = {4, 6, 5, 2} proportion 
 ai
Other Descriptive Summary Measures
• Proportions - Sum of all proportions = 1. These are
useful for comparing two sets of data w/different sizes
and category counts, e.g., a different box of marbles
gives a yellow proportion of 2/23, and in order for this
to be a reasonable comparison we need to know the
totals for both samples
• Percentages - Calculated by proportions x 100, e.g.,
2/23 x 100% = 8.696%, use of these should be
restricted to larger samples sizes, perhaps 20+
observations
Other Descriptive Summary Measures

• Location Quotients - An index of relative concentration


in space, a comparison of a region's share of something
to the total
• Example – Suppose we have a region of 1000 Km2
which we subdivide into three smaller areas of 200, 300,
and 500 km2 (labeled A, B, & C)
• The region has an influenza outbreak with 150 cases in
A, 100 in B, and 350 in C (a total of 600 flu cases):
Proportion of Area Proportion of Cases Location Quotient
A 200/1000=0.2 150/600=0.25 0.25/0.2=1.25
B 300/1000=0.3 100/600=0.17 0.17/0.3 = 0.57
C 500/1000=0.5 350/600=0.58 0.58/0.5=1.17

Potrebbero piacerti anche