Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
n N
xi x i
x i 1 i 1
n N
Measures of Central Tendency – Mean
• Example I
– Data: 8, 4, 2, 6, 10
5
x i
(8 4 2 6 10)
x i 1
6
5 5
• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
10
x 59.70
Mean
1198.10 (mm)
Mean
58.51 (°F)
Chapel Hill, NC
(1972-2001)
Measures of Central Tendency – Mean
• Advantage
– Sensitive to any change in the value of any observation
• Disadvantage
– Very sensitive to outliers
wi xi City
A
Avg. Income
$23,000
Population
100,000
x i 1
n
B $20,000 50,000
w
C $25,000 150,000
i
i 1 Here, population is the weighting factor
and the average income is the variable
of interest
Measures of Central Tendency – Median
• Median – This is the value of a variable such that half of
the observations are above and half are below this value
i.e. this value divides the distribution into two groups of
equal size
2, 4, 6, 8, 10 median: 6
• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)
7.8, 9.8, 10.1, 10.2, 13.9, 14.5, 15.5, 17.5, 20.0, 24.5
Mean = 8.10 m
Source: http://www.forestlearn.org/forests/refor.htm
median: (6.0 + 7.1) = 6.55
# Tree Height # Tree Height
(m) (m)
1 4.5 6 7.1 mode: 7.5
2 4.8 7 7.5
3 5.0 8 7.5
4 5.3 9 8.0
5 6.0 10 25.4
30 40 25 50 45
50 55 45 48 61
60 75 70 45 72
24 45 200 205 65
65 39 58 45 65
24, 25, 30, 39, 40, 45, 45, 45, 45, 45, 48, 50, 50,
55, 58, 60, 61, 65, 65, 65, 70, 72, 75, 200, 205
Source: http://www.shodor.org/interactivate/discussions/sd1.html
Which one is better: mean, median, or mode?
• Range
– Percentile range
– Quartile deviation
– Mean deviation
– Variance and standard deviation
R x1 , x 2 ,........, x n max x1 , x 2 ,......... .., x n min x1 , x 2 ,......... ..., x n
Percentile Range
P 90
10 P90 P10
Quartile Deviation
Q Q3 Q1
• The inter-quartile range is frequently reduced to the measure
of semi-interquartile range, known as the quartile deviation
(QD), by dividing it by 2. Thus
Q3 Q1
QD
2
Mean Deviation
f i xi x
MD x i 1
n
• k = Number of classes
• xi= Mid point of the i-th class
• fi= frequency of the i-th class
Standard Deviation
Population Sample
i
x 2
s
i
x x 2
N N 1
SD variance
Standard Deviation for Group Data
• SD is :
f i xi x 2
x
fx
i i
s
f
Where
N i
• Simplified formula
fx fx
2
2
s
N N
Example-1: Find Standard Deviation of
Ungroup Data
Family
1 2 3 4 5 6 7 8 9 10
No.
Size (xi) 3 3 4 4 5 5 6 6 7 7
Here, x
x i
50
5
n 10
xi 3 3 4 4 5 5 6 6 7 7 50
xi x -2 -2 -1 -1 0 0 1 1 2 2 0
x i x 2
4 4 1 1 0 0 1 1 4 4 20
2
xi 9 9 16 16 25 25 36 36 49 49 270
ix x 2
20
s2 2.2, s 2.2 1.48
n 1 9
Example-2: Find Standard Deviation of
Group Data
xi fi f i xi f i xi
2 x i x x i x 2 f i x i x 2
3 2 6 18 -3 9 18
5 3 15 75 -1 1 3
7 2 14 98 1 1 2
8 2 16 128 2 4 8
9 1 9 81 3 9 9
Total 10 60 400 - - 40
x
f x i i
60
6 f x i x
2
i 40
s 2
4.44
f i 10 n 1 9
Relative Measures of Dispersion
• Coefficient of variation
• Coefficient of mean deviation
• Coefficient of range
• Coefficient of quartile deviation
Coefficient of Variation
• A coefficient of variation is computed as a
ratio of the standard deviation of the
distribution to the mean of the same
distribution.
sx
CV
x
Example-3: Comments on Children in a
community
Height weight
Mean 40 inch 10 kg
SD 5 inch 2 kg
CV 0.125 0.20
Mean deviation
Coefficien t of mean deviation = 100
Mean
Coefficient of Range
LS
Coefficien t of range 100
LS
Q3 Q1
Coefficien t of quartile deviation 100
Q3 Q1
Measures of Asymmetry
Measures of Skewness and Kurtosis
• A fundamental task in many statistical analyses is to
characterize the location and variability of a data set
(Measures of central tendency vs. measures of
dispersion)
• Both measures tell us nothing about the shape of the
distribution
• A further characterization of the data includes
skewness and kurtosis
• The histogram is an effective graphical technique for
showing both the skewness and kurtosis of a data set
Histograms
Fig. 3. Histogram of crown width (m) measured in situ for a random sample of
Quercus robur trees in Frame Wood (n = 63; mean = 9.3 m; SD = 4.64 m).
Source: Koukoulas & Blackburn, 2005. Journal of Vegetation Science: Vol. 16, No. 5, pp. 587–596
Frequency & Distribution
• A histogram is one way to depict a frequency
distribution
• Frequency is the number of times a variable takes on a
particular value
• Note that any variable has a frequency distribution
• e.g. roll a pair of dice several times and record the
resulting values (constrained to being between and 2
and 12), counting the number of times any given value
occurs (the frequency of that value occurring), and take
these all together to form a frequency distribution
Frequency & Distribution
• Frequencies can be absolute (when the frequency
provided is the actual count of the occurrences) or
relative (when they are normalized by dividing the
absolute frequency by the total number of observations
[0, 1])
• Relative frequencies are particularly useful if you want
to compare distributions drawn from two different
sources (i.e. while the numbers of observations of each
source may be different)
Histograms
• We may summarize our data by constructing
histograms, which are vertical bar graphs
• A histogram is used to graphically summarize the
distribution of a data set
• A histogram divides the range of values in a data set
into intervals
• Over each interval is placed a bar whose height
represents the frequency of data values in the interval.
Building a Histogram
• To construct a histogram, the data are first grouped
into categories
• The histogram contains one vertical bar for each
category
• The height of the bar represents the number of
observations in the category (i.e., frequency)
• It is common to note the midpoint of the category on
the horizontal axis
Building a Histogram – Example
• 1. Develop an ungrouped frequency table
– That is, we build a table that counts the number of
occurrences of each variable value from lowest to highest:
TMI Value Ungrouped Freq.
4.16 2
4.17 4
4.18 0
… …
13.71 1
• We could attempt to construct a bar chart from this table, but
it would have too many bars to really be useful
Building a Histogram – Example
• 2. Construct a grouped frequency table
– Select an appropriate number of classes
48
Percent of cells in catchment
44
40
36
32
28
24
20
16 A proxy for
12
8 Soil Moisture
4
0
4 5 6 7 8 9 10 11 12 13 14 15 16
Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA:
Macmillan College Publishing Co., p. 91.
Further Moments of the Distribution
(x x)
i
3
skewness i 1
3
ns
• If skewness equals zero, the histogram is symmetric
about the mean
• Positive skewness vs negative skewness
Further Moments – Skewness
Source: http://library.thinkquest.org/10030/3smodsas.htm
Further Moments – Skewness
• Positive skewness
– There are more observations below the mean
than above it
– When the mean is greater than the median
• Negative skewness
– There are a small number of low observations and
a large number of high ones
– When the median is greater than the mean
Further Moments – Kurtosis
• Kurtosis measures how peaked the histogram is
n
(x x)
i
4
kurtosis i
4
3
ns
• The kurtosis of a normal distribution is 0
• Kurtosis characterizes the relative peakedness or
flatness of a distribution compared to the normal
distribution
Further Moments – Kurtosis
• Platykurtic– When the kurtosis < 0, the frequencies
throughout the curve are closer to be equal (i.e., the
curve is more flat and wide)
• Thus, negative kurtosis indicates a relatively flat
distribution
• Leptokurtic– When the kurtosis > 0, there are high
frequencies in only a small part of the curve (i.e, the
curve is more peaked)
• Thus, positive kurtosis indicates a relatively peaked
distribution
Further Moments – Kurtosis
platykurtic leptokurtic
Source: http://www.riskglossary.com/link/kurtosis.htm
• Histograms
• Box plots
Functions of a Histogram
• The function of a histogram is to graphically
summarize the distribution of a data set
• The histogram graphically shows the following:
1. Center (i.e., the location) of the data
2. Spread (i.e., the scale) of the data
3. Skewness of the data
4. Kurtosis of the data
4. Presence of outliers
5. Presence of multiple modes in the data.
Functions of a Histogram
5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
25th percentile: 11.75 75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Measures of relationships
The relationship between x and y
• Correlation: is there a relationship between 2
variables?
• Regression: how well a certain independent
variable predict dependent variable?
• CORRELATION CAUSATION
– In order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams
Y Y Y
Y Y Y
X X X
S
2 i 1
n 1
x
Covariance:
• Gives information on the degree to
which two variables vary together. n
• Note how similar the covariance is
to variance: the equation simply (x i x)( yi y )
multiplies x’s error scores by y’s error
cov( x, y ) i 1
n 1
scores as opposed to squaring x’s
error scores.
Covariance
n
(x i x)( yi y )
cov( x, y ) i 1
n 1
When X and Y : cov (x,y) = pos.
When X and Y : cov (x,y) = neg.
When no constant relationship: cov (x,y) = 0
Example Covariance
6 x y xi x yi y ( xi x )( yi y )
5
0 3 -3 0 0
4
3
2 2 -1 -1 1
2
3 4 0 1 0
1
4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x3 y3 7
( x x)( y
i i y ))
7
What does this
cov( x, y ) i 1
1.75 number tell us?
n 1 4
Problem with Covariance:
• The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data
cov( x, y )
rxy
sx s y
Pearson’s R continued
n n
( x x)( y
i i y) ( x x)( y
i i y)
cov( x, y ) i 1
rxy i 1
n 1 (n 1) s x s y
Z xi * Z yi
rxy i 1
n 1
Limitations of r
• When r = 1 or r = -1:
– We can predict y from x with certainty
– all data points are on a straight line: y = ax + b
• r is actually r̂
– r = true r of whole population
– r̂= estimate of r based on data
• r is very sensitive to extreme values:
5
0
0 1 2 3 4 5 6
Regression
• Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.
= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
b
ε b ε
b
b b b
• We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
∑(y – y)2 SSy
• Total variance of y: sy2 =
n-1
=
dfy
r2 = sŷ2 / sy2
• F-statistic: complicated
rearranging
sŷ2 r2 (n - 2)2
F(df ,df ) = =......=
ŷ er
ser2 1 – r2
And it follows that:
r (n - 2) So all we need to
(because F = t2) t(n-2) = know are r and n
√1 – r2
General Linear Model
• Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
• A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression