Sei sulla pagina 1di 109

Research Methodology

Lecture 7 : Processing and Analysis


of Data
Processing of Data
• The collected data in research is processed and
analyzed to come to some conclusions or to verify
the hypothesis made.
• Processing of data is important as it makes further
analysis of data easier and efficient. Processing of
data technically means
– Editing of the data
– Coding of data
– Classification of data
– Tabulation of data.
Editing
• Data editing is a process by which
collected data is examined to detect any
errors or omissions and further these are
corrected as much as possible before
proceeding further.
• Editing is of two types:
– Field Editing
– Central Editing.
Field and Central Editing
FIELD EDITING:
• This is a type of editing that relates to abbreviated or
illegible written form of gathered data. Such editing is more
effective when done on same day or the very next day after
the interview. The investigator must not jump to conclusion
while doing field editing.
CENTRAL EDITING:
• Such type of editing relates to the time when all data
collection process has been completed. Here a single or
common editor corrects the errors like entry in the wrong
place, entry in wrong unit e.t.c. As a rule all the wrong
answers should be dropped from the final results.
Considerations for Editing
• Editor must be familiar with the interviewer’s
mind set, objectives and everything related to
the study.
• Different colors should be used when editors
make entry in the data collected.
• They should initial all answers or changes they
make to the data.
• The editors name and date of editing should
be placed on the data sheet.
Coding
➢ Classification of responses may be done on the basis of one or
more common concepts.
➢ In coding a particular numeral or symbol is assigned to the answers
in order to put the responses in some definite categories or classes.
➢ The classes of responses determined by the researcher should be
appropriate and suitable to the study.
➢ Coding enables efficient and effective analysis as the responses are
categorized into meaningful classes.
➢ Coding decisions are considered while developing or designing the
questionnaire or any other data collection tool.
➢ Coding can be done manually or through computer.
Classification
➢ Classification of the data implies that the collected raw
data is categorized into common group having
common feature.
➢ Data having common characteristics are placed in a
common group.
➢ The entire data collected is categorized into various
groups or classes, which convey a meaning to the
researcher.
Classification is done in two ways:
- Classification according to attributes.
- Classification according to the class intervals.
Classification According To Attributes
➢ Here the data is classified on the basis of common
characteristics that can be descriptive like literacy, sex,
honesty, marital status etc. or numeral like weight,
height, income etc.
➢ Descriptive features are qualitative in nature and
cannot be measured quantitatively but are kindly
considered while making an analysis.
➢ Analysis used for such classified data is known as
statistics of attributes and the classification is known as
the classification according to the attributes.
Classification on the Basis of the
Interval
• The numerical feature of data can be measured
quantitatively and analyzed with the help of some
statistical unit like the data relating to income,
production, age, weight etc. come under this category.
This type of data is known as statistics of variables
and the data is classified by way of intervals.
• Classification According to the Class Interval usually
involves the following three main problems :
– Number of Classes.
– How to select class limits.
– How to determine the frequency of each class.
Tabulation
• The mass of data collected has to be
arranged in some kind of concise and
logical order.
• Tabulation summarizes the raw data and
displays data in form of some statistical
tables.
• Tabulation is an orderly arrangement of
data in rows and columns.
Objectives of Tabulation
• Conserves space & minimizes explanation
and descriptive statements.
• Facilitates process of comparison and
summarization.
• Facilitates detection of errors and
omissions.
• Establish the basis of various statistical
computations.
Basic Principles of Tabulation
1. Tables should be clear, concise & adequately titled.
2. Every table should be distinctly numbered for easy
reference.
3. Column headings & row headings of the table should
be clear & brief.
4. Units of measurement should be specified at
appropriate places.
5. Explanatory footnotes concerning the table should be
placed at appropriate places.
6. Source of information of data should be clearly
indicated.
Basic Principles of Tabulation
7. The columns & rows should be clearly separated with dark
lines
8. Demarcation should also be made between data of one
class and that of another.
9. Comparable data should be put side by side.
10. The figures in percentage should be approximated before
tabulation.
11. The alignment of the figures, symbols etc. should be
properly aligned and adequately spaced to enhance the
readability of the same.
12. Abbreviations should be avoided.
Analysis of Data
The important statistical measures that are used to
analyze the research or the survey are:
1. Measures of central tendency(mean, median & mode)
2. Measures of dispersion(standard deviation, range,
mean deviation)
3. Measures of asymmetry(skew ness)
4. Measures of relationship etc.( correlation and
regression)
5. Association in case of attributes.
6. Time series Analysis
Measures of Central Tendency
Measures of Central Tendency – Mean

• Mean – Most commonly used measure of central


tendency
• Average of all observations
• The sum of all the scores divided by the number of
scores
• Note: Assuming that each observation is equally
significant
Measures of Central Tendency – Mean

Sample mean: Population mean:

n N

 xi x i

x i 1  i 1

n N
Measures of Central Tendency – Mean
• Example I
– Data: 8, 4, 2, 6, 10
5

x i
(8  4  2  6  10)
x i 1
 6
5 5
• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
10

x (9.8  10.2      24.5)


i
x 
i 1
 14.38
10 10
Measures of Central Tendency – Mean
• Example III
Annual mean temperature (°F)

x  59.70

Monthly mean temperature (°F) at Chapel Hill, NC (2001).


Mean annual precipitation (mm) Example IV

Mean
1198.10 (mm)

Mean annual temperature ((°F)

Mean
58.51 (°F)

Chapel Hill, NC
(1972-2001)
Measures of Central Tendency – Mean
• Advantage
– Sensitive to any change in the value of any observation
• Disadvantage
– Very sensitive to outliers

# Tree Height # Tree Height


(m) (m)
1 5.0 6 5.3
2 6.0 7 7.1
3 7.5 8 25.4
4 8.0 9 7.5
5 4.8 10 4.5

Source: http://www.forestlearn.org/forests/refor.htm Mean = 6.19 m Mean = 8.10 m


Weighted Mean
• We can also calculate a weighted mean using some
weighting factor:
e.g. What is the average income of all
n people in cities A, B, and C:

 wi xi City
A
Avg. Income
$23,000
Population
100,000
x i 1
n
B $20,000 50,000

w
C $25,000 150,000
i
i 1 Here, population is the weighting factor
and the average income is the variable
of interest
Measures of Central Tendency – Median
• Median – This is the value of a variable such that half of
the observations are above and half are below this value
i.e. this value divides the distribution into two groups of
equal size

• When the number of observations is odd, the median is


simply equal to the middle value

• When the number of observations is even, we take the


median to be the average of the two values in the middle
of the distribution
Measures of Central Tendency – Median
• Example I
– Data: 8, 4, 2, 6, 10 (mean: 6)

2, 4, 6, 8, 10 median: 6

• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)

7.8, 9.8, 10.1, 10.2, 13.9, 14.5, 15.5, 17.5, 20.0, 24.5

median: (13.9 + 14.5) / 2 = 14.2


# Tree Height # Tree Height
(m) (m)
1 5.0 6 5.3
2 6.0 7 7.1
3 7.5 8 25.4
4 8.0 9 7.5
5 4.8 10 4.5

Source: http://www.forestlearn.org/forests/refor.htm Mean = 6.19 m Mean = 8.10 m

# Tree Height # Tree Height median: (6.0 + 7.1) = 6.55


(m) (m)
1 4.5 6 7.1
• Advantage: the value is NOT
2 4.8 7 7.5
affected by extreme values at the
3 5.0 8 7.5
end of a distribution (which are
4 5.3 9 8.0
potentially are outliers)
5 6.0 10 25.4
Measures of Central Tendency – Mode

• Mode - This is the most frequently occurring value in


the distribution
• This is the only measure of central tendency that can be
used with nominal data
• The mode allows the distribution's peak to be located
quickly
Mean = 6.19 m (without outlier)

Mean = 8.10 m

Source: http://www.forestlearn.org/forests/refor.htm
median: (6.0 + 7.1) = 6.55
# Tree Height # Tree Height
(m) (m)
1 4.5 6 7.1 mode: 7.5
2 4.8 7 7.5
3 5.0 8 7.5
4 5.3 9 8.0
5 6.0 10 25.4
30 40 25 50 45

50 55 45 48 61

60 75 70 45 72

24 45 200 205 65

65 39 58 45 65

Landsat ETM+, Chapel Hill (2002-05-24)


(7-4-1 band combination)

24, 25, 30, 39, 40, 45, 45, 45, 45, 45, 48, 50, 50,
55, 58, 60, 61, 65, 65, 65, 70, 72, 75, 200, 205

mean: 63.28 median: 50 mode: 45


mean (without outliers): 51.17
Which one is better: mean, median, or
mode?
• Most often, the mean is selected by default
• The mean's key advantage is that it is sensitive to
any change in the value of any observation
• The mean's disadvantage is that it is very sensitive to
outliers
• We really must consider the nature of the data, the
distribution, and our goals to choose properly
Which one is better: mean, median, or mode?

• The mean is valid only for interval data or ratio data.


• The median can be determined for ordinal data as well
as interval and ratio data.
• The mode can be used with nominal, ordinal, interval,
and ratio data
• Mode is the only measure of central tendency that can
be used with nominal data
Which one is better: mean, median, or mode?

• It also depends on the nature of the distribution

Multi-modal distribution Unimodal symmetric

Unimodal skewed Unimodal skewed


Which one is better: mean, median, or mode?

• It also depends on your goals


• Consider a company that has nine employees with
salaries of 35,000 a year, and their supervisor makes
150,000 a year.
• If you want to describe the typical salary in the
company, which statistics will you use?
• I will use mode or median (35,000), because it tells
what salary most people get

Source: http://www.shodor.org/interactivate/discussions/sd1.html
Which one is better: mean, median, or mode?

• It also depends on your goals


• Consider a company that has nine employees with
salaries of 35,000 a year, and their supervisor makes
150,000 a year
• What if you are a recruiting officer for the company
that wants to make a good impression on a
prospective employee?
• The mean is (35,000*9 + 150,000)/10 = 46,500 I would
probably say: "The average salary in our company is
46,500" using mean
Source: http://www.shodor.org/interactivate/discussions/sd1.html
Measures of Dispersion
Measures of Variability or Dispersion

• The dispersion of a distribution reveals how the


observations are spread out or scattered on each
side of the center.
• To measure the dispersion, scatter, or variation of a
distribution is as important as to locate the central
tendency.
• If the dispersion is small, it indicates high uniformity
of the observations in the distribution.
• Absence of dispersion in the data indicates perfect
uniformity. This situation arises when all
observations in the distribution are identical.
• If this were the case, description of any single
observation would suffice.
Purpose of Measuring Dispersion

• A measure of dispersion appears to serve two purposes.


• First, it is one of the most important quantities used to
characterize a frequency distribution.
• Second, it affords a basis of comparison between two or
more frequency distributions.
• The study of dispersion bears its importance from the
fact that various distributions may have exactly the same
averages, but substantial differences in their variability.
Measures of Dispersion

• Range
– Percentile range
– Quartile deviation
– Mean deviation
– Variance and standard deviation

• Relative measure of dispersion


– Coefficient of variation
– Coefficient of mean deviation
– Coefficient of range
– Coefficient of quartile deviation
Range
• The simplest and crudest measure of
dispersion is the range. This is defined as the
difference between the largest and the
smallest values in the distribution. If
x1 , x 2 ,......... ., x n are the values of observations in a
sample, then range (R) of the variable X is
given by:

R x1 , x 2 ,........, x n   max x1 , x 2 ,......... .., x n  min x1 , x 2 ,......... ..., x n 
Percentile Range

• Difference between 10 to 90 percentile.


• It is established by excluding the highest and
the lowest 10 percent of the items, and is the
difference between the largest and the
smallest values of the remaining 80 percent of
the items.

P 90
10  P90  P10
Quartile Deviation

• A measure similar to the special range (Q) is the inter-quartile


range . It is the difference between the third quartile (Q3) and
the first quartile (Q1). Thus

Q  Q3  Q1
• The inter-quartile range is frequently reduced to the measure
of semi-interquartile range, known as the quartile deviation
(QD), by dividing it by 2. Thus
Q3  Q1
QD 
2
Mean Deviation

• The mean deviation is an average of absolute


deviations of individual observations from the central
value of a series. Average deviation about mean
k

f i xi  x
MD x   i 1
n
• k = Number of classes
• xi= Mid point of the i-th class
• fi= frequency of the i-th class
Standard Deviation

• Standard deviation is the positive square root of the


mean-square deviations of the observations from
their arithmetic mean.

Population Sample


 i
 x   2

s
 i
 x  x 2

N N 1

SD  variance
Standard Deviation for Group Data
• SD is :
 f i xi  x 2
x
 fx
i i
s
f
Where
N i

• Simplified formula

 fx  fx 
2
2 
s 
N  N 
 
Example-1: Find Standard Deviation of
Ungroup Data

Family
1 2 3 4 5 6 7 8 9 10
No.
Size (xi) 3 3 4 4 5 5 6 6 7 7
Here, x
 x i

50
5
n 10

Family No. 1 2 3 4 5 6 7 8 9 10 Total

xi 3 3 4 4 5 5 6 6 7 7 50
xi  x -2 -2 -1 -1 0 0 1 1 2 2 0

x i  x  2
4 4 1 1 0 0 1 1 4 4 20

2
xi 9 9 16 16 25 25 36 36 49 49 270


 ix  x  2
20
s2    2.2, s  2.2  1.48
n 1 9
Example-2: Find Standard Deviation of
Group Data

xi fi f i xi f i xi
2 x i  x x i  x 2 f i x i  x 2
3 2 6 18 -3 9 18
5 3 15 75 -1 1 3
7 2 14 98 1 1 2
8 2 16 128 2 4 8
9 1 9 81 3 9 9
Total 10 60 400 - - 40

x
 f x i i

60
6  f x i  x 
2
i 40
s 2
   4.44
f i 10 n 1 9
Relative Measures of Dispersion

• To compare the extent of variation of different


distributions whether having differing or identical
units of measurements, it is necessary to consider
some other measures that reduce the absolute
deviation in some relative form.

• These measures are usually expressed in the form of


coefficients and are pure numbers, independent of
the unit of measurements.
Relative Measures of Dispersion

• Coefficient of variation
• Coefficient of mean deviation
• Coefficient of range
• Coefficient of quartile deviation
Coefficient of Variation
• A coefficient of variation is computed as a
ratio of the standard deviation of the
distribution to the mean of the same
distribution.

sx
CV 
x
Example-3: Comments on Children in a
community

Height weight
Mean 40 inch 10 kg
SD 5 inch 2 kg
CV 0.125 0.20

• Since the coefficient of variation for weight is


greater than that of height, we would tend to
conclude that weight has more variability than
height in the population.
Coefficient of Mean Deviation

• The third relative measure is the coefficient of mean


deviation. As the mean deviation can be computed from
mean, median, mode, or from any arbitrary value, a general
formula for computing coefficient of mean deviation may be
put as follows:

Mean deviation
Coefficien t of mean deviation =  100
Mean
Coefficient of Range

• The coefficient of range is a relative measure


corresponding to range and is obtained by the
following formula:

LS
Coefficien t of range   100
LS

• where, “L” and “S” are respectively the largest and


the smallest observations in the data set.
Coefficient of Quartile Deviation

• The coefficient of quartile deviation is


computed from the first and the third
quartiles using the following formula:

Q3  Q1
Coefficien t of quartile deviation   100
Q3  Q1
Measures of Asymmetry
Measures of Skewness and Kurtosis
• A fundamental task in many statistical analyses is to
characterize the location and variability of a data set
(Measures of central tendency vs. measures of
dispersion)
• Both measures tell us nothing about the shape of the
distribution
• A further characterization of the data includes
skewness and kurtosis
• The histogram is an effective graphical technique for
showing both the skewness and kurtosis of a data set
Histograms

Fig. 3. Histogram of crown width (m) measured in situ for a random sample of
Quercus robur trees in Frame Wood (n = 63; mean = 9.3 m; SD = 4.64 m).
Source: Koukoulas & Blackburn, 2005. Journal of Vegetation Science: Vol. 16, No. 5, pp. 587–596
Frequency & Distribution
• A histogram is one way to depict a frequency
distribution
• Frequency is the number of times a variable takes on a
particular value
• Note that any variable has a frequency distribution
• e.g. roll a pair of dice several times and record the
resulting values (constrained to being between and 2
and 12), counting the number of times any given value
occurs (the frequency of that value occurring), and take
these all together to form a frequency distribution
Frequency & Distribution
• Frequencies can be absolute (when the frequency
provided is the actual count of the occurrences) or
relative (when they are normalized by dividing the
absolute frequency by the total number of observations
[0, 1])
• Relative frequencies are particularly useful if you want
to compare distributions drawn from two different
sources (i.e. while the numbers of observations of each
source may be different)
Histograms
• We may summarize our data by constructing
histograms, which are vertical bar graphs
• A histogram is used to graphically summarize the
distribution of a data set
• A histogram divides the range of values in a data set
into intervals
• Over each interval is placed a bar whose height
represents the frequency of data values in the interval.
Building a Histogram
• To construct a histogram, the data are first grouped
into categories
• The histogram contains one vertical bar for each
category
• The height of the bar represents the number of
observations in the category (i.e., frequency)
• It is common to note the midpoint of the category on
the horizontal axis
Building a Histogram – Example
• 1. Develop an ungrouped frequency table
– That is, we build a table that counts the number of
occurrences of each variable value from lowest to highest:
TMI Value Ungrouped Freq.
4.16 2
4.17 4
4.18 0
… …
13.71 1
• We could attempt to construct a bar chart from this table, but
it would have too many bars to really be useful
Building a Histogram – Example
• 2. Construct a grouped frequency table
– Select an appropriate number of classes

Class Frequency Percentage


4.00 - 4.99 120
5.00 - 5.99 807
6.00 - 6.99 1411
7.00 - 7.99 407
8.00 - 8.99 87
9.00 - 9.99 33
10.00 - 10.99 17
11.00 - 11.99 22
12.00 - 12.99 43
13.00 - 13.99 19
Building a Histogram – Example
• 3. Plot the frequencies of each class
– All that remains is to create the bar graph

Pond Branch TMI Histogram

48
Percent of cells in catchment

44
40
36
32
28
24
20
16 A proxy for
12
8 Soil Moisture
4
0
4 5 6 7 8 9 10 11 12 13 14 15 16

Topographic Moisture Index


Further Moments of the Distribution
• While measures of dispersion are useful for helping
us describe the width of the distribution, they tell us
nothing about the shape of the distribution

Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA:
Macmillan College Publishing Co., p. 91.
Further Moments of the Distribution

• There are further statistics that describe the shape of


the distribution, using formulae that are similar to
those of the mean and variance
• 1st moment - Mean (describes central value)
• 2nd moment - Variance (describes dispersion)
• 3rd moment - Skewness (describes asymmetry)
• 4th moment - Kurtosis (describes peakedness)
Further Moments – Skewness

• Skewness measures the degree of asymmetry exhibited


by the data
n

 (x  x)
i
3

skewness  i 1
3
ns
• If skewness equals zero, the histogram is symmetric
about the mean
• Positive skewness vs negative skewness
Further Moments – Skewness

Source: http://library.thinkquest.org/10030/3smodsas.htm
Further Moments – Skewness
• Positive skewness
– There are more observations below the mean
than above it
– When the mean is greater than the median

• Negative skewness
– There are a small number of low observations and
a large number of high ones
– When the median is greater than the mean
Further Moments – Kurtosis
• Kurtosis measures how peaked the histogram is
n

 (x  x)
i
4

kurtosis  i
4
3
ns
• The kurtosis of a normal distribution is 0
• Kurtosis characterizes the relative peakedness or
flatness of a distribution compared to the normal
distribution
Further Moments – Kurtosis
• Platykurtic– When the kurtosis < 0, the frequencies
throughout the curve are closer to be equal (i.e., the
curve is more flat and wide)
• Thus, negative kurtosis indicates a relatively flat
distribution
• Leptokurtic– When the kurtosis > 0, there are high
frequencies in only a small part of the curve (i.e, the
curve is more peaked)
• Thus, positive kurtosis indicates a relatively peaked
distribution
Further Moments – Kurtosis

platykurtic leptokurtic

Source: http://www.riskglossary.com/link/kurtosis.htm

• Kurtosis is based on the size of a distribution's


tails.
• Negative kurtosis (platykurtic) – distributions with
short tails
• Positive kurtosis (leptokurtic) – distributions with
relatively long tails
Why Do We Need Kurtosis?

• These two distributions have the same variance,


approximately the same skew, but differ markedly in
kurtosis.
Source: http://davidmlane.com/hyperstat/A53638.html
How to Graphically Summarize Data?

• Histograms

• Box plots
Functions of a Histogram
• The function of a histogram is to graphically
summarize the distribution of a data set
• The histogram graphically shows the following:
1. Center (i.e., the location) of the data
2. Spread (i.e., the scale) of the data
3. Skewness of the data
4. Kurtosis of the data
4. Presence of outliers
5. Presence of multiple modes in the data.
Functions of a Histogram

• The histogram can be used to answer the following


questions:
1. What kind of population distribution do the data
come from?
2. Where are the data located?
3. How spread out are the data?
4. Are the data symmetric or skewed?
5. Are there outliers in the data?
Source: http://www.robertluttman.com/vms/Week5/page9.htm (First three)
http://office.geog.uvic.ca/geog226/frLab1.html (Last)
Box Plots
• We can also use a box plot to graphically summarize a
data set
• A box plot represents a graphical summary of what is
sometimes called a “five-number summary” of the
distribution
– Minimum
– Maximum
– 25th percentile 75th
max.
– 75th percentile %-ile
median
– Median 25th
min. %-ile
• Interquartile Range (IQR)
Rogerson, p. 8.
Box Plots
• Example – Consider first 9 Commodore prices ( in
$,000)
6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0
• Arrange these in order of magnitude
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
• The median is Q2 = 6.7 (there are 4 values on either
side)
• Q1 = 5.9 (median of the 4 smallest values)
• Q3 = 10.2 (median of the 4 largest values)
• IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
• Example (ranked)
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
• The median is Q1 = 6.7
• Q1 = 5.9 Q3 = 10.2 IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
Box Plots

Example: Table 1.1 Commuting data (Rogerson, p5)

Ranked commuting times:

5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47

25th percentile is represented by observation (30+1)/4=7.75


75th percentile is represented by observation 3(30+1)/4=23.25
25th percentile: 11.75
75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Example (Ranked commuting times):

5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
25th percentile: 11.75 75th percentile: 26
Interquartile range: 26 – 11.75 = 14.25
Measures of relationships
The relationship between x and y
• Correlation: is there a relationship between 2
variables?
• Regression: how well a certain independent
variable predict dependent variable?
• CORRELATION  CAUSATION
– In order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams

Y Y Y
Y Y Y

X X X

Positive correlation Negative correlation No correlation


Variance vs Covariance
• First, a note on your sample:
• If you’re wishing to assume that your sample is
representative of the general population (RANDOM
EFFECTS MODEL), use the degrees of freedom (n – 1)
in your calculations of variance or covariance.
• But if you’re simply wanting to assess your current
sample (FIXED EFFECTS MODEL), substitute n for
the degrees of freedom.
Variance vs Covariance
• Do two variables change together?
Variance: n
• Gives information on variability of a
single variable.  (x i  x) 2

S 
2 i 1

n 1
x
Covariance:
• Gives information on the degree to
which two variables vary together. n
• Note how similar the covariance is
to variance: the equation simply  (x i  x)( yi  y )
multiplies x’s error scores by y’s error
cov( x, y )  i 1

n 1
scores as opposed to squaring x’s
error scores.
Covariance
n

 (x i  x)( yi  y )
cov( x, y )  i 1

n 1
 When X and Y : cov (x,y) = pos.
 When X and Y : cov (x,y) = neg.
 When no constant relationship: cov (x,y) = 0
Example Covariance

6 x y xi  x yi  y ( xi  x )( yi  y )
5
0 3 -3 0 0
4

3
2 2 -1 -1 1
2
3 4 0 1 0
1
4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x3 y3  7

 ( x  x)( y
i i  y ))
7
What does this
cov( x, y )  i 1
  1.75 number tell us?
n 1 4
Problem with Covariance:
• The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data

Subject x y x error * y x y X error * y


error error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50

Sum of x error * y error : 7000 Sum of x error * y error : 28

Covariance: 1166.67 Covariance: 4.67


Solution: Pearson’s r

• Covariance does not really tell us anything

» Solution: standardise this measure

• Pearson’s R: standardises the covariance value.


• Divides the covariance by the multiplied standard deviations of X
and Y:

cov( x, y )
rxy 
sx s y
Pearson’s R continued

n n

 ( x  x)( y
i i  y)  ( x  x)( y
i i  y)
cov( x, y )  i 1
rxy  i 1

n 1 (n  1) s x s y

Z xi * Z yi
rxy  i 1

n 1
Limitations of r
• When r = 1 or r = -1:
– We can predict y from x with certainty
– all data points are on a straight line: y = ax + b
• r is actually r̂
– r = true r of whole population
– r̂= estimate of r based on data
• r is very sensitive to extreme values:
5

0
0 1 2 3 4 5 6
Regression
• Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.

• To do this we need REGRESSION!


Best-fit Line
• Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that
gives best prediction of y for any value of x

• This will be the line that ŷ = ax + b


minimises distance between
data and fitted line, i.e. slope intercept
the residuals
ε

= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression

• To find the best line we must minimise the sum of


the squares of the residuals (the vertical distances
from the data points to our line)
Model line: ŷ = ax + b a = slope, b = intercept

Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2

 we must find values of a and b that minimise


Σ (y – ŷ)2
Finding b

• First we find the value of b that gives the min


sum of squares

b
ε b ε
b

 Trying different values of b is equivalent to


shifting the line up and down the scatter plot
Finding a

• Now we find the value of a that gives the min


sum of squares

b b b

 Trying out different values of a is equivalent to


changing the slope of the line, while b stays
constant
Minimising sums of squares

• Need to minimise Σ(y–ŷ)2


• ŷ = ax + b
• so need to minimise:
Σ(y - ax - b)2

sums of squares (S)


• If we plot the sums of squares
for all different values of a and b
we get a parabola, because it is a
squared term
Gradient = 0
min S
• So the min sum of squares is at Values of a and b
the bottom of the curve, where
the gradient is zero.
The maths bit
• The min sum of squares is at the bottom of the curve
where the gradient = 0

• So we can find a and b that give min sum of squares


by taking partial derivatives of Σ(y - ax - b)2 with
respect to a and b separately

• Then we solve these for 0 to give us the values of a


and b that give the min sum of squares
The solution

• Doing this gives the following equations for a and b:

r sy r = correlation coefficient of x and y


a= sx
sy = standard deviation of y
sx = standard deviation of x

 From you can see that:


▪ A low correlation coefficient gives a flatter slope (small value of
a)
▪ Large spread of y, i.e. high standard deviation, results in a
steeper slope (high value of a)
▪ Large spread of x, i.e. high standard deviation, results in a flatter
slope (high value of a)
The solution cont.

• Our model equation is ŷ = ax + b


• This line must pass through the mean so:
y = ax + b b = y – ax
 We can put our equation for a into this giving:
r sy r = correlation coefficient of x and y
b=y- x s = standard deviation of y
y
sx s = standard deviation of x
x

 The smaller the correlation, the closer the


intercept is to the mean of y
Back to the model
a b
r sy r sy
ŷ = ax + b = x+y- x
sx sx
a a
r sy
Rearranges to: ŷ= (x – x) + y
sx
• If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line crossing the
x-axis at y

• But this isn’t very useful.

• We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
∑(y – y)2 SSy
• Total variance of y: sy2 =
n-1
=
dfy

 Variance of predicted y values (ŷ):


∑(ŷ – y)2 SSpred This is the variance
sŷ2 = = explained by our
n-1 dfŷ regression model

 Error variance: This is the variance of the error


between our predicted y values
∑(y – ŷ)2 SSer and the actual y values, and thus is
serror2 = = the variance in y that is NOT
n-2 dfer
explained by the regression model
How good is our model cont.
• Total variance = predicted variance + error variance
sy2 = sŷ2 + ser2

• Conveniently, via some complicated rearranging


sŷ2 = r2 sy2

r2 = sŷ2 / sy2

• so r2 is the proportion of the variance in y that is explained by


our regression model
How good is our model cont.

• Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get:

ser2 = sy2 – r2sy2


= sy2 (1 – r2)

• From this we can see that the greater the correlation


the smaller the error variance, so the better our
prediction
Is the model significant?

• i.e. do we get a significantly better prediction of y


from our regression equation than by just predicting
the mean?

• F-statistic: complicated
rearranging
sŷ2 r2 (n - 2)2
F(df ,df ) = =......=
ŷ er
ser2 1 – r2
 And it follows that:
r (n - 2) So all we need to
(because F = t2) t(n-2) = know are r and n
√1 – r2
General Linear Model
• Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
• A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression

• Multiple regression is used to determine the effect of a number


of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
• The different x variables are combined in a linear way and
each has its own regression coefficient:

y = a1x1+ a2x2 +…..+ anxn + b + ε

• The a parameters reflect the independent contribution of each


independent variable, x, to the value of the dependent variable,
y.
• i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for

Potrebbero piacerti anche