Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
An Introduction to Econometrics
and Statistical Inference
Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
Understand the steps involved in
conducting an empirical research
Understand the meaning of the term
econometrics
Understand relationship between
populations, samples, and statistical
inference
Understand the important role that
sampling distributions play in statistical
inference
1-2
1-3
The 5 Steps in
Conducting an Empirical
Research Project?
What is Econometrics?
Econometrics is the application of
statistical techniques to economic
data.
1-5
Populations, Samples,
and Statistical Inference
A population is the entire group of entities that we
are interested in learning about.
A sample is a subset or part of the population and
it is what is used to perform statistical inference.
Statistical inference is the process of drawing
conclusions from data that are subject to random
variation.
1-6
Populations, Samples,
and Statistical Inference
Continued
1-7
Sampling Distributions
A
distribution is the distribution of a
sampling
A Visual Example
1-10
Chapter 2
Collection and
Management of Data
Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
Consider potential sources of data
Work through an example of the first
three steps in conducting an
empirical research project
Develop data management skills
Understand some useful Excel
commands
1-12
1-13
Types of Data
Cross-sectional data is data collected for
many different individuals, countries, firms,
etc. in a given time-period.
Time-series data is data collected for a
given individual, country, firm, etc. over
many different time periods.
Panel data are data collected for a number
of individuals, countries, firms, etc. over
many different time periods.
1-14
publicly-available data
obtained through the internet or through formal
Freedom of Information Act (FOIA) request
1-17
1-18
1-21
Chapter 3
Summary Statistics
Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
1-25
Construct a Relative
Frequency Histogram
A bar chart that shows how often
observations lie within a specified classes
Allows a visual inspection of the data
Based on a Relative Frequency Table
The example dataset for constructing a
histogram use states.xls, a survey of
econometrics students that asked how
many states they have been visited.
1-26
0.2
0.15
0.1
0.05
0
0-5.99
6-11.99
12-17.99
18-24
1-27
To create a frequency
distribution we must
1. Select the number of classes
2. Choose the class interval or width of
the classes
3. Select the class boundaries or the
values that form the interval for each
class
4. Count the number of values in the
dataset that fall in each class
1-28
Step 1: Example
We have 43 data points so the rule is:
Approximate number of classes = [(2)(20)].3333
= 3.503
Round this up to the next integer value which is 4.
The number of classes is 4.
**Always round up!!
1-30
Step 2: Example
Approximate interval width = (24-1)/4
= 5.75
Round up to 6.
Therefore the class width is 6.
1-32
1-33
Step 3: Example
Lowest data point is 1. We will start
our classes at 0.
Class
Class
Class
Class
1
2
3
4
=
=
=
=
0
6 (=0+6)
12 (=6+6)
18 (=12+6)
1-34
Step 3: Example
Continued
Class boundaries are then:
Class
Class
Class
Class
1:
2:
3:
4:
0- 5.99
6-11.99
12-17.99
18-24
1-35
1-36
1-38
Creating relative
frequency and percent
frequency distributions
Recall that the relative frequency is
the proportion of the observations
belonging to a class. With n
observations
Relative frequency of a class =
Frequency of the class
n
The percent frequency is the relative
frequency multiplied by 100.
1-39
1-40
1-41
0.4
0.35
0.3
0.25
Relative Frequency
0.2
0.15
0.1
0.05
0
0-5.99
6-11.99
12-17.99
18-24
1-43
Calculate Measures of
Central Tendency
Central tendency is the middle value of a
dataset.
The measure of central tendency is
typically
thought of as the number that best
describes
the data.
Measures of central tendency are:
(1)Mean
(2)Median
1-44
Measure of Central
Tendency - Mean
The mean is the arithmetic average of the data. To
calculate the mean sum all the observations and divide by
the number of observations.
1-45
Measure of Central
Tendency - Median
Median the middle observation when the data are
arranged from smallest to largest sometimes called
the 50% percentile. Half the observations lie below
the median and half the observations lie above the
median.
The median is the middle observation for an odd
number of ordered observations and the average of
the middle two ordered observations for an even
number of observations.
The median is an order statistic so in order to calculate
it the data must be ordered from smallest to largest.
1-46
Measure of Central
Tendency - Median
Median Central observation for an odd number of
observations and an average of the two middle data points
for an even number of observations
For the following small data set :
95 85 99 92 80
(ordered data 80 85 92 95 99)
Median = 92 (the 3rd data point)
If we had 75 80 85 92 95 99
median =(.5*85)+(.5*92) = (85+92)/2 = 42.5+46 = 88.5
In Excel =median(highlight data)
1-47
Calculate Measures of
Dispersion
Dispersion is a measure of how the
data vary.
Measures of dispersion are:
(1)Variance
(2)Standard Deviation
(3)Percentiles
(4)Five Number Summary
1-48
Measure of Dispersion
Variance and Standard
Deviation
Standard Deviation the average deviation away from the
mean. It is the square root of the variance.
The variance is calculated by subtracting the mean from
each observation, squaring that value, adding up all n values,
and then dividing that by the number of observations less
n
one.
2
s2
Sample variance formula is 2
s s
( xi x )
i 1
n 1
Standard deviation is
In Excel = var(highlight data)
= stdev(highlight data)
1-49
Measure of Dispersion
Variance and Standard
Deviation
n
2
(
x
x
)
i
s 2 i 1
Sample variance:
n 1
Measure of Dispersion
Percentile
A percentile is a number such that p% of the ordered
observations lie below the percentile and (1-p)% of the
observations lie above the percentile.
The median is the 50th percentile and an example of a
percentile where 50% of the ordered data lies below
that level and 50% of the ordered data lies above that
level.
A percentile is an order statistic.
There are many different ways to calculate percentiles.
On the next slide one of the easiest ways to calculate
percentiles.
1-51
1-52
1-53
Measure of Dispersion
Five Number Summary
The Five Number Summary is
(1) Minimum
(2) Q1 or 25th Percentile
(3) Q2 or Median (50th Percentile)
(4) Q3 or 75th Percentile
(5) Maximum
1-54
Shapes of Histograms
Symmetric
Skewed to the right or Positively
skewed
Skewed to the left or Negatively
Skewed
Bimodal
1-56
Symmetric
Histogram
1-57
Positively Skewed
Distribution
1-58
Negatively Skewed
Distribution
1-59
Bimodal
Distribution
1-60
Positively Skewed
Distribution
Median = 2.77
Mean = 4.16
1-61
1-62
How to determine if
your data is skewed or
symmetric
Pearsons coefficient of skewness:
sk = 3*(mean-median)/(standard dev.)
Rule of Thumb:
If sk<-.5 or sk>.5 then the distribution
is skewed.
Otherwise
the distribution is
Negatively skewed Symmetric Positively Skewed
symmetric.-.5
.5
1-63
Symmetric Histogram
Mean = .5013
Standard Deviation =.019
1-64
Positively Skewed
Distribution
Median = 2.779
0.008
1.1578
2.779
5.643
29.001
1-65
x 2s
x 3s
1-66
Scatter Diagram
Examples
y
Positive Linear
relationship
x
Negative Linear
relationship
Curvilinear
relationships
x
y
x
1-69
Scatter Diagram
Examples
y
Strong
relationships
x
y
Weak
relationships
x
y
x
1-70
Scatter Diagrams
Examples
y
No
relationship
x
y
x
1-71
1-72
140,000
120,000
100,000
Salary (dollars)
80,000
60,000
40,000
20,000
0
10
12
14
16
18
20
22
Experience (years)
1-74
1-75
Covariance
Covariance is a measure of the linear
relationship between two random variables
A positive covariance indicates a positive
linear relationship between x and y (if x is
below its mean then y tends to be below its
mean and if x is above its mean then y
tends to be above its mean)
A negative covariance indicates a negative
linear relationship between x and y (if x is
below its mean then y tends to be above its
mean and if x is above its mean then y
tends to be below its mean)
1-77
Covariance
A covariance near 0 indicates no linear
relationship between x and y
A problem with covariance is that it
depends on the units of measurement for x
and y if we change from measuring in feet
to inches the covariance will go up even
though the overall relationship hasnt
changed.
1-78
Covariance a Measure of
Linear Association
Between Two Variables
Remember the formula for variance is
n
s2
(x i x)
i 1
n 1
( x i x )( x i x )
i 1
n 1
Cov( x , y) s xy
( x i x )( yi y)
i 1
n 1
and it measures how varies with y in a linear
fashion.
1-79
Calculating Covariance in
Excel
In some versions of Excel, the covariance is
not calculated correctly.
The Excel command is
=Covar(highlight x values, highlight
y values)
You should perform this command in Excel for
the data set above and see if it matches the
value 82,555.5556.
If you obtain 74,300 using the covar
command (which is likely), you must multiply
the value you obtain in Excel by n/(n-1) to
obtain the correct value for covariance.
1-81
Correlation Coefficient
The sample correlation coefficient,
rxy, is an estimate of population
correlation coefficient and is used to
measure the strength and direction of
the linear between two random
variables.
The correlation is a unit free measure
(unlike the covariance) and falls
between -1 and 1.
1-82
Examples of Approximate
rxy Values
y
r = -1
r = -.6
r=0
r = +.3
r = +1
x
1-84
1-86