Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Chapter 2 Outline
• Random Processes and Probability
o Random Process: A process whose outcome cannot be predicted
with certainty.
o Probability: The likelihood of a particular outcome of a random
process.
• Random Variable: A variable that is associated with an outcome of a
random process.
• Discrete Random Variables and Probability Distributions
o Probability Distribution: Describes the probability for all possible
values of a random variable.
o A Random Variable’s Bad News and Good News
o Relative Frequency Interpretation of Probability: When a random
process is repeated many, many times, the relative frequency of an
outcome equals its probability.
• Describing a Probability Distribution of a Random Variable
o Center of the Distribution: Mean
o Spread of the Distribution: Variance
• Continuous Random Variables and Probability Distributions
• Estimation Procedures: Populations and Samples
• Clint’s Dilemma: Assessing Clint’s Political Prospects
• Usefulness of Simulations
• Center of an Estimate’s Probability Distribution: Mean
• Spread of an Estimate’s Probability Distribution: Variance
• Mean, Variance, and Covariance: Data Variables and Random
Variables
3. Review the arithmetic of means and variances and then complete the following
equations:
a. Mean[cx] = _______________________
b. Mean[x + y] = _______________________
c. Var[cx] = _______________________
d. Var[x + y] = _______________________
e. When x and y are independent: Var[x + y] =
_______________________
• the probability of a Red Sox loss (and a Yankee win) is also one half.
We shall use a card draw as our first illustration of a random process. While we
could use a standard deck of fifty-two cards as the example, the arithmetic can
become cumbersome. Consequently, to keep the calculations manageable, we use
a deck of only four cards, the 2 of clubs, the 3 of hearts, the 3 of diamonds, and
the 4 of hearts:
2♣ 3♥ 3♦ 4♥
Similarly, the probability of drawing the 3 of diamonds is 1/4 and the probability
of drawing the 4 of hearts is 1/4. To summarize:
1 1 1 1
Prob[2♣] = Prob[3♥] = Prob[3♦] = Prob[4♥] =
4 4 4 4
4
Random Variables
To illustrate a discrete random variable consider our card draw experiment and
define v:
v = the value of the card drawn
That is, v equals 2, if the 2 of hearts were drawn; 3, if the 3 of hearts or the 3 of
diamonds were drawn; and 4, if the 4 of hearts were drawn.
Question: Why do we call v a discrete random variable?
• v is discrete because it can only take on a countable number of values; v
can take on three values: 2 or 3 or 4.
• v is a random variable because we cannot determine the value of v before
the experiment is conducted.
While we cannot determine v’s value beforehand, we can calculate the probability
of each possible value:
• v equals 2 whenever the 2 of clubs is drawn; since the probability of
drawing the 2 of clubs is 1/4, the probability that v will equal 2 is 1/4.
• v equals 3 whenever the 3 of hearts or the 3 of diamonds is drawn; since
the probability of drawing the 3 of hearts is 1/4 and the probability of
drawing the 3 of diamonds is 1/4, the probability that v will equal 3 is 1/2.
• v equals 4 whenever the 4 of hearts is drawn; since the probability of
drawing the 4 of hearts is 1/4, the probability that v will equal 4 is 1/4.
Card Drawn v Prob[v]
1
2♣ 2 = .25
4
1 1 1
3♥ or 3♦ 3 + = = .50
4 4 2
1
4♥ 4 = .25
4
Table 2.1: Probability Distribution of the Random Variable v
5
.50
.25
v
2 3 4
Figure 2.1: Probability Distribution of the Random Variable v
In general, a random variable brings both bad and good news. Before the
experiment is conducted:
• Bad news. What we do not know: We cannot determine the numerical
value of the random variable with certainty.
• Good news. What we do know: On the other hand, we can often calculate
the random variable’s probability distribution telling us how likely it is for
the random variable to equal each of its possible numerical values.
We could justify this interpretation of probability “by hand,” but doing so would
be a very time consuming and laborious process. Computers, however, allow us to
simulate the experiment quickly and easily. The Card Draw simulation in our
Econometrics Lab does so.
We first specify the cards to include in our deck. By default, the 2♣, 3♥, 3♦, and
4♥ are included. When we click Start the simulation randomly selects one of our
four cards. The card drawn and its value are reported. To randomly select a
second card, click Continue. A table reports on the relative frequency of each
possible value and a histogram visually illustrates the distribution of the
numerical values.
selected. It will now repeat the experiment very rapidly. What happens as the
number of repetitions becomes large? The relative frequency of a 2 is
approximately .25, the relative frequency of a 3 is approximately .50, and the
relative frequency of a 4 is approximately .25. After many, many repetitions, click
Stop. Recall the probabilities that we calculated for our random variable v:
Prob[v = 2] = .25
Prob[v = 3] = .50
Prob[v = 4] = .25
.50
.25
v
2 3 4
Figure 2.3: Histogram of the Numerical Values of v
We have already defined the mean of a data variable in Chapter 1: The mean is
the average of the numerical values. We can extend the notion of the mean to a
9
More formally, we can calculate the mean of a random variable using the
following two steps:
• Multiply each value by the value’s probability.
• Sum the products.
The mean equals the sum of these products. The following equation describes the
steps more concisely:2
Mean[v ] = ∑ v Prob[v] For each possible value, multiply the value
All v and its probability; then, sum the products.
Let us now “dissect” the right hand side of the equation:
• “v” represents the numerical value of the random variable.
• “Prob[v]” represents the probability of v.
• The upper case sigma, Σ, is the summation sign; it indicates that we
should sum the product of v and its probability, Prob[v].
• The “All v” beneath the summation sign indicates that we should sum over
all numerical values of v.
10
Next, let us turn our attention to the variance. Recall from Chapter 1 that the
variance of a data variable describes the spread of a data variable’s distribution.
The variance equals the average of the squared deviations of the values from the
mean. Just as we used the relative frequency interpretation of probability to
extend the notion of the mean to a random variable, we shall now use it to extend
the notion of the variance. The variance of a random variable equals the average
of the squared deviations of the values from its mean after the experiment is
repeated many, many times.
Begin by calculating the deviation from the mean and then the squared
deviation for each possible value of v:
Deviation From Squared
Card Drawn v Mean[v] Mean[v] Deviation Prob[v]
1
2♣ 2 3 2 − 3 = −1 1 = .25
4
1
3♥ or 3♦ 3 3 3−3=0 0 = .50
2
1
4♥ 4 3 4−3=1 1 = .25
4
If we repeated our experiment many, many times, what would the squared
deviations equal on average?
• About one fourth of the time v would equal 2, the deviation would equal
−1, and the squared deviation 1.
• About one half of the time v would equal 3, the deviation would equal 0,
and the squared deviation 0.
• About one fourth of the time v would equal 1, the deviation would equal 1,
and the squared deviation 1.
11
Half of the time the squared deviation would equal 1 and half of the time 0. On
average, the squared deviations from the mean would equal 1/2.
Econometrics Lab 2.2: Card Draw Simulation – Checking the Mean and Variance
Calculations
It is useful to use our simulation to check our mean and variance calculations. We
shall exploit the relative frequency interpretation of probability to do so. Recall
our experiment:
• Shuffle the 2♣, 3♥, 3♦, and 4♥ thoroughly.
• Draw one card and record its value.
• Replace the card.
The relative frequency interpretation of probability asserts that when an
experiment is repeated many, many times, the relative frequency of each outcome
equals its probability. After many, many repetitions, the distribution of the
numerical values from all the repetitions mirrors the probability distribution:
Distribution of the
Numerical Values
⏐ After many,
⏐ many
↓ repetitions
Probability Distribution
If our equations are correct, what should we expect when we repeat our
experiment many, many times?
• The mean of a random variable’s probability distribution should equal the
average of the numerical values of the variable obtained from each
repetition of the experiment after the experiment is repeated many, many
times. Consequently, after many, many repetitions, the mean should equal
about 3.
• The variance of a random variable’s probability distribution should equal
the average of the squared deviations from the mean obtained from each
repetition of the experiment after the experiment is repeated many, many
times. Consequently, after many, many repetitions, the variance should
equal about .5.
To access the simulation, click inside the red rectangle below:
As before, the 2♣, 3♥, 3♦, and 4♥ are selected by default; so just click Start.
Recall that the simulation now randomly selects one of the four cards. The
numerical value of the card selected is reported. Note that the mean and variance
of the numerical values are also reported. You should convince yourself that the
simulation is calculating the mean and variance correctly by clicking Continue
and calculating the mean and variance yourself. You will observe that the
simulation is indeed performing the calculations accurately. If you are still
skeptical, click Continue again and perform the calculations. Do so until you are
convinced that the mean and variance reported by the simulation are indeed
correct.
14
Next, uncheck the Pause checkbox and click Continue. The simulation no
longer pauses after each card is selected. It will now repeat the experiment very
rapidly. After many, many repetitions, click Stop. What happens as the number of
repetitions becomes large?
• The mean of the numerical values is about 3. This is consistent with our
equation for the mean of the random variable’s probability distribution:
Mean[v ] = ∑ v Prob[v]
All v
1 1 1
Mean[v ] = 2× + 3× + 4×
4 2 4
1 3
= + + 1 = 3
2 2
• The variance of the numerical values is about 0.5. This is consistent with
our equation for the variance of the random variable’s probability
distribution:
For each possible value,
Var[v ] = ∑ (v − Mean[v ]) Prob[v] multiply the squared
2
1 1 1
Var[v ] = 1× + 0× + 1×
4 2 4
1 1 1
= + 0 + = = .5
4 4 2
The simulation illustrates that the equations we use to compute the mean and
variance are indeed correct.
fairway is 32 yards wide, there are 16 yards of fairway to the left of Dan’s target
point and 16 yards of fairway to the right.
Eighteen Hole
Fairway
32 yards
Lake
Target
Left
Rough
200 yards
Right
Rough
Tee
Probability Distribution
.025
.020
.015
.010
.005
v
−40 −32 −24 −16 −8 0 8 16 24 32 40
Figure 2.5: A Continuous Random Variable
16
Eighteen Hole
Fairway
32 yards
Lake
Target
Left
Rough
200 yards
Right
Rough
Tee
Probability Distribution
Prob[v Between −16 and + 16]
.025
.015
v
−40 −32 −24 −16 −8 0 8 16 24 32 40
Figure 2.6: A Continuous Random Variable – Calculating Probabilities
18
• What is the probability that Dan’s drive will land in the lake? The shore of
the lake lies 16 yards to the right of the target point; hence, the probability
that his drive lands in the lake equals the probability that v will be greater
than 16:
Prob[Drive in Lake] = Prob[v Greater Than +16]
This just equals the area beneath the probability distribution that lies to the
right of 16. Applying the equation for the area of a triangle:
1
Prob[Drivein Lake] = × .015 × 24 = .18
2
• What is the probability that Dan’s drive will land in the left rough? The
left rough lies 16 yards to the left of the target point; hence, the probability
that his drive lands in the left rough equals the probability that v will be
less than or equal to −16:
Prob[Drive in Left Rough] = Prob[v Less Than −16]
This just equals the area beneath the probability distribution that lies to the
left of −16:
1
Prob[Drivein Lake] = × .015 × 24 = .18
2
• What is the probability that Dan’s drive will land in the fairway? The
probability that his drive lands in the fairway equals the probability that v
will be within 16 yards of the target point:
Prob[Drive in Fairway] = Prob[v Between −16 and +16]
This just equals the area beneath the probability distribution that lies
between −16 and 16. We can calculate this area by dividing the area into a
rectangle and triangle:
1
Prob[Drivein Fairway] = .015 × 32 + × .010 × 32
2
= (.015 + .005) × 32
= .020 × 32 = .64
We shall now apply what we have learned about random variables to gain insights
into statistics that are cited in the news every day. For example, when the Bureau
of Labor Statistics calculates the unemployment rate every month, it does not
interview every American, the entire American population; instead it gathers
information from a subset of the population, a sample. More specifically, data are
collected from interviews with about 60,000 households. Similarly, political
pollsters do not poll every American voter to forecast the outcome of an election,
but rather they query only a sample of the voters. In each case, a sample of the
population is used to draw inferences about the entire population. How reliable
are these inferences? To address this question, we consider an example.
By conducting the poll, Clint has adopted the philosophy of the econometrician:
Econometrician’s Philosophy: If you lack the information to determine the
value directly, estimate the value to the best of your ability using the
information you do have.
Clint uses the information collected from the 16 students, the sample to
draw inferences about the entire student body, the population. Seventy-five
percent, .75, of the sample support Clint.
Estimate of the Actual Population Fraction Supporting Clint = EstFrac =
12/16 = 3/4= .75
This suggests that Clint leads, does it not? But how confident should Clint be that
he is in fact ahead. Clint faces a dilemma:
Clint’s Dilemma: Should Clint be confident that he has the election in hand
and save his funds or should he finance the beer tap rally?
We shall now pursue the following project to help Clint resolve his dilemma:
Project: Use Clint’s opinion poll to assess his election prospects.
Usefulness of Simulations
In reality, Clint only conducts one poll. How, then, can a simulation of the polling
process be useful? The relative frequency interpretation of probability provides
the answer. We can use a simulation to conduct a poll many, many times. After
many, many repetitions, the simulation reveals the probability distribution of the
possible outcomes for the one poll that Clint conducts.
Distribution of the
Numerical Values
⏐ After many,
⏐ many
↓ repetitions
Probability Distribution
We shall now illustrate how the probability distribution might help Clint decide
whether or not to fund the beer tap rally.
Next, uncheck the Pause checkbox and click Continue. After many, many
repetitions, click Stop. The simulation histogram illustrates that sometimes 12 or
more of those polled support Clint even though only half the population actually
supports him. So, it is entirely possible that the election is a tossup even though
12 of the 16 individuals supported Clint in his poll. In other words, Clint cannot
be completely certain that he is leading despite the fact that 75 percent of the 16
individuals polled supported him. So, where does Clint stand? The poll results do
not allow him to conclude he is leading with certainty. What conclusions can
Clint justifiably draw from his poll results?
Write the name of each student in the college, the population, on a 3×5 card, then:
• Thoroughly shuffle the cards.
• Randomly draw one card.
• Ask that individual if he/she supports Clint and record the answer.
• Replace the card.
Population
Sample
For Clint?
To explain how we can do so, assume for the moment that the election is
actually a tossup as we did in our simulation; that is, assume that half the
population supports Clint and half does not. We make this hypothetical
assumption only temporarily because it will help us understand the polling
process. With this assumption, we can easily determine v’s probability
distribution. Since the individual is chosen at random, the chances that the
individual will support Clint equal the chances he/she will not:
Individual’s
Response v Prob[v]
1
For Clint 1
2
1
Not for Clint 0
2
v Prob
For Clint 1 1/2
1/2
Individual
1/2
Not for Clint 0 1/2
Figure 2.9: Probabilities for a Sample Size of One
This makes sense, does it not? In words, the mean of a random variable
equals the average of the values of the variable after the experiment is repeated
many, many times. Recall that we have assumed that the election is a tossup.
Consequently, after the experiment is repeated many, many times, we would
expect v to be
• 1, about half of the time.
• 0, about half of the time.
After many, many repetitions of the experiment, the numerical value of v should
1
average out to equal .
2
25
Recall the equation and the four steps we used to calculate the variance:
For each possible value, multiply
Var[v ] = ∑ (v − Mean[v ]) 2 Prob[v] the squared deviation and its
All v
probability; then, sum the products.
This equation summarizes the four steps we can use to calculate the variance:
• For each possible value, calculate the deviation from the mean;
• Square each value’s deviation;
• Multiply each value’s squared deviation by the value’s probability;
• Sum the products.
Individual’s Deviation From Squared
Response v Mean[v] Mean[v] Deviation Prob[v]
1 1 1 1 1
For Clint 1 1− =
2 2 2 4 2
1 1 1 1 1
Not for Clint 0 0− = −
2 2 2 4 2
Econometrics Lab 2.4: Polling – Checking the Mean and Variance Calculations
We shall now use our Opinion Poll simulation to check our mean and variance
calculations by specifying a sample size of 1. In this case, the estimated fraction,
EstFrac, and v are identical:
Number for Clint Number for Clint
EstFrac = = =v
SampleSize 1
26
Distribution of the
Numerical Values
⏐ After many,
⏐ many
↓ repetitions
Probability Distribution
If our calculations for the mean and variance of v’s probability distribution are
correct, the mean of the numerical values should equal about .50 and the variance
about .25 after many, many repetitions:
Mean of the Numerical
Variance of Numerical Values
Values
↓ After many, many repetitions ↓
Mean of Probability Variance of Probability
1 1
Distribution = = .50 Distribution = = .25
2 4
Indeed, after many, many repetitions the mean numerical value is about
.50 and the variance about .25, consistent with our calculations.
27
Generalization
Thus far, we have assumed that the portion of the population supporting Clint is
1/2. Let us now generalize our analysis by letting p equal the actual fraction of the
population supporting Clint’s candidacy: ActFrac = p. The probability that that
the individual selected will support Clint just equals p, the actual fraction of the
population supporting Clint:
Individual’s
Response v Prob(v)
For Clint 1 p
Not for Clint 0 1 − p
where p = ActFrac = Actual Fraction of the Population Supporting Clint
Number of Clint Supporters in theStudent Body
=
Total Number of Students in theStudent Body
v Prob
For Clint 1 p
p
Individual
1−p
Not for Clint 0 1−p
Figure 2.10: Probabilities for a Sample Size of One
We can derive an equation for the mean of v’s probability distribution and the
variance of v’s probability distribution.
28
The equation and the four steps we used to calculate the variance:
For each possible value, multiply
Var[v ] = ∑ (v − Mean[v ]) 2 Prob[v] the squared deviation and its
All v
probability; then, sum the products.
Four steps:
• For each possible value, calculate the deviation from the mean.
• Square each value’s deviation.
• Multiply each value’s squared deviation by the value’s probability.
• Sum the products.
Individual’s Deviation From Squared
Response v Mean[v] Mean[v] Deviation Prob[v]
For Clint 1 p 1−p (1−p)2 p
Not for Clint 0 p −p p 2
1−p
Experiment 2.3: Opinion Poll with Sample Size of 2 – Another Unrealistic but
Instructive Experiment
v1 + v2 1
Estimated Fraction of Population Supporting Clint = EstFrac = = (v1 + v2 )
2 2
30
Population
Sample
Individual 1
for Clint?
Individual 2
for Clint?
o Since the first card drawn is replaced, the second stage of the
experiment is also identical to the previous experiment:
Mean[v2] = Mean[v] = p
1
Focus on Mean[ (v1 + v2 )] and apply the arithmetic of means:
2
Mean[cx] = c Mean[x] Mean[x + y] = Mean[x] + Mean[y]
↓ ↓
1 1 1
Mean[ (v1 + v2 )] = Mean[(v1 + v2 )] = [Mean[v1 ] + Mean[v2 ]]
2 2 2
Mean[v1] = Mean[v2] = p
1
= [ p + p]
2
Simplifying
1
= [2 p ]
2
= p
32
Now focus on the covariance of v1 and v2, Cov[v1, v2]. The covariance tells
us whether the variables are correlated or independent. When two variables are
correlated their covariance is nonzero; knowing the value of one variable helps us
predict the value of the other. On the other hand, when two variables are
independent their covariance equals zero; knowing the value of one does not help
us predict the other.
In this case, v1 and v2 are independent and their covariance equals 0. Let us
explain why. Since the first card drawn is replaced, whether or not the first voter
polled supports Clint does not affect the probability that the second voter will
support Clint. Regardless of whether or not the first voter polled supported Clint,
the probability that the second voter will support Clint is p, the actual population
fraction:
p = ActFrac = Actual Fraction of the Population Supporting Clint
Number of Clint Supporters in theStudent Body
=
Total Number of Students in theStudent Body
More formally, the numerical value of v1 does not affect v2’s probability
distribution and vice versa. Consequently, the two random variables, v1 and v2, are
independent and their covariance equals 0:
Cov[v1, v2] = 0
33
1
Focus on Var[ (v1 + v2 )] and apply all this:
2
Var[cx] = c2Var[x] Var[x + y] = Var[x] + 2Cov[x, y] + Var[y]
↓ ↓
1 1 1
Mean[ (v1 + v2 )]
2
=
4
Var[(v1 + v2 )]
4
= [
Var[v1 ] + 2Cov[v1 + v2 ] + Var[v2 ]]
Cov[v1, v2] = 0
1
=
4
[
Var[v1 ] + Var[v2 ] ]
Var[v1] = Var[v2] = p(1 − p)
1
=
4
[
p (1- p ) + p (1- p ) ]
Simplifying
1
=
4
[
2 p (1- p ) ]
p (1- p )
=
2
Econometrics Lab 2.5: Polling – Checking the Mean and Variance Calculations
As before, we can use the simulation to check the equations we just derived by
exploiting the relative frequency interpretation of probability. When the
experiment is repeated many, many times, the relative frequency of each outcome
equals its probability. After many, many repetitions, the distribution of the
numerical values from all the repetitions mirrors the probability distribution:
Distribution of the
Numerical Values
⏐ After many,
⏐ many
↓ repetitions
Probability Distribution
Applying this to the mean and variance:
Mean of the Numerical
Variance of Numerical Values
Values
⏐ After many, ⏐
⏐ many ⏐
↓ repetitions ↓
Mean of Probability Variance of Probability
Distribution = p p (1- p )
Distribution =
2
34
Be certain the simulation’s Pause checkbox is cleared. Click Start and then, after
many, many repetitions, click Stop.
1
Actual Population Fraction = ActFrac = p = = .50
2
Equations: Simulation:
Mean of Variance of
Mean of Variance of Numerical Numerical
EstFrac’s EstFrac’s Values of Values of
Sample Probability Probability Simulation EstFrac from EstFrac from
Size Distribution Distribution Repetitions the Experiments the Experiments
1 1
2 = .50 = .125 >1,000,000 ≈.50 ≈.125
2 8
Table 2.3: Opinion Poll Simulation Results – Sample Size of Two
The simulation results suggest that our equations are correct. After many, many
repetitions, the mean (average) of the numerical values equals the mean of the
probability distribution, .50. Similarly, the variance of the numerical values equals
the variance of the probability distribution, .125.
35
In this chapter, we extended the notions of mean, variance, and covariance that we
introduced in Chapter 1 from data variables to random variables. The mean and
variance describe the distribution of a single variable. The mean depicts the center
of a variable’s distribution; the variance depicts the distribution spread. In the
case of a data variable, the distribution is illustrated by a histogram; consequently,
the mean and variance describe the center and spread of the data variable’s
histogram. In the case of a random variable, mean and variance describe the
center and spread of the random variable’s probability distribution.
Covariance quantifies the notion of how two variables are related. When
two data variables are uncorrelated, they are independent; the value of one
variable does not help us predict the value of the other. In the case of independent
random variables, the value of one variable does not affect the probability
distribution of the other variable and their covariance equals 0.
1
The mean and expected value of a random variable are synonyms. Throughout
this textbook we shall be consistent and always use the term “mean.” You should
note, however, that the term “expected value” is frequently used instead of
“mean.”
2
Note that the v in Mean[v] is in a bold font. This is done to emphasize the fact
that the mean refers to the entire probability distribution. Mean[v] refers to the
center of the entire probability distribution, not just a single value. When v does
not appear in a bold font, we are referring to a specific value that v can take on.