Sei sulla pagina 1di 50

Contents

Untitled Divider

Class 2
Hypothesis Testing

EE Example
Showroom not a problem if service time to non-customers < 5760
seconds
Our sample mean of 100 people was 4880 seconds
Why sample? Because its not feasible to survey the entire population
of 4000
BUT thats just a sample mean - if we polled 100 other people, wed
probably get something different.
So what we want is some semblance of assurance that average of the
sample means does not exceed 5760 seconds.
What we did in Bootcamp was create a confidence interval that
showed the range of sample means we are likely to obtain IF the
population mean is 5760.
A 95% confidence interval demonstrates the range of sample means
were likely to obtain 95% of the time if the population mean is 5760.
So the 95% confidence interval = 5,760t(alpha,(sx)
Alpha = % observations in the tails
Where t = t-statistic that corresponds to alpha (what t-value
corresponds to the number of sdev above or below the mean?)
= number of degrees of freedom (#variables that are free to
vary - just think number of observations 1)
Sx = standard error of the mean
Standard deviation of the sample = standard
deviation/(sample size)
t value associated with 5% of the observations lying in the tails
with 99 degrees of freedom was 1.98
Standard error of the mean was 261
Hence, 95% confidence interval = 5,760 1.98(261).
What does the 1.98 mean? It means that 95% of the
observations lie within 1.98 standard deviations of the mean.

Hypothesis Testing
So as an example, take a murder trial. Jury is told to presume Not Guilty. Its
up to the prosecutor to get the jury to reject that hypothesis with evidence.
atty tries to get you to doubt the evidence.Not guilty verdict means atty
failed to sustain his burden of proof
So now lets apply that logic to our hypothesis testing.
The hypothesis to be tested is the null hypothesis (not guilty)
The alternative hypothesis (H_A) is that the defendant IS guilty
So lets set up our hypothesis test:
H_0: is not guilty
H_A: is guilty
Only if the evidence exceeds the minimum burden of proof can the null
hypothesis be rejected!
BUT
Type I Error: null hypothesis is correct but the hypothesis is rejected
Type II Error: null hypothesis is incorrect but the hypothesis is not
rejected
Type I Error is more likely to occur in a civil case - the burden of proof
is lower
Type II Error is more likely to occur in a criminal case - burden of proof
is really stringent
So applying the courts to hypothesis testing
Null hypothesis is established
H_0: Service time to non-buyers 5760 seconds
If the null hypo isnt true, the alternative must be true: service time to
non-buyers 5760 seconds
Note: could you have reversed these hypothesis? Sure, but its easier
to have the null hypothesis be if this is true, our response is status
quo If you reject the null hypothesis, were going to have to change
something.
So if we dont have evidence of a problem, we wont change
anything - well only implement this policy of charging for help if
our service time 5760
The null hypothesis is only rejected if the evidence exceeds the
minimum burden of proof.
Test Example
Computer company thinks introducing different colored computers may
increase profits. TO be profitable, sales have to increase by at least 275
units/week.
New colors were test marketed over 36 weeks.

So lets set up our hypothesis test


H_0: average sales/week 275 units (not worth it)
H_0: 275
H_A: average sales/week > 275 units (lets do it!)
H_A: >275
Hokay. So we did our little sample and got a mean of 290. Are we done?
No! We have to know what is the probability that we could draw a sample
mean of 290 if the population mean was 275 or less
The answer is called the P-Value
To determine the P-Value ,you need to know how many standard
deviations 290 is from 275.
Thats the t-statistic: 290-population mean/standard error of the mean.
So we calculate our t-statistic: 1.76, when excel isnt being weird.
We can find the p-value by using the excel function T.DIST.RT (RT
means Right Tail)
So X = 1.76; Deg. Freedom = 35
What determines the p-value that implies strong evidence?
In other words, whats a reasonable doubt?
Significance level: alpha
Basically, you set it yourself at the outset: so you set it to 0.05. If
your P-value is 5% or less, you reject the null hypothesis
BUT the most popular alphas are: .10, .05, .01.
If the P-value is less than alpha, you are concluding that the likelihood
of getting that sample mean if the hypothesized mean is true is small.
If the odds of your null hypothesis being true is less than ten
percent, youll reject it
I will only reject the null hypothesis if that p-value is alpha or less
Note: as alpha shrinks, the more stringent the test and the less
likely we are to reject the null hypothesis. (youre moving from civil
to criminal)
So going from .10 alpha to .01 alpha, youre less likely to make a
Type I error (its harder to reject the null)
Type I = you rejected the null, but you shouldnt have
BUT if youre less likely to reject the null, then by definition
youre more likely to not reject it when you should have!
Now, had we used a .05 alpha, we would have rejected the null
hypothesis and started producing colored computers.
BUT if we had used a .01 alpha, we would fail to reject the null
hypothesis - we would not produce the colored computers. The
evidence isnt strong enough to shift the status quo in favor of
producing colored computers.

So whats the right answer? We dont know! It all depends on the


alpha you choose but if you choose it, thats pretty arbitrary.
So... whats the best test? .1? .05? .01?
It depends on what the business is willing to stomach as a risk.
Going from 10% to 5% to 1% decreases likelihood of a Type I error
and increases the likelihood of a Type II error
How do you decide which one to go with? Well, it all comes down to
the bottom line:
How costly is a Type I error relative to a Type II error?
So how costly would it be to manufacture colored computers that
dont generate sufficient sales to cover the added costs? On the
other hand, how much profit is foregone from failing to produce
colored computers that would have been probable.
Note: why do we say fails to reject the null hypothesis?
Think back to the courtroom analogy: innocent means you didnt do it,
not guilty means the prosecution didnt sustain its burden of proof
Fail to reject the null hypothesis states that the available evidence isnt
strong enough to cause one to believe ethe null hypothesis is incorrect
Now then, our test marketing example used a one-tailed test: we only cared
about whether our number was big enough.
But there is also something called a two-tailed test
Two tailed test is necessary when either a large value or small value will
cause the null hypothesis to be rejected.
Ex) E-cig manufacturer believes average age of e-smoker is 25. If
average age is either significantly higher or lower than 25, ad plan wont
be effective - if the sample mean lies in either tail
So here is our test:
H_0: = 25
H_A: 25
Okay. Heres our data:
Sample mean = 34
Standard error of the mean = 5.1
35 degrees of freedom
t-statistic = (34-25)/5.1
=> 1.76
So should we reject it? Not necessarily. In the one-tailed test,
everything to the right was 5%. Here, we have two tails, so we have to
look for a p-value of 2.5%
So for a two-tailed test, use T.DIST.2T
=8.7%

So if alpha is 5%, then we would fail to reject the null hypothesis it falls short of our tail. Thus we have concluded that we dont have
enough evidence to say our target market is not 25
Suppose the sample mean had been 18 and standard error of the
mean was 3.2 (with 35 df)
t = 18-25/3.2 = -2.1875 (so 18 lies 2.1875 SDEV below 25)
P-value = ???
YOU CANT PLUG A NEGATIVE NUMBER INTO T.DIST.2T
No worries though, just plug in the absolute value (it doesnt
matter because the graph is symmetrical)
So we plug that in, get p-value of 3.5%, meaning we would
reject the null hypothesis.
3.5% means that there is a 3.5% chance that we would get
a sample mean of 18 if the true mean were 25. So - we
dont think our target market is 25.
Notes on HW
Write out an explanation of what you did
Consumer Packaging Example
Review
Hypothesis testing says if the mean is this, what are the odds that we
would get a sample mean of that
So the colored computer example
Our test market sample was 290, which makes us say wow,
above and beyond break even!
Buuuuuuut its also the sample mean.
So we calculated the P-Value: if the population mean is this, whats
the probability we get this sample mean.
So that was a one-tailed test.
Our P-Value was in one tail (critical region being 5%), so we
rejected the null hypothesis.
But what about a two-tailed test? Then, the areas in the critical region
sum to 5% (each tail is 2.5%)
The example we gave was our e-cig market: if our hypothesized
average was 25, we didnt care if it was 10 years older or 10 years
younger - we want to know if were advertising to the right people.
That foundation laid, lets go with a real example:
Company is considering two kinds of packing: which will sell better.
Each type is sold in 36 demographically similar sales districts
Demographically similar is key because we need to know that to
assume an unbiased sampling population.

If theyre not demographically similar, your sample might not be


representative, and so biased.
Test market lasts one month
So we plug that Data into Excel and find that our mean sales are
Pack 1: 290.54
Pack 2: 262.75
So we go with Pack 1, right? Well, not so fast - we dont know yet
Pack 1 sold better with the sample population, but would it sell
better for the entire population?
Enter statistics
First things first, one or two-tailed test?
Two-tailed test, our null hypothesis would be that the sales of each
package type are the same; the alternative hypothesis is theyre
not
H_o: 1 = 2
H_A: 1 2
Or you could rewrite it as: H_o: 1-2 = 0
H_A: 1-2 0
But... if you reject the null hypothesis, what do you do? All this
tells us is that the two arent the same; one is better than the
other - but we dont know what that is.
So we want to do a one-tailed test
A one-tailed test allows us to directly test whether one package
type sells more than the other
Our sample mean shows higher average sales for package
type 1; we want to see if the difference is statistically significant
So. Our null hypothesis is that H_o: 1-20; H_A: 1-2>0
Were saying that 21 is our null hypothesis because if
we fail to reject it, we stay with our current state of things
This test will be different though! We have two sample means and
are making inferences about two population means.
The difference between the two populations is 1-2
Difference between the two standard deviations is the square root
of the sum of the squares of the two standard deviations (variance
sum law)
Remember though: were dealing with samples, the difference
becomes the sum of the squares sample deviations/sample
numbers
Thats complicated, so we write the differences of the standard
deviations as S Xbar1-Xbar2
We use the Excel function T.TEST to perform our test

Array 1: data set 1


Array 2: data set 2
Tails: how many tails do you want?
Type:
3 is usually what you want - you have two samples with
different standard deviations.
So we get 1.1% as an answer
This means that if package 1 does not sell more than
package 2, then we have a 1.1 percent chance of drawing
these sample means that show 1 selling more than 2 in the
manner we did.
So do we reject the null hypothesis? Well that depends on the
significance level
If our significance level was 5%, then that means we will
reject the null hypothesis if the p value is 5% or less. So if
alpha is .05, then wed reject the null hypothesis and then
choose package 1
If our significance level was 1%, then we would fail to reject
the null hypothesis - we would be saying that were not
persuaded that 1 is better than 2. In which case, wed do a
more extensive test market. The status quo is that you
dont know.
Once again, our decision as to what significance level
should be be based on our assessment as to which type of
error (Type 1 or 2) would be more costly to the firm
Type 1: Rejecting the null hypothesis, but you shouldnt
have.
Type 2: failing to reject the null hypothesis when you
should have.
The bar you set for evidence is always circumstance specific
If youre testing for differences in proportions, the difference in
standard deviations becomes
So your test statistic becomes:
Note that this is a z-score, not a t-statistic
When youre dealing with proportions, you need to have a
good sample size (at least 30 observations)
Asset Returns
so the general idea is: you want a diversified portfolio so that the stocks that
do well compensate for the stocks that tank.
So weve got our data in the spreadsheet; lets investigate the stability of each
asset class by comparing the returns for the first 20 years with the second.

But first look at the averages: there seems to be a fair bit of difference
Since we want to know if the month returns have been stable in each 20
period subset, we will use the following two-tailed test:
H0: 1 - 2 = 0
HA: 1 - 2 0
*Why not a one-tailed test? The two tailed test is about showing whether
or not theyre the same; a one-tailed test would say which period was
better. But thats not helpful for investing - were trying to predict the future
T.TEST
The p-value is 0.628. We cannot reject the null hypothesis that returns
have been stable
Corporate Bond: really small P value - strong evidence to reject the idea
that the returns are the same
Government: 1.1% probability; depending on alpha, may or may not reject
null hypothesis
T-bills: infinitesimal number, so reject the idea that returns are the same.
Note: when do you use a one-tailed test versus a two tailed test
Use a one-tailed test when you want a value that is AT LEAST XXX.
Use a two-tailed test when you want a value that is EXACTLY XXX. Bigger
OR Smaller is a problem.

Ch. 3: Intro to Regression Analysis


The Basic Basics
First off, dependent v. independent variable
Dependent variables value depends on another variable
Independent is a given
Everything weve done thus far assumes that were operating in a constant
environment. If one factor changes, the distribution will shift accordingly.
we need a tools to alter predictions for the dependent variable when factors
that influence it change
=> Enter Least Squares Regression Analysis
Now lets think about equations a sec: they have dependent and
independent variables
So Y = 3x + 5, x is independent and y is dependent
So. Lets get this out of the theoretical.
Law of Demand: as the price of a good rises, the quantity demanded
decreases.
Graphic representation is called the Demand Curve

Note: the Demand Curve is from the buyers perspective, so the


price is the independent variable and demand is the dependent.
The buyer doesnt set the price, its a given - but based on the
price, the buyer decides how much to purchase.
Now then, since the demand curve is a line, it must have an equation.
Quantity demanded is the dependent variable.
Price is the independent variable
Q = (P) or Q = a + bP
What does a mean in the equation?
=> The amount we could give away for free: the value of Q
when P = 0
What about b?
=> Change in quantity demanded resulting from a given
change in price: Q/P
Note that the demand curve has a negative slope: when price goes
up, demand goes down and vice versa. Hence we actually write the
demand curve equation as Q = a - bP
a is the value of Q when P = 0 - Its the number we could just give
away for free
b tells us how much Q increases for each one-unit increase in P
Every demand curve has a line that fits the generic Q = a - bP
equation.
So suppose the demand curve equation is Q = 10 - 2P
If the goods were free, people would take 10
For each $1 increase in price, demand decreases by 2
Note that this equation is set up in X-intercept form (blame Alfred
Marshall, thanks)
The slope is given in run/rise
Also: note that quantity demanded may not be the same as quantity
sold
If you stock out, all you know is what you sold, not what you could
have sold (demand)
Now then, itd be pretty great to actually have the demand equation
How do you do this?
Well, you keep track of your sales and price data and you can come
up with a cluster of dots! Woo hoo! Dots!
And of course, you want to take those dots and figure out the
regression line so that you can figure out your demand curve
equation
Least Squares Regression: fits a line through a set of data and
reports the equation

How does it work?


Comes up with a line and squares the distance of each data point
from the line. Why square? Because that gets rid of the negative - we
dont care if its above or below, we just want to know the absolute
value
Then you sum the squares of the deviations and the one with the least
sum shows us which line fits the data best
The lower the sum of the squares, the better the data fits the line
Hence the name least squares regression
How do you calculate the slope of the line? Through a very
complicated equation on slide 39.
So whats our average price? 47.
What does p_i refer to? Each of the prices in the list. So first is
p_1, p_2. So p_i just says, you have a bunch of observations and
were going to name them
So what does this do? We take an instance and determine how
much it varies from p[bar],
So lets do it in Excel
Known ys: all the dependent variables
Known xs: all the independent variables
Const.: TRUE
Stats: blank
Note: get the formula, plug in 50 and get 907. Okay, thats not the exact
number youd sell at that price (the line isnt a perfect fit), but it is what
youd expect to sell on average
Now. were only going to focus on the dots that correspond to $50,
and were going to imagine theyre 3D. We would see a normal
distribution.
On the horizontal axis, were focusing on unit sales ONLY when
P=$50. When we charge 50, how much will we sell
Average distribution says, on average, well sell 907.
Sometimes youre going to sell more, sometimes fewer
Now our regression estimate (the point estimate) is the average
of unit sales when the price is $50.
What if P = $40? Now were going to get an average of 1056 sold.
The point estimate is 1056 - when the price is $40, you will sell an
average of 1056 - the entire distribution has shifted
So. We also call the point estimate the conditional mean: the
mean on the condition that the price is $40, or whatever you
choose

Average value of the dependent variable given the


independent variable
Whoah. More vocab
Standard error of regression: equivalent of the standard error of the
mean, which is
Remember, standard deviation is a generic unit of distribution.
So lots of times we have subdivisions: standard deviation of this in
particular
So when we have a certain distribution for our P = $40, we want to
know, 95% of the time, how much can we expect to sell when the
price is $50?
What do you need to know to make that calculation?
Basically youre constructing a confidence interval
So youd need to know the standard deviation - and thats the
standard error of regression.
Now then, remember theres the whole issue with our population of
interest and sample? Because its rarely practical to sample the entire
population.
And of course the problem there is that youre probably not going to
get the same sample mean as the population mean
But what we do know is that if you took repeated samples from the
same population and averaged THOSE means, then you would get
the mean of the population
But... cmon, were not going to do that either - were only going to take
our sample mean and then draw inferences about the sample mean.
So. Again, were going to be using samples to run our regressions most of
the time.
What do we do? Well suppose we believe the population slope is -10.
The 95% confidence interval would be
Now, if we want to get our confidence interval, we need to know the
standard deviation and the t-value corresponding to a 95% confidence
interval
Standard deviation is the standard error of regression
The t-statistic means that when you have 29 degrees of freedom
95% of observations will lie within 2.04 standard deviations from
the mean.
So our confidence interval is -10(2.04)(1.175)
That means that if the population mean (slope) was -10, 95%
of the time, a sample mean would be between -12.39 and
-7.61

-12.39 -7.61 is the correct way to write it


BUT we got -14.9 - thats a number outside our desired
range
So we reject our null hypothesis
t(alpha/2).
t tells us: how many standard deviations from the mean a number
is
So this is the t value that corresponds to alpha over two
Alpha = 5, that means the area in the tails = 5%
Alpha/2 means that were splitting alpha into the tails - each
tail gets 2.5%
So were looking for the corresponding t value for a 95%
confidence interval
n-2 corresponds to the number of degrees of freedom
The t value differs depending not he number of degrees of
freedom
Why n-2?
If you had one observation, you could draw infinite lines
through that line
If you had two observations, you could draw one
In order to draw the line, you need two points
If you have three, you could draw two lines
Thats why degrees of freedom = n-2: you need at least
2 points, anything beyond that gives you the number of
degrees of freedom
s_b = standard error of the coefficient .
Kstat gives us the info we need to construct a confidence interval
Standard error of the coefficient? Check
It even gives us the t-value for the 95% confidence interval
95% of the sample slopes will be within 2.05 standard deviations of
the population slope
Now, we could also do this same calculation yin the standard normal
distribution
Standard normal is the distribution when there are 29 degrees of freedom
- a generic, statistical fact thats true in any situation with 29 degrees of
freedom
So we could also get our t-statistic to find how many standard deviations
14.9 is from 10
My number - what you think the population slope is/standard error of
the coefficient
14.0 (-10)/1.17 = -4.19

So this tells us that if the population mean is -10, -14.9 is -4.19


standard deviations below -10
But the standard normal distribution says we need it to be within 2.04
standard deviations
So again, we reject the null hypothesis
We could also set this up as a hypothesis: (Note: = population slope)
Ho: = -10
Ha: -10
Now,most statistical software incorporate a default hypothesis that the
population slope is zero.
All the data that is conveyed to you is based on the assumption that your
hypothesis is that the population slope is 0
Slope of 0 means theres no correlation between the variable: if X
changes, Y doesnt change ( = y/x = 0 because y = 0)
So then that means the alternative hypothesis is that they ARE related
So, statistically, what youre really saying is:
Ho: = 0
Ha: 0
If the two were unrelated, what would it look like?
Dots would be scattered all over with no pattern
If you were to run a regression on that, the line would be
completely horizontal
y = a + 0x
So really, y = a
But... how do you calculate a?
Average of the values for y
So in our GMAT score v. height example, no matter what the
height, your y-intercept would just be the average GMAT score.
BUT we dont know the population regression!
Were reaching into the population, pulling out a sample, and
making our observation
Our sample equation is going to be a little bit different
Perhaps our sample says GMAT = 500 +.08(height)
But... what if it says GMAT = 580 1.2(height)?
Which one is correct?
Well, knowing that the correlation is 0, we can say neither they came from a population with a correlation of zero
Those numbers mean nothing - its just a number thats a little
different from the true slope
Which brings us back to our original problem: we dont know the
true slope

BLURGH!
This is why our default hypothesis is that the population slope
=0
In most applications, the purpose o fthe study is to see if X
and Y are related
If the evidence is strong enough, we can conclude that
the theory is unlikely to be correct; that X and Y are
related
When you run stat, the regression shows a t-statistic (called a
t-ration) based on an implicit assumption that the population
slope = 0
What does 12.6812 mean? That tells us that a slope of
-14.9 lies 12.68 standard deviations below zero
Thats pretty good evidence that we should reject the
hypothesis that Price and Quantity are unrelated.
Makes sense: if we rejected the notion that -14.9
was too far from -10, then its REALLY too far from
0
Remember, 95% of sample slopes lie within 2.04
standard deviations of 0. If our number is 12 standard
deviations, then thats probably not a good assumption.
Significance is our P-value: the likelihood that youd get
a number thats 12.68 standard deviations from 0. And
the answer is virtually 0.
Whats that beta-weight number?
Similar to the slope coefficient, except it measures
changes in terms of standard deviations.
Every time you increase price by one standard
deviation, the demand falls by 0.92 standard deviations
We could ALSO set this up as a one-tailed hypothesis test.
Kstat assumes a 2 tailed test with Ho: = 0
In the case of the price coefficient, the null hypothesis will be rejected if
the coefficient is either positive or negative
But cmon, the secret to selling more is never to raise the price: then a
one-tailed test would be better
So
Ho: 0
Ha: < 0
In other words, our null is that theres either no relationship or a
positive relationship.

Because were still relying on a hypothesized population slope of 0,


the t-statistic is still -12.868
But wed rely on a one-tailed test via T.DIST.RT
Recall that the number 1 is the area in the right tail, hence the
area to the left is 1-X. The states that the probability of getting a
sample slope that lies 12.19 standard deviations below zero if the
population slop was greater than or equal to zero are virtually zero

Autorama Case

So weve got our data in the data page; we run univariate statistics - what do
we see?
Mean:
Income; 60359 (18900101300)
Price: 19522 (510019650)
Now lets make a scatterplot, then run regression
So whats our regression equation?
Constant - 5788
Income - 0.2275
So. P = 5788 + 0.2275(Income)
That means, each extra dollar in income is associated with a
$0.22 increase in the price of a car
Or, in terms of something thats actually useful, each $1000
increase in income is associated with a $227.50 increase in the
price of the car
So our sample regression suggests a positive relationship between
income and car prices, but how confident are we that this is really the
case?
Ho: income and car prices are unrelated
Ha: income and car prices ARE related
So,
Ho: Income slope coefficient = 0
Ha: Income slope coefficient 0
Well use a 5% significance test to render a conclusion (hey, its the
default in stat)
Now then, stat tells us that with 98 degrees of freedom, 95% of the
sample slopes should lie within 1.98 standard deviations of the
hypothesized population slope of zero (thats the t-statistic)
The t-ratio is 9.0759 - so assuming that the population slope is
zero, a slope of 0.2275 is 9.07 standard deviations from zero
Moreover, our p-value says that the likelihood of getting a sample
slope of 0.2275 when the population slope is 0 is almost 0%.
Therefore, we conclude that the car price IS related to income.

Now, what if wed done a 1% significance test?


Kstat gave us the 5% test, but how would you go about doing a
1% test
So to find the critical value for a 1% test, you want to use T.INV.
Probability = 0.01; Deg_freedom = 98
And we get 2.63
That means 99% of the time, the sample slope will be within
2.63 standard deviations of the mean. And again, ours was still
way off the map.
BUT We know that 0.2275 is only our sample slope - for every $1000
increase in income, people spend $227.50 more on cars
But thats only a sample slope, we dont know the population
So lets get a confidence interval
= sample slop (t value corresponding to 98 deg freedom)(standard
distribution of the sample)
0.1778 0.277

So Each $1 increase in income - an increase in the price of the car


the customer buys by an amount between $.1778 and $.227
So each $1000 increase leads to an increase between $177.80
and $227.00
Now, we can use our tools to construct a confidence around the point
estimate (the average number of units sold when P = $50 or whatever)
To figure out the plus or minus part of the confidence interval, it should be
907 something
That something requires that we know the t-statistic and the standard
deviation of the distribution - we call that the standard error or the
estimate
s_y
Aka standard error of the estimated mean
Introducing stat Prediction: the yellow boxes are places you can plug in
numbers
So value for prediction is where you can put in the price to get the point
estimate of the quantity sold
The standard error of the estimate is reported to you
If you want to do a 95% confidence interval, you can do that. You can
change it too - but 95 is default.
So it gives us the t-statistic
Lets do our confidence interval
907 (t-statistic)(standard error of the estimated mean)
Whats the difference between confidence and prediction intervals??

Confidence interval: range of values within which the population


conditional mean value lies
In contrast, the prediction interval contains the values within which an
observation selected at random might fall.
If you were to reach into that population and pull a sample out at
random, what range would you expect to see
How do you calculate it? Using the standard error of prediction
Now, the prediction interval is going to be wider - it makes sense if you
look at the equation.
So... if the point estimate is the same, why does it matter if the dots are
scattered or clustered?
Well, thats where goodness of fit comes in
Remember, the point estimate is a conditional mean: on average, this
is the value for the dependent value when the independent variable
takes on a given value
In other words, when P = $50, the number of units demanded will, on
average, be equal to 93
BUT in comparing the two scatterplots, which demand schedule will
show a greater variation around the mean?
The one where the dots are more scattered.
Now, least squares regression will always find the best line - but
simply looking at the equation doesnt tell you which equation is a
better fit for the data.
Goodness of Fit
Most commonly used measure of goodness of fit is called the coefficient
of determination
AKA R-Squared
Measures the percentage of the total variation in the dependent
variable that can be explained by the variation in the independent
variable.
Say wha?
So the total variation is the difference between the actual number
and the average
The explained variation is the difference between the point
estimate and the actual number
So. Going back to our definition, there is a variation of 156
between the number and the mean; we can explain 150 of those
variations.
But... we were still off by six units
Those are the unexplained variation.

The unexplained variation might refer to other independent


variables that influence the quantity for which we do not
have the data
It could also be random error
For example, are your monthly cell phone minutes exactly the
same every month?
Sometimes you can explain a decrease: yeah, its low
because I was in Europe for three weeks.
But sometimes its just random.
BUT remember: the coefficient of determination is a percentage
So take the explained variation and divide by the total
variation.
Here, we get 96%
The coefficient of determination performs that operation on
squared deviations between each observation and the mean (total
variation) and the fitted value and the mean (explained variation)
R^2 = (explained variation of all observations)/(total variation)
And hey look: kstat reports the coefficient of determination
So if it says 84.72%, what does that mean?
84.72% of the difference between the variation in
quantity can be explained by the price.
What does this NOT mean?
This does NOT mean that your prediction for the quantity
will be correct 84.72% of the time.
Now then, R^2 ranges between 0 and 1. R^2 = 1 means the
independent variable can explain 100% of the variation. In
other words, ever dot in the scatterplot is exactly on the
regression line.
R^2 = 0 means a variation in the independent variable
explains nothing in terms of the variation in the dependent
variable.
So think about our GMAT and height relationship:
changing the height explains none of the variance
between the mean GMAT and the observed score.
Now, R^2 must be adjusted for degrees of freedom
Lets go back to our GMAT = a + b(Height)
Suppose we only had two observations: (470, 72) and
(626,61). Well our R^2 would be 1! Duh. Thats
because mathematically, if you have only two points,
you dont have any degrees of freedom so youre going

to get a perfect R^2 of because both observations lie


on the line.
Now if we added a third observation, its less likely that
all three are exactly on the line. By definition, two of
them will be on the line. So we have one degree of
freedom.
Now, as you add degrees of freedom, R^2 gets smaller
and smaller.
Thats why we rely on the adjusted R^2: R^2 is biased
We account for the number of degrees of freedom
by using the adjusted coefficient of
determination
Its generally going to be a little less than the
regular R^2 (makes sense when you think
about it)
Now. Dont be deceived by a high R^2!!!
It may not always be indicative of useful information.
Ex) Youre a meteorologist and predict rainfall. Which model will give
you the highest R^2?
1. Rainfall = a + b(Dew Point)
2. Rainfall = a + b(Relative HUmidity)
3. Rainfall = a + b(# people with umbrellas)
Well obviously 3 is going to give you the highest R^2, but thats not
really helpful for predicting rainfall, is it?
Back to our data set
What if you truncate the data by lopping off three zeros?
Sure, go ahead - just be sure and put those back on when youre
interpreting your data - but when youre plugging values back into the
equation, keep the zeroes off.

Newspaper Case

Startup costs are $2M to add Sunday edition; fixed operating costs per
annum = $1M
Company needs to forecast Sunday circalation to determine if the investment
is worhtwhile.
Break even = 260,000
What does our adjusted coefficient of determination of 86.08% mean?
That means that 86.08 of the variation in Sunday circulation can be
explained by the number of Daily circulation
Thus, in order to know our projected Sunday circulation, we need to know our
daily circulation

So lets say its 190,000


BUT ojo! plug it into your equation in the same form youd enter the data
(so here, lop off those three zeroes)
Well and so you plug it in and get a great number, but remember, thats
just a sample.
Take a look at your confidence and prediction intervals
Shows just how risky this is.
Adjusted Coefficient of determination: 86.08% of the variation in Sunday
edition circulation can be explained by knowing the Daily edition
circulation.
Equation: Sunday = 24.76 +1.35(Daily)
So... What do we project for Sunday when Daily is 190,000?
First of all, you have to plug in 190 because we divided all the data
points by 1000 when we entered it in.
So we plug in 190 and get 281.486, which is really 281,486.
Well our point estimate is 281,486, our break-even is 260,000. So that
means were good, right? Maybe
281,000 is our average sample circulation - based on our
estimated sample slope. On average, if daily circulation is 190,
then the sunday circulation will be 280
The reality is, the number could be different than that. And our
95% confidence interval says the true conditional mean is between
214 and 348.
Now then. A confidence interval is trying to predict the true
conditional mean.
A prediction interval takes it one step further: it says, Im not
interested in what the average newspaper circulation is, I want
to know what MY newspaper is going to get. If I were to reach
into this distribution, what range of values am I likely to see?
Itll be bigger, because youre looking for a single number, not
an average, so the range will be greater.
And you plug that in - and get -18 and 581 (or rather 0
newspapers 581)
Eesh. So thats ugly. What if you tried 90%? Thats a narrower
interval, and still pretty good confidence.
We can also set this up as a hypothesis test.
Ho: Sunday loses money
Ha: Sunday does not lose money
But lets refine that even further - if we need 260 and we sell 190 daily,
lets make our hypothesis

Ho: with a daily circulation of 190, the average newspaper will not sell
enough Sunday papers to make a profit
Ha: with a daily circulation of 190, the average newspaper will not sell
enough Sunday papers to make a profit
OR
Ho: given x = 190, y 260
Ha: given x = 190, y > 260
Now we have our hypotheses. We want to know how far our point
estimate of 281 is from 260. We have to measure this in terms of standard
deviations, so we need a t-statistic
Take your point estimate 260 (break even)/(standard error of the
estimated mean)
SO. (281260)/33.156.
t = 0.648
That means that our number is 0.648 standard deviations
above 260.
How do we convert that to probability?
Use T.DIST.RT.
Why one-tailed test? Because our null hypothesis that
Sunday papers are less than or equal to our estimate.
Using a two tailed test would lead to rejecting selling way
more than our point estimate.
So we plug it into the box: t-value = 0.648; Deg_Freedom =
33 (35 observations 2 coefficients = 33)
And we get 26% - theres a one in four chance that
youd get a sample like this if Sunday papers lose
money.
For us to feel really good about this, our p-value would
have to be less than five percent - its not even close to
that.
But - this result refers to the average newspaper whose daily
circulation is 190. What about OUR newspaper? What are the
odds that one of those newspapers, drawn at random, will
make a profit? Thats going to change our hypothesis test a bit.
So. Given daily = 190, what is the probability that one
observation at random would be profitable?
Ho: with a daily circulation of 190, an individual newspaper
will not sell enough Sunday papers to make a profit
Ha: with a daily circulation of 190, an individual newspaper
will sell enough Sunday papers to make a profit
Ho: given x = 190, yi 260

Ha: givne x = 190, yi > 260


Our t-value here is going to be different. Same
numerator (predicted minus average) BUT because
were pulling out a single number, the denominator is
the standard error of prediction
So (281260)/147 = 0.146.
That means our number, 281, lies 0.146 from the
mean.
Lets find out the odds of getting a number like that
We plug it into T.DIST.RT and get 44% chance that
our sample would say this makes money when we
in fact dont make money.
We cannot reject the null hypothesis.
What have we done?
We need to sell 260 to profit
Based on daily circulation of 190, our point estimate is 281 - on
average, thats how many Sundays youd sell.
90% of the time, a random paper would have a Sunday circulation of
31 to 531 when Daily = 190
The odds of getting that average is 26%
The odds of getting that sample is 44%

Multiple Regressions
Practice Set
Economic theory assumes that the quantity demands depends on more than just
price
When we focus on price, were assuming that those other factors are constant
A demand curve focuses on the relationship between price and quantity
demanded, if all other factors are held constant!
Now - suppose a major competitor raised his price - how will that affect the
demand for your good?
Itll shift your curve to the right - youll likely sell more at the same price.
So this is our demand curve - but is it a single demand curve, or two
demand curves? Well lets circle the ones sold given our price AND
assuming our competitor was charging 50 (spoiler: its the lower line)
In other words, when the competitor charged $50, our demand curve was
the lower line, but when the competitor raised his price to $60, our

demand curve shifted to the right, causing us to sell more. And look, their
slopes are even roughly comparable.
BUT initially we didnt report the competitors price - all regression sees is
the price we charged and the quantities we sold. Regression found a
single line to best fit the data: right down the center:
So the regression line we had splits the center. If you were to do a
bunch of point estimates, youd get points along the regression line.
Now, based upon our regression equation, if we charged 40, we could
expect to sell 112.
BUT... if we account for the fact that our competitor is charging $50,
our point estimate is going to be too high - were overestimating what
well actually sell.
Likewise, if our competitors are charging $60, well be underestimating
what well actually say.
Whats the point? We didnt tell the computer that theres another
variable involved and so the regression line is says, this is a number
and I can guarantee that its not the best number.
SO. If were systematically overestimating when competitor
charges $50 and underestimating when competitor charges $60,
we have a problem.
When we look at our data, we can see that when the competitor
charges $60, we really did get too low an estimate and too high an
estimate for when other guy charges $50.
Blurgh it all? Not so fast.
Thus far, weve used a single independent variable in our equation:
simple linear regression: Q = a bP
But if theres more than one variable thats important, we need to include
more than one variable in our regression.
Using only one variable when you should use two results in an
unreliable demand curve. Sacr bleu! Cest pas possible!
Failure to include an important independent variable is referred to as
omitted variable bias.
Enter multiple regression.
If a $10 increase in competitors price led to a 10 unit increase in unit
sales, how much would unit sales increase if the competitors price
increased by $5?
Well, logic would say 5 units.
And a $1 increase would lead to a 1 unit increase.
Basic Equation: Q = a b1P + b2CP
b1: the amount our quantity changes when our price changes

b2: the amount of quantity changes when our competitors price


changes.
When we run the regression on our data set, we get
Q = 115 1.38P + 0.94CP
And what does our adjusted R^2 mean?
Given 17 degrees of freedom (17 because we now have
three coefficients) 96.3% of the variance in quantity
demanded can be explained by knowing the price we
charge and the price the competitor charged.
What was the R^2 when we only included our own price?
78%!
Wow! We just added an additional 18.3% of explained
variation.
Whats the other 4%?
Might involve another variable
Might just be random variance.
Okay. How do we do this on excel?
Data Analysis + Regression
Y-variable is your dependent variable
Now, we get all these lovely slope coefficients that say when price
goes up, quantity sold goes down. Well one thing we might want to
test is whether its really true we have a negative slope?
So excel gives you a t-statistic that assumes that what youre trying to
test si whether the two variables are related to each other - they show
how far the number is from 0.
And if you look at ours, we see t-statistics of 17, -19, and 9.5 - so it
is EXTREMELY unlikely that there is no relation.
And look at the P-values - almost 0% chance that theres no
relation.
Here, this seems like common sense - but in other applications, it might
not be so intuitive. Thats why multiple regression and t-statistics and pvalues are so helpful - we can see if the variables are truly related to
quantity or whatever it is youre analyzing.

Qualitative Variables
We know that regression calculates coefficients based on numerical data
Quantity
Price

BUT location isnt a number - its one or the other


What if we were to ignore location and estimate?
Q = a - b1(Price)
Not a good idea - the regression line would just split the difference
between the two demand curves
So that would be omitted variable bias
No bueno
What should we do then? We will assign them a number (called dummy
variables)
First: what not to do is give Gilbert a 1; Scottsdale a 2 and run a
regression
Nope, cant do that - regression is going to scale the numbers,
meaning 2 x Gilbert = Scottsdale
Okay, heres what you DO do:
Select an attribute.
Ex) Store is in the Gilbert location.
So you go down and assign a 1 to every observation that has
the attribute
Assign a 0 to every observation that does not have the
attribute (Scottsdale)
Now then we get an equation: Q = 11,721 629PRICE
1833LOCATION
How do you estimate the quantity demanded in Gilbert if P =
$13?
Plug in a 1 for Gilbert (remember, that was the original
input)
So. Q = 11,721 629(13) - 1833(1) = 1711
And for Scottsdale?
Plug in a 0 for location
Q = 11,721 629(13) 1833(0) = 3544
So this is in accordance with our scatterplots: we see that we
are in fact selling less in Gilbert than in Scottsdale
How much less? 1833 less.
Note that when we estimated sales in Scottsdale, we plugged
in a 0 for location.
Because we plugged in 0 for the dummy variable, we got the
same result as if we had used the formula Q = 11,721
629PRICE
So what must be true about Q = 11,721 629PRICE?
In essence, this is the regression line for Scottsdale!

-1833(LOCATION) is the adjustment you make for


Gilbert. Youre always subtracting 1833 from the
Scottsdale point estimate
So this is a dummy coefficient that means that, all else
being equal, 1833 fewer units are sold at the Gilbert
location than the Scottsdale location.
More generally, it compares the observations that have the
attribute with those that do not have the attribute.
So REALLY the equation is Q = 11,721 629PRICE
1833GILBERT
BUT are unit sales really lower at the Gilbert location?
After all, were relying on sample data - we dont know about the
population...
To test if unit sales at the Gilbert Location are different from Scottsdale,
we need to run a t-test
T-statistic measures how far your sample slope measures from the
hypothesized population slope, measured in standard deviations
Now then, most statistical software assumes that the hypothesized
population slope lies fro the hypothesized population slope. In other
words, were asking whether the location slope varies with the
independent variable
What were asking is if the location is related to the price slope.
So. We have 9 degrees of freedom (12 observations minus at least 3
coefficients that must be estimated)
And if you look, we have a t-statistic for computing 95% confidence
intervals.
t-statistic = 2.262
That means that, if the true population slope were zero (no
difference between the locations), then 95% of the time, the
sample slopes will be with 2.2622 standard deviations of 0
And look at the t-ratio reported under regression: we have
-11.2574. That means that if the population slope is zero, then
getting a slope of 1833 is 11.2574 standard deviations below
zeros.
And look at the significance (thats really the P-value) - its
0.0001%. Those are our odds of drawing that sample slope if the
population slope is zero. If in fact, location does not matter, then if
that were true, then the odds of getting a number that lies 11.62
standard deviations below zero are 0.0001%.
So this is very powerful evidence to reject the null hypothesis that Gilbert
and Scottsdale are equivalent.

And what if wed reversed how we coded the dummy variable?


Well then the slope would be + 1833 and it would be 11.62 standard
deviations above 0.
Q = 11,721 629PRICE is the demand curve for Gilbert
And then you have to add the 1833 to get the Scottsdale #
BUT the x-intercept is also going to change. Think about it:
Running a regression on Gilbert (location = 0), then you adjust
from Gilbert to get to Scottsale
So we know that the x-intercept of Gilbert is 1833 less than
Scottsdales
So that means the intercept will be 9888, not 11721!
BUT does it make a difference? Nope - try calculating G location at P=13
L=0, Q = 11,721 629(13) + 1833(0) = 1707
And Scottsdale? Q = 11,721 629(13) + 1833(1) = 3544
Amazing! Math! Numbers!
Reversing the coding doesnt change the point estimates
It only changes the identity of the default attribute.
Now then. What if created a GILBERT variable (1 = Gilbert, 0 = not
Gilbert) and a Scottsdale variable (1 = Scottsdale, 0 = not scottsdale)
What equation would you get?
Q = a b1P + b2Gilbert + b3Scottsdale
And when you plug it in?
ERROR! Why?
Its likely that one of the variables in your model is perfectly
dependent on one or more of the others. Eliminate that
variable rom your model and try again.
So what does that mean? Basically youre adjusting Scottsdale for
Gilbert and then adjusting right back.
Perfect multicollinearity: perfectly correlated
So if you picked an observation at random from the data and
told you the value for PRICE, youwoulndt know with certainty
the corresponding value for Gilbert or Scottsdale
But if you picked an observation at random and knew that
Gilbert was 1 and Scottsdale was 0, then they have a perfect
inverse correlation
Now then, originally we had a single variable that we used for
the equation: all Gilberts got 1, all not Gilberts got 0.
That gave us an equation with a demand for Scottsdale
plus an adjustment for Gilbert
So you dont need a dummy variable for each location, just
one - whichever one you dont choose is the default

In other words, the default portion of the equation picks up


observations that arent coded as 1.
By trying to run a regression with each location, whos the
default group?
Q = 11,721 629PRICE: what does that refer to?
Well it doesnt refer to anything - we got rid of the
default group! Its everyone whos neither Scottsdale
nor Gilbert - but there is nobody else!
This is the dummy variable trap: creating a separate dummy
variable for each possibility leads to perfect multicollinearity.
You MUST leave out one of the options so it will be picked
up in the default portion.

BUT now lets imagine we have THREE locations. Shoot.

In our previous example, if we had a Gilbert Variable, then a 0 was by


definition Scottsdale
But now what does not Gilbert mean? It means Scottsdale or Glendale.
So lets say we had a dummy variable for all three locations and attempted to
estimate Q = a b1PRICE + b2GILBERT + b3SCOTTSDALE + b4GLENDALE
Well, smart one, youd have fallen into the dummy variable trap again perfect multicollinearity!
You need to leave out one of the options - then it can get picked up in the
default portion of the equation
So say we only have values for Gilbert and Scottsdale: Q = a b1PRICE
+ b2GILBERT + b3SCOTTSDALE.
Q = a b1PRICE is our Glendale Value
What does the Gilbert coefficient tell us?
Well, thats the adjustment you make to the default group for
Glendale. It compares Gilbert to Glendale.
And this makes sense: you take the default (Glendale) and
add the adjustment for Gilbert and then the adjustment for
Scottsdale x 0, or 0
And the Scottsdale coefficient?
That compares Scottsdale with Glendale.
So we go through and do our assignments: 1 for Gilbert, 1 for Scottsdale,
both 0 for Glendale
So we get the equation
Q = 11919 639PRICE 1918GILBERT 77Scottsdale
And what does R^2 mean? 97.08 of the total variation in quantity can
be explained by the price and store location
Is there a significant difference between unit sales in Glendale and
Gilbert?

t-statistic for Gilbert is -14.24, meaning the sample slope for


Gilbert is 14.24 standard deviations below zero, so the
significance/P-value is nearly zero - the odds youd reach into the
population and get those slopes are close to zero, if the two are
about the same
So this is very, very powerful evidence to reject the null
hypothesis that theyre the same
Okay, now Glendale v. Scottsdale
the t-statistic is -0.55 standard deviations below 0
Well thats not too far off.
So if Scottsdale and Glendale were basically the same and you
ran regression after regression, the odds of getting a sample
regression, 95% of your sample slopes would be within 2.14
standard deviations of 0
And this number is within 2.14 - in fact, our odds of getting such a
slope is 58.9%.
So here, we fail to reject the null hypothesis
BUT what if we omitted Scottsdale and added a Glendale variable?
Well then Scottsdale would be the default
But our intercept would change: go from 11919 to 11919 77, or 11,842.
Q = 11,842 639PRICE
But what would the coefficient for Gilbert be?
Well, if Glendale to Gilbert is 1,919, then Scottsdale to Gilbert
would be (Glendale - Gilbert) (Glendale - Scottsdale) or 1840
Now were comparing everything to Scottsdale
HW Part a on problem CE 2,
Standard error of the estimate: standard deviation of the estimates, who in
this case, is 145.
Standard error of the coefficient: slope coefficients.
Now then, the difference between the actual value of the dependent variable and
the predicted value is called the residual: basically, how far off is your value from
the real number
Yi - y = Residual (Yi is the real number; Y is the estimated)
Okay. So youre at the residuals tab; you sort from lowest to highest
We see that at the highest prices we overestimated and the lower, you
underestimated.
So you have a pattern with your residuals - and thats bad. You want your
residuals to be random.
If you make Scottsdale your default, you get the opposite pattern:
overestimate at low prices and underestimate the high prices.

Why? Because we forced a parallel line. All we did was tell the computer was
to generate a different intercept, but didnt tell it to look at the slopes of the
lines.
So... our dummy variable for location dealt with the different intercepts - our
equation was Q = 11,721 629P 1833GILBERT.
But clearly we have a difference in slopeshow do we adjust for that??
So how do you adjust for differences in slopes? Well we need a coefficient to
adjust the slope.
We call these interaction terms (book calls it a slope dummy variable):
create a variable in which you multiply the value for the dummy variable
by the variable whose slope differs
In this case, the variable is GILBERT x PRICE
So our equation is
Q = 11,721 550PRICE - 1833GILBERT 70GILBERT X PRICE
So the Gilbert Slope is -55070
Gilbert Demand is Q = 9888 - 620 PRICE
This isnt perfect, but it reduces bias - the over/underestimates are
gone now.
So what does this mean: the first part of the equation, the default
price, is the same as before: thats Scottsdale - Scottsdale didnt get a
1 dummy variable, so its a 0, so its the default.
Now then, -1833GILBERT is how you adjust the intercept to get the
Scottsdale intercept
Interaction coefficient of 70 adjusts the slope.
Why do we do the dummy variable system?
Why not just separate out Scottsdale and Gilbert?
Because we get better results with more observations. More
efficient estimates - and the same equation.

Head-Hunting Agency Case Study

Facts
CEO Seek advertises it can find a suitable candidate within 15 days or the
service is free
Our job is to determine if the premise is feasible.
You suspect that its harder to find a suitable CEO for a large firm than for
a smaller firm.
Issues Presented
Is the size of the firm related to the number of days needed to find a
suitable candidate?
What would you recommend re 15-day guarantee

Does the search for a CEO differ from the search for lower level
managers in terms of the number of days?
Data
48 observations, for either a CEO search or lower level management
search.
Each shows # days required to find candidate
Each observation shows size of client firm.
So. Data set Headhunting.xls
Days: number of days to find candidate
Size: number of employees (measured in hundreds)
LOWConst: lower level manager = 1; CEO = 0
LOWSlope = interaction term (size of firm multiplied by the dummy
variable)
Regression: Days v. Size
So the coefficient for size says 0.00597
DAYS/SIZE. So for every 100 increase in the number of employees,
you add 0.006 days to the job search
But we also have to look at the t-statistic and p-statistic
If there was no relation between firm size and days it takes to fill, the
slope would be 0.
So look at the t-ratio: 0.29 deviations from zero. And the P-value tells
us that if there were no relation between the two, you could reach into
the population and pull out a sample that does show a relationship is
76%.
So we fail to reject the null hypothesis that there is a relation between
the two
But do a scatterplot!
So thats why we go that tiny slope!
The regression line split the difference!
Better throw in a dummy variable, huh?
Regression with Dummy Variable for CEO and Manager
DAYS = 18.82 - 0.005(SIZE) - 8,935(LOW CONST)
So for every 100 more employees, the number of days required goes
down by 0.005
But what does R^2 mean again?
72,29% of the variance can be explained by knowing the firm size and
whether youre searching for the CEO or manager.
What about the t-statistic that corresponds to the SIZE coefficient: (stat
calls it t-ratio): 0.4899
So pretty small

And our p-value is 62.53%, meaning that if there is no relation, then


62.53% of the time, well get numbers that show this kind of
relationship. And so we would fail to reject the null.
And what about the dummy variable coefficient? What does that mean (8.93)
If youre searching for a lower level manager, then it takes 8.9 fewer
days to do the search
And what about the t-ratio: -15.79?
What is the odds that wed get a number like 8.93 if the relation was
0?
15.79 standard below zero
Theres virtually no chance youd get a number like this if it was
the same for both groups
So reject the null hypothesis that theres no relation between length
and whom youre searching for.
Regression Using Interaction Terms
Now we get the equation:
Q = 13.17 + 0.088SIZE - 1.02(LOW) - 0.17(LOW X SIZE)
So how many days does it take to fill CEO position?
Days = 13.17 + 0.887SIZE
Now then, eyeballing our scatterplot, can we see a difference between the
intercepts? Hard to say.
But what about the slopes? Yeah, clearly different.
Lets look at the statistical results
Is the intercept for lower level management different than CEOs?
The intercept we got was 1.022 - that means it would take on
eless day to fill lower than CEO
Any significance to that?
Null hypothesis is that the true slope is 0
The t-ratio says this is 1.42 standard deviations from 0
And the P-Value says 15.88% chance of getting this number
And the slope adjustor for the CEO/Manager?
We have a t-statistic of 12.54 and P value of 0.0000%
So clearly there is a relationship - we reject the null hypothesis
And that makes sense, given the scatterplot - the relationship was
really obvious. We couldnt really tell the difference between the
intercepts just by eyeballing it, but this one was obvious
So whats the equation for lower level searches?
DAYS = 12.15 - .105
First, you adjust the intercept: 13.17 -LOWConst(1.02), so we get
12.15

And then the Slope? You have to take the slope and then multiply
0.0887(LOWSlope), or -.105
Now go to Univariate Statstics
Now were going to develop a point estimate and 95% confidence and
prediction interval for firms of 2000, 5000, and 9000
And our predicted values are 14.95, 17,611, 21.16 (for CEOs)
What about management?
10.62, 8.33, and 5.28, respectively.
Conclusions: the larger the size of the firm, the longer the search for CEO
Larger the firm, shorter search for lower level managers
15 day guarantee works for lower level managers, but not for CEOs

Vocabulary
General Terms
Standard deviation: rarely used because its of a full population
Standard error of the coefficient: standard deviation within your sample
coefficient
AKA Sometimes called Std. Dev. of the Sample or Std. Error of the
Sample
t-ratio: distance that the coefficient is from a hypothetical slope of 0,
measured in standard deviations (Coefficient/Standard Error of the
Coefficient)
Specifically for where the coefficient lies from zero in terms of
standard errors of the coefficient.
Significance: exact same thing as the t-ratio, except measured in
probability
Coefficient of determination: doesnt take into account degrees of freedom
Adjusted coefficient of determination: explains how much of the variation in
the dependent variable can be explained by variation in the independent
variable(s)
AKA R^2 (i.e., KStat is weird, everyone else calls it R^2)
T-statistic: the number you would use to create the upper and lower bound
youd make for a confidence interval. Sample mean population
mean/standard
Say you want to test true pop mean = 1. sample mean = 0.33, st. dev. = .
005, df = 129. 0.5% significance test
Ho: pop mean =1
Ha: pop mean 1

If t-statistic you generate is less than the one in the t-table, you fail to
reject the null hypothesis
SO t-statistic = | 0.33 1 |/.005 = 134
t-table for 100, 129 df says 1.984
SO. If your calculated t-statistic is less than the appropriate t-statistic
from the t-chart, then you
Standard error of the mean:
Standard Errors...
Standard error of regression: standard deviation for this individual observation
that we--for this regression equation, this is the st. dev
What you use to build confidence
Standard error of prediction: incorporates the unknowns and makes the
interval a bit bigger
Standard error of the mean: when in doubt, use this.
How to build a confidence interval for a point estimate. Say you dont want a 95%
confidence interval, you want a 90% confidence interval.
Given coefficient: .235, st. error = 0.97, 100 df., 0.9 confidence interval
Between this and our t-chart, we have everything we need to build a 90%
confidence
Were hypothesizing that the center of our bell curve is .235; we want to
know the upper and lower bounds that embrace 90% of that bell curve
Look at confidence level at the bottom
So t-stat = 1.66 so you use that to convert your standard deviation into
the difference both directions.
Coefficient (t-statistic from chart)(standard deviation)
= .235 (1.66)(0.97)
Given coefficient of 3.5; std. error = 1.2; df = 30; 95% confidence
interval
=3.5 1.2*2.042
If your standard deviation gets smaller, your window will shrink; same
if you decrease your confidence interval because youre looking for
something quite narrow: 50% of the time it will fall between these
numbers is more specific than 99% of the time it will fall between
these numbers
Use the standard error of the coefficient.

Multicollinearity

Coeff t-stat = 0.75; Variance inflation - 25; 0.05 significance level; 30 DF


Variance inflation speaks to the coefficient theyre related to. So go back
tot he coefficient, apply the square root of your variance factor
Do an analysis of variance
First column: base equation

SEcond: extended
Third: difference between the two. Significance less than 5% causes you
to reject the null hypothesis that there is no explanatory value.

Time Series

Based on the assumption that historical trends continue


That means you have to ask if time is an element
Price and quantity: time is irrelevant
Sales over time: yeah, time matters
Convert all your time periods into consecutive periods
Choosing the best model: whichever model gives the smallest residuals in the
most recent periods

Spurious Correlations
So we have this weight/CPI regression: Weight = 70+ .54(CPI)
Significance = 0.000%
So inflation is CLEARLY responsible for weight.
These are type I errors.
So lets look at the refrigerator again. Demand Curves:
You may be trying to estimate a demand curve and then find that when price
goes up, demand goes up
But the cause could be because demand is just plain higher
Or maybe the competition raised its prices more
Beware reverse causation: fire trucks cause fires
Remember: regression doesnt measure causation, merely correlation.
So as an example, chief manager of a wine company believes that preferences
for brands of wine are related to household incomes
Sales A refers to Almaden. Run a regression in the form Sales A = a +
b(Income A)
So we run a regression on our data and come up with the formula Sales A =
3 + .50Income. BUT income and sales are in thousands of dollars
.50 means for every $1000 increase in income, sales increase by $500.
Each $1 increase in income increases Almaden sales by $.50.
And for Bianco wine
Sales B = a + b(Income B)
And look, we get the same equation: Sales B = a + .5(income B)
So its about the same as A. But look at the scatter plot:

But... thats not a line, thats a curve


But we told regression to find a line, so thats what it did.
Regression doesnt care what the scatterplot looks like: it just tries
to find the best line to fit the data, regardless of whether the data
falls into that pattern!
So how do nonlinear relationship affect your estimates?
Click on the Kstat tab that says residuals
So we see that for the first 2 lower incomes, the model predicts
too high
For middle incomes, the model underpredicts
And then the final two incomes are overestimates again.
So our residual (the difference between estimated and
observed) has a pattern: lowHIGHlow
Now then, if the regression line were passing through the center of
the scatter plot, there would be no pattern to the residuals, but
here we have a clear, nonrandom pattern.
So whats responsible for this pattern?
Well we need to add a quadratic:
y/x = Sales/income. An increase in the
Sales = -5.999 + 2.78(Income) - .13(Income squared)
As income rises, the relationship between income and
sales gets smaller.
Instead of a straight line, it goes up, then goes down: the
relationship between sales and income eventually
becomes negative.
And if you look at the residuals, we can see that our point
estimates are spot on
The inclusion of that quadratic term has fit a curve to
nonlinear data.
Phew. Now then, nonlinear data can usually be spotted with a satterplot.
But it may not be so easily spotted if youre using multiple regression (if
you have more than two dimensions in your graph)
A more reliable means of spotting nonlinearity is through a residual plot.
So going back to our Sales B = 3 + .5(Income B) line.
We know thats no bueno.
BUT we can then plot the residuals:
So not only did we plot the residuals and see a pattern of negative,
positive, negative, but the size of the residuals follows a pattern:
so thats pretty clear evidence that were trying to fit a line through
nonlinear data.

And compare to Almaden wine: the dots seemed to follow a linear


pattern, line passed through the data. And look what happens
when you plot the residuals: Theyre all over the place! No pattern
to the residuals.
This is good: we want there to be no pattern to residuals - the
only way to get a better number is with a crystal ball.
And what about Casarosa wines? Lets run a regression:
Again, we get Sales C = 3 + 0.5(Income C)
So the line is the same as A and B... but theres probably something
funky going on with the scatterplot.
Well... less funky and more almost perfect correlation with a single
outlier
And look: that outlier tilts the line up. At the really low levels, well
underestimate, and then well overestimate moving on, with the
exception of the outlier, which is an underestimate.
And look, we have exactly that result.
Now then, Kstat is set up to find outliers.
Go to Model Analysis then Residuals.
stdized refers to studentized residual
This column measures how far the residual lies from 0,
measured in standard deviations
So if you line passes through the center of the data, the
residuals should average to 0 (positives and negatives cancel
out)
The idea is that some of the residuals will be closer to 0
than others - really far from 0 is an outlier.
And if you look, the outlier is highlighted in red: 2,8
standard deviations above 0.
The default number they give you is for a 95% confidence
interval. And our number here is 2.2622 - 95% of residuals
are within 2.622 standard deviations.
So Kstat will highlight any residual that lies in the 5%
tail region.
So moving on to the Casarosa example from the email, you run a
regression of Sales v. Income and get 4.025 = 0.34(Income)
Lovely little R^2 too: 99.95%!
But thats with the right information: what if its the wrong info
To check, go to charts, fitted line plots, and then plot against income.
Almost perfect, as you can see

But what happens when theres bad info? Say you pu tin 8.41 as 841?

Totally tanked our R2! 7.01%. And our t-ratio is 1.32, with
significance of 21.8%
Sales = -192.78 + 30.62 (income)
So... every dollar increase in income translates to an extra
30.62 in sales
And when you plot it on a fitted line plot:
So the regression line is way steeper than it should be.
Note: when you saw the correct data, it looked like data
was upward sloping
Now it looks flat.
But its not - it just looks that way because that outlier
up at 841 messes with the scale
Okay, that was a typo: always double check your data entry. But...
what if the outlier wasnt a typo? What if there really was that
oddball day?
One way to minimize the impact of an outlier is to increase
sample size
Well lookie there, the impact of the outlier has been
substantially reduced from when it was one of 14. Bias
is still there, but its must smaller
Note: we just copied and pasted the data several
times- thats not what youd really do, obviously
Heres what you really want to do though: unless it was the
result of a typo, an outlier probably has a cause
If you can determine the cause, you can add the
independent variable that was responsible for the outlier to
your equation.
Add an outlier dummy variable!!! In this case, were
saying the cause is a huge Black Friday sale, so our
dummy variable is BF = 1, BF = 0.
Now we have the equation: Sales = 4.01 +
0.34(Income) + 832(BF)
And look at the residuals: as youd expect with an R2,
the fitted values are equal to the actual values,
including for BF value.
Never, ever, ever omit the outliers from your data sets!
Why? Because when youre doing empirical studies, youre
testing a theory. So when you omit outliers, youre giving
the impression that you rigged your resutsl

And look at this crazy plot. Of course youd never get


results like this in the real world, but look how this leverage
point has a disproportionately large influence on the
regression equation. It looks like theres really no relation
between income and sales, but the regression line, due to
the leverage point, says otherwise.
But Kstat thinks life is just dandy - this point fits the line
exactly, so A LEVERAGE POINT IS NOT AN OUTLIER
And yet clearly something aint right: so what do you do?
Go to model analysis.
Now look at the leverage column - our leverage point is
highlighted in RED. So instead of studentized, you look
at leverage.
And stat has identified that this has a
disproportionate effect on the regression line.
Now then, scatterplots like this are really unusual
Lets have another example: store near ASU sells higher-end goods aimed at
students. Plot income to Sales.
So here we have But this really doesnt make sense: the relationship
between what students earn and spend is biased by the inclusion of all
those students who make no money and dont shop at the store.
But... what if students are supported by their parents? Or earned
money over the summer to spend: so summer income supports
spending during the year. Income = $0; spending $0
So now you cant just get rid of the zero incomes... but those who
do have no income and spend, spend at a different rate.
So maybe you need two regression equations: one for lower
incomes and one for higher; this matches two distinct market
segmets
Now, this is an example of the threshold effect
For our purposes, know this issue is here and that issues arise
from it
Sometimes, its just a matter of exercising good judgment.

Income vs. Spending

A prime example of heteroskedasticity: people who make very little spend


every bit they have (they have to), but people who make more money can
spend or save with more discretion.
But check out that residual plot:

Why do we care? If theres more than one independent variable, y ouhave to


look at the residual plot to spot the error.
We use the Breuch Pagan test
Run a regression of the general form
Y = a + b1X1 + b2X2 + bnXn
Then you run the residuals (yi - y), square them, and run the regression.
2 = a +c1X1 + c2X2 +cnXn
Where 2 is the squared residuals
If theres no heteroskedasticity present, the slopes in the second
equation will be zero - theres no correlation between the residuals
and the independent variable. If there is such a relationship, there
will be a slope >0
Here is the hypothesis test
Ho: c1 = c2 = cn = 0
Ha: c1 c2 cn 0
Using the R2 from residual regression, the Breusch-Pagan test
statistic is calculated by something we dont have to do. We then
test against a chi-square statistic.
But kstat does it for you as well as report the corresponding Pvalue
Look at it in Model Analysis, look for the statistic and P-value,
right next to each other
P-value: odds of getting a BP stat if there were no
heteroskedasticity, meaning we have strong/weak evidence
to reject the notion that heteroskedasticity does not exist
How do you fix heteroskedasticity?
Easiest (and only way well cover): convert the data into logarithms
(semi-log or log-log(, run the regression, examine the residuals for
signs of heteroskedasticity, and perform a BP test.
If it fixed the problem, thats the equation to go with
If it didnt fix it by doing semi-log, try log-log
If THAT didnt fix it, then dont worry about it for our purposes

Ch.7: Dealing with Multicollinearity


So. Hotdog example.
Interpret Pdub: -0.0007642. Market share was in terms of percent; price was
in cents

Each one cent increase in the price of a Dubuque hotdog causes its
market share to fall by 0.076%
Logical enough: thats the demand curve
But lets evaluate this
T-statistic: this number is 9.6042 standard deviations below 0
So the odds of getting this number if there was no relationship
between price and market share are very very low
So our market share is clearly related to the price of an Oscar Meyer
hotdog.
Ditto for PB reg
POscar: 0.00026333
Every cent increase in Oscar Meyers price causes Dubuques market
share to rise by 0.026%
And look at the t-statistic: 3.13 standard deviations above 0, P value =
0.0%
PBPReg: 0.000459698. So every 1 cent increase in BPReg, D gains market
share of 0.0046%
T-stat = 5.88
P-val = 0.0%
But... look at R2: 51.27%
Now we run the regression, but swap out BPReg and put in all beef.
In terms of Dubuques market share/price relationship, that stayed the
same. Ditto for Oscar Meyer price/our market share
And PBPBeef has slope of 0.00040, so for every 1 cent increase in price,
our market share increases by 0.04%.
T-ratio = 5.7
Significance = 0.0%
So our colleague who says OM is our only competitor is full of bologna (nyuck
nyuck nyuck)
Now run the regression one more time and include all the brands.
Well shoot. No impact on relation between our price and market share,
OM price and our market share, but look what it did to our BP data!
Well now it looks like we dont have sufficient information to say that
theres a relationship between BPs price and our market share
Our significance levels went to 29,7% and 72,8%. Useless!
Okay now this is just funky: either BP hotdog flying solo clearly had a
relation to our market share
But together, no impact?

Either our market share is influenced, or its not. But we have strong
evidence that on their own they are, but together theyre not.
So... whats the explanation?
Go to charts and plot Pdub (independent variable) against POscar
(independent variable):
Not much of a correlation - maybe a positive correlation? But its not
strong.
Now Oscar Meyer against BPreg
Again, there appears to be some positive correlation, but its not
strong
Now BPReg v. BPBeef:
Well theres a strong correlation there.
This is visual evidence of multicollinearity: strong correlation
between independent variables
In Ch. 5, we talked about perfect multicollinearity - when you
have perfect multicollinearity, you cant run a regression
equation again.
But here, the two variables are strongly correlated, but not
perfectly
So we can still estimate a regression equation, but it has a
problem
How do you spot multicollinearity?
Look at the correlation coefficient: correlation coefficient of 1 indicates
a perfect, positive correlation between variables. -1 indicates perfect
negative correlation between variables. 0 means no correlation.
So how do you do this? Go to kstat and click on correlations
So we have the correlation coefficient for our hotdog price and its
positively correlated with all other brands: as other brands go up,
so do ours
But look at the correlation coefficient relation the ball part regular
price to all beef price. As you can see, its almost 1
Why does multicollinearity create problems?
Our equation is: MKTDUB = a + b1Pdub + b2Poscar + b3Pbpreg +
b4Pbpbeef

Differentiation: If we were to increase the values of the variables


by 1 unit, what would happen to our market share
Partial derivatives: identifies the change in the dependent
variable resulting from a given change in the independent
variable holding all other values constant

So impact of a change in price of a regular BP Hotdog on


Dubs market share, holding constant the price of other
types of hotdogs.
But you cant hold the all beef and regular constant because
the same company is in charge of both: so they always move
together.
But since BPReg is highly correlated with BPBeef, regression
doesnt see the independent effect that BPReg has on Ds
market share - and the same goes with BPBeef.
Why do we care? this case is based on the concern that if BP changes its
price, it may affect Ds market share. And our colleagues said that OM was
the only competitor.
So... does Ds market share appear to be sensitive to price of BP
hotdogs?
When the regression includes both types of BP hotdogs, the equation
misleads D into thinking BP doesnt have an effect; it doesnt appear
to be when you look at the straight regression line.
But we know, from running individual regressions for each kind of BP
hotdog, that there IS an influence
Well go back to the regression with just PReg.
Standard error of the coefficient is .0000078
But when you run it with both, the standard error of the coefficient
increases to 0.00033: its four times bigger than the other equation.
Multicollinearity inflates the size of the standard error of the coefficient.
Why do we care?
t=b(sample slope)-Popslope (default assumption is 0)/standard error
of the coefficient (S[sub]b)
And if you have multicollinearity, you make the denominator bigger,
decreasing the t-statistic
And if the t-statistic is too small, it makes it look more significant
You will be less likely to reject the null hypothesis when a relationship
between variables actually exists (Type II error)
Type I error = null hypothesis is true, but you reject it as not true
Type II error = null hypothesis is false, but you fail to reject it as
true.
And weve already seen this: when only one BP hotdog was in the
equation, its t-statistic showed it was significantly related to Ds
market share,. And when both were in the equation, neither were
shown to be related to Ds market share.

So how do you know if the correlation is large enough to create a


problem?
We use variance inflation factors
Variance Inflation Factor: indicates the amount by which
multicollinearity inflates the standard error of the coefficient.
Suppose that the t-statistic is 0.5. So then the idea is that
theres no relation. But the t-statistic is sample slope 0/standard error of the coefficient
And that causes your number to be articifically low.
So its misleading
How do you know if its low enough to make a difference in
your conclusion?
Thats the variance inflation factor
So... going back to boot camp
Variance: Sum(each outcome - mean)^2/number of outcomes
Less popular because its harder to interpret. Its the average
squared deviation
Makes sense conceptually, but not much use in terms of
application because its a square
Standard Deviation: square root of the variance. Much easier to
wrap your head around intuitively
Why did we go back to that?
Suppose the t-statistic is 0.5. If the variance inflation factor = 36,
then standard error of the coefficient has been inflated six times
due to multicollinearity (b/c standard deviation = square root of
variance)
To find the variance inflation factors, click statistics, then model
analysis.
We get: pdub: 1.36 (sqrt = 1.1). So our t-statistic is lower than it
really is, it should be 1.1 times what it is. But we already said its
related, so now its really related - in other words, since we already
concluded the variables are related to the dependent variable, that
just strengthens our conclusion
Poscar: 1.66 (sqrt = ). Weve already concluded theres a relation,
so the fact that the t-statistic is depressed by sort 1.66 isnt too
worrisome.
Pbreg: 25.96 (sqrt = 5.00)
So we concluded theres no relation, but it should be
5.something HIGHER than what we got)
We got a t-ratio of 1.04, but it should be 5.2!
Well if that were the case, wed say strong correlation!

BPBeef: 25.14 (sqrt= 5)


T statistic is 0.35, but it should be 1.6 - so likely significant at
the 10% level
So what do we see? Multicollinearity has led us into failing to reject
the null hypothesis when it was false!
So is multicollinearity always a problem? Well, it depends on what youre
trying to do.
What impact does multicollinearity have on the point estimate?
When the points average out to something other than the
population slope
When you only included regular BP hotdogs, regression equation
was
Prediction without multicollinearity:
Prediction with multicollinearity: the same
So the problems are thus:
T-statstics are depressed, which may mislead you into concluding
variables are not related when, in fact, they are.
Slope coefficients for the independent variables that are correlated
are not a reliable measure of how changes in that independent
variable affect the dependent variable.
But... if two independent variable are highly correlated and have low tstatistics, how do you know if theyre correlated with the dependent
variable?
We call it a Joint F-Test
Suppose PBPReg and PBPBeef have no explanatory value. BP prices
have nothing to do with Ds market share.
Well if thats true, then running a regression with them and without
them will do equally well. Thats what the F-test does.
If this were true, the residual sum of squares (RSS), or
Sum(y[sub]i (actual observation for the dependent variable) y(predicted value))^2
So if we had MDub = a + b1Dub + b2POscar and MDub = a +
b1Dub + b2POscar + b3PBPReg + b4PBPBeef
And if b3PBPReg + b4PBPBeef have no explanatory
value, than the RSS for both equations would be the same.
If the equation without PBPreg and PBPBeef has less
explanatory power than the full equation, the RSS without
b3PBPReg + b4PBPBeef will be larger than for the full
equation
So you set it up as a hypothesis test.

What does this mean? RSS[sub]r = restricted, [sub]u =


unrestricted
Normally, you have to consult an F-table to see if the F-statistic is
large enough to cause you to reject the null hypothesis - but Kstat
reports the p-value associated with the statistic
Analysis of variance
Left column is the restricted equation; right is the
unrestricted (check the boxes for the variables youre
adding)
Difference of the sum of squares shows whether adding
the new values is significant. We get significance =
0.00003%, so thats very, very strong evidence that these
two variables add to the explanatory
We still dont know on the individual level which
variable causes what, but we CAN see that,
collectively, the two variables make a difference. BP is
a competitor. Its not enough to just run the restricted
equation.
And if you had output that reported only the F-statistic?
Use the excel function F.DIST.RT:
F-statistic goes in the X box.
When do we NOT worry about multicollinearity?
When youre looking for a point estimate
When you just want a control variable
So say you want to estimate a demand equation that controls for the
state of the economy
Your equation includes real GDP and real GSP
Now then, GDP and GSP may be highly correlated.
You cant do anything about them, but in a recession, youll
overestimate sales if you throw those numbers out
In other words, they have an impact, whether you can do
anything about it or not.
But youre not interested in the coefficients for each variable,
you just want to control for the state of the economy
NOTE: if youre running a residual plot, do it against the same dependent variable
Extrapolation
You dont want to extrapolate outside your sample data

You dont know if the pattern is going to continue - and so regression will give
you a wider confidence interval
Hidden Extrapolation
SO you pick values within your data
But... even though you pick numbers within your range, you can still do
hidden extrapolation if your prediction values never appear at the same time hence its as if you extrapolated outside the sample range
So. Youve picked numbers that are all within your range: your standard error
of the estimated mean, used to calculate your confidence interval, is
0.002444.
But if you change one input to something thats outside your resample
range (and by that I mean moving one variable thats linked to another
more than the link would suggest), the standard error of the estimated
mean gets way bigger, making your confidence interval bigger

Ch. 9: Because our brains arent full enough


Time Series Analysis and Forecasting
Time series says that you assume your trend/pattern will continue.
When you do that, youre implicitly assuming that sales are a function of time:
whenever you assume that past trends will continue in the future, thats what
youre saying
Of course, time isnt the cause and effect variable
So what weve done thus far is to use the real predictive variables (e.g.,
price of goods, competitors prices) so that we can plug them in for
analysis
If you want to show that the trend is going to go in the future? Create a time
variable and number each period.
Then you can plug into kstat
Run a regression and then predict using 1316
But, what if your data shows that you dont have a linear regression?
Youre going to get a bad estimate if you use regression to predict.
Which is better for fitting an equation to a fitted line plot to a chart?
R2 is good tool - but it depends
If youre doing sales forecasting Because youre forecasting into the near
future you want smaller residuals for the more recent periods
Dont worry too much about the ancient residuals: look and see how well
its done lately
SO our most recent residuals range from -14 to 41 for the quadratic

To compare to a semi-log, you have to convert your numbers out of


log format and back into units
And the semi log residuals go from -4 to 4.1
How do you eliminate heteroskedasticity?
Try running as a semi log
Multicollinearity
Whe you have two variables that are highly correlated, regression cant figure
out who does what, so it splits the difference between the two variables - and
that split is arbitrary
You can ask whether your variables are unnecessary
Multicollinearity says theres no way to trust these numbers and theres no
way to fix it. You can test whether they belong in there at all, by doing a joint-F
test: that tells you whether the variables add to the equation or add nothing.

EXAM REVIEW

Things you have to do


Plug variables into the equation - no predict tab
Calculate confidence intervals
Point estimate plus or minus t-statistic x ?
Calculate t-statistic
sample coefficient(slope)-hypothesized population slope/standard
error of the coefficient
BUT the t-statistic that kstat provides assumes your hypothesized
slope is 0, so your t-statistic is really your sample coefficient/standard
error of the coefficient
But if you have a different hypothesis, then that t-statistic is going
to be different than what kstat reports
T-Table: if youre doing a 2-tailed test, 5% significance, 22 degrees of
freedom: go down to 22, go over to 5%, then thats your critical t-value
above which you reject null hypothesis, below which you fail to reject.

Ch. 8
So we plot advertising expense against sales and get this scatterplot:
Pretty ugly, nest pas?

So why dont we add a quadratic?


Okay, so now we add a column for a quadratic and then click charts>
fitted line plots

Plot predicted values against exp.

So much prettier
Semi-log: Another way to deal is convert the dependent variable into
natural logs and estimate the equation in the form LN(Sales) = a + bEXP
How do you do that?
Create a column that converts dependent variable into logarithms
by using =LN(SALES)
Now run a regression on Ln(Sales) and Expense
And we get an equation
The Coefficient states that a $1 increase in advertising
expenditures leads to a 0.06 increase in the natural log of
sales.
But it ALSO means that a $1 increase in advertising
expenditures leads to a 6% increase in sales.
Same thing, but the second one sounds a lot better in
presentation
But when you have the dependent variable expressed as a
percentage, the relationship is nonlinear
And yet... it doesnt fit as well:
So lets try log-log: create a log column for both independent
AND dependent variables
And it does, in fact, look better
It finds a linear relation between the logs, although the
logs, by definition, are nonlinear.
And sure enough, this fits much, MUCH better than
linear or semi-log, and even slightly better than the
quadratic.
Now, suppose you wanted to predict sales when
advertising expenditure = $2000
Well first of all, your input needs to be log of $2000
But second, the data was input into the
spreadsheet in thousands
SO youd actually plug in =ln(2)
And we GET 2.816
What does that mean?
That means that the natural log of sales is
2.816
So how do you get the increase in sales?
Use =EXP(Number)

And we get 16.71864


BUT you plugged it in in thousands,
so change the result: $16,7186
A $2000 increase in advertising
results in a $16,718 increase in
sales
Now, theres this thing called Heteroskedasticity: when the variation in the
dependent variable is correlated with the value of the independent variable
Suppose we surveyed firms and regressed the workers salary against the
numb rod years of experience
So we would expect a positive correlation
But whats the variation in salaries for people with little experience v.
those with lots
Less variation for those with little experience
Entry level is entry level
But more variation as time goes on
So wed expect to see something like this:
The variation changes with the value of the independent variable. Kinda
looks like you sneezed on the scatterplot.

And what does regression do with that?


Good news is that heteroskedasticity does not result in bias
But why is it a problem?
Because regression reports a single number as the error of
the coefficient, but there is not such single number: its
always changing!
Why is this a problem? Because we rely on the standard
error of the coefficient to establish a confidence interval for
the population slope
We also rely on the t-statistic to determine if X and Y
are related.
It tells us whether variables are related or whether
the evidence is weak
But the standard error of the coefficient is the
denominator of the t-statistic

Potrebbero piacerti anche