Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Untitled Divider
Class 2
Hypothesis Testing
EE Example
Showroom not a problem if service time to non-customers < 5760
seconds
Our sample mean of 100 people was 4880 seconds
Why sample? Because its not feasible to survey the entire population
of 4000
BUT thats just a sample mean - if we polled 100 other people, wed
probably get something different.
So what we want is some semblance of assurance that average of the
sample means does not exceed 5760 seconds.
What we did in Bootcamp was create a confidence interval that
showed the range of sample means we are likely to obtain IF the
population mean is 5760.
A 95% confidence interval demonstrates the range of sample means
were likely to obtain 95% of the time if the population mean is 5760.
So the 95% confidence interval = 5,760t(alpha,(sx)
Alpha = % observations in the tails
Where t = t-statistic that corresponds to alpha (what t-value
corresponds to the number of sdev above or below the mean?)
= number of degrees of freedom (#variables that are free to
vary - just think number of observations 1)
Sx = standard error of the mean
Standard deviation of the sample = standard
deviation/(sample size)
t value associated with 5% of the observations lying in the tails
with 99 degrees of freedom was 1.98
Standard error of the mean was 261
Hence, 95% confidence interval = 5,760 1.98(261).
What does the 1.98 mean? It means that 95% of the
observations lie within 1.98 standard deviations of the mean.
Hypothesis Testing
So as an example, take a murder trial. Jury is told to presume Not Guilty. Its
up to the prosecutor to get the jury to reject that hypothesis with evidence.
atty tries to get you to doubt the evidence.Not guilty verdict means atty
failed to sustain his burden of proof
So now lets apply that logic to our hypothesis testing.
The hypothesis to be tested is the null hypothesis (not guilty)
The alternative hypothesis (H_A) is that the defendant IS guilty
So lets set up our hypothesis test:
H_0: is not guilty
H_A: is guilty
Only if the evidence exceeds the minimum burden of proof can the null
hypothesis be rejected!
BUT
Type I Error: null hypothesis is correct but the hypothesis is rejected
Type II Error: null hypothesis is incorrect but the hypothesis is not
rejected
Type I Error is more likely to occur in a civil case - the burden of proof
is lower
Type II Error is more likely to occur in a criminal case - burden of proof
is really stringent
So applying the courts to hypothesis testing
Null hypothesis is established
H_0: Service time to non-buyers 5760 seconds
If the null hypo isnt true, the alternative must be true: service time to
non-buyers 5760 seconds
Note: could you have reversed these hypothesis? Sure, but its easier
to have the null hypothesis be if this is true, our response is status
quo If you reject the null hypothesis, were going to have to change
something.
So if we dont have evidence of a problem, we wont change
anything - well only implement this policy of charging for help if
our service time 5760
The null hypothesis is only rejected if the evidence exceeds the
minimum burden of proof.
Test Example
Computer company thinks introducing different colored computers may
increase profits. TO be profitable, sales have to increase by at least 275
units/week.
New colors were test marketed over 36 weeks.
So if alpha is 5%, then we would fail to reject the null hypothesis it falls short of our tail. Thus we have concluded that we dont have
enough evidence to say our target market is not 25
Suppose the sample mean had been 18 and standard error of the
mean was 3.2 (with 35 df)
t = 18-25/3.2 = -2.1875 (so 18 lies 2.1875 SDEV below 25)
P-value = ???
YOU CANT PLUG A NEGATIVE NUMBER INTO T.DIST.2T
No worries though, just plug in the absolute value (it doesnt
matter because the graph is symmetrical)
So we plug that in, get p-value of 3.5%, meaning we would
reject the null hypothesis.
3.5% means that there is a 3.5% chance that we would get
a sample mean of 18 if the true mean were 25. So - we
dont think our target market is 25.
Notes on HW
Write out an explanation of what you did
Consumer Packaging Example
Review
Hypothesis testing says if the mean is this, what are the odds that we
would get a sample mean of that
So the colored computer example
Our test market sample was 290, which makes us say wow,
above and beyond break even!
Buuuuuuut its also the sample mean.
So we calculated the P-Value: if the population mean is this, whats
the probability we get this sample mean.
So that was a one-tailed test.
Our P-Value was in one tail (critical region being 5%), so we
rejected the null hypothesis.
But what about a two-tailed test? Then, the areas in the critical region
sum to 5% (each tail is 2.5%)
The example we gave was our e-cig market: if our hypothesized
average was 25, we didnt care if it was 10 years older or 10 years
younger - we want to know if were advertising to the right people.
That foundation laid, lets go with a real example:
Company is considering two kinds of packing: which will sell better.
Each type is sold in 36 demographically similar sales districts
Demographically similar is key because we need to know that to
assume an unbiased sampling population.
But first look at the averages: there seems to be a fair bit of difference
Since we want to know if the month returns have been stable in each 20
period subset, we will use the following two-tailed test:
H0: 1 - 2 = 0
HA: 1 - 2 0
*Why not a one-tailed test? The two tailed test is about showing whether
or not theyre the same; a one-tailed test would say which period was
better. But thats not helpful for investing - were trying to predict the future
T.TEST
The p-value is 0.628. We cannot reject the null hypothesis that returns
have been stable
Corporate Bond: really small P value - strong evidence to reject the idea
that the returns are the same
Government: 1.1% probability; depending on alpha, may or may not reject
null hypothesis
T-bills: infinitesimal number, so reject the idea that returns are the same.
Note: when do you use a one-tailed test versus a two tailed test
Use a one-tailed test when you want a value that is AT LEAST XXX.
Use a two-tailed test when you want a value that is EXACTLY XXX. Bigger
OR Smaller is a problem.
BLURGH!
This is why our default hypothesis is that the population slope
=0
In most applications, the purpose o fthe study is to see if X
and Y are related
If the evidence is strong enough, we can conclude that
the theory is unlikely to be correct; that X and Y are
related
When you run stat, the regression shows a t-statistic (called a
t-ration) based on an implicit assumption that the population
slope = 0
What does 12.6812 mean? That tells us that a slope of
-14.9 lies 12.68 standard deviations below zero
Thats pretty good evidence that we should reject the
hypothesis that Price and Quantity are unrelated.
Makes sense: if we rejected the notion that -14.9
was too far from -10, then its REALLY too far from
0
Remember, 95% of sample slopes lie within 2.04
standard deviations of 0. If our number is 12 standard
deviations, then thats probably not a good assumption.
Significance is our P-value: the likelihood that youd get
a number thats 12.68 standard deviations from 0. And
the answer is virtually 0.
Whats that beta-weight number?
Similar to the slope coefficient, except it measures
changes in terms of standard deviations.
Every time you increase price by one standard
deviation, the demand falls by 0.92 standard deviations
We could ALSO set this up as a one-tailed hypothesis test.
Kstat assumes a 2 tailed test with Ho: = 0
In the case of the price coefficient, the null hypothesis will be rejected if
the coefficient is either positive or negative
But cmon, the secret to selling more is never to raise the price: then a
one-tailed test would be better
So
Ho: 0
Ha: < 0
In other words, our null is that theres either no relationship or a
positive relationship.
Autorama Case
So weve got our data in the data page; we run univariate statistics - what do
we see?
Mean:
Income; 60359 (18900101300)
Price: 19522 (510019650)
Now lets make a scatterplot, then run regression
So whats our regression equation?
Constant - 5788
Income - 0.2275
So. P = 5788 + 0.2275(Income)
That means, each extra dollar in income is associated with a
$0.22 increase in the price of a car
Or, in terms of something thats actually useful, each $1000
increase in income is associated with a $227.50 increase in the
price of the car
So our sample regression suggests a positive relationship between
income and car prices, but how confident are we that this is really the
case?
Ho: income and car prices are unrelated
Ha: income and car prices ARE related
So,
Ho: Income slope coefficient = 0
Ha: Income slope coefficient 0
Well use a 5% significance test to render a conclusion (hey, its the
default in stat)
Now then, stat tells us that with 98 degrees of freedom, 95% of the
sample slopes should lie within 1.98 standard deviations of the
hypothesized population slope of zero (thats the t-statistic)
The t-ratio is 9.0759 - so assuming that the population slope is
zero, a slope of 0.2275 is 9.07 standard deviations from zero
Moreover, our p-value says that the likelihood of getting a sample
slope of 0.2275 when the population slope is 0 is almost 0%.
Therefore, we conclude that the car price IS related to income.
Newspaper Case
Startup costs are $2M to add Sunday edition; fixed operating costs per
annum = $1M
Company needs to forecast Sunday circalation to determine if the investment
is worhtwhile.
Break even = 260,000
What does our adjusted coefficient of determination of 86.08% mean?
That means that 86.08 of the variation in Sunday circulation can be
explained by the number of Daily circulation
Thus, in order to know our projected Sunday circulation, we need to know our
daily circulation
Ho: with a daily circulation of 190, the average newspaper will not sell
enough Sunday papers to make a profit
Ha: with a daily circulation of 190, the average newspaper will not sell
enough Sunday papers to make a profit
OR
Ho: given x = 190, y 260
Ha: given x = 190, y > 260
Now we have our hypotheses. We want to know how far our point
estimate of 281 is from 260. We have to measure this in terms of standard
deviations, so we need a t-statistic
Take your point estimate 260 (break even)/(standard error of the
estimated mean)
SO. (281260)/33.156.
t = 0.648
That means that our number is 0.648 standard deviations
above 260.
How do we convert that to probability?
Use T.DIST.RT.
Why one-tailed test? Because our null hypothesis that
Sunday papers are less than or equal to our estimate.
Using a two tailed test would lead to rejecting selling way
more than our point estimate.
So we plug it into the box: t-value = 0.648; Deg_Freedom =
33 (35 observations 2 coefficients = 33)
And we get 26% - theres a one in four chance that
youd get a sample like this if Sunday papers lose
money.
For us to feel really good about this, our p-value would
have to be less than five percent - its not even close to
that.
But - this result refers to the average newspaper whose daily
circulation is 190. What about OUR newspaper? What are the
odds that one of those newspapers, drawn at random, will
make a profit? Thats going to change our hypothesis test a bit.
So. Given daily = 190, what is the probability that one
observation at random would be profitable?
Ho: with a daily circulation of 190, an individual newspaper
will not sell enough Sunday papers to make a profit
Ha: with a daily circulation of 190, an individual newspaper
will sell enough Sunday papers to make a profit
Ho: given x = 190, yi 260
Multiple Regressions
Practice Set
Economic theory assumes that the quantity demands depends on more than just
price
When we focus on price, were assuming that those other factors are constant
A demand curve focuses on the relationship between price and quantity
demanded, if all other factors are held constant!
Now - suppose a major competitor raised his price - how will that affect the
demand for your good?
Itll shift your curve to the right - youll likely sell more at the same price.
So this is our demand curve - but is it a single demand curve, or two
demand curves? Well lets circle the ones sold given our price AND
assuming our competitor was charging 50 (spoiler: its the lower line)
In other words, when the competitor charged $50, our demand curve was
the lower line, but when the competitor raised his price to $60, our
demand curve shifted to the right, causing us to sell more. And look, their
slopes are even roughly comparable.
BUT initially we didnt report the competitors price - all regression sees is
the price we charged and the quantities we sold. Regression found a
single line to best fit the data: right down the center:
So the regression line we had splits the center. If you were to do a
bunch of point estimates, youd get points along the regression line.
Now, based upon our regression equation, if we charged 40, we could
expect to sell 112.
BUT... if we account for the fact that our competitor is charging $50,
our point estimate is going to be too high - were overestimating what
well actually sell.
Likewise, if our competitors are charging $60, well be underestimating
what well actually say.
Whats the point? We didnt tell the computer that theres another
variable involved and so the regression line is says, this is a number
and I can guarantee that its not the best number.
SO. If were systematically overestimating when competitor
charges $50 and underestimating when competitor charges $60,
we have a problem.
When we look at our data, we can see that when the competitor
charges $60, we really did get too low an estimate and too high an
estimate for when other guy charges $50.
Blurgh it all? Not so fast.
Thus far, weve used a single independent variable in our equation:
simple linear regression: Q = a bP
But if theres more than one variable thats important, we need to include
more than one variable in our regression.
Using only one variable when you should use two results in an
unreliable demand curve. Sacr bleu! Cest pas possible!
Failure to include an important independent variable is referred to as
omitted variable bias.
Enter multiple regression.
If a $10 increase in competitors price led to a 10 unit increase in unit
sales, how much would unit sales increase if the competitors price
increased by $5?
Well, logic would say 5 units.
And a $1 increase would lead to a 1 unit increase.
Basic Equation: Q = a b1P + b2CP
b1: the amount our quantity changes when our price changes
Qualitative Variables
We know that regression calculates coefficients based on numerical data
Quantity
Price
Why? Because we forced a parallel line. All we did was tell the computer was
to generate a different intercept, but didnt tell it to look at the slopes of the
lines.
So... our dummy variable for location dealt with the different intercepts - our
equation was Q = 11,721 629P 1833GILBERT.
But clearly we have a difference in slopeshow do we adjust for that??
So how do you adjust for differences in slopes? Well we need a coefficient to
adjust the slope.
We call these interaction terms (book calls it a slope dummy variable):
create a variable in which you multiply the value for the dummy variable
by the variable whose slope differs
In this case, the variable is GILBERT x PRICE
So our equation is
Q = 11,721 550PRICE - 1833GILBERT 70GILBERT X PRICE
So the Gilbert Slope is -55070
Gilbert Demand is Q = 9888 - 620 PRICE
This isnt perfect, but it reduces bias - the over/underestimates are
gone now.
So what does this mean: the first part of the equation, the default
price, is the same as before: thats Scottsdale - Scottsdale didnt get a
1 dummy variable, so its a 0, so its the default.
Now then, -1833GILBERT is how you adjust the intercept to get the
Scottsdale intercept
Interaction coefficient of 70 adjusts the slope.
Why do we do the dummy variable system?
Why not just separate out Scottsdale and Gilbert?
Because we get better results with more observations. More
efficient estimates - and the same equation.
Facts
CEO Seek advertises it can find a suitable candidate within 15 days or the
service is free
Our job is to determine if the premise is feasible.
You suspect that its harder to find a suitable CEO for a large firm than for
a smaller firm.
Issues Presented
Is the size of the firm related to the number of days needed to find a
suitable candidate?
What would you recommend re 15-day guarantee
Does the search for a CEO differ from the search for lower level
managers in terms of the number of days?
Data
48 observations, for either a CEO search or lower level management
search.
Each shows # days required to find candidate
Each observation shows size of client firm.
So. Data set Headhunting.xls
Days: number of days to find candidate
Size: number of employees (measured in hundreds)
LOWConst: lower level manager = 1; CEO = 0
LOWSlope = interaction term (size of firm multiplied by the dummy
variable)
Regression: Days v. Size
So the coefficient for size says 0.00597
DAYS/SIZE. So for every 100 increase in the number of employees,
you add 0.006 days to the job search
But we also have to look at the t-statistic and p-statistic
If there was no relation between firm size and days it takes to fill, the
slope would be 0.
So look at the t-ratio: 0.29 deviations from zero. And the P-value tells
us that if there were no relation between the two, you could reach into
the population and pull out a sample that does show a relationship is
76%.
So we fail to reject the null hypothesis that there is a relation between
the two
But do a scatterplot!
So thats why we go that tiny slope!
The regression line split the difference!
Better throw in a dummy variable, huh?
Regression with Dummy Variable for CEO and Manager
DAYS = 18.82 - 0.005(SIZE) - 8,935(LOW CONST)
So for every 100 more employees, the number of days required goes
down by 0.005
But what does R^2 mean again?
72,29% of the variance can be explained by knowing the firm size and
whether youre searching for the CEO or manager.
What about the t-statistic that corresponds to the SIZE coefficient: (stat
calls it t-ratio): 0.4899
So pretty small
And then the Slope? You have to take the slope and then multiply
0.0887(LOWSlope), or -.105
Now go to Univariate Statstics
Now were going to develop a point estimate and 95% confidence and
prediction interval for firms of 2000, 5000, and 9000
And our predicted values are 14.95, 17,611, 21.16 (for CEOs)
What about management?
10.62, 8.33, and 5.28, respectively.
Conclusions: the larger the size of the firm, the longer the search for CEO
Larger the firm, shorter search for lower level managers
15 day guarantee works for lower level managers, but not for CEOs
Vocabulary
General Terms
Standard deviation: rarely used because its of a full population
Standard error of the coefficient: standard deviation within your sample
coefficient
AKA Sometimes called Std. Dev. of the Sample or Std. Error of the
Sample
t-ratio: distance that the coefficient is from a hypothetical slope of 0,
measured in standard deviations (Coefficient/Standard Error of the
Coefficient)
Specifically for where the coefficient lies from zero in terms of
standard errors of the coefficient.
Significance: exact same thing as the t-ratio, except measured in
probability
Coefficient of determination: doesnt take into account degrees of freedom
Adjusted coefficient of determination: explains how much of the variation in
the dependent variable can be explained by variation in the independent
variable(s)
AKA R^2 (i.e., KStat is weird, everyone else calls it R^2)
T-statistic: the number you would use to create the upper and lower bound
youd make for a confidence interval. Sample mean population
mean/standard
Say you want to test true pop mean = 1. sample mean = 0.33, st. dev. = .
005, df = 129. 0.5% significance test
Ho: pop mean =1
Ha: pop mean 1
If t-statistic you generate is less than the one in the t-table, you fail to
reject the null hypothesis
SO t-statistic = | 0.33 1 |/.005 = 134
t-table for 100, 129 df says 1.984
SO. If your calculated t-statistic is less than the appropriate t-statistic
from the t-chart, then you
Standard error of the mean:
Standard Errors...
Standard error of regression: standard deviation for this individual observation
that we--for this regression equation, this is the st. dev
What you use to build confidence
Standard error of prediction: incorporates the unknowns and makes the
interval a bit bigger
Standard error of the mean: when in doubt, use this.
How to build a confidence interval for a point estimate. Say you dont want a 95%
confidence interval, you want a 90% confidence interval.
Given coefficient: .235, st. error = 0.97, 100 df., 0.9 confidence interval
Between this and our t-chart, we have everything we need to build a 90%
confidence
Were hypothesizing that the center of our bell curve is .235; we want to
know the upper and lower bounds that embrace 90% of that bell curve
Look at confidence level at the bottom
So t-stat = 1.66 so you use that to convert your standard deviation into
the difference both directions.
Coefficient (t-statistic from chart)(standard deviation)
= .235 (1.66)(0.97)
Given coefficient of 3.5; std. error = 1.2; df = 30; 95% confidence
interval
=3.5 1.2*2.042
If your standard deviation gets smaller, your window will shrink; same
if you decrease your confidence interval because youre looking for
something quite narrow: 50% of the time it will fall between these
numbers is more specific than 99% of the time it will fall between
these numbers
Use the standard error of the coefficient.
Multicollinearity
SEcond: extended
Third: difference between the two. Significance less than 5% causes you
to reject the null hypothesis that there is no explanatory value.
Time Series
Spurious Correlations
So we have this weight/CPI regression: Weight = 70+ .54(CPI)
Significance = 0.000%
So inflation is CLEARLY responsible for weight.
These are type I errors.
So lets look at the refrigerator again. Demand Curves:
You may be trying to estimate a demand curve and then find that when price
goes up, demand goes up
But the cause could be because demand is just plain higher
Or maybe the competition raised its prices more
Beware reverse causation: fire trucks cause fires
Remember: regression doesnt measure causation, merely correlation.
So as an example, chief manager of a wine company believes that preferences
for brands of wine are related to household incomes
Sales A refers to Almaden. Run a regression in the form Sales A = a +
b(Income A)
So we run a regression on our data and come up with the formula Sales A =
3 + .50Income. BUT income and sales are in thousands of dollars
.50 means for every $1000 increase in income, sales increase by $500.
Each $1 increase in income increases Almaden sales by $.50.
And for Bianco wine
Sales B = a + b(Income B)
And look, we get the same equation: Sales B = a + .5(income B)
So its about the same as A. But look at the scatter plot:
But what happens when theres bad info? Say you pu tin 8.41 as 841?
Totally tanked our R2! 7.01%. And our t-ratio is 1.32, with
significance of 21.8%
Sales = -192.78 + 30.62 (income)
So... every dollar increase in income translates to an extra
30.62 in sales
And when you plot it on a fitted line plot:
So the regression line is way steeper than it should be.
Note: when you saw the correct data, it looked like data
was upward sloping
Now it looks flat.
But its not - it just looks that way because that outlier
up at 841 messes with the scale
Okay, that was a typo: always double check your data entry. But...
what if the outlier wasnt a typo? What if there really was that
oddball day?
One way to minimize the impact of an outlier is to increase
sample size
Well lookie there, the impact of the outlier has been
substantially reduced from when it was one of 14. Bias
is still there, but its must smaller
Note: we just copied and pasted the data several
times- thats not what youd really do, obviously
Heres what you really want to do though: unless it was the
result of a typo, an outlier probably has a cause
If you can determine the cause, you can add the
independent variable that was responsible for the outlier to
your equation.
Add an outlier dummy variable!!! In this case, were
saying the cause is a huge Black Friday sale, so our
dummy variable is BF = 1, BF = 0.
Now we have the equation: Sales = 4.01 +
0.34(Income) + 832(BF)
And look at the residuals: as youd expect with an R2,
the fitted values are equal to the actual values,
including for BF value.
Never, ever, ever omit the outliers from your data sets!
Why? Because when youre doing empirical studies, youre
testing a theory. So when you omit outliers, youre giving
the impression that you rigged your resutsl
Each one cent increase in the price of a Dubuque hotdog causes its
market share to fall by 0.076%
Logical enough: thats the demand curve
But lets evaluate this
T-statistic: this number is 9.6042 standard deviations below 0
So the odds of getting this number if there was no relationship
between price and market share are very very low
So our market share is clearly related to the price of an Oscar Meyer
hotdog.
Ditto for PB reg
POscar: 0.00026333
Every cent increase in Oscar Meyers price causes Dubuques market
share to rise by 0.026%
And look at the t-statistic: 3.13 standard deviations above 0, P value =
0.0%
PBPReg: 0.000459698. So every 1 cent increase in BPReg, D gains market
share of 0.0046%
T-stat = 5.88
P-val = 0.0%
But... look at R2: 51.27%
Now we run the regression, but swap out BPReg and put in all beef.
In terms of Dubuques market share/price relationship, that stayed the
same. Ditto for Oscar Meyer price/our market share
And PBPBeef has slope of 0.00040, so for every 1 cent increase in price,
our market share increases by 0.04%.
T-ratio = 5.7
Significance = 0.0%
So our colleague who says OM is our only competitor is full of bologna (nyuck
nyuck nyuck)
Now run the regression one more time and include all the brands.
Well shoot. No impact on relation between our price and market share,
OM price and our market share, but look what it did to our BP data!
Well now it looks like we dont have sufficient information to say that
theres a relationship between BPs price and our market share
Our significance levels went to 29,7% and 72,8%. Useless!
Okay now this is just funky: either BP hotdog flying solo clearly had a
relation to our market share
But together, no impact?
Either our market share is influenced, or its not. But we have strong
evidence that on their own they are, but together theyre not.
So... whats the explanation?
Go to charts and plot Pdub (independent variable) against POscar
(independent variable):
Not much of a correlation - maybe a positive correlation? But its not
strong.
Now Oscar Meyer against BPreg
Again, there appears to be some positive correlation, but its not
strong
Now BPReg v. BPBeef:
Well theres a strong correlation there.
This is visual evidence of multicollinearity: strong correlation
between independent variables
In Ch. 5, we talked about perfect multicollinearity - when you
have perfect multicollinearity, you cant run a regression
equation again.
But here, the two variables are strongly correlated, but not
perfectly
So we can still estimate a regression equation, but it has a
problem
How do you spot multicollinearity?
Look at the correlation coefficient: correlation coefficient of 1 indicates
a perfect, positive correlation between variables. -1 indicates perfect
negative correlation between variables. 0 means no correlation.
So how do you do this? Go to kstat and click on correlations
So we have the correlation coefficient for our hotdog price and its
positively correlated with all other brands: as other brands go up,
so do ours
But look at the correlation coefficient relation the ball part regular
price to all beef price. As you can see, its almost 1
Why does multicollinearity create problems?
Our equation is: MKTDUB = a + b1Pdub + b2Poscar + b3Pbpreg +
b4Pbpbeef
You dont know if the pattern is going to continue - and so regression will give
you a wider confidence interval
Hidden Extrapolation
SO you pick values within your data
But... even though you pick numbers within your range, you can still do
hidden extrapolation if your prediction values never appear at the same time hence its as if you extrapolated outside the sample range
So. Youve picked numbers that are all within your range: your standard error
of the estimated mean, used to calculate your confidence interval, is
0.002444.
But if you change one input to something thats outside your resample
range (and by that I mean moving one variable thats linked to another
more than the link would suggest), the standard error of the estimated
mean gets way bigger, making your confidence interval bigger
EXAM REVIEW
Ch. 8
So we plot advertising expense against sales and get this scatterplot:
Pretty ugly, nest pas?
So much prettier
Semi-log: Another way to deal is convert the dependent variable into
natural logs and estimate the equation in the form LN(Sales) = a + bEXP
How do you do that?
Create a column that converts dependent variable into logarithms
by using =LN(SALES)
Now run a regression on Ln(Sales) and Expense
And we get an equation
The Coefficient states that a $1 increase in advertising
expenditures leads to a 0.06 increase in the natural log of
sales.
But it ALSO means that a $1 increase in advertising
expenditures leads to a 6% increase in sales.
Same thing, but the second one sounds a lot better in
presentation
But when you have the dependent variable expressed as a
percentage, the relationship is nonlinear
And yet... it doesnt fit as well:
So lets try log-log: create a log column for both independent
AND dependent variables
And it does, in fact, look better
It finds a linear relation between the logs, although the
logs, by definition, are nonlinear.
And sure enough, this fits much, MUCH better than
linear or semi-log, and even slightly better than the
quadratic.
Now, suppose you wanted to predict sales when
advertising expenditure = $2000
Well first of all, your input needs to be log of $2000
But second, the data was input into the
spreadsheet in thousands
SO youd actually plug in =ln(2)
And we GET 2.816
What does that mean?
That means that the natural log of sales is
2.816
So how do you get the increase in sales?
Use =EXP(Number)