Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Sums of Normals
➔ Mean for uniform distribution:
(a+b)
E (X) = 2
➔ Variance for unif. distribution:
(b a) 2
V ar(X) = 12 Confidence Intervals = tells us how good our
estimate is
Normal Distribution Sums of Normals Example: **Want high confidence, narrow interval
➔ governed by 2 parameters: **As confidence increases , interval also
μ (the mean) and
σ (the standard increases
deviation)
➔ X ~ N (μ, σ 2 ) A. One Sample Proportion
Standardize Normal Distribution:
X μ
Z = σ
➔ Zscore is the number of standard
deviations the related X is from its
︿ x number of successes in sample
mean ➔ Cov(X,Y) = 0 b/c they're independent ➔ p= n = sample size
➔ **Z< some value, will just be the
probability found on table Central Limit Theorem
➔ **Z> some value, will be ➔ as n increases,
(1probability) found on table ➔ x should get closer to
μ (population
➔
mean) ➔ We are thus 95% confident that the true
➔ mean( x) = μ population proportion is in the interval…
Normal Distribution Example ︿
➔ variance (x) = σ 2 /n ➔ We are assuming that n is large, n p >5 and
2 our sample size is less than 10% of the
➔ X ~ N (μ, σn ) population size.
◆ if population is normally distributed,
n can be any value
◆ any population, n needs to be ≥ 30
Standard Error and Margin of Error B. One Sample Mean * Stata always uses the tdistribution when
For samples n > 30 computing confidence intervals
Confidence Interval:
Hypothesis Testing
➔ Null Hypothesis:
➔ H 0 , a statement of no change and is
➔ If n > 30, we can substitute s for assumed true until evidence indicates
σ so that we get: otherwise.
➔ Alternative Hypothesis: H a is a
statement that we are trying to find
evidence to support.
Example of Sample Proportion Problem ➔ Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
➔ Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.
Example of Type I and Type II errors
Determining Sample Size
︿ ︿
(1.96)2 p(1 p)
n =
e2 ︿
➔ If given a confidence interval, p is For samples n < 30
the middle number of the interval
➔ No confidence interval; use worst
case scenario
︿
◆ p =0.5
T Distribution used when:
➔ σ is not known, n < 30, and data is
Methods of Hypothesis Testing
normally distributed 1. Confidence Intervals **
2. Test statistic
3. Pvalues **
➔ C.I and Pvalues always safe to do
because don’t need to worry about
size of n (can be bigger or smaller
than 30)
One Sample Hypothesis Tests
1. Confidence Interval (can be
used only for twosided tests)
4. PValues
➔ a number between 0 and 1
➔ the larger the pvalue, the more
consistent the data is with the null
➔ the smaller the pvalue, the more
consistent the data is with the
2. Test Statistic Approach alternative
(Population Mean) ➔ ** If P is low (less than 0.05),
3. Test Statistic Approach (Population
H 0 must go reject the null
Proportion)
hypothesis
Two Sample Hypothesis Tests ➔ Test Statistic for Two Proportions 2. Comparing Two Means (large
1. Comparing Two Proportions independent samples n>30)
(Independent Groups)
➔ Calculate Confidence Interval ➔ Calculating Confidence Interval
➔ Test Statistic for Two Means
Matched Pairs
➔ Two samples are DEPENDENT
Example:
︿
➔ Interpretation of slope for each ➔ corr (Y , e) = 0
additional x value (e.x. mile on
odometer), the y value decreases/ A Measure of Fit: R2
increases by an average of b1 value
➔ Interpretation of yintercept plug in
︿
0 for x and the value you get for y is
the yintercept (e.x.
y=3.250.0614xSkippedClass, a
student who skips no classes has a
gpa of 3.25.)
➔ ** danger of extrapolation if an x
value is outside of our data set, we
can't confidently predict the fitted y ➔ Good fit: if SSR is big, SEE is small
value ➔ SST=SSR, perfect fit
Simple Linear Regression
➔ R2 : coefficient of determination
➔ used to predict the value of one 2
variable (dependent variable) on the Properties of the Residuals and Fitted R = SSR = 1 SSE
2
SST SST
basis of other variables (independent Values ➔ R is between 0 and 1, the closer R2
variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit
︿ the residuals = 0
➔ Y = b0 + b1 X ➔ Interpretation of R2 : (e.x. 65% of the
︿ 2. Mean of original values is the same variation in the selling price is explained by
➔ Residual: e = Y Y f itted ︿
as mean of fitted values Y = Y the variation in odometer reading. The rest
➔ Fitting error: 35% remains unexplained by this model)
︿
ei = Y i Y i = Y i b0 bi X i ➔ ** R2 doesn’t indicate whether model
◆ e is the part of Y not related is adequate**
to X ➔ As you add more X’s to model, R2
➔ Values of b0 and b1 which minimize goes up
the residual sum of squares are: ➔ Guide to finding SSR, SSE, SST
sy
(slope) b1 = r s
x
b0 = Y b1 X 3.
4. Correlation Matrix
Assumptions of Simple Linear Regression Example of Prediction Intervals: Regression Hypothesis Testing
1. We model the AVERAGE of something *always a twosided test
rather than something itself ➔ want to test whether slope ( β 1 ) is
needed in our model
2. ➔ H 0 : β 1 = 0 (don’t need x)
H a : β 1 =/ 0 (need x)
➔ Need X in the model if:
a. 0 isn’t in the confidence
interval
Standard Errors for b1 and b0 b. t > 1.96
➔ standard errors when noise c. Pvalue < 0.05
➔ sb0 amount of uncertainty in our
estimate of β 0 (small s good, large s Test Statistic for Slope/Yintercept
bad) ➔ can only be used if n>30
➔ sb1 amount of uncertainty in our
➔ if n < 30, use pvalues
estimate of β 1
◆ As ε (noise) gets bigger, it’s
harder to find the line
Confidence Intervals for b1 and b0
Estimating S e
2 ➔
➔ S e = SSEn 2
2
➔ S e is our estimate of σ 2
➔
➔ S e = S e2 is our estimate of σ
√
➔ 95% of the Y values should lie within ➔
+
the interval b0 + b1 X 1.96S e
➔
➔ n small → bad
se big → bad
s2x small→ bad (wants x’s spread out for
better guess)
Multiple Regression
➔
➔ Variable Importance:
◆ higher tvalue, lower pvalue =
variable is more important
◆ lower tvalue, higher pvalue =
variable is less important (or not
Interaction Terms
needed)
➔ allow the slopes to change
➔ interaction between 2 or more x
Adjusted Rsquared variables that will affect the Y variable
➔ k = # of X’s
Modeling Regression How to Create Dummy Variables (Nominal
Backward Stepwise Regression Variables)
1. Start will all variables in the model ➔ If C is the number of categories, create
2. at each step, delete the least important (C1) dummy variables for describing
➔ Adj. Rsquared will as you add junk x variable based on largest pvalue above the variable
variables 0.05 ➔ One category is always the
➔ Adj. Rsquared will only if the x you 3. stop when you can’t delete anymore
“baseline”, which is included in the
add in is very useful ➔ Will see Adj. Rsquared and Se
intercept
➔ **want Adj. Rsquared to go up and Se
low for better model Dummy Variables
➔ An indicator variable that takes on a
The Overall F Test value of 0 or 1, allow intercepts to
change
➔ Always want to reject F test (reject
null hypothesis)
Recoding Dummy Variables
➔ Look at pvalue (if < 0.05, reject null)
Example: How many hockey sticks sold in
➔ H 0 : β 1 = β 2 = β 3 ... = β k = 0 (don’t
the summer (original equation)
need any X’s) hockey = 100 + 10W tr 20Spr + 30F all
H a : β 1 = β 2 = β 3 ... = β k =/ 0 (need at Write equation for how many hockey sticks
least 1 X) sold in the winter
➔ If no x variables needed, then SSR=0 hockey = 110 + 20F all 30Spri 10Summer
and SST=SSE ➔ **always need to get same exact
values from the original equation
Regression Diagnostics so that we can compare models. ◆ Homoskedastic: band around the
Standardize Residuals Can’t compare models if you take log values
of Y. ◆ Heteroskedastic: as x goes up,
◆ Transformations cheatsheet the noise goes up (no more band,
fanshaped)
Check Model Assumptions ◆ If heteroskedastic, fix it by
➔ Plot residuals versus Yhat logging the Y variable
◆ If heteroskedastic, fix it by
making standard errors robust
➔ Multicollinearity
◆ when x variables are highly
correlated with each other.
◆ ovtest: a significant test ◆ R2 > 0.9
statistic indicates that ◆ pairwise correlation > 0.9
➔ Outliers polynomial terms should be ◆ correlate all x variables, include
◆ Regression likes to move added y variable, drop the x variable
towards outliers (shows up ◆ H 0 : data = no transf ormation that is less correlated to y
as R2 being really high) H a : data =/ no transf ormation
◆ want to remove outlier that is Summary of Regression Output
extreme in both x and y
➔ Nonlinearity (ovtest)
◆ Plotting residuals vs. fitted
values will show a
relationship if data is ➔ Normality (sktest)
nonlinear ( R2 also high) ◆ H 0 : data = normality
H a : data =/ normality
◆ don’t want to reject the null
hypothesis. Pvalue should
be big
◆ Log transformation
accommodates nonlinearity,
reduces right skewness in the Y, ➔ Homoskedasticity (hettest)
eliminates heteroskedasticity ◆ H 0 : data = homoskedasticity
◆ **Only take log of X variable ◆ H a : data =/ homoskedasticity