Statistics Cheat Sheet-Harvard

Population entire collection of objects or ➔ Mean arithmetic average of data ➔ Variance the average distance

individuals about which information is desired. values squared
➔ easier to take a sample n
◆ * *Highly susceptible to ∑ (xi x)2
◆ Sample part of the population extreme values (outliers).
that is selected for analysis Goes towards extreme values
sx2 = i=1 n 1
◆ Watch out for: ◆ Mean could never be larger or
● Limited sample size that smaller than max/min value but ◆ sx2 gets rid of the negative
might not be values
could be the max/min value
representative of
◆ units are squared
population
◆ Simple Random Sampling ➔ Median in an ordered array, the
Every possible sample of a certain median is the middle number ➔ Standard Deviation shows variation
size has the same chance of being ◆ **Not affected by extreme about the mean
selected values n
∑ (xi x)2
i=1
Observational Study there can always be ➔ Quartiles split the ranked data into 4 s =
√ n 1
lurking variables affecting results equal groups
➔ i.e, strong positive association between ◆ Box and Whisker Plot ◆ highly affected by outliers
shoe size and intelligence for boys ◆ has same units as original
➔ **should never show causation data
◆ finance = horrible measure of
Experimental Study lurking variables can be risk (trampoline example)
controlled; can give good evidence for causation

Descriptive Statistics Part I
Descriptive Statistics Part II
➔ Summary Measures
Linear Transformations

➔ Range = X maximum X minimum
◆ Disadvantages: Ignores the
way in which data are
distributed; sensitive to outliers

➔ Interquartile Range (IQR) = 3rd
➔ Linear transformations change the
quartile 1st quartile
center and spread of data
◆ Not used that much
◆ Not affected by outliers ➔ V ar(a + bX) = b2 V ar(X)
➔ Average(a+bX) = a+b[Average(X)]

➔ Effects of Linear Transformations: Skewness ◆ Correlation doesn't imply
◆ meannew = a + b*mean ➔ measures the degree of asymmetry causation
◆ mediannew = a + b*median exhibited by data ◆ The correlation of a variable

◆ stdev new = |b| *stdev ◆ negative values= skewed left with itself is one

◆ IQRnew = |b| *IQR ◆ positive values= skewed right
➔ Zscore new data set will have mean ◆ if |skewness| < 0.8 = don't need Combining Data Sets
0 and variance 1 to transform data ➔ Mean (Z) = Z = aX + bY
z = X S X ➔ Var (Z) = sz2 = a2 V ar(X) + b2 V ar(Y ) +
Measurements of Association 2abCov(X, Y )

➔ Covariance
Empirical Rule
◆ Covariance > 0 = larger x, Portfolios
➔ Only for moundshaped data
larger y ➔ Return on a portfolio:
Approx. 95% of data is in the interval:
◆ Covariance < 0 = larger x,
(x 2sx , x + 2sx ) = x + / 2sx smaller y
➔ only use if you just have mean and std.
Rp = wA RA + wB RB
n
dev. 1
◆ sxy = n 1 ∑ (x x)(y y )
i=1 ◆ weights add up to 1
Chebyshev's Rule ◆ Units = Units of x Units of y
◆ return = mean
➔ Use for any set of data and for any ◆ Covariance is only +, , or 0 ◆ risk = std. deviation
number k, greater than 1 (1.2, 1.3, etc.) (can be any number)
➔ 1 1 ➔ Variance of return of portfolio
2
k ➔ Correlation measures strength of a
➔ (Ex) for k=2 (2 standard deviations), linear relationship between two 2 2
sp2 = wA sA + wB2 sB2 + 2wA wB (sA,B )
75% of data falls within 2 standard variables
deviations covariancexy
◆ r xy = (std.dev. )(std. dev. ) ◆ Risk(variance) is reduced when
x y
Detecting Outliers stocks are negatively
◆ correlation is between 1 and 1
➔ Classic Outlier Detection correlated. (when there's a
◆ Sign: direction of relationship
◆ doesn't always work negative covariance)
◆ Absolute value: strength of

◆ relationship (0.6 is stronger
|z | = || X S X || ≥ 2
➔ The Boxplot Rule relationship than +0.4)
Probability
◆ Value X is an outlier if:
➔ measure of uncertainty
X<Q11.5(Q3Q1) ➔ all outcomes have to be exhaustive
or (all options possible) and mutually
X>Q3+1.5(Q3Q1) exhaustive (no 2 outcomes can
occur at the same time)

Probability Rules ➔ Another way to find joint probability: ➔ Expected Value Solution =
1. Probabilities range from P (A and B) = P (A|B) P (B)
0 ≤ P rob(A) ≤ 1 P (A and B) = P (B|A) P (A) E M V = X 1 (P 1 ) + X 2 (P 2 )... + X n (P n )
2. The probabilities of all outcomes must
add up to 1 2 x 2 Table
3. The complement rule = A happens
or A doesn't happen

P (A) = 1 P (A)
Decision Tree Analysis
P (A) + P (A) = 1 ➔ square = your choice
4. Addition Rule: ➔ circle = uncertain events
P (A or B) = P (A) + P (B) P (A and B)

Contingency/Joint Table Discrete Random Variables
➔ To go from contingency to joint table, ➔ P X (x) = P (X = x)
divide by total # of counts
➔ everything inside table adds up to 1 Expectation

Conditional Probability ➔ μx = E(x) = ∑ xi P (X = xi )

➔ P (A|B)
P (A and B) Decision Analysis ➔ Example: (2)(0.1) + (3)(0.5) = 1.7
➔ P (A|B) = P (B) ➔ Maximax solution = optimistic
➔ Given event B has happened, what is approach. Always think the best is Variance
the probability event A will happen? going to happen ➔ σ 2 = E (x2 ) μx2
➔ Look out for: "given", "if" ➔ Maximin solution = pessimistic ➔ Example:
approach. (2)2 (0.1) + (3)2 (0.5) (1.7)2 = 2.01
Independence
➔ Independent if: Rules for Expectation and Variance
P (B|A) = P (B)
P (A|B) = P (A) or ➔ μs = E (s) = a + bμx
➔ If probabilities change, then A and B
➔ Var(s)= b2 σ 2
are dependent

➔ **hard to prove independence, need
Jointly Distributed Discrete Random
to check every value
Variables

➔ Independent if:
Multiplication Rules

➔ If A and B are INDEPENDENT:
P x,y (X = x and Y = y ) = P x (x) P y (y)
P (A and B) = P (A) P (B)

➔ Combining Random Variables 2.) All Successes Continuous Probability Distributions
◆ If X and Y are independent: P (all successes) = pn ➔ the probability that a continuous
3.) At least one success random variable X will assume any
E (X + Y ) = E (X) + E (Y ) P (at least 1 success) = 1 (1 p)n particular value is 0
V ar(X + Y ) = V ar(X) + V ar(Y ) 4.) At least one failure ➔ Density Curves
P (at least 1 f ailure) = 1 pn ◆ Area under the curve is the
◆ If X and Y are dependent: 5.) Binomial Distribution Formula for probability that any range of
E (X + Y ) = E (X) + E (Y ) x=exact value values will occur.
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) ◆ Total area = 1

➔ Covariance: Uniform Distribution
C ov(X, Y ) = E (XY ) E (X)E(Y )
➔ If X and Y are independent, Cov(X,Y)
= 0

6.) Mean (Expectation) ◆ X ~ U nif (a, b)

μ = E (x) = np
7.) Variance and Standard Dev. Uniform Example
σ 2 = npq
σ = √npq
q = 1 p

Binomial Example

Binomial Distribution
➔ doing something n times
➔ only 2 outcomes: success or failure
➔ trials are independent of each other (Example cont'd next page)
➔ probability remains constant

1.) All Failures
P (all f ailures) = (1 p)n

X μ
➔ Z = σ/√n

Sums of Normals
➔ Mean for uniform distribution:
(a+b)
E (X) = 2

➔ Variance for unif. distribution:
(b a) 2
V ar(X) = 12 Confidence Intervals = tells us how good our
estimate is
Normal Distribution Sums of Normals Example: **Want high confidence, narrow interval
➔ governed by 2 parameters: **As confidence increases , interval also
μ (the mean) and
σ (the standard increases
deviation)
➔ X ~ N (μ, σ 2 ) A. One Sample Proportion

Standardize Normal Distribution:
X μ
Z = σ
➔ Zscore is the number of standard
deviations the related X is from its
︿ x number of successes in sample
mean ➔ Cov(X,Y) = 0 b/c they're independent ➔ p= n = sample size
➔ **Z< some value, will just be the
probability found on table Central Limit Theorem
➔ **Z> some value, will be ➔ as n increases,
(1probability) found on table ➔ x should get closer to
μ (population
➔
mean) ➔ We are thus 95% confident that the true
➔ mean( x) = μ population proportion is in the interval…
Normal Distribution Example ︿
➔ variance (x) = σ 2 /n ➔ We are assuming that n is large, n p >5 and
2 our sample size is less than 10% of the
➔ X ~ N (μ, σn ) population size.
◆ if population is normally distributed,
n can be any value
◆ any population, n needs to be ≥ 30

Standard Error and Margin of Error B. One Sample Mean * Stata always uses the tdistribution when
For samples n > 30 computing confidence intervals
Confidence Interval:

Hypothesis Testing
➔ Null Hypothesis:
➔ H 0 , a statement of no change and is
➔ If n > 30, we can substitute s for assumed true until evidence indicates
σ so that we get: otherwise.
➔ Alternative Hypothesis: H a is a
statement that we are trying to find

evidence to support.
Example of Sample Proportion Problem ➔ Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
➔ Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.

Example of Type I and Type II errors

Determining Sample Size
︿︿
(1.96)2 p(1 p)
n =
e2 ︿
➔ If given a confidence interval, p is For samples n < 30
the middle number of the interval
➔ No confidence interval; use worst
case scenario
︿
◆ p =0.5
T Distribution used when:
➔ σ is not known, n < 30, and data is
Methods of Hypothesis Testing
normally distributed 1. Confidence Intervals **
2. Test statistic
3. Pvalues **
➔ C.I and Pvalues always safe to do
because don’t need to worry about
size of n (can be bigger or smaller
than 30)

One Sample Hypothesis Tests
1. Confidence Interval (can be
used only for twosided tests)

4. PValues
➔ a number between 0 and 1
➔ the larger the pvalue, the more
consistent the data is with the null
➔ the smaller the pvalue, the more
consistent the data is with the
2. Test Statistic Approach alternative

(Population Mean) ➔ ** If P is low (less than 0.05),
3. Test Statistic Approach (Population
H 0 must go reject the null
Proportion)
hypothesis

Two Sample Hypothesis Tests ➔ Test Statistic for Two Proportions 2. Comparing Two Means (large
1. Comparing Two Proportions independent samples n>30)
(Independent Groups)
➔ Calculate Confidence Interval ➔ Calculating Confidence Interval

➔ Test Statistic for Two Means

Matched Pairs
➔ Two samples are DEPENDENT
Example:

︿
➔ Interpretation of slope for each ➔ corr (Y , e) = 0
additional x value (e.x. mile on
odometer), the y value decreases/ A Measure of Fit: R2
increases by an average of b1 value
➔ Interpretation of yintercept plug in
︿
0 for x and the value you get for y is
the yintercept (e.x.
y=3.250.0614xSkippedClass, a
student who skips no classes has a
gpa of 3.25.)
➔ ** danger of extrapolation if an x
value is outside of our data set, we

can't confidently predict the fitted y ➔ Good fit: if SSR is big, SEE is small

value ➔ SST=SSR, perfect fit
Simple Linear Regression
➔ R2 : coefficient of determination
➔ used to predict the value of one 2
variable (dependent variable) on the Properties of the Residuals and Fitted R = SSR = 1 SSE
2
SST SST
basis of other variables (independent Values ➔ R is between 0 and 1, the closer R2
variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit
︿ the residuals = 0
➔ Y = b0 + b1 X ➔ Interpretation of R2 : (e.x. 65% of the
︿ 2. Mean of original values is the same variation in the selling price is explained by
➔ Residual: e = Y Y f itted ︿
as mean of fitted values Y = Y the variation in odometer reading. The rest
➔ Fitting error: 35% remains unexplained by this model)
︿
ei = Y i Y i = Y i b0 bi X i ➔ ** R2 doesn’t indicate whether model
◆ e is the part of Y not related is adequate**
to X ➔ As you add more X’s to model, R2
➔ Values of b0 and b1 which minimize goes up
the residual sum of squares are: ➔ Guide to finding SSR, SSE, SST
sy
(slope) b1 = r s
x
b0 = Y b1 X 3.
4. Correlation Matrix

Assumptions of Simple Linear Regression Example of Prediction Intervals: Regression Hypothesis Testing
1. We model the AVERAGE of something *always a twosided test
rather than something itself ➔ want to test whether slope ( β 1 ) is
needed in our model
2. ➔ H 0 : β 1 = 0 (don’t need x)
H a : β 1 =/ 0 (need x)

➔ Need X in the model if:
a. 0 isn’t in the confidence
interval
Standard Errors for b1 and b0 b. t > 1.96
➔ standard errors when noise c. Pvalue < 0.05
➔ sb0 amount of uncertainty in our
estimate of β 0 (small s good, large s Test Statistic for Slope/Yintercept
bad) ➔ can only be used if n>30
➔ sb1 amount of uncertainty in our
➔ if n < 30, use pvalues
estimate of β 1

◆ As ε (noise) gets bigger, it’s
harder to find the line

Confidence Intervals for b1 and b0
Estimating S e
2 ➔
➔ S e = SSEn 2
2
➔ S e is our estimate of σ 2
➔
➔ S e = S e2 is our estimate of σ
√
➔ 95% of the Y values should lie within ➔
+
the interval b0 + b1 X 1.96S e
➔
➔ n small → bad

se big → bad

s2x small→ bad (wants x’s spread out for
better guess)

Multiple Regression
➔
➔ Variable Importance:
◆ higher tvalue, lower pvalue =
variable is more important
◆ lower tvalue, higher pvalue =
variable is less important (or not
Interaction Terms
needed)
➔ allow the slopes to change

➔ interaction between 2 or more x
Adjusted Rsquared variables that will affect the Y variable
➔ k = # of X’s
Modeling Regression How to Create Dummy Variables (Nominal
Backward Stepwise Regression Variables)
1. Start will all variables in the model ➔ If C is the number of categories, create
2. at each step, delete the least important (C1) dummy variables for describing
➔ Adj. Rsquared will as you add junk x variable based on largest pvalue above the variable
variables 0.05 ➔ One category is always the
➔ Adj. Rsquared will only if the x you 3. stop when you can’t delete anymore
“baseline”, which is included in the
add in is very useful ➔ Will see Adj. Rsquared and Se
intercept
➔ **want Adj. Rsquared to go up and Se
low for better model Dummy Variables
➔ An indicator variable that takes on a
The Overall F Test value of 0 or 1, allow intercepts to
change

➔ Always want to reject F test (reject

null hypothesis)
Recoding Dummy Variables
➔ Look at pvalue (if < 0.05, reject null)
Example: How many hockey sticks sold in
➔ H 0 : β 1 = β 2 = β 3 ... = β k = 0 (don’t
the summer (original equation)
need any X’s) hockey = 100 + 10W tr 20Spr + 30F all
H a : β 1 = β 2 = β 3 ... = β k =/ 0 (need at Write equation for how many hockey sticks
least 1 X) sold in the winter
➔ If no x variables needed, then SSR=0 hockey = 110 + 20F all 30Spri 10Summer
and SST=SSE ➔ **always need to get same exact
values from the original equation

Regression Diagnostics so that we can compare models. ◆ Homoskedastic: band around the
Standardize Residuals Can’t compare models if you take log values
of Y. ◆ Heteroskedastic: as x goes up,
◆ Transformations cheatsheet the noise goes up (no more band,
fanshaped)
Check Model Assumptions ◆ If heteroskedastic, fix it by
➔ Plot residuals versus Yhat logging the Y variable
◆ If heteroskedastic, fix it by
making standard errors robust

➔ Multicollinearity
◆ when x variables are highly
correlated with each other.
◆ ovtest: a significant test ◆ R2 > 0.9
statistic indicates that ◆ pairwise correlation > 0.9
➔ Outliers polynomial terms should be ◆ correlate all x variables, include
◆ Regression likes to move added y variable, drop the x variable
towards outliers (shows up ◆ H 0 : data = no transf ormation that is less correlated to y
as R2 being really high) H a : data =/ no transf ormation
◆ want to remove outlier that is Summary of Regression Output
extreme in both x and y
➔ Nonlinearity (ovtest)
◆ Plotting residuals vs. fitted
values will show a
relationship if data is ➔ Normality (sktest)
nonlinear ( R2 also high) ◆ H 0 : data = normality
H a : data =/ normality
◆ don’t want to reject the null
hypothesis. Pvalue should
be big

◆ Log transformation
accommodates nonlinearity,
reduces right skewness in the Y, ➔ Homoskedasticity (hettest)
eliminates heteroskedasticity ◆ H 0 : data = homoskedasticity
◆ **Only take log of X variable ◆ H a : data =/ homoskedasticity

Statistics Cheat Sheet-Harvard

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Statistics Cheat Sheet-Harvard

Caricato da

Copyright:

Formati disponibili

Population entire collection of objects or ➔ Mean arithmetic average of data ➔ Variance the average distance

6.) Mean (Expectation) ◆ X ~ U nif (a, b)

Potrebbero piacerti anche

Statistics Cheat Sheet-Harvard

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Statistics Cheat Sheet-Harvard

Caricato da

Copyright:

Formati disponibili

Population ­ entire collection of objects or ➔ Mean ­ arithmetic average of data ➔ Variance ­ the average distance

6.) Mean (Expectation) ◆ X ~ U nif (a, b)

Potrebbero piacerti anche

Population entire collection of objects or ➔ Mean arithmetic average of data ➔ Variance the average distance