Sei sulla pagina 1di 9

Formula and Concepts for 22S:008

Ziqian Zhou May 6, 2010

Six Steps of Inference


Question Population Sample Variable Summary Inference

2
2.1

Summary Statistics
Statistics and Parameter
A parameter is a xed number describing something of a population. (e.g. mean , median M , variance 2 , standard deviation ) A statistics is something that describes a sample and is used as an estimator for a population parameter. (e.g. sample mean x, sample 2 median m, sample variance s , sample standard deviation s) Formulas for calculating sample mean x and sample standard deviation s: n 1 x= xi n i=1 1

1 s = n1
2

(xi x)2 s =
i=1

s2

2.2

Empirical Rule

From a population with normal distribution (bell shape curve): About 68% of the values lie within 1 sd of the mean; About 95% of the values lie within 2 sds of the mean. Nearly all (99.7%) of the values lie within 3 sd of the mean. Another thing we should notice for a normal distribution is that it is symmetric about the population mean .

Probability
Probability of an event is a represented by a number in the range of 0 to 1. Addition Rule: P(A or B) = P(A) + P(B) - P(A and B) Conditional Probability P (B|A) =
P (A and B) P (A)

Multiplication Rule (comes from conditional probability): P (A and B) = P (A) P (B|A) We also have: P (A) P (B|A) = P (B) P (A|B) Independence: To test whether event A and B are independent, we can test whether P (A) P (B) = P (A and B)) or whether P (A|B) = P (A) Two additional formula: P (A and B) = 1 P (A or B) P (A) = P (A and B) + P (A and B) (Not required, but useful if you can understand them.) 2

Probability Distribution
Population Mean:
n

=
i=1

xi p(xi )

Population Variance:
n

=
i=1

pi (xi )2

Binomial Distribution : X is the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. An example: Roll a standard die ve times and count the number of sixes. The distribution of the number of sixes is a binomial distribution with n = 5 and p = 1/6.
n P (X = k) = Ck pk (1 p)nk n Ck =

n! k!(n k)!

n Ck denote the number of ways that k things can be chosen from a set of n things, when order is irrelevant.

n! = n (n 1) (n 2) . . . 3 2 1 Factorial is the product of all positive integers less than or equal to n. For example, 5! = 5 4 3 2 1 = 120. For binomial distribution: = np, 2 = np(1 p).

Continuous Distribution

Distinction of continuous vs discrete random variable: Continuous random variable should be able to have arbitrary accuracy.

5.1

Uniform Distribution

For a random variable Xwith uniform distribution on the interval (a, b) (b > a), the probability X lies inside of interval (c, d) equals to length of (c, d) length of (a, b) if (c, d) is inside of (a, b). And the probability is 0 if (c, d) is outside of (a, b). The mean of X is b+a 2

5.2

Normal Distribution

X follows normal distribution with mean and standard deviation , Z follows standard normal distribution, then: P (a < X < b) = P ( X b a < < ) a b = P( <Z< )

6
6.1

Sampling Distribution
Distribution of Sample Mean
The sample mean x is a random variable. It has a probability distri bution. If we know the mean for X is and standard deviation is , than: The mean of x is x = The standard deviation of x is / n, n is the sample size. Moreover, if the sample size is large (n 30), then P (a < x < b) = P ( a x b < < ) / n / n / n a b = P( < Z < ) / n / n

Z is a standard normal random variable. 4

6.2

Distribution of Proportion

Proportion p of a sample can be considered as the average of successes or failures ( yes or no answers). Therefore, if the sample size is large, we can use Central Limit Theorem. If we know the true proportion of the population is p = p, then p = p(1 p)/n and by Central Limit Theorem P (a < p < b) = P ( ap pp bp < < ) p p p bp ap <Z< ) = P( p(1 p)/n p(1 p)/n

Z is a standard normal random variable.

Condence Interval

Interpretation of a a% condence interval: If the same sampling is repeated for many time, and for each sample we construct a condence interval using the same procedure, a% of the time the condence interval will contain the true parameter we are trying to estimate. Condence interval for population mean : If the sample size is large ( 30), regardless what is the distribution of the random variable x: s x z n If the sample is sample, but if we can assume the distribution of the random variable x is normal distribution: s x t v n v is called the degree of freedom and v = n 1 Condence interval for proportion (the sample size should be larger than 30): p z
p(1 p n

The margin of error decrease as the sample size increase and if the condence level is smaller. How to calculate sample size? Given a required margin of error e, we can calculate the minimal sample size: for x problem, n= for proportion p problem, n= z 4e
2

zs e

and we should always round up to be conservative.

8
8.1

Hypothesis Testing
Alternative Hypothesis and Null Hypothesis

In our class, usually: Alternative hypothesis is the one you want to test, is the question of the problem. In Business, scientic study, the alternative are usually the new ideas or new products is dierent from the old. Null hypothesis is the one we can be falsied with data. We base our distribution (rejection region) on null hypothesis.

8.2

Type I and Type II Error

Type I Error: reject null hypothesis when null hypothesis is true. Type II Error: Do not reject null hypothesis when null hypothesis is false. type of error do we control? Type I Error rate is the signicant level . 6

8.3

To Reject or Not to Reject?

We reject null hypothesis when the test statistics is in the rejection region, or when the p-value is smaller than the signicant level . The rejection region is dierent for dierent alternative hypothesis. (Drawing a picture may help you to understand.) Test statistic: Large-sample test for mean : Z Small-sample test for mean : t= x 0 s/ n x 0 s/ n

To use this test, the assumption that each individual random variable follows normal distribution must be satised. Large-sample test for proportion p: Z= p p0
p0 (1p0 ) n

The Linear Regression Model

In our class, we would like to model the relationship between X and Y, and we would like to predict the value of Y based on X. A way of doing it is to use linear regression. We can t a regression line and compute predicted values of Y given X. Y = 1 X + 0 (In algebra: 1 is called slope and 0 is called intercept) Interpretation of the slope: a unit change of x will on average cause y to change the amount of 1 (1 can be positive or negative). 7

Is there a linear relationship between x and y? That is the same as asking whether 1 = 0. You should do a hypothesis test to answer this question: HA : 1 = 0 H0 : 1 = 0 The p-value of this test is often give in the software output at the line of 1 term. Reject H0 if p-value is smaller than your signicant level . Prediction interval and condence interval: Prediction interval is larger. ( Do you know why? )

9.1

Residuals
e=Y Y

Residual = the dierence between the observed Y and the predicted Y.

Residual sum of squares (SSE) = square the residual (error) for each observation and sum them all up. e2 = (Y Y )2

Linear regression method gives us the line that minimize SSE.

9.2

How Good is the Model Fit?

Total Sum of Squares and Sum of Squares of Regression: SST = (Looks familiar?) SSR = SST SSE SSE SSR =1 R2 = SST SST 2 R explains the percentage of the variation of y that is explained by the regression model (x)! 8 (y y )2

9.3

Multiple Regression

We have more than one predictor Model: Y = b1 X 1 + b2 X 2 + . . . + 0 The ideas of hypothesis test for slope terms and R2 are the same. But we have to choose the most appropriate model. Criteria: Choose the model with highest R2 and all the predictors are signicant. (The Best Conservative Model) Interpreting the slop of multiple regression (using 1 as a example): holding the other predictor constant, a unit change of x1 will on average cause y to change the amount of 1 (1 can be positive or negative).

Potrebbero piacerti anche