Sei sulla pagina 1di 23

Measures of Central Tendency

Mean (Average):

Mean vs. Medium Mean : Skewed to the Right Mean : Skewed to the Left Right Long Tail: Skewed to the Right Left Long Tail: Skewed to the Left Boxplot: 1. Q1,Q2,Q3 2. Fences Outliers 3. Max, Min Q1 First Quartile: 25th Percentile Q2 Median: 50th Percentile Q3 Third Quartile: 7th Percentile Upper Fence: Q3 + 1.5IQR Lower Fence: Q1 1.5IQR Range = Interquartile Range (IQR) = Q3 Q1 Spread of middle 50% of distribution Sensitivity to Outliers Not Skewed (Roughly Symmetric) Sensitive (No) to Outliers: Mean and Range, Variance, SD Skewed (Has Outliers) Not Sensitive to Outliers: Median and IQR Observation number of outliers = n Margin of outliers = n of outliers Population Standard Deviation:

or

Sample Standard Deviation:

or

Linear Transformation Adding a constant (a): center (mean, median) will shift by constant, does not change spread Multiplying a constant (b): center and spread (SD, IQR) change, variance is b2 Linear Model: New Linear Model: o Center (c): : plug original c into new formula o Spread (d): : plug original d into new formula o Variance:

Normal Distribution
Empirical Rule

Standard Normal Distribution: N(0,1) or N(, ) For Data: For Model: For %:

Sum of Random Variables N(1, 1) + N(2, 2) = N(1 + 2, )

Sampling Distributions for Proportions: Categorical Data p: true proportion, center of histogram : sample proportion, varies from one sample to the next For CI:

~ N(p, )

Sampling Distributions for Means: Quantitative Data : population mean : sample mean

For CI:

: sample SD, unknown

: population SD, known

Statistical Inference
Properties of Sampling Distribution of Sample Means 1. = : sample mean = population mean 2.
x=

: SD of sample means = population SD

Central Limit Theorem The relationship between the sampling distribution of sample means and the population that the samples are taken from If samples of n25 drawn from any population with mean and SD , then sampling distribution of sample means approximates normal distribution. The greater the sample size the better the approximation If population normally distributed, sampling distribution of sample means is normally distributed for any sample size n If asks about one (individual), then use:

If asks about mean of sample of n individuals, then use:

Hypothesis Testing Math for 1 Sample


Z-Test for a Pop Mean
H0: = 0 Ha: <,,> 0 Known : process/pop ~N(0,1)

One-Sample t-Test
H0: = 0 Ha: <,,> 0 Unknown : data/sample ~tn-1

Z-Test for Pop Proportion


H0: p = p0 Ha: p <,,> p0 ~N(0,1)

P-Value P(z< P(z>



P-Value P(tn-1<

P-Value P(z< P(z>


p0
p q0 0

p p q

)=% ) )


s s

) = #<p-value<#

)=%
p0
p q0 0

) = 1 P(z>

P(tn-1>

) = #<p-value<#
s

p0
p q0 0

) = 1 P(z>

) )

P(z>()

) = 2 P(z>

P(tn-1>()

)=2#<p-value<2#

P(z>()

p0
p q0 0

)= 2 P(z>

p0
p q0 0

Conclusion: based on / CI
P-value < : reject H0, accept Ha Sufficient evidence Significant Unlikely P-value > : fail to reject H0 Not sufficient evidence Not significant Likely

Conclusion: based on / CI
P-value < : reject H0, accept Ha Sufficient evidence Significant Unlikely P-value > : fail to reject H0 Not sufficient evidence Not significant Likely CI:

Conclusion: based on / CI
P-value < : reject H0, accept Ha Sufficient evidence Significant Unlikely P-value > : fail to reject H0 Not sufficient evidence Not significant Likely CI:

CI:

Z*:

new score, ME: old one used

CI 90% 95% 99%


1.645 1.960 2.576

Hypothesis Testing Qualitative for 1 Sample


Null Hypothesis (H0): value of population model parameter Skeptical claim: nothing different Default is true Alternative Hypothesis (Ha): value of population parameter that consider valuable if H0 rejected Two-Sided: H0: p = p0 Ha: p p 0 o P-value: probability of deviating in either direction from H0 One-Sided: H0: p = p0 Ha: p < p0 or p > p0 o P-value: probability of deviating only in the direction of Ha away from the H0 Alpha Levels: threshold for P-Value Statistically significant/insignificant depending on alpha level o Depends on how large sample is 0.05 (95% CI) 0.01 (99% CI) 0.001 (99.9% CI) One-Sided 1.645 2.33 3.09 Two-Sided 1.96 2.576 3.29

P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation Conclusion: reject or fail to reject H0

Confidence Interval
CI: Estimated range of values, calculated from the sample data that is likely to include unknown population parameter level + Confidence have to capture true value more often so make interval wider Smaller interval (less variability) choose a large sample Estimate ME ME = SE: extent of interval on either side of middle value o ME < 5% is acceptable Level of Confidence Probability that the interval estimate contains the population parameter 95% confidence level o One can be 95% confident that the population parameter is contained in the interval Critical Value # of SEs must stretch put on either side of middle value

Types of Errors and Level of Significance


Decision Accept H0 H0 True H0 False Correct Type II error Decision Reject H0 Type I error Correct

Type I Error (false positive): H0 is rejected, when it is true (drew unusual sample) Healthy person diagnosed with disease Jury convicts innocent person Money will be invested in project that turns out not to be profitable Type II Error (false negative): H0 is not rejected (fail to reject), when false Infected person diagnosed healthy Jury fails to convict guilty person Money wont be invested in project that would have been profitable Detecting false hypothesis: Power ()

Descriptive Statistics
Categorical Variable: descriptive responses Quantitative Variable: measure of quantity (units) Time-series: variable measures at regular intervals over time Consistent space-time interval (months, weeks) Cross Sectional Data: several variables measured at same point in time. Exact time (every Feb at Starbucks) Stem-plot: distribution of data, which also includes the specific data points Can calculate mean, quartiles, median, shape Scatterplot: plots 2 quantitative variables Histogram: distribution of data by breaking the range of values of variable into intervals and displaying count or proportion of observations that fall into each interval Shapes of Distributions Symmetric, Unimodal, Bimoda, Uniform, Skewed Data Collection: From direct observation or produced through experiments Survey (Response Rate): personal interview, telephone interview, self-administered questionnaire Sampling Plans SRS: sample selected so that every possible sample with the same number of observations is equally likely to be selected Stratified RS: separating the population into mutually exclusive sets or strata and drawing a SRS from each o Homogeneous and different to one another (Black vs. White) Cluster S: SRS of groups or clusters of elements o Heterogeneous and similar to one another (Vancouver vs. Toronto) Systematic S: sample every kth unit in a population Multi-stage S: randomly choose clusters and randomly sample individuals within each cluster Errors Sampling Error: difference between sample and population that exists only because of the observations that happened to be selected for the sample Non-sampling Error: due to mistakes made in acquiring data or sample observations being selected improperly (more serious b/c cannot be corrected by + sample size) o Errors in acquiring data: recording incorrect responses o Nonresponse error: error (or bias) introduced when responses are not obtained be some members of a sample o Response Bias: anything that influences responses o Voluntary Response Bias: large group invited to respond and all counted o Selection bias: occurs when sampling plan is such that some members of the target population cannot be selected for inclusion in sample o Convenience Sampling: include individuals that are convenient o Under-coverage: some portion of population not sampled at all or has smaller representation Answer Phrasing o Measurement Errors: inaccurate responses o Pilot Test: small trial run of study to check if method of study okay

Probability and Random Variables


Probability: likelihood of random phenomenon or chance behaviors 0 P(A) 1 P(S) = P(certain event) P(A1 or A2 or) = P(A1) + P(A2) + P(A3) if mutually exclusive Interpreting Probability Relative Frequency Approach: Classical Approach:

Random Variables: variable that assigns a numerical result to an outcome of an event that is associated with chance Discrete: if can take only finite or countably infinite number of values Continuous: not discrete Discrete Conditions 0 P(x) 1 Expected Value: mean of probability distribution: = E(x) = Variance: or If Linear H(x) = ax + b E(h(x)) = E(h(x)) = E(ax + b) = aE(x) + b Var(h(x)) = Var(ax + b) = a2Var(x) If Random Variables E(x + y) = E(x) + E(y) Var(x y) = Var(x) + Var(y)

Hypothesis Testing For Comparing Two Means


Two Independent Means
Case 1: :
1 2

Two Independent Means


Case 1: H0: 1 2 = D0 Ha: 1 2 <,,> D0

Two Population Means


Matched Pairs Experiments H0: D = D0 Ha: D <,,> D0

H0: 1 2 = D0 Ha: 1 2 <,,> D0

1 2
1

D0
1

1 2
2

0
2

2 + 1 2 2 + 1 2 1 +2 2 2

1+ 2

1 2

P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top) Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely

P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top) Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely

P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top) Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely

P-value > : fail to reject H0 CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude CI:
2 1 1 2 2

P-value > : fail to reject H0 CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude

P-value > : fail to reject H0


Not sufficient evidence, Not significant, Likely

CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude

CI:

2 2 1 + 2 1 2 2 2 2 1 2 1 1 2 + 1 1 1 2 1 2

min

Hypothesis Testing For Comparing Two Means: Qualitative


Null Hypothesis (H0): There is no difference between the means or proportions = 0 Alternative Hypothesis (Ha): There is a difference between means or proportions P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation Conclusion: reject or fail to reject H0 % P-value % : Do not Reject H0: We cannot conclude there has been a significant reduction/ increase/ difference in mean study of experiment -z-score < z-test value < z-score: Do not Reject H0 in inside: There is not sufficient evidence at the 5% level of significance that there is a reduction/ increase/ difference between the two means

Confidence Interval
CI: + Confidence have to capture true value more often so make interval wider Smaller interval (less variability) choose a large sample

Level of Confidence 90% CI: (-#,#): Since the 90% CI contains 0, we cannot conclude at the 10% level of significance that the perception of the study examined of and 90% CI: We are 90% confident that the interval (#, #) contains the true mean/proportion of the study experiment

Hypothesis Testing For Comparing Population Proportion


Two Separate Samples
H0: Ha:

P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top)
Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely

P-value > : fail to reject H0 CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude
1 1 1 2 2 2

Analysis of Two Way Tables


Chi-Square Tests Goodness of Fit Test
Used to measure how well observed data fit what would be expected under specified conditions H0: no association between row and
column variables

Test of Independence
Used to determine whether the row and column variables in a two-way contingency table are independent or related H0: no association between row and
column variables

Homogeneity Test
Comparing observed counts from 2 or more populations Examine sample to see if they have the same proportions of some characteristic H0: the populations have the same
proportion of some characteristic : 2 categorical variables are independent Ha: at least one of the populations has a different proportion : they arent independent

: p1 = p2 = = pk Ha: not all proportions equal


2

: p1 = p2 = = p k Ha: not all proportions equal


2

where f0 = the observed frequency fe = the expected frequency df = k 1: k is number of

categories/cells specified under H0


n: total number of observations in the


table

n: total number of observations in the


table

n: total number of observations in the


table

r: number of rows c: number of columns Conclusion: P-value:

r: number of rows c: number of columns Conclusion: P-value: P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely

Conclusion: P-value:

P-value < : reject H0, accept Ha


P-value > : fail to reject H0
Not sufficient evidence, Not significant, Likely Sufficient evidence, Significant, Unlikely

P-value < : reject H0, accept Ha


P-value > : fail to reject H0
Not sufficient evidence, Not significant, Likely Sufficient evidence, Significant, Unlikely

P-value > : fail to reject H0

, Reject H0

, Reject H0

, Reject H0

Analysis of Two Way Tables


Chi-Square Tests
Random Sample A B Total Example: Does the gender of a survey interviewer have an effect on survey responses by men? ): 2 ) We reject the H0 and conclude there is significant evidence that the proportion of men is different for the two genders at the interviewers A B Total

Reject H0: We conclude that the variables are not independent. They are associated. Fail to Reject H0: There is no evidence of an association

Linear Regression (Scatterplots)


Correlation: Relationship between two variables Only measures strength of linear relationship/ association x independent/ explanatory/ predictor variable y dependent variable/ response/ predicted variable

2

OR

2
2

Properties of Correlation Coefficient 1. -1 1 2. rxy = ryx 3. Positive values = Positive correlation; Negative Values = Negative correlation 4. Strong Correlation/linear relationship closer to -1 and 1 5. Weak Correlation/linear relationship closer to 0 6. Only used on two quantitative variables 7. Calculated using means, SDs, z-scores 8. Not resistant to outliers Describing Association 1. Form Is it linear, Bell-shaped, Curved, Cloud 2. Direction Positive or Negative? 3. Strength Spread Linear Model:

Slope: +/- by slope (y units) per x (x units) Gets sign from correlation Gets units from ratio of two SDs, so units of the slope are a ratio of units of the variables OR Intercept: is y-intercept when x=0, = Starting value for predictions
2
2

To find slope and intercept need: Correlation (r): tells us strength of linear association Means: tells us where to locate the line SD: tells us the units Predicting SD +/- the mean when have another SD +/- the mean: SDy = r (SDx)

Correlation to the Line: plot of standardized variables =r =0 : For every SD +/- the mean we are in x, well predict that y is r(SD) +/- the mean y Residual: observed (point) predicted (line) Does the model make sense? How well does the line fit the data? o How much variation in y does our model explain? coefficient of determination R2 e: is big (overestimate) + e: is small (underestimate) Residuals vs. predicted values: shows no patterns, no direction, no shape, mean = 0 Point Prediction: the value of obtained by specifying the value x and solving regression equation for X* we have for regression equation o we can only make predictions within range of our data not away because of EXRAPOLATION BAD Coefficient of Determination Measures the proportion of variation in y that is explained be the variation in x R2: fraction of datas variation accounted for by model About R2 % of the variation in y that is explained by the variation in x, 1 R2: fraction of variation left in residuals

Lurking Variable: Variable that is not among the explanatory or response variables. It can influence the interpretation of relationships among these variables Ex. Lung cancer stain fingernails: smoking What is the most realistic value of SD?

Simple Linear Regression (Inference) Qualitative


First order Model: where: y = Dependent variable x = Independent variable = y-intercept = Slope of the line (defined as rise/run) = Error variable Measures of Variation = point (x,y) = sample mean

= predicted y-value

Total Deviation: vertical distance

Explained Deviation: vertical distance Unexplained Deviation: vertical distance

Coefficient of Determination/Variation (R2): the amount of the variation in y that is explained by the regression line. It is the ratio of the explained variation to the total variation

Standard Error of Estimate (Se): a measure of the differences between the observed sample yvalues and the predicted that are obtained using the regression equation Sum of Squares for X (SSxx): the sum of the squares of the deviations of x ASSESING THE MODEL Standard Deviation of the Error Variable ( ) If is large some of the errors will be large, which implies that the models fit is poor If is small, the errors tend to be close to the mean (which is 0) and so the model fits well Sum of Squares for Regression (SSR): measures the amount of variation in y that is explained by the variation in the independent variable x Variation in So SSE is the amount of variation in y that remains unexplained
2 2 2 2

The greater the explained variation (greater the SSR or R2), the better the model

Simple Linear Regression (Inference) Testing


Significance of Regression
Predictor Coefficient/ Slope: H0: = 0: y is not linearly related to
x, so regression line will be horizontal so slope is 0

Regression Equation
Prediction Interval Determines how closely matches
the true value of y Single observation predicting individual vale wider interval

Coefficient of Correlation
Data observational Two variables are bivariate normally distributed Can test for linear association b/w 2 variables using test : estimate of population coefficient of correlation is the sample coefficient of correlation r

Ha: 0: determines linear


relationship

1 1 1

H0: = 0: there is no relationship Ha: 0


2

between the two variables

Confidence Interval Estimator of E(y) will be narrower

than the prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value

df: v = n 2

Conclusion: P-value: P-value < : reject H0, accept Ha P-value > : fail to reject H0
Not sufficient evidence, Not significant, Likely
CI:

Conclusion: P-value:
P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely

Sufficient evidence, Significant, Unlikely

: estimated error

P-value > : fail to reject H0

1 : n2 distribution df: v = n 2

: outside

from chart: Reject H0

: outside

from chart: Reject H0

Multiple Regressions

where: k = independent variables are potentially related to the dependent variable y = dependent variable , , , = independent variables , , , = coefficients = error variable Independent variables: may be function of other variables , , Meaning of Regression Coefficient : with all other variables held constant, if the average your expected Y increases/decreases by coefficient function (what it means) increases by 1,

Adjusted R2: takes into account the sample size and the number of independent variables k > n: unadjusted R2 may be unrealistically high 2
( ) 1

CI: Test for Each Variable If P-value < : We conclude that i is greater/smaller/not than 0

Multiple Regressions Tests


Significance of Regression Testing the validity of the model: f- test F test combines t- tests into a single test Not affected by the problem of multicollinearity,
which is when the independent variables are correlated with one another

Significance of Each Variable Testing the coefficients: t- tests T-tests of individual coefficients allow us to determine whether 0 (for I = 1, 2, , k) if linear using lots of t-tests instead of F-tests to test the
relationship exists between xi and y. validity of the model increases the probability of Type 1 Error

H0: = = = = 0 : if true says none of the independent variables

, , , is linearly related to y, so the model is invalid

H0: at least one = 0 : if true says no linear relationship Ha: 0 for any i

Ha: at least one 0 : the model has some validity


2 ( )

1 df: v = n k - 1

Conclusion: P-value: P-value < : reject H0, accept Ha P-value > : fail to reject H0
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely

Confidence Interval Estimator of E(y) will be narrower than the

prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value

df (numerator): v = k df (denominator): v = n k - 1 ,,

, Reject H0

: estimated error

because the model is valid, regression is significant

, , Reject H0 parameter is significant , , Reject H0

parameter is insignificant for determining y

Analysis of Variance (ANOVA)


One-Way ANOVA: one way analysis of variance because we use a single property, or characteristic, for categorizing the populations H0: 1 = 2 = 3 ANOVA
a method of the testing the quality of three or more population means by analyzing sample variances we test the hypothesis by determining if the variation b/w groups is larger that the variation within groups

H0: 1 = 2 = = k
Ha: at least one j is different : not all the same

: test statistic involves variation within groups and the variation among groups.

If differences among sample means very large, the variation witihin groups to be large, so the numerator of the test statistic becomes larger that the denominator. So large values of the test statistic suggest unequal means


ANOVA Regression Residual Total ANOVA Regression Residual Total SST


SS

, Reject H0
MS F

df
1 2 SST

Significance F
where Fstat=

SS

df
g-1

MS

Significance F
where Fstat=

P-Value = P(Fg-1,n-g) > F0: P-value < : Reject H0 g = how many groups means variables n = samples observations
SST:SSR: sum of squares between groups (treatments) Represents the variation between the means of the groups SSE: sum of squares within groups (error) Represents the variation within a group due to random error SSTotal: total sums of squares Represents the total variation among all data points and is equal to the sum of squares between groups and within a group

Regression Statistics
Multiple R R Square Adjusted R Square Standard Error Observations ANOVA Regression Residual Total [ ]

df
1 2 SST

SS

MS

Significance F
where Fstat=

Coefficients
Intercept X Variable

Standard Error

tStat

P-value
2xP(tn-2)>t-stat

Lower 95%

Upper 95%

2xP(tn-2)>t-stat

Conditions + Assumptions
Regression Lines (Correlation) Quantitative Variables Condition Linearity Condition Outlier Condition Equal Spread Condition: checking spread is about the same throughout Model for Sampling Distribution of Proportions- 68-95-99.7 Independence Assumptions Sample Size Assumption: sample size, n, must be large enough o Randomization Conditions o 10% Condition: n < 10% of population o Success/Failure Condition: np > 10 nq > 10 Model for Sampling Distribution of Means: z-score Independence Assumption o Randomization Condition Sample Size Assumption o 10% Condition o Large Enough Sample Condition: depends on shape of original data distribution Confidence Intervals for one proportion z-interval, one proportion z-test Independence Assumption o Randomization Condition o 10% Condition Sample Size Assumption o Inference CLT o Need large enough sampling model o Success/Failure Condition: n 10 n 10 Sampling Distribution for Mean: t-score Independence Assumption o Randomization Condition o 10% Condition Normal Population Assumption: students t-model wont work for that are badly skewed o Nearly Normal Condition: n < 5: data should follow normal model 15 < n < 40: t-method works well as long as data unimodal and symmetric n > 40: t-method safe to use unless data very skewed, also can use of very skewed if n is large enough b/c SD close enough to Normal

ANOVA 1. All populations are normally distributed 2. The populations variances are equal 3. The observations are independent of one another Multiple Regression: required conditions for the error variable 1. The probability distribution of is normal 2. The mean of is 0 3. The standard deviation of is , which is constant of each value x 4. The errors are independent Simple Linear Regression (inference): required conditions for the error variable 1. The probability distribution of is normal 2. The mean of the distribution is 0; that is E( ) = 0 3. The standard deviation of is , which is constant regardless of the value of x 4. The value of associated with any particular value of y is independent of associated with any value of y Chi-Square test 1. Expected cell frequency condition: all expected cell counts are at least 5, so X2 is reliable Comparing two Means 1. Independence Assumption: Randomization, 10% 2. Normal Population Assumption a. N < 15: do not use student t if skewed b. N 40: okay if mildly skewed c. N > 40: CLT works unless data very skewed 3. Independence group assumption a. Two independent samples Pooled T-test 1. Paired data assumption 2. Independence assumption

Potrebbero piacerti anche