Sei sulla pagina 1di 226

Statistics Made Easy

Hani Tamim, MPH, PhD

Assistant Professor Epidemiology and Biostatistics Research Center / College of Medicine King Saud bin Abdulaziz University for Health Sciences Riyadh – Saudi Arabia

Objective of medical research

Is treatment A better than treatment B for patients with hypertension?

What is the survival rate among ICU patients?

What is the incidence of Down’s syndrome among a certain group of people?

Is the use of Oral Contraceptives associated with an increased risk of breast cancer?

Research Process?

Planning Design Data collection Analysis

Data entry Data cleaning Data management Data analysis

Reporting

Statistics is used in ….

Statistics is used in ….

Statistics is used in ….
Statistics is used in ….
Statistics is used in ….
Statistics is used in ….
Statistics is used in ….
Statistics is used in ….
Statistics is used in ….
Statistics is used in ….

What is statistics?

Scientific methods for:

Collecting Organizing Summarizing Presenting Interpreting

data

Definition of some basic terms

Population: The largest collection of entities for which we have interest at a particular time

Sample: A part of a population

Simple random sample: is when a sample n is drawn from a population N in such a way that every possible sample of size n has the same chance of being selected

Definition of some basic terms

Variable: A characteristic of the subjects under observation that takes on different values for different cases, example: age gender, diastolic blood pressure

Quantitative variables: Are variables that can convey information regarding amount

Qualitative variables: Are variables in which measurements consist of categorization

Types of variables

Categorical variables

Continuous variables

Categorical variables

Nominal: unordered data

 

Death

Gender

Country of birth

Ordinal: Predetermined order among response classification

 

Education

Satisfaction

Continuous variables

Continuous: Not restricted to integers

Age

Weight

Cholesterol

Blood pressure

Steps involved (data)

Data collection

Database structure

Data entry

Data cleaning

Data management

Data analyses

Data collection

Data collection:

Collection of information that will be used to answer the research question

Could be done through questionnaires, interviews, data abstraction, etc.

Data collection

Data collection

Database structure

Database structure:

Structure the database (using SPSS) into which the data will be entered

Data entry

Data entry:

Entering the information (data) into the computer

Most of the times done manually

Single data entry Double data entry

Data cleaning

Data cleaning:

Identify any data entry mistakes

Correct such mistakes

Data management

Data management:

Create new variables based on different criteria

Such as:

BMI Recoding Categorizing age (less than 50 years, and 50 years and above) Etc.

Data analyses

Data analyses:

Descriptive statistics: are the techniques used to describe the main features of a sample

Inferential statistics: is the process of using the sample statistic to make informed guesses about the value of a population parameter

Data analyses

Data analyses:

Univariate analyses

Bivariate analyses

Multivariate analyses

Bottom line

There are different statistical methods for different types of variables

Descriptive statistics: categorical variables

Frequency distribution

Graphical representation

Descriptive statistics: categorical variables

Frequency distribution

A frequency distribution lists, for each value (or small range of values) of a variable, the number or proportion of times that observation occurs in the study population

Descriptive statistics: categorical variables

Frequency distribution:

How to describe a categorical variable (marital status)?

Descriptive statistics: categorical variables

Construct a frequency distribution

Title Values Frequency Relative frequency (percent) Valid relative frequency (valid percent) Cumulative relative frequency (cumulative percent)

Descriptive statistics: categorical variables

Marital status of the 291 patients admitted to the Emergency Department

     

Cumulative

 

Frequency

Percent

Valid Percent

 

Percent

Valid

Married

266

91.4

 

94.7

 

94.7

Single

13

4.5

4.6

99.3

Widow

2

.7

.7

100.0

Total

281

96.6

100.0

Missing

System

10

3.4

Total

291

100.0

Example

Example

Example: summarizing data

Example: summarizing data

Descriptive statistics: categorical variables

Graphical representation

A graph lists, for each value (or small range of values) of a variable, the number or proportion of times that observation occurs in the study population

Descriptive statistics: categorical variables

Graphical representation:

Two types

Bar chart Pie chart

Descriptive statistics: categorical variables

Construct a bar or pie chart

Title Values Frequency or relative frequency Properly labelled axes

Descriptive statistics: categorical variables

Descriptive statistics: categorical variables

Descriptive statistics: categorical variables

Descriptive statistics: categorical variables

Descriptive statistics: continuous variables

Central tendency

Dispersion

Graphical representation

Descriptive statistics: continuous variables

How to describe a continuous variable (Systolic blood pressure)?

Central tendency:

Mean Median Mode

Descriptive statistics: continuous variables

Mean:

Add up data, then divide by sample size (n) The sample size n is the number of observations (pieces of data) Example

n

= 5 Systolic blood pressures (mmHg) X1 = 120

X2 =

80

X3 =

90

X4 = 110

95

X5 =

X

=

5

= 99mmHg

120

+

80

+

90

+

110

+

95

Descriptive statistics: continuous variables

Formula

n

∑ X i i = 1 X =
X
i
i
= 1
X =

n

Summation Sign Summation sign () is just a mathematical shorthand for “add up all of the observations”

n ∑ X i = X + X + X + + X 1 2
n
X i = X + X + X
+
+ X
1
2
3
n
i
= 1

Descriptive statistics: continuous variables

Also called sample average or arithmetic mean

Sensitive to extreme values One data point could make a great change in sample mean Uniqueness Simplicity

X

Descriptive statistics: continuous variables

Median: is the middle number, or the number that cuts the data in half

80

90

95

110

120

number that cuts the data in half 80 90 95 110 120 The sample median is

The sample median is not sensitive to extreme values For example: If 120 became 200, the median would remain the same, but the mean would change to 115.

Descriptive statistics: continuous variables

If the sample size is an even number

80

90

95

80 90 95 110 120 125

110

120

125

95 +

110 =

2

102.5 mmHg

Descriptive statistics: continuous variables

Median: Formula

n = odd: Median = middle value (n+1/2) n = even: Median = mean of middle 2 values (n/2 and n+2/2)

Properties:

Uniqueness Simplicity Not affected by extreme values

Descriptive statistics: continuous variables

Mode: Most frequently occurring number

80

90

95

95

120

125

Mode = 95

Descriptive statistics: continuous variables

Example:

Statistics

Systolic blood pressure

N

Valid

286

Missing

5

Mean

144.13

Median

144.50

Mode

155

Descriptive statistics: continuous variables

Central tendency measures do not tell the whole story

Example:

21

22

23

23

23

24

24

25

28

 

Mean = 213/9 = 23.6 Median = 23

 

15

18

21

21

23

25

25

32

33

Mean = 213/9 = 23.6 Median = 23

Descriptive statistics: continuous variables

How to describe a continuous variable (Systolic blood pressure) in addition to central tendency?

Measures of dispersion:

Range Variance Standard Deviation

Descriptive statistics: continuous variables

Range

Range = Maximum – Minimum

Example:

Range = 120 – 80 = 40

X

X

X

X

X

1 120

2 80

3 90

4 110

5 95

=

=

=

=

=

Descriptive statistics: continuous variables

Sample variance (s 2 or var or σ 2 ) The sample variance is the average of the square of the deviations about the sample mean

n

i

=

1

(X

i X)

2

n

1

s

2

=

Sample standard deviation (s or SD or σ) It is the square root of variance

s

i

n

=

1

n

1

2

s i n = 1 n − 1 2 = ∑

=

(X

i

X)

Descriptive statistics: continuous variables

 

X

1

= 120

X

2

= 80

X

3

= 90

X

4

= 110

X

5

= 95

5

i = 1

(X

i

X)

2

(120

 

99)

2

(80

 

99)

2

(90

 

99)

2

=

 

+

 

+

+

(110

99)

2

+

(95

99)

2

= 1020

 

Example: n = 5 systolic blood pressures (mm Hg)

Recall, from earlier: average = 99 mm HG

Descriptive statistics: continuous variables

Sample Variance

s

2

n

2 (X − X) i 1020 = ∑ i = 1 = n − 1
2
(X
− X)
i
1020
= ∑
i
=
1
=
n
− 1
4

= 255

Sample standard deviation (SD) s = s 2 = 255 = 15.97 (mm Hg)

Descriptive statistics: continuous variables

The bigger s, the more variability

s measures the spread about the mean

s can equal 0 only if there is no spread

All n observations have the same value

The units of s is the same as the units of the data (for example, mm Hg)

Descriptive statistics: continuous variables

Example:

Statistics

Systolic blood pressure

N

Valid

286

Missing

5

Mean

144.13

Median

144.50

Mode

155

Std. Deviation

35.312

Variance

1246.916

Range

202

Minimum

55

Maximum

257

Example: summarizing data

Example: summarizing data

Descriptive statistics: continuous variables

Graphical representation:

Different types

Histogram

Descriptive statistics: continuous variables

Construct a chart

Title Values Frequency or relative frequency Properly labelled axes

Descriptive statistics: continuous variables

Descriptive statistics: continuous variables

Shapes of the Distribution

Three common shapes of frequency distributions:

A B C
A
B
C

Symmetrical and bell shaped

Positively skewed or skewed to the right

Negatively skewed or skewed to the left

Shapes of Distributions

Symmetric (Right and left sides are mirror images)

Left tail looks like right tail Mean = Median = Mode

(Right and left sides are mirror images) Left tail looks like right tail Mean = Median

Mean

Median

Mode

Shapes of Distributions

Left skewed (negatively skewed)

Long left tail Mean < Median

Mean Mode Median
Mean
Mode
Median

Shapes of Distributions

Right skewed (positively skewed)

Long right tail Mean > Median

Mode Mean
Mode
Mean

Median

Shapes of the Distribution

Three less common shapes of frequency distributions:

A B
A
B

Bimodal

Reverse

J-shaped

the Distribution Three less common shapes of frequency distributions: A B Bimodal R e v e

C

Uniform

Probability

Probability

Probability

Definition:

The likelihood that a given event will occur

It ranges between 0 and 1:

0 means the event is impossible to occur 1 means that the event is definitely occurring

How do we calculate it?

Frequentist Approach:

Probability: is the long term relative frequency

Thus, it is an idealization based on imagining what would happen to the relative frequencies in an indefinite long series of trials

Application in medicine

How does probability apply in medicine?

Probability is the most important theory behind biostatistics

It is used at different levels

Descriptive

Example: 4% chance of a patient dying after admission to emergency department (from the previous example)

What do we mean? Out of each 100 patients admitted to the emergency department, 4 will die, whereas 96 will be discharged alive

Example: 1 in 1000 babies are born with a certain abnormality!

Incidence and prevalence

Associations

Example: the association between cigarette smoking and death after admission to the emergency department with an MI

Current Cigarrete Smoking in association with death at discharge

Count

Death at discharge

Death

Discharged

Total

Current Cigarrete Smoking

Total

No

Yes

5

5

10

123

154

277

128

159

287

Probability of being smoker

= 100 / 331

Probability of dying if a smoker

= 5 / 159 = 3.1%

Associations

Same is applied to:

Relative risk

Risk difference

Attributable risk

Odds ratio

Etc…

Bottom line

Probability is applied at all levels of statistical analyses

Probability distributions

Probability distributions list or describe probabilities for all possible occurrences of a random variable

There are two types of probability distributions:

Categorical distributions

Continuous distributions

Probability distributions: categorical variables

Categorical variables

Frequency distribution

Other distributions, such as binomial

Probability distributions: continuous variables

Continuous variables

Continuous distribution

Such as Z and t distributions

Normal Distribution

Properties of a Normal Distribution

Also called Gaussian distribution

A continuous, Bell shaped, symmetrical distribution; both tails extend to infinity

The mean, median, and mode are identical

The shape is completely determined by the mean and standard deviation

Normal Distribution

A normal distribution can have any µ and any σ:

e.g.: Age:

µ=40 ,

σ = 10

The area under the curve represents 100% of all the observations

any σ : e.g.: Age: µ =40 , σ = 10 The area under the curve

Mean

Median

Mode

Normal Distribution

Normal Distribution

Normal Distribution

Age distribution for a specific population

50% 50%
50%
50%

Mean=40

SD=10

Normal Distribution

Age distribution for a specific population

?
?

Age = 25

Mean=40

SD=10

Normal distribution

The formula used to calculate the area below a certain point in a normal distribution:

The probability density function of the normal distribution with mean µ and variance σ 2

in a normal distribution: The probability density function of the normal distribution with mean µ and

Normal distribution

Thus, for any normal distribution, once we have the mean and sd, we can calculate the percentage of subjects:

Above a certain level Below a certain level Between different levels

But the problem is:

Calculation is very complicated and time consuming, so:

Standardized Normal Distribution

We standardize to a normal distribution

What does this mean?

For a specific distribution, we calculate all possible probabilities, and record them in a table

A normal distribution with a µ = 0, σ = 1 is called a Standardized Normal Distribution

Standardized Normal Distribution

Standardized Normal Distribution Mean=0 SD=1

Mean=0

SD=1

Area under the Normal Curve from 0 to X

X

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.0

0.00000 0.00399 0.00798 0.01197 0.01595 0.01994 0.02392 0.02790 0.03188 0.03586

0.1

0.03983 0.04380 0.04776 0.05172 0.05567 0.05962 0.06356 0.06749 0.07142 0.07535

0.2

0.07926 0.08317 0.08706 0.09095 0.09483 0.09871 0.10257 0.10642 0.11026 0.11409

0.3

0.11791 0.12172 0.12552 0.12930 0.13307 0.13683 0.14058 0.14431 0.14803 0.15173

0.4

0.15542 0.15910 0.16276 0.16640 0.17003 0.17364 0.17724 0.18082 0.18439 0.18793

0.5

0.19146 0.19497 0.19847 0.20194 0.20540 0.20884 0.21226 0.21566 0.21904 0.22240

0.6

0.22575 0.22907 0.23237 0.23565 0.23891 0.24215 0.24537 0.24857 0.25175 0.25490

0.7

0.25804 0.26115 0.26424 0.26730 0.27035 0.27337 0.27637 0.27935 0.28230 0.28524

0.8

0.28814 0.29103 0.29389 0.29673 0.29955 0.30234 0.30511 0.30785 0.31057 0.31327

0.9

0.31594 0.31859 0.32121 0.32381 0.32639 0.32894 0.33147 0.33398 0.33646 0.33891

1.0

0.34134 0.34375 0.34614 0.34849 0.35083 0.35314 0.35543 0.35769 0.35993 0.36214

1.1

0.36433 0.36650 0.36864 0.37076 0.37286 0.37493 0.37698 0.37900 0.38100 0.38298

1.2

0.38493 0.38686 0.38877 0.39065 0.39251 0.39435 0.39617 0.39796 0.39973 0.40147

1.3

0.40320 0.40490 0.40658 0.40824 0.40988 0.41149 0.41308 0.41466 0.41621 0.41774

1.4

0.41924 0.42073 0.42220 0.42364 0.42507 0.42647 0.42785 0.42922 0.43056 0.43189

1.5

0.43319 0.43448 0.43574 0.43699 0.43822 0.43943 0.44062 0.44179 0.44295 0.44408

1.6

0.44520 0.44630 0.44738 0.44845 0.44950 0.45053 0.45154 0.45254 0.45352 0.45449

1.7

0.45543 0.45637 0.45728 0.45818 0.45907 0.45994 0.46080 0.46164 0.46246 0.46327

1.8

0.46407 0.46485 0.46562 0.46638 0.46712 0.46784 0.46856 0.46926 0.46995 0.47062

1.9

0.47128 0.47193 0.47257 0.47320 0.47381 0.47441 0.47500 0.47558 0.47615 0.47670

2.0

0.47725 0.47778 0.47831 0.47882 0.47932 0.47982 0.48030 0.48077 0.48124 0.48169

2.1

0.48214 0.48257 0.48300 0.48341 0.48382 0.48422 0.48461 0.48500 0.48537 0.48574

2.2

0.48610 0.48645 0.48679 0.48713 0.48745 0.48778 0.48809 0.48840 0.48870 0.48899

2.3

0.48928 0.48956 0.48983 0.49010 0.49036 0.49061 0.49086 0.49111 0.49134 0.49158

2.4

0.49180 0.49202 0.49224 0.49245 0.49266 0.49286 0.49305 0.49324 0.49343 0.49361

2.5

0.49379 0.49396 0.49413 0.49430 0.49446 0.49461 0.49477 0.49492 0.49506 0.49520

2.6

0.49534 0.49547 0.49560 0.49573 0.49585 0.49598 0.49609 0.49621 0.49632 0.49643

2.7

0.49653 0.49664 0.49674 0.49683 0.49693 0.49702 0.49711 0.49720 0.49728 0.49736

2.8

0.49744 0.49752 0.49760 0.49767 0.49774 0.49781 0.49788 0.49795 0.49801 0.49807

2.9

0.49813 0.49819 0.49825 0.49831 0.49836 0.49841 0.49846 0.49851 0.49856 0.49861

3.0

0.49865 0.49869 0.49874 0.49878 0.49882 0.49886 0.49889 0.49893 0.49896 0.49900

3.1

0.49903 0.49906 0.49910 0.49913 0.49916 0.49918 0.49921 0.49924 0.49926 0.49929

3.2

0.49931 0.49934 0.49936 0.49938 0.49940 0.49942 0.49944 0.49946 0.49948 0.49950

3.3

0.49952 0.49953 0.49955 0.49957 0.49958 0.49960 0.49961 0.49962 0.49964 0.49965

3.4

0.49966 0.49968 0.49969 0.49970 0.49971 0.49972 0.49973 0.49974 0.49975 0.49976

3.5

0.49977 0.49978 0.49978 0.49979 0.49980 0.49981 0.49981 0.49982 0.49983 0.49983

3.6

0.49984 0.49985 0.49985 0.49986 0.49986 0.49987 0.49987 0.49988 0.49988 0.49989

3.7

0.49989 0.49990 0.49990 0.49990 0.49991 0.49991 0.49992 0.49992 0.49992 0.49992

3.8

0.49993 0.49993 0.49993 0.49994 0.49994 0.49994 0.49994 0.49995 0.49995 0.49995

3.9

0.49995 0.49995 0.49996 0.49996 0.49996 0.49996 0.49996 0.49996 0.49997 0.49997

Standardized Normal Distribution

Normal Distribution

Standardized Normal Distribution (Z)

Normal Distribution Standardized Normal Distribution (Z) Mean = µ , SD = σ TRANSFORM Mean =
Normal Distribution Standardized Normal Distribution (Z) Mean = µ , SD = σ TRANSFORM Mean =

Mean = µ, SD = σ

TRANSFORM

Mean = 0, SD = 1

x - µ Z =
x - µ
Z =

σ

Standardized Normal Distribution

Normal Distribution

Standardized Normal Distribution (Z)

Normal Distribution Standardized Normal Distribution (Z) Mean = 40, SD = 10 TRANSFORM Mean = 0,
Normal Distribution Standardized Normal Distribution (Z) Mean = 40, SD = 10 TRANSFORM Mean = 0,

Mean = 40, SD = 10

TRANSFORM

Mean = 0, SD = 1

Distribution (Z) Mean = 40, SD = 10 TRANSFORM Mean = 0, SD = 1 Z

Z (40)

= x - µ = 40 - 40 = 0

σ

10

Standardized Normal Distribution

Normal Distribution

Standardized Normal Distribution (Z)

30
30
-1
-1

Mean = 40, SD = 10

TRANSFORM

Mean = 0, SD = 1

(Z) 30 -1 Mean = 40, SD = 10 TRANSFORM Mean = 0, SD = 1

Z (40)

= x - µ = 30 - 40 = -1

σ

10

Standardized Normal Distribution: summary

For any normal distribution, we can Transform the values to the standardized normal distribution (Z) Use the Z table to get the following areas

Above a certain level Below a certain level Between different levels

Normal Distribution

Age distribution for a specific population

Mean=40 SD=10
Mean=40
SD=10

30

Mean – 1SD

68%

50

Mean + 1SD

Normal Distribution

Age distribution for a specific population

Mean=40 SD=10
Mean=40
SD=10

20

Mean – 2SD

95%

60

Mean + 2SD

Normal Distribution

Age distribution for a specific population

Mean=40 SD=10
Mean=40
SD=10

10

99.7%

70

Practical example

Practical example

Practical example

Practical example

The 68-95-99.7 Rule for the Normal Distribution

68% of the observations fall within one standard deviation of the mean

95% of the observations fall within two standard deviations of the mean

99.7% of the observations fall within three standard deviations of the mean

When applied to ‘real data’, these estimates are considered approximate!

Distributions of Blood Pressure

.4 .3 68% Mean = 125 mmHG s = 14 mmHG .2 95% 99.7% .1
.4
.3
68%
Mean = 125 mmHG
s = 14 mmHG
.2
95%
99.7%
.1
0
83
97
111
125
139
153
167

The 68-95-99.7 rule applied to the distribution of systolic blood pressure in men.

Data analyses

Data analyses:

Descriptive statistics: are the techniques used to describe the main features of a sample

Inferential statistics: is the process of using the sample statistic to make informed guesses about the value of a population parameter

Why do we carry out research?

population sample
population
sample

Inference: Drawing conclusions on certain questions about a population from sample data

Inferential statistics

Since we are not taking the whole population, we have to draw conclusions on the population based on results we get from the sample

Simple example: Say we want to estimate the average systolic blood pressure for patients admitted to the emergency department after having an MI

Other more complicated measures might be quality of life, satisfaction with care, risk of outcome, etc.

Inferential statistics

What do we do?

Take a sample (n=291) of patients admitted to emergency department in a certain hospital

Calculate the mean and SD (descriptive statistics) of systolic blood pressure

Statistics

Systolic blood pressure

N

Valid

286

Missing

5

Mean

144.13

Std. Deviation

35.312

Inferential statistics

The next step is to make a link between the estimates we observed from the sample and those of the underlying population (inferential statistics)

What can we say about these estimates as compared to the unknown true ones???

In other words, we trying to estimate the average systolic blood pressure for ALL patients admitted to the emergency department after an MI

Inferential statistics

Sample data

Inferential statistics Sample data N=291 Mean=144 SD=35
N=291
N=291

Mean=144

SD=35

Inference

In statistical inference we usually encounter TWO issues

Estimate value of the population parameter. This is done through point estimate and interval estimate (Confidence Interval)

Evaluate a hypothesis about a population parameter rather than simply estimating it. This is done through tests of significance known as hypothesis testing (P-value)

1- Confidence Interval

1- Confidence Interval

Confidence Intervals

A point estimate:

A single numerical value used to estimate a population parameter.

Interval estimate:

Consists of 2 numerical values defining a range of values that with a specified degree of confidence includes the parameter being estimated. (Usually interval estimate with a degree of 95% confidence is used)

Example

What is the average systolic blood pressure for patients admitted to emergency departments after an MI?

Select a sample

Point estimate

Interval estimate = 95% CI = (140 – 148)

144

⎛ 35 ⎞ ⎜ ⎟ ⎝ 291 ⎠
⎛ 35 ⎞
291

= 144

= mean

±

1.95

×

95% Confidence Interval:

- Upper limit =

- Lower limit =

× 95% Confidence Interval: - Upper limit = - Lower limit = x + z (1-

x + z (1-α/2) SE

x + z (1-α/2) SE

x z (1-α/2) SE

Sampling distribution of mean

N = 291
N = 291

µ - 2SE

95%

µ + 2SE

Standard error

Standard error

= sd / n

As sample size increases the standard error decreases

The estimation as measured by the confidence interval will be better, ie narrower confidence interval

Interpretation

95% Confidence Interval

There is 95% probability that the true parameter is within the calculated interval

Thus, if we repeat the sampling procedure 100 times, the above statement will be:

correct in 95 times (the true parameter is within the interval)

wrong in 5 times (the true parameter is outside the interval) (also called α error)

Notes on Confidence Intervals

Interpretation

It provides the level of confidence of the value for the population average systolic blood pressure

Are all CIs 95%?

No

It is the most commonly used

A 99% CI is wider

A 90% CI is narrower

Notes on Confidence Intervals

To be “more confident” you need a bigger interval

For a 99% CI, you need ± 2.6 SEM For a 95% CI, you need ± 2 SEM For a 90% CI, you need ± 1.65 SEM

2-

P-value

2- P-value

Inference

P-value

Is related to another type of inference

Hypothesis testing

Evaluate a hypothesis about a population parameter rather than simply estimating it

Hypothesis testing

Back to our previous example

We want to make inference about the average systolic blood pressure of patients admitted to emergency department after MI

Assume that the normal systolic blood pressure is 120

The question is whether the average systolic blood pressure for patients admitted to emergency departments is different than the normal, which is 120

Hypothesis testing

Two types of hypotheses:

Null hypothesis: is a statement consistent with “no difference”

Alternative hypothesis: is a statement that disagrees with the null hypothesis, and is consistent with presence of “difference”

The logic of hypothesis testing

To decide which of the hypothesis is true

Take a sample from the population

If the data are consistent with the null hypothesis, then we do not reject the null hypothesis (conclusion = “no difference”)

If the sample data are not consistent with the null hypothesis, then we reject the null (conclusion = “difference”)

Hypothesis testing

Example: is the systolic blood pressure for patients admitted to

emergency department after an MI normal (ie =120)?

- Ho: µ = 120

- Ha: µ 120

How do we answer this question?

We take a sample and find that the mean is 144 years

Can we consider that the 144 is consistent with the normal value

(120 years)?

Hypothesis testing

N = 291 mean mean 144 144
N = 291
mean
mean
144
144

Ho: µ = 120

It looks like it is consistent with the null hypothesis Is it still consistent with the null hypothesis?

Hypothesis testing

N = 291 mean mean 2.5% 2.5% 95%
N = 291
mean
mean
2.5%
2.5%
95%

µ - 2SE

Ho: µµ == 120120

µ + 2SE

Test statistic

It is the statistic used for deciding whether the null hypothesis should be rejected or not

Used to calculate the probability of getting the observed results if the null hypothesis is true.

This probability is called the p-value.

How to decide

We calculate the probability of obtaining a sample with mean of 144 if the true mean is 120 due to chance alone (p-value)

Based on p-value we make our decision:

If the p-value is low then this is taken as evidence that it is unlikely that the null hypothesis is true, then we reject the null hypothesis (we accept alternative one)

If the p-value is high, it indicates that most probably the null hypothesis is true, and thus we do not reject the Ho

Problem!

We could be making the wrong decisions

Decision

Decision

Decision

Decision

Decision

Ho True

Ho True

Ho True

Ho True

Ho True

Ho False

Ho False

Ho False

Ho False

Ho False

Do not reject Ho

Do not reject Ho

Do not reject Ho

Do not reject Ho

Do not reject Ho

Correct decision

Correct decision

Correct decision

Correct decision

Type II error

Reject Ho

Reject Ho

Reject Ho

Reject Ho

Reject Ho

Type I error

Type I error

Correct decision

Correct decision

Correct decision

Type I error: is rejecting the null hypothesis when it is true

Type II error: is not rejecting the null hypothesis when it is false

Error

Type I error:

Referred to as α Probability of rejecting a true null hypothesis

Type II error:

Referred to as β Probability of accepting a false null hypothesis

Power:

Represented by 1-β Probability of correctly rejecting a false null hypothesis

Significance level

The significance level, α, of a hypothesis test is defined as the probability of making a type I error, that is the probability of rejecting a true null hypothesis

It could be set to any value, as:

0.05

0.01

0.1

Statistical significance

If the p-value is less then some pre-determined cutoff the result is called “statistically significant”

(e.g.

.05),

This cutoff is the α-level

The α-level is the probability of a type I error

It is the probability of falsely rejecting H 0

Back to the example

To test whether the average systolic blood pressure for patients admitted to the emergency department after an MI is different than 120 (which is the normal blood pressure)

We carry out a test called “one sample t-test” which provides a p- value based on which we accept or reject the null hypothesis.

Back to the example

One-Sample Statistics

       

Std. Error

N

Mean

Std. Deviation

Mean

Systolic blood pressure

286

144.13

35.312

2.088

One-Sample Test

Test Value = 120

t

df

Sig. (2-tailed)

Mean

Difference

95% Confidence Interval of the Difference

Lower

Upper

Systolic blood pressure

11.558

285

.000

24.133

20.02

28.24

Since p-value is less than 0.05, then the conclusion will be that the systolic blood pressure for patients admitted to emergency department after an MI is significantly higher than the normal value which is 120

p-values

p-values are probabilities (numbers between 0 and 1)

Small p-values mean that the sample results are unlikely when the null is true

The p-value is the probability of obtaining a result as/or more extreme than you did by chance alone assuming the null hypothesis H 0 is true

t-distribution

The t-distribution looks like a standard normal curve

A t-distribution is determined by its degrees of freedom (n-1), the lower the degrees of freedom, the flatter and fatter it is

its degrees of freedom (n-1), the lower the degrees of freedom, the flatter and fatter it

Normal (0,1)

t 35 0 t 15
t
35
0
t
15

ν

75%

80%

85%

90%

95%

97.5%

99%

99.5%

99.75%

99.9%

99.95%

1

1.000

1.376

1.963

3.078

6.314

12.71

31.82

63.66

127.3

318.3

636.6

2

0.816

1.061

1.386

1.886

2.920

4.303

6.965

9.925

14.09

22.33

31.60

3

0.765

0.978

1.250

1.638

2.353

3.182

4.541

5.841

7.453

10.21

12.92

4

0.741

0.941

1.190

1.533

2.132

2.776

3.747

4.604

5.598

7.173

8.610

5

0.727

0.920

1.156

1.476

2.015

2.571

3.365

4.032

4.773

5.893

6.869

6

0.718

0.906

1.134

1.440

1.943

2.447

3.143

3.707

4.317

5.208

5.959

7

0.711

0.896

1.119

1.415

1.895

2.365

2.998

3.499

4.029

4.785

5.408

8

0.706

0.889

1.108

1.397

1.860

2.306

2.896

3.355

3.833

4.501

5.041

9

0.703

0.883

1.100

1.383

1.833

2.262

2.821

3.250

3.690

4.297

4.781

10

0.700

0.879

1.093

1.372

1.812

2.228

2.764

3.169

3.581

4.144

4.587

11

0.697

0.876

1.088

1.363

1.796

2.201

2.718

3.106

3.497

4.025

4.437

12

0.695

0.873

1.083

1.356

1.782

2.179

2.681

3.055

3.428

3.930

4.318

13

0.694

0.870

1.079

1.350

1.771

2.160

2.650

3.012

3.372

3.852

4.221

14

0.692

0.868

1.076

1.345

1.761

2.145

2.624

2.977

3.326

3.787

4.140

15

0.691

0.866

1.074

1.341

1.753

2.131

2.602

2.947

3.286

3.733

4.073

16

0.690

0.865

1.071

1.337

1.746

2.120

2.583

2.921

3.252

3.686

4.015

17

0.689

0.863

1.069

1.333

1.740

2.110

2.567

2.898

3.222

3.646

3.965

18

0.688

0.862

1.067

1.330

1.734

2.101

2.552

2.878

3.197

3.610

3.922

19

0.688

0.861

1.066

1.328

1.729

2.093

2.539

2.861

3.174

3.579

3.883

20

0.687

0.860

1.064

1.325

1.725

2.086

2.528

2.845

3.153

3.552

3.850

100

0.677

0.845

1.042

1.290

1.660

1.984

2.364

2.626

2.871

3.174

3.390

120

0.677

0.845

1.041

1.289

1.658

1.980

2.358

2.617

2.860

3.160

3.373

 
 

0.674

0.842

1.036

1.282

1.645

1.960

2.326

2.576

2.807

3.090

3.291

Hypothesis Testing

Different types of hypothesis:

Mean (a) = Mean (b) Proportion (a) = Proportion (b) Variance (a) = Variance (b) OR = 1 RR = 1 RD = 0 Test of homogeneity Etc

Example

Example

Comparing two means: paired testing

In the previous example, is the heart rate at admission different than the heart rate at discharge among the patients admitted to the emergency department after an MI?

Statistics

Heart Rate at admission Heart Rate at discharge N Valid 286 77 Missing 5 214
Heart Rate at
admission
Heart Rate at
discharge
N
Valid
286
77
Missing
5
214
Mean
82.64
76.99
Std. Deviation
22.598
17.900

Is this decrease in heart rate statistically significant?

Thus, we have to make inference….

Comparing two means: paired testing

What type of test to be used?

Since the measurements of the heart rate at admission and at discharge are dependent on each other (not independent), another type of test is used

Paired t-test

Comparing two means: paired testing

Paired Samples Statistics

       

Std. Error

Mean

N

Std. Deviation

Mean

Pair

Heart Rate at admission

81.16

75

23.546

2.719

1

Heart Rate at discharge

76.72

75

17.973

2.075

Paired Samples Test

   

Paired Differences

       
   

Std. Error

95% Confidence Interval of the Difference

Mean

Std. Deviation

Mean

Lower

Upper

t

df

Sig. (2-tailed)

Pair

Heart Rate at admission -

               

1

Heart Rate at discharge

4.440

25.302

2.922

-1.381

10.261

1.520

74

.133

95%CI = 4.4 ±1.95× 2.9

H 0 :

H A : µ b - µ a

µ b - µ a = 0

0

P-value = 0.133, thus no significant difference

How Are p-values Calculated?

t =

sample

mean 0

SEM

 

4.4

t

=

=

 

2.9

1 .52

The value t = 1.52 is called the test statistic

Then we can compare the t-value in the table and get the p-value, or get it from the computer (0.13)

Interpreting the p-value

The p-value in the example is 0.133

Interpretation: If there is no difference in heart rate between admission and discharge to an emergency department, then the chance of finding a mean difference as extreme/more extreme as 4.4 in a sample of 291 patients is 0.133

Thus, this probability is big (bigger than 0.05) which leads to saying that the difference of 4.4 is due to chance

Notes

How to decide on significance from the 95% CI?

3 scenarios

-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -15
-15
-10
-5
0
5
10
15
-15
-10
-5
0
5
10
15
-15
-10
-5
0
5
10
15

Comparing two means: Independent sample testing

In the previous example, is the systolic blood pressure different between males and females among the patients admitted to the emergency department after an MI?

Group Statistics

       

Std. Error

 

Sex

N

Mean

Std. Deviation

Mean

Systolic blood pressure

Male

240

145.05

35.162

2.270

Female

44

138.64