Statistics Made Easy
Hani Tamim, MPH, PhD
Assistant Professor Epidemiology and Biostatistics Research Center / College of Medicine King Saud bin Abdulaziz University for Health Sciences Riyadh – Saudi Arabia
Objective of medical research
Is treatment A better than treatment B for patients with hypertension?
What is the survival rate among ICU patients?
What is the incidence of Down’s syndrome among a certain group of people?
Is the use of Oral Contraceptives associated with an increased risk of breast cancer?
Research Process?
Planning Design Data collection Analysis
Data entry Data cleaning Data management Data analysis
Reporting
Statistics is used in ….
What is statistics?
Scientific methods for:
Collecting Organizing Summarizing Presenting Interpreting
data
Definition of some basic terms

Population: The largest collection of entities for which we have interest at a particular time 

Sample: A part of a population 

Simple random sample: is when a sample n is drawn from a population N in such a way that every possible sample of size n has the same chance of being selected 
Definition of some basic terms

Variable: A characteristic of the subjects under observation that takes on different values for different cases, example: age gender, diastolic blood pressure 

Quantitative variables: Are variables that can convey information regarding amount 

Qualitative variables: Are variables in which measurements consist of categorization 
Types of variables

Categorical variables 

Continuous variables 
Categorical variables

Nominal: unordered data 


Death 


Gender 


Country of birth 


Ordinal: Predetermined order among response classification 


Education 


Satisfaction 
Continuous variables
Continuous: Not restricted to integers

Age 

Weight 

Cholesterol 

Blood pressure 
Steps involved (data)
Data collection
Database structure
Data entry
Data cleaning
Data management
Data analyses
Data collection
Data collection:
Collection of information that will be used to answer the research question
Could be done through questionnaires, interviews, data abstraction, etc.
Data collection
Database structure
Database structure:
Structure the database (using SPSS) into which the data will be entered
Data entry
Data entry:
Entering the information (data) into the computer
Most of the times done manually
Single data entry Double data entry
Data cleaning
Data cleaning:
Identify any data entry mistakes
Correct such mistakes
Data management
Data management:
Create new variables based on different criteria
Such as:
BMI Recoding Categorizing age (less than 50 years, and 50 years and above) Etc.
Data analyses
Data analyses:
Descriptive statistics: are the techniques used to describe the main features of a sample
Inferential statistics: is the process of using the sample statistic to make informed guesses about the value of a population parameter
Data analyses
Data analyses:
Univariate analyses
Bivariate analyses
Multivariate analyses
Bottom line
There are different statistical methods for different types of variables
Descriptive statistics: categorical variables
Frequency distribution
Graphical representation
Descriptive statistics: categorical variables
Frequency distribution
A frequency distribution lists, for each value (or small range of values) of a variable, the number or proportion of times that observation occurs in the study population
Descriptive statistics: categorical variables
Frequency distribution:
How to describe a categorical variable (marital status)?
Descriptive statistics: categorical variables
Construct a frequency distribution
Title Values Frequency Relative frequency (percent) Valid relative frequency (valid percent) Cumulative relative frequency (cumulative percent)
Descriptive statistics: categorical variables
Marital status of the 291 patients admitted to the Emergency Department
Cumulative 

Frequency 
Percent 
Valid Percent 
Percent 

Valid 
Married 
266 
91.4 
94.7 
94.7 

Single 
13 
4.5 
4.6 
99.3 

Widow 
2 
.7 
.7 
100.0 

Total 
281 
96.6 
100.0 

Missing 
System 
10 
3.4 

Total 
291 
100.0 
Example
Example: summarizing data
Descriptive statistics: categorical variables
Graphical representation
A graph lists, for each value (or small range of values) of a variable, the number or proportion of times that observation occurs in the study population
Descriptive statistics: categorical variables
Graphical representation:
Two types
Bar chart Pie chart
Descriptive statistics: categorical variables
Construct a bar or pie chart
Title Values Frequency or relative frequency Properly labelled axes
Descriptive statistics: categorical variables
Descriptive statistics: categorical variables
Descriptive statistics: continuous variables
Central tendency
Dispersion
Graphical representation
Descriptive statistics: continuous variables
How to describe a continuous variable (Systolic blood pressure)?
Central tendency:
Mean Median Mode
Descriptive statistics: continuous variables
Mean:
Add up data, then divide by sample size (n) The sample size n is the number of observations (pieces of data) Example
n
= 5 Systolic blood pressures (mmHg) X1 = 120
X2 = 
80 
X3 = 
90 
X4 = 110
95
X5 =
X
=
5
= 99mmHg
120
+
80
+
90
+
110
+
95
Descriptive statistics: continuous variables
Formula
n
n
Summation Sign Summation sign (∑) is just a mathematical shorthand for “add up all of the observations”
Descriptive statistics: continuous variables
Also called sample average or arithmetic mean
Sensitive to extreme values One data point could make a great change in sample mean Uniqueness Simplicity
X
Descriptive statistics: continuous variables
Median: is the middle number, or the number that cuts the data in half
80
90
95
110
120
The sample median is not sensitive to extreme values For example: If 120 became 200, the median would remain the same, but the mean would change to 115.
Descriptive statistics: continuous variables
If the sample size is an even number
80
90
95
110
120
125
95 +
110 _{=}
2
102.5 mmHg
Descriptive statistics: continuous variables
Median: Formula
n = odd: Median = middle value (n+1/2) n = even: Median = mean of middle 2 values (n/2 and n+2/2)
Properties:
Uniqueness Simplicity Not affected by extreme values
Descriptive statistics: continuous variables
Mode: Most frequently occurring number
80
90
95
95
120
125
Mode = 95
Descriptive statistics: continuous variables
Example:
Statistics
Systolic blood pressure
N 
Valid 
286 
Missing 
5 

Mean 
144.13 

Median 
144.50 

Mode 
155 
Descriptive statistics: continuous variables
Central tendency measures do not tell the whole story
Example:
21 
22 
23 
23 
23 
24 
24 
25 
28 
Mean = 213/9 = 23.6 Median = 23 

15 
18 
21 
21 
23 
25 
25 
32 
33 
Mean = 213/9 = 23.6 Median = 23
Descriptive statistics: continuous variables
How to describe a continuous variable (Systolic blood pressure) in addition to central tendency?
Measures of dispersion:
Range Variance Standard Deviation
Descriptive statistics: continuous variables
Range
Range = Maximum – Minimum
Example:
Range = 120 – 80 = 40
X
X
X
X
X
1 120
2 80
3 90
4 110
5 95
=
=
=
=
=
Descriptive statistics: continuous variables
Sample variance (s ^{2} or var or σ ^{2} ) The sample variance is the average of the square of the deviations about the sample mean
n
i
=
1
(X
i − X)
2
n
− 1
s
2
= ^{∑}
Sample standard deviation (s or SD or σ) It is the square root of variance
s
i
n
=
1
n
− 1
2
= ^{∑}
(X
i
− X)
Descriptive statistics: continuous variables
X 
1 
= 120 

X 
2 
= 80 

X 
3 
= 90 

X 
4 
= 110 

X 
5 
= 95 

5 

∑ i = 1 (X 
− i X) 
2 
(120 
99) 
2 
(80 
99) 
2 
(90 
99) 
2 

= 
− 
+ 
− 
+ 
− 

+ 
(110 − 99) 
2 
+ 
(95 
− 
99) 
2 
= 1020 
Example: n = 5 systolic blood pressures (mm Hg)
Recall, from earlier: average = 99 mm HG
Descriptive statistics: continuous variables
Sample Variance
s
2
n
= 255
Sample standard deviation (SD) s = √s ^{2} = √255 = 15.97 (mm Hg)
Descriptive statistics: continuous variables

The bigger s, the more variability 


s measures the spread about the mean 


s can equal 0 only if there is no spread 


All n observations have the same value 


The units of s is the same as the units of the data (for example, mm Hg) 
Descriptive statistics: continuous variables
Example:
Statistics
Systolic blood pressure
N 
Valid 
286 
Missing 
5 

Mean 
144.13 

Median 
144.50 

Mode 
155 

Std. Deviation 
35.312 

Variance 
1246.916 

Range 
202 

Minimum 
55 

Maximum 
257 
Example: summarizing data
Descriptive statistics: continuous variables
Graphical representation:
Different types
Histogram
Descriptive statistics: continuous variables
Construct a chart
Title Values Frequency or relative frequency Properly labelled axes
Descriptive statistics: continuous variables
Shapes of the Distribution
Three common shapes of frequency distributions:
Symmetrical and bell shaped
Positively skewed or skewed to the right
Negatively skewed or skewed to the left
Shapes of Distributions
Symmetric (Right and left sides are mirror images)
Left tail looks like right tail Mean = Median = Mode
Mean
Median
Mode
Shapes of Distributions
Left skewed (negatively skewed)
Long left tail Mean < Median
Shapes of Distributions
Right skewed (positively skewed)
Long right tail Mean > Median
Median
Shapes of the Distribution
Three less common shapes of frequency distributions:
Bimodal
^{R}^{e}^{v}^{e}^{r}^{s}^{e}
Jshaped
^{C}
Uniform
Probability
Probability
Definition:
The likelihood that a given event will occur
It ranges between 0 and 1:
0 means the event is impossible to occur 1 means that the event is definitely occurring
How do we calculate it?
Frequentist Approach:
Probability: is the long term relative frequency
Thus, it is an idealization based on imagining what would happen to the relative frequencies in an indefinite long series of trials
Application in medicine

How does probability apply in medicine? 

Probability is the most important theory behind biostatistics 

It is used at different levels 
Descriptive
Example: 4% chance of a patient dying after admission to emergency department (from the previous example)
What do we mean? Out of each 100 patients admitted to the emergency department, 4 will die, whereas 96 will be discharged alive
Example: 1 in 1000 babies are born with a certain abnormality!
Incidence and prevalence
Associations
Example: the association between cigarette smoking and death after admission to the emergency department with an MI
Current Cigarrete Smoking in association with death at discharge
Count
Death at discharge
Death
Discharged
Total
Current Cigarrete Smoking
Total
No
Yes
5
5
10
123
154
277
128
159
287
Probability of being smoker
= 100 / 331
Probability of dying if a smoker
= 5 / 159 = 3.1%
Probability of dying if a nonsmoker
= 5 / 128 = 3.9%
Associations
Same is applied to:
Relative risk
Risk difference
Attributable risk
Odds ratio
Etc…
Bottom line
Probability is applied at all levels of statistical analyses
Probability distributions
Probability distributions list or describe probabilities for all possible occurrences of a random variable
There are two types of probability distributions:
Categorical distributions
Continuous distributions
Probability distributions: categorical variables
Categorical variables
Frequency distribution
Other distributions, such as binomial
Probability distributions: continuous variables
Continuous variables
Continuous distribution
Such as Z and t distributions
Normal Distribution
Properties of a Normal Distribution
Also called Gaussian distribution
A continuous, Bell shaped, symmetrical distribution; both tails extend to infinity
The mean, median, and mode are identical
The shape is completely determined by the mean and standard deviation
Normal Distribution
A normal distribution can have any µ and any σ:
e.g.: Age:
µ=40 ,
σ = 10
The area under the curve represents 100% of all the observations
Mean
Median
Mode
Normal Distribution
Normal Distribution
Age distribution for a specific population
Mean=40
SD=10
Normal Distribution
Age distribution for a specific population
Age = 25
Mean=40
SD=10
Normal distribution
The formula used to calculate the area below a certain point in a normal distribution:
The probability density function of the normal distribution with mean µ and variance σ ^{2}
Normal distribution
Thus, for any normal distribution, once we have the mean and sd, we can calculate the percentage of subjects:
Above a certain level Below a certain level Between different levels
But the problem is:
Calculation is very complicated and time consuming, so:
Standardized Normal Distribution
We standardize to a normal distribution
What does this mean?
For a specific distribution, we calculate all possible probabilities, and record them in a table
A normal distribution with a µ = 0, σ = 1 is called a Standardized Normal Distribution
Standardized Normal Distribution
Mean=0
SD=1
Area under the Normal Curve from 0 to X
X 
0.00 
0.01 
0.02 
0.03 
0.04 
0.05 
0.06 
0.07 
0.08 
0.09 
0.0 
0.00000 0.00399 0.00798 0.01197 0.01595 0.01994 0.02392 0.02790 0.03188 0.03586 

0.1 
0.03983 0.04380 0.04776 0.05172 0.05567 0.05962 0.06356 0.06749 0.07142 0.07535 

0.2 
0.07926 0.08317 0.08706 0.09095 0.09483 0.09871 0.10257 0.10642 0.11026 0.11409 

0.3 
0.11791 0.12172 0.12552 0.12930 0.13307 0.13683 0.14058 0.14431 0.14803 0.15173 

0.4 
0.15542 0.15910 0.16276 0.16640 0.17003 0.17364 0.17724 0.18082 0.18439 0.18793 

0.5 
0.19146 0.19497 0.19847 0.20194 0.20540 0.20884 0.21226 0.21566 0.21904 0.22240 

0.6 
0.22575 0.22907 0.23237 0.23565 0.23891 0.24215 0.24537 0.24857 0.25175 0.25490 

0.7 
0.25804 0.26115 0.26424 0.26730 0.27035 0.27337 0.27637 0.27935 0.28230 0.28524 

0.8 
0.28814 0.29103 0.29389 0.29673 0.29955 0.30234 0.30511 0.30785 0.31057 0.31327 

0.9 
0.31594 0.31859 0.32121 0.32381 0.32639 0.32894 0.33147 0.33398 0.33646 0.33891 

1.0 
0.34134 0.34375 0.34614 0.34849 0.35083 0.35314 0.35543 0.35769 0.35993 0.36214 

1.1 
0.36433 0.36650 0.36864 0.37076 0.37286 0.37493 0.37698 0.37900 0.38100 0.38298 

1.2 
0.38493 0.38686 0.38877 0.39065 0.39251 0.39435 0.39617 0.39796 0.39973 0.40147 

1.3 
0.40320 0.40490 0.40658 0.40824 0.40988 0.41149 0.41308 0.41466 0.41621 0.41774 

1.4 
0.41924 0.42073 0.42220 0.42364 0.42507 0.42647 0.42785 0.42922 0.43056 0.43189 

1.5 
0.43319 0.43448 0.43574 0.43699 0.43822 0.43943 0.44062 0.44179 0.44295 0.44408 

1.6 
0.44520 0.44630 0.44738 0.44845 0.44950 0.45053 0.45154 0.45254 0.45352 0.45449 

1.7 
0.45543 0.45637 0.45728 0.45818 0.45907 0.45994 0.46080 0.46164 0.46246 0.46327 

1.8 
0.46407 0.46485 0.46562 0.46638 0.46712 0.46784 0.46856 0.46926 0.46995 0.47062 

1.9 
0.47128 0.47193 0.47257 0.47320 0.47381 0.47441 0.47500 0.47558 0.47615 0.47670 

2.0 
0.47725 0.47778 0.47831 0.47882 0.47932 0.47982 0.48030 0.48077 0.48124 0.48169 

2.1 
0.48214 0.48257 0.48300 0.48341 0.48382 0.48422 0.48461 0.48500 0.48537 0.48574 

2.2 
0.48610 0.48645 0.48679 0.48713 0.48745 0.48778 0.48809 0.48840 0.48870 0.48899 

2.3 
0.48928 0.48956 0.48983 0.49010 0.49036 0.49061 0.49086 0.49111 0.49134 0.49158 

2.4 
0.49180 0.49202 0.49224 0.49245 0.49266 0.49286 0.49305 0.49324 0.49343 0.49361 

2.5 
0.49379 0.49396 0.49413 0.49430 0.49446 0.49461 0.49477 0.49492 0.49506 0.49520 

2.6 
0.49534 0.49547 0.49560 0.49573 0.49585 0.49598 0.49609 0.49621 0.49632 0.49643 

2.7 
0.49653 0.49664 0.49674 0.49683 0.49693 0.49702 0.49711 0.49720 0.49728 0.49736 

2.8 
0.49744 0.49752 0.49760 0.49767 0.49774 0.49781 0.49788 0.49795 0.49801 0.49807 

2.9 
0.49813 0.49819 0.49825 0.49831 0.49836 0.49841 0.49846 0.49851 0.49856 0.49861 

3.0 
0.49865 0.49869 0.49874 0.49878 0.49882 0.49886 0.49889 0.49893 0.49896 0.49900 

3.1 
0.49903 0.49906 0.49910 0.49913 0.49916 0.49918 0.49921 0.49924 0.49926 0.49929 

3.2 
0.49931 0.49934 0.49936 0.49938 0.49940 0.49942 0.49944 0.49946 0.49948 0.49950 

3.3 
0.49952 0.49953 0.49955 0.49957 0.49958 0.49960 0.49961 0.49962 0.49964 0.49965 

3.4 
0.49966 0.49968 0.49969 0.49970 0.49971 0.49972 0.49973 0.49974 0.49975 0.49976 

3.5 
0.49977 0.49978 0.49978 0.49979 0.49980 0.49981 0.49981 0.49982 0.49983 0.49983 

3.6 
0.49984 0.49985 0.49985 0.49986 0.49986 0.49987 0.49987 0.49988 0.49988 0.49989 

3.7 
0.49989 0.49990 0.49990 0.49990 0.49991 0.49991 0.49992 0.49992 0.49992 0.49992 

3.8 
0.49993 0.49993 0.49993 0.49994 0.49994 0.49994 0.49994 0.49995 0.49995 0.49995 

3.9 
0.49995 0.49995 0.49996 0.49996 0.49996 0.49996 0.49996 0.49996 0.49997 0.49997 
4.0
0.49997 0.49997 0.49997 0.49997 0.49997 0.49997 0.49998 0.49998 0.49998 0.49998
Standardized Normal Distribution
Normal Distribution
Standardized Normal Distribution (Z)
Mean = µ, SD = σ
TRANSFORM
Mean = 0, SD = 1
σ
Standardized Normal Distribution
Normal Distribution
Standardized Normal Distribution (Z)
Mean = 40, SD = 10
TRANSFORM
Mean = 0, SD = 1
Z _{(}_{4}_{0}_{)}
_{=} x  µ _{=} 40  40 _{=} 0
σ
10
Standardized Normal Distribution
Normal Distribution
Standardized Normal Distribution (Z)
Mean = 40, SD = 10
TRANSFORM
Mean = 0, SD = 1
Z _{(}_{4}_{0}_{)}
_{=} x  µ _{=} 30  40 _{=} 1
σ
10
Standardized Normal Distribution: summary
For any normal distribution, we can Transform the values to the standardized normal distribution (Z) Use the Z table to get the following areas
Above a certain level Below a certain level Between different levels
Normal Distribution
Age distribution for a specific population
30
Mean – 1SD
68%
50
Mean + 1SD
Normal Distribution
Age distribution for a specific population
20
Mean – 2SD
95%
60
Mean + 2SD
Normal Distribution
Age distribution for a specific population
10
99.7%
70
Mean – 3SD
Mean + 3SD
Practical example
Practical example
The 689599.7 Rule for the Normal Distribution
68% of the observations fall within one standard deviation of the mean
95% of the observations fall within two standard deviations of the mean
99.7% of the observations fall within three standard deviations of the mean
When applied to ‘real data’, these estimates are considered approximate!
Distributions of Blood Pressure
The 689599.7 rule applied to the distribution of systolic blood pressure in men.
Data analyses
Data analyses:
Descriptive statistics: are the techniques used to describe the main features of a sample
Inferential statistics: is the process of using the sample statistic to make informed guesses about the value of a population parameter
Why do we carry out research?
Inference: Drawing conclusions on certain questions about a population from sample data
Inferential statistics
Since we are not taking the whole population, we have to draw conclusions on the population based on results we get from the sample
Simple example: Say we want to estimate the average systolic blood pressure for patients admitted to the emergency department after having an MI
Other more complicated measures might be quality of life, satisfaction with care, risk of outcome, etc.
Inferential statistics
What do we do?
Take a sample (n=291) of patients admitted to emergency department in a certain hospital
Calculate the mean and SD (descriptive statistics) of systolic blood pressure
Statistics
Systolic blood pressure
N 
Valid 
286 
Missing 
5 

Mean 
144.13 

Std. Deviation 
35.312 
Inferential statistics
The next step is to make a link between the estimates we observed from the sample and those of the underlying population (inferential statistics)
What can we say about these estimates as compared to the unknown true ones???
In other words, we trying to estimate the average systolic blood pressure for ALL patients admitted to the emergency department after an MI
Inferential statistics
Sample data
Mean=144
SD=35
Inference
In statistical inference we usually encounter TWO issues
Estimate value of the population parameter. This is done through point estimate and interval estimate (Confidence Interval)
Evaluate a hypothesis about a population parameter rather than simply estimating it. This is done through tests of significance known as hypothesis testing (Pvalue)
1 Confidence Interval
Confidence Intervals
A point estimate:
A single numerical value used to estimate a population parameter.
Interval estimate:
Consists of 2 numerical values defining a range of values that with a specified degree of confidence includes the parameter being estimated. (Usually interval estimate with a degree of 95% confidence is used)
Example
What is the average systolic blood pressure for patients admitted to emergency departments after an MI?
Select a sample
Point estimate
Interval estimate = 95% CI = (140 – 148)
144
= 144
= mean
±
1.95
×
95% Confidence Interval:
 Upper limit =
 Lower limit =
x + z (1α/2) SE
x + z (1α/2) SE
x − z (1α/2) SE
Sampling distribution of mean
µ  2SE
95%
µ + 2SE
Standard error
Standard error

= sd / √n 

As sample size increases the standard error decreases 

The estimation as measured by the confidence interval will be better, ie narrower confidence interval 
Interpretation
95% Confidence Interval

There is 95% probability that the true parameter is within the calculated interval 

Thus, if we repeat the sampling procedure 100 times, the above statement will be: 

correct in 95 times (the true parameter is within the interval) 

wrong in 5 times (the true parameter is outside the interval) (also called α error) 
Notes on Confidence Intervals
Interpretation
It provides the level of confidence of the value for the population average systolic blood pressure
Are all CIs 95%?
No
It is the most commonly used
A 99% CI is wider
A 90% CI is narrower
Notes on Confidence Intervals
To be “more confident” you need a bigger interval
For a 99% CI, you need ± 2.6 SEM For a 95% CI, you need ± 2 SEM For a 90% CI, you need ± 1.65 SEM
2
Pvalue
Inference
Pvalue

Is related to another type of inference 

Hypothesis testing 

Evaluate a hypothesis about a population parameter rather than simply estimating it 
Hypothesis testing
Back to our previous example
We want to make inference about the average systolic blood pressure of patients admitted to emergency department after MI
Assume that the normal systolic blood pressure is 120
The question is whether the average systolic blood pressure for patients admitted to emergency departments is different than the normal, which is 120
Hypothesis testing
Two types of hypotheses:
Null hypothesis: is a statement consistent with “no difference”
Alternative hypothesis: is a statement that disagrees with the null hypothesis, and is consistent with presence of “difference”
The logic of hypothesis testing
To decide which of the hypothesis is true
Take a sample from the population
If the data are consistent with the null hypothesis, then we do not reject the null hypothesis (conclusion = “no difference”)
If the sample data are not consistent with the null hypothesis, then we reject the null (conclusion = “difference”)
Hypothesis testing

Example: is the systolic blood pressure for patients admitted to 
emergency department after an MI normal (ie =120)? 

 Ho: µ = 120 

 Ha: µ ≠ 120 


How do we answer this question? 

We take a sample and find that the mean is 144 years 

Can we consider that the 144 is consistent with the normal value 
(120 years)?
Hypothesis testing
Ho: µ = 120
It looks like it is consistent with the null hypothesis Is it still consistent with the null hypothesis?
Hypothesis testing
µ  2SE
Ho: µµ == 120120
µ + 2SE
Test statistic
It is the statistic used for deciding whether the null hypothesis should be rejected or not
Used to calculate the probability of getting the observed results if the null hypothesis is true.
This probability is called the pvalue.
How to decide
We calculate the probability of obtaining a sample with mean of 144 if the true mean is 120 due to chance alone (pvalue)
Based on pvalue we make our decision:
If the pvalue is low then this is taken as evidence that it is unlikely that the null hypothesis is true, then we reject the null hypothesis (we accept alternative one)
If the pvalue is high, it indicates that most probably the null hypothesis is true, and thus we do not reject the Ho
Problem!
We could be making the wrong decisions
Decision Decision Decision Decision Decision 
Ho True Ho True Ho True Ho True Ho True 
Ho False Ho False Ho False Ho False Ho False 
Do not reject Ho Do not reject Ho Do not reject Ho Do not reject Ho Do not reject Ho 
Correct decision Correct decision Correct decision Correct decision 
Type II error 
Reject Ho Reject Ho Reject Ho Reject Ho Reject Ho 
Type I error Type I error 
Correct decision Correct decision Correct decision 
Type I error: is rejecting the null hypothesis when it is true
Type II error: is not rejecting the null hypothesis when it is false
Error
Type I error:
Referred to as α Probability of rejecting a true null hypothesis
Type II error:
Referred to as β Probability of accepting a false null hypothesis
Power:
Represented by 1β Probability of correctly rejecting a false null hypothesis
Significance level
The significance level, α, of a hypothesis test is defined as the probability of making a type I error, that is the probability of rejecting a true null hypothesis
It could be set to any value, as:
0.05
0.01
0.1
Statistical significance
If the pvalue is less then some predetermined cutoff the result is called “statistically significant”
(e.g.
.05),
This cutoff is the αlevel
The αlevel is the probability of a type I error
It is the probability of falsely rejecting H _{0}
Back to the example
To test whether the average systolic blood pressure for patients admitted to the emergency department after an MI is different than 120 (which is the normal blood pressure)
We carry out a test called “one sample ttest” which provides a p value based on which we accept or reject the null hypothesis.
Back to the example
OneSample Statistics
Std. Error 

N 
Mean 
Std. Deviation 
Mean 

Systolic blood pressure 
286 
144.13 
35.312 
2.088 
OneSample Test
Test Value = 120
t
df
Sig. (2tailed)
Mean
Difference
95% Confidence Interval of the Difference
Lower
Upper
Systolic blood pressure
11.558
285
.000
24.133
20.02
28.24
Since pvalue is less than 0.05, then the conclusion will be that the systolic blood pressure for patients admitted to emergency department after an MI is significantly higher than the normal value which is 120
pvalues
pvalues are probabilities (numbers between 0 and 1)
Small pvalues mean that the sample results are unlikely when the null is true
The pvalue is the probability of obtaining a result as/or more extreme than you did by chance alone assuming the null hypothesis H _{0} is true
tdistribution
The tdistribution looks like a standard normal curve
A tdistribution is determined by its degrees of freedom (n1), the lower the degrees of freedom, the flatter and fatter it is
Normal (0,1)
ν 
75% 
80% 
85% 
90% 
95% 
97.5% 
99% 
99.5% 
99.75% 
99.9% 
99.95% 
1 
1.000 
1.376 
1.963 
3.078 
6.314 
12.71 
31.82 
63.66 
127.3 
318.3 
636.6 
2 
0.816 
1.061 
1.386 
1.886 
2.920 
4.303 
6.965 
9.925 
14.09 
22.33 
31.60 
3 
0.765 
0.978 
1.250 
1.638 
2.353 
3.182 
4.541 
5.841 
7.453 
10.21 
12.92 
4 
0.741 
0.941 
1.190 
1.533 
2.132 
2.776 
3.747 
4.604 
5.598 
7.173 
8.610 
5 
0.727 
0.920 
1.156 
1.476 
2.015 
2.571 
3.365 
4.032 
4.773 
5.893 
6.869 
6 
0.718 
0.906 
1.134 
1.440 
1.943 
2.447 
3.143 
3.707 
4.317 
5.208 
5.959 
7 
0.711 
0.896 
1.119 
1.415 
1.895 
2.365 
2.998 
3.499 
4.029 
4.785 
5.408 
8 
0.706 
0.889 
1.108 
1.397 
1.860 
2.306 
2.896 
3.355 
3.833 
4.501 
5.041 
9 
0.703 
0.883 
1.100 
1.383 
1.833 
2.262 
2.821 
3.250 
3.690 
4.297 
4.781 
10 
0.700 
0.879 
1.093 
1.372 
1.812 
2.228 
2.764 
3.169 
3.581 
4.144 
4.587 
11 
0.697 
0.876 
1.088 
1.363 
1.796 
2.201 
2.718 
3.106 
3.497 
4.025 
4.437 
12 
0.695 
0.873 
1.083 
1.356 
1.782 
2.179 
2.681 
3.055 
3.428 
3.930 
4.318 
13 
0.694 
0.870 
1.079 
1.350 
1.771 
2.160 
2.650 
3.012 
3.372 
3.852 
4.221 
14 
0.692 
0.868 
1.076 
1.345 
1.761 
2.145 
2.624 
2.977 
3.326 
3.787 
4.140 
15 
0.691 
0.866 
1.074 
1.341 
1.753 
2.131 
2.602 
2.947 
3.286 
3.733 
4.073 
16 
0.690 
0.865 
1.071 
1.337 
1.746 
2.120 
2.583 
2.921 
3.252 
3.686 
4.015 
17 
0.689 
0.863 
1.069 
1.333 
1.740 
2.110 
2.567 
2.898 
3.222 
3.646 
3.965 
18 
0.688 
0.862 
1.067 
1.330 
1.734 
2.101 
2.552 
2.878 
3.197 
3.610 
3.922 
19 
0.688 
0.861 
1.066 
1.328 
1.729 
2.093 
2.539 
2.861 
3.174 
3.579 
3.883 
20 
0.687 
0.860 
1.064 
1.325 
1.725 
2.086 
2.528 
2.845 
3.153 
3.552 
3.850 
100 
0.677 
0.845 
1.042 
1.290 
1.660 
1.984 
2.364 
2.626 
2.871 
3.174 
3.390 
120 
0.677 
0.845 
1.041 
1.289 
1.658 
1.980 
2.358 
2.617 
2.860 
3.160 
3.373 


0.674 
0.842 
1.036 
1.282 
1.645 
1.960 
2.326 
2.576 
2.807 
3.090 
3.291 
Hypothesis Testing
Different types of hypothesis:
Mean (a) = Mean (b) Proportion (a) = Proportion (b) Variance (a) = Variance (b) OR = 1 RR = 1 RD = 0 Test of homogeneity Etc
Example
Comparing two means: paired testing
In the previous example, is the heart rate at admission different than the heart rate at discharge among the patients admitted to the emergency department after an MI?
Statistics
Is this decrease in heart rate statistically significant?
Thus, we have to make inference….
Comparing two means: paired testing
What type of test to be used?
Since the measurements of the heart rate at admission and at discharge are dependent on each other (not independent), another type of test is used
Paired ttest
Comparing two means: paired testing
Paired Samples Statistics
Std. Error 

Mean 
N 
Std. Deviation 
Mean 

Pair 
Heart Rate at admission 
81.16 
75 
23.546 
2.719 
1 
Heart Rate at discharge 
76.72 
75 
17.973 
2.075 
Paired Samples Test
Paired Differences 

Std. Error 
95% Confidence Interval of the Difference 

Mean 
Std. Deviation 
Mean 
Lower 
Upper 
t 
df 
Sig. (2tailed) 

Pair 
Heart Rate at admission  

1 
Heart Rate at discharge 
4.440 
25.302 
2.922 
1.381 
10.261 
1.520 
74 
.133 
95%CI = 4.4 ±1.95× 2.9
H _{0} :
H _{A} : µ _{b}  µ _{a}
µ _{b}  µ _{a} = 0
≠ 0
Pvalue = 0.133, thus no significant difference
How Are pvalues Calculated?
t =
sample
mean − 0
SEM
4.4 

t 
= 
= 

2.9 
1 .52
The value t = 1.52 is called the test statistic
Then we can compare the tvalue in the table and get the pvalue, or get it from the computer (0.13)
Interpreting the pvalue
The pvalue in the example is 0.133
Interpretation: If there is no difference in heart rate between admission and discharge to an emergency department, then the chance of finding a mean difference as extreme/more extreme as 4.4 in a sample of 291 patients is 0.133
Thus, this probability is big (bigger than 0.05) which leads to saying that the difference of 4.4 is due to chance
Notes
How to decide on significance from the 95% CI?
3 scenarios
Comparing two means: Independent sample testing
In the previous example, is the systolic blood pressure different between males and females among the patients admitted to the emergency department after an MI?
Group Statistics
Std. Error 

Sex 
N 
Mean 
Std. Deviation 
Mean 

Systolic blood pressure 
Male 
240 
145.05 
35.162 
2.270 
Female 
44 
138.64 
Molto più che documenti.
Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.
Annulla in qualsiasi momento.