Sei sulla pagina 1di 60

# EPIDEMIOLOGY AND

BIOSTATISTICS
REVIEW, PART I
Tommy Byrd MSII

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

Nominal
Ordinal
Interval
Ratio

## Nominal scale data are divided into

qualitative categories or groups
Male

Female

Black

White

Suburban

Rural

## Ordinal scale data has an order

Class rankings data (1st / 2nd / 3rd)
Answers to these types of questions:

## **But it does not describe the size of the interval (eg. it

cannot tell by how many percentage points Tommy is
ranked 1st in his class)

interval

## Celsius (and Fahrenheit)

temperatures
Anno Domini years (1990,

## 1991, 1992, etc.)

**But ratios of this kind of data are not meaningful
100C is not twice as hot as 50C because 0C does not

## Ratio scale data has order, a set interval,

and is based on an absolute zero
Kelvin temperatures
MOST BIOMEDICAL VARIABLES
Weight (grams, pounds)
Time (seconds, days zero is the starting point of measurement)
Age (years)
Blood pressure (mmHg)
Pulse (beats per minute)
With these types of data ratios are valid:
300K is twice as hot as 150K
A pulse rate of 120 beats/min is twice as fast as a pulse rate of 60
beats/min

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## Many naturally occurring phenomena are

distributed in the bell-shaped normal or
Gaussian distribution

Score
(Blood pressure, cholesterol, etc.)

## Skewed distributions are described by the

location of the tail of the curve, not the
location of the hump

Mode
Median
Mean

Score

## Mode is the value that occurs with the

greatest frequency
2 4 5 7 4 2 3 6 8 9 7 5 4 4 2 4 6 7 7 7

Bimodal
distribution!
2

## Median is the value that divides the

distribution in half
Odd # total elements: the median is the middle one
Even # total elements: the median is the average of the

## Mean (the average) is the sum of all

values divided by the total # of values

## Unlike median and mode, it is very sensitive to extreme

scores
Therefore NOT good for measuring skewed distributions

## tend to have very similar means

Therefore the mean is the measure of central tendency that BEST

## Match the mean, median, and mode each

with its corresponding hash mark
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

## Glaser, Anthony N. High-yield Biostatistics, Epidemiology,

& Public Health. N.p.: n.p., n.d. 9. Print.

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## Normal distributions with identical

measures of central tendency can have
different variabilities
Variability = the extent to which their scores are clustered

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it
again.

## Standard deviation () measures how far

away, on average, that values lay away
from the mean of the population
Remember the last infectious disease quiz?
Lets assume the mean (average) grade was a 70% with a normal
distribution
If the was really HIGH, there was probably a bunch of As and a bunch

So, since
gun hard,
how
we use standard deviation
of Fs we
to Bs and
Cscan
and Ds
Ifexactly
the washow
reallywe
LOW,
people probably to
goteverybody
a high D or low
C
to tell
didmost
in comparison
else?

## By MEMORIZING these numbers!

Approx. 68% of the distribution falls within 1 standard deviations
Approx. 95% of the distribution falls within 2 standard deviations
Approx. 99.7% of the distribution falls within 3 standard deviations
So, out of a
class of 100,
many people
got an A?
(assume
extra credit
was possible)

A) 9-11
B) 2-3
C) 14-16
D) 4-6
E) 19-21
Therefore,
assuming the
of the test
scores was 10
points, we can
assume the
following:

## The z score is simply how many standard

deviations the element lies above or
below the mean
A table of z scores
compares the z score to
the Area beyond Z

65
z = 0.5

85
z = + 1.5

## The z score is simply how many standard

deviations the element lies above or
below the mean
A table of z scores
compares the z score to
the Area beyond Z
6.7% got beyond an 85%
on our startlingly realistic,

~7
people
here

## Therefore the z score can be used to

specify probability
We know that 6.7% of the class
has a grade above 85%, so the
probability of one randomly
selected person from this
85% is 6.7%, or 0.067

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## What if we dont know every single

persons score on the test?
But, through some stealthy looking-over-shoulders while

## people check their online test scores, we can get a

sample of random scores
How close to the actual class average will our sample be?

One sample
representing
one score

## The # of times that

the average of a
sample of 4 scores
is ~80%

n = the size of
each sample
0%

70%

100%

## The standard error of the mean (SEM)

is the standard deviation over the square
root of the sample size

SEM = /n
SEM = 10/1 = 10

standard
deviation () of
this test was 10
percentage
points

SEM = 10/4 = 5

0%

70%

100%

## Standard error (SEM) can be used in the

same way as standard deviation
But remember that SEM decreases as n
Now we have gathered a sample of 10 random scores

## from our classmates, so:

SEM = /n
SEM = 10/10 = 3.2
**Do you remember how much of the population falls within 2 standard
deviations (or SEMs) of the mean?

## 95% confidence limits are

approximately equal to the sample mean
plus or minus 2 standard errors
Practically, the 95%
confidence interval is
the range in which the
means 95% of
samples would be
expected to fall
In other words,
there is a 95%
chance that the
average of our
random sample
would be in this
range

## 95% confidence limits are

approximately equal to the sample mean
plus or minus 2 standard errors
Remember, the on our test was 10%, and the mean was

## a 70%. We are randomly sampling 10 scores (n=10)

So the standard error (SEM) = /n = 10/10 = 3.2%

## falling within 2 SEMs of the average

So our 95% confidence interval is 70% 2(SEM)
= 70% 2(3.2%) = 70% 6.4%
= 63.6% - 76.4%
A random sample of 10 peoples scores on this test has
a 95% chance of averaging between 63.6% and 76.4%
The width of the confidence interval reflects precision

## How would we double the precision of an

estimate?
Double the sample size?
We need to quadruple the sample

size!
SEM = /n

## If we do not know the of our population,

can we still calculate SEM?
Pretend we dont have any fancy ExamSoft statistics from

## our test, only our sample of 10 scores

We can calculate the standard deviation of the 10 scores
in our sample (S), and substitute it in for in the SEM
equation to come up with the estimated standard error of
the mean

## The t score is to the z score as the

estimated standard error is to the
Similar to P values !
For USMLE
purposes,
consider
degrees of
freedom
(df) to
equal n-1

So what do
we do with
all this?

## t = the number of estimated standard

errors away from the sample mean

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## There are 7 steps in hypothesis testing

1) State the null and alternative hypothesis, H0 and HA
H0 = no difference
HA = there is a difference
2) Select the decision criterion (level of significance)
3) Establish the critical values of t
4) Draw a random sample, find its mean
5) Calculate the standard deviation of the sample (S) and

## find the estimated standard error of the sample

6) Calculate the value of the test statistic t that
corresponds to the mean of the sample (tcalc)
7) Compare the calculated value of t with the critical
values of t, then accept or reject the null hypothesis

## Step 1: State the null and alternative

hypotheses
We want to test Julia Silvas claim: Because of Tommy

## and Danielles amazing biostats presentation, the average

Step 1 score of our class will be 260
Null hypothesis = The mean score is 260
Alternative hypothesis = The mean score is not 260

## rather take a random representative sample so we can

save time
Again, our sample size will be 10 randomly selected students

## Step 2: Select the decision criteria

Random sampling error (this is normal) will always cause

## our sample mean to deviate slightly from the true mean

We have to decide what an acceptable level of this chance

deviation is

## is conventionally set at 0.05

If the probability of obtaining the sample mean is greater than 0.05,
H0 is accepted:
The class indeed scored an average of 260

rejected:

= 0.05
Sample size
(n) = 10
students, so
df = 9

So tcrit = 2.262

## Step 4: Draw a random sample and

calculate the mean of the sample
284 234 268 254 246 264 266 265 245 244
Average = 257

## Step 5: Calculate standard deviation and

estimated standard error of the sample
In our sample, standard deviation (S) = 15
(You dont have to know the equation for standard deviation on the
USMLE)

= 15 / 10
= 4.747

## Step 6: Calculate t from the data

Remember, similar to a z-value, the t-score represents the

## # of estimated standard means that the sample mean lays

away from the hypothesized mean
Our average score was 257, which is 3 points away from

## our hypothesized average of 260

Therefore, our t-value is the # of estimated standard errors

contained in 3 points
Our estimated standard error from the last slide is 4.747
This gives a t-score (tcalc) of:
3

/ 4.747 = 0.632

## Step 7: Compare t-values and be very

concerned that Julia Silva is a psychic
Our calculated t-value (same thing as t-score) is 0.632
Our critical t-value is 2.262
Clearly our calculated t lies between +2.2 and 2.2,

therefore:
H0 is accepted and reported as follows: The hypothesis that the

## mean Step 1 score of the medschool class is 260 was accepted,

t = 0.632, df = 9, p 0.5
+2.262

2.262

t=0

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## Error types indicate that you accepted the

wrong hypothesis
Type I Error
False-positive error
You accept the alternative
hypothesis when there is no
difference
Also known as alpha ()

## error yes, this is

referring to the we just
The p-value is the
probability of making a
type I error

Type II Error
False-negative error
You fail to reject the null
hypothesis when there
actually is a difference
Also known as error
is the probability of

## A study with greater power has less

type II () error
The power of a statistical test = 1
The power represents the probability of rejecting the null

## hypothesis when it is in fact false (vs. accepting it in

error); we want this to happen!
Conventionally, a study is required to have a power of 0.8
(or a of 0.2) to be acceptable
Power increases as increases trade off
High-yield point: Increasing the sample size is the
most practical and important way of increasing the
power of a statistical test

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## Nonexperimental (descriptive or analytic)

study designs Cohort studies
Group without disease are selected and followed for an

extended period
Some members may have already been exposed to risk
factor
Exception: Inception Cohorts follow those recently
diagnosed to track progression
Can estimate incidence
Not good for rare diseases
Historical cohort study = retrospective cohort study

## Nonexperimental (descriptive or analytic)

study designs Case-control studies
All are retrospective
Compare people who do have the disease (the cases) w/

## otherwise similar people who do not have the disease

Start w/ outcome then LOOK BACK into the past for
possible independent variables that may have caused the
disease
Cheap, good for rare or that take a long time to develop

## Nonexperimental (descriptive or analytic)

study designs Case-series studies
Essentially a series of case reports that may link disease

## to exposure, but NOT controlled, as in case-control (no

group w/o the disease compared to)
Eg. Kaposiss sarcoma

## Nonexperimental (descriptive or analytic)

study designs Prevalence survey
Survey (snap shot) of a whole population, also asks

Prevalence ratio = the prevalence of a disease in people
who have and have not been exposed to a risk factor
Likely to overrepresent chronic diseases and
underrepresent acute diseases

## Nonexperimental (descriptive or analytic)

study designs Ecological studies
Check non-individual info (eg. study of the rate of

## diabetes in countries with different levels of automobile

ownership)
May be experimental:
Community intervention trials
Experimental group consists of an entire community, while the control

## group is an otherwise similar community that is not subject to any kind

of intervention

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf

## Bias occurs from systemic (rather than

random) errors when one outcome is
systematically favored over another

e
What is th
difference
between
bias
n
o
i
t
c
e
l
e
s
ling
and samp
bias?

(Magazine
subscribers in
great
depression)
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file
again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and
then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

(Referral bias)

## Bias occurs from systemic (rather than

random) errors when one outcome is
systematically favored over another

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then
open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then
open the file again. If the red x still appears, you may have to delete the image and then insert it again.

## (Putting all whites in drug

group and blacks in
control group for treating a
racially selective disease)
Race = confounding
variable

## Bias occurs from systemic (rather than

random) errors when one outcome is
systematically favored over another

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then
open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the
file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image,
or the image may have been corrupted. Restart your computer, and then open the file again. If the
red x still appears, you may have to delete the image and then insert it again.

http://www.usmle.org/pdfs/step-1/2013midMay2014_Step1.pdf