Sei sulla pagina 1di 271

By Dr. Aamir S.

Alrubaiee (MBCHB FIBMS CM)


Statistics divided into:

Descriptive statistics: a field of study concerned with


methods & procedures of collection, organization,
classification, and summarization, giving only
descriptive data.

Analytic statistics: it is concerned with analysis &


drawing of interferences & making a conclusion.

Biostatistics: data are concerned with medical &


biological information.
Advantages of statistics:
1.Carry out your own research.
2.Evaluating published papers.
3.Ethical consideration (e.g. through statistics
you can compare between old & new drugs
choosing the most appropriate).
4.Professional & personal satisfaction (to value a
result of a particular work if it is good or not).
Purposes of statistics:
1.Data reduction: to summarize the results (e.g. to study the
measures of blood pressure of 1000 persons, we can by
statistical methods summarize the measures by simple
figure.
2. Dealing with variable subjects matter: to know certain
variable or factor if true or not (e.g. if we have 50 males &
50 females & found that females are more obese than
males, we can't rely on this but use statistical methods, we
can know if the result is true or not).
3. Sampling & generalization: e.g. we can't measure the
blood pressure of all Iraqi people so we take a sample &
by statistical method we can reflect the results on other
people.
Application of statistics:
1.Are the differences between groups significant?
i.e. if we take 2 groups , the 1st is composed of 250 males &
the 2nd is composed of 250 female, both of 20y age, and
we take the mean of weight of each group (i.e. gender),
separately, we can decide are the differences between the
2 groups are significant or not.
2. Are these two measures related or associated?
i.e. if we have 2 groups the 1st is smoking patients with
bronchogenic carcinoma & the 2nd is non-smoking with
bronchogenic carcinoma , we can find if there is any
relation between smoking & this malignancy.
3. Can one predict the value of one variable (out come) from
knowledge of the values of other variables?
i.e if we have a table for increasing of the height with age
can we know the height from the age.
VARIABLE: it is a characteristic that take different values in
different persons, places, times, or occasions, e.g. age, height,
blood urea, weight, etc. types:
1. Quantitative: that can be measured in the usual sense (i.e. we
can measure it), e.g. age, height, blood urea, etc. & it can be
subdivided into:
A. Discrete: characterized by gaps or interruptions in the value
that can assumed (i.e. can't assume fraction like 2.3 persons,
5.6 beds in a hospital ).
B. Continuous: not have gaps or interruption, e.g. S.cholesterol &
weight (i.e. we can say 25.8 kg in weight).
2. Qualitative: that which can't be measured in usual sense but we
can describe them & measure them in categories, e.g. eye
colors & socio-economic status.
SCALES used to measure variables:
1.Nominal scale: use names, numbers or
other symbols, each measurement is
assigned to limited numbers of unordered
categories & fall in only one category (i.e.
when the information of an individual put
the individual in one category only, e.g.
eye colors & blood groups).
2.Ordinal scale: each measurement is assigned to
one of a limited number of categories that are
ranked in a graded order. Differences among
categories are not necessary to be equal &
often not measurable. This scale is used when
there are degrees in the variables so we can put
them upon each other in the scale, e.g. if the
variable is the damaged caused by cancer, we
can put it in categories according the degree of
that damage to body systems.
3.Interval scale: each measurement is assigned to
one of unlimited number of categories that are
equally spaced with no true zero point, i.e. it
does not begin from zero due to the presence of
minus numbers, e.g. temperature.

4.Ratio scale: measurement begins at true zero


point & the scale has equal intervals, as in
height.
POPULATION: the largest collection of
anything, if this collection has limits, this is finite
population, and if not, this is infinite population. It
can be a:
A. Population of entities: is the largest collection of
entities in which we have an interest at a particular time
(e.g. population of humans); each population member
has many variables.
B. Population of values: it is the largest collection of
values of a random variable from which we have an
interest of a variable for a particular time, e.g. blood
urea.
 SAMPLE: is a part of a population. It is either
a sample of entities or of values.
PRESENTATION OF DATA: can be:

1.Mathematical: a) Measures of central tendency


b) Measures of dispersion.

2.Tabular: using tables.


3.Graphical: using graphs.
4.Pictorial: using pictures.
ORDERED ARRAY: the values of collection
are listed in an ordered (ascending or
descending) magnitude. It is used to make the
measures easier.
CLASS INTERVALS: a set of contiguous,
non-overlapping intervals, it is used to group a
set of observations such a way that each value
can be placed on one interval only, e.g. if we
want to put 1000 people in class intervals
according to their age; the class interval will be
as the following: 0-9, 10-19, 20-29, etc.
CLASS INTERVALS(related)
 We notice from that:
1.The class interval should be continuous to each other,
i.e. after 0-9 then must 10-19 to come & so on.

2.Class interval should not be overlapped, i.e. not 0-10,


10-20, 20-30 because when we have an individual
with the age of 10 year; we will put him in both class
interval (double readings).

3.The class interval should include the smallest &


largest values in the study sample, e.g. if we have age
of 15 & the class intervals start with 20-29, then
where to put 15?
STURGE'S RULE: used to decide the number &
width of class intervals:

K = 1 + 3.322 log n & W=R/K


Where K = no. of intervals
N = no. of observations = total no. of
measurements
W = width of internals
R = the range of readings = largest value (L) –
smallest value (S)
FREQUENCY DISTRIBUTION (F.D): number of
individuals falling into each class of interval. While

Relative Frequency Distribution: is the proportion of


values in each class interval, which is determined by
dividing F.D by total no. of observations.

Cumulative Frequency Distribution & Cumulative


Relative Frequency Distribution are used to facilitate
obtaining information.
 Example: weights of malignant tumors
removed from the abdomen of 57
subjects: 68, 63, 42, 27, 30, 36, 28, 32,
79, 27, 22, 23, 24, 25, 44, 65, 43, 25, 74,
51, 36, 42, 28, 31, 28, 25, 45, 12, 57, 51,
12, 32, 49, 38, 42, 27, 31, 50, 38, 21, 16,
24, 69, 47, 23, 22, 43, 27, 49, 28, 23, 19,
46, 30, 43, 49, 12.
Solution:
K = 1 + 3.322 log (57) = 1 + 3.322 (1.7559) = 7
W = R / K = (79 – 12) / 7 = 9.6 ≈ 10
Class C.R.F.D
F.D C.F.D R.F.D%
interval %
10 - 19 5 5 8.77 8.77
20 - 29 19 24 33.33 42.1
30 – 39 10 34 17.54 59.64
40 – 49 13 47 22.81 82.45
50 – 59 4 51 7.02 89.47
60 – 69 4 55 7.02 96.49
70 – 79 2 57 3.51 100.00
Totals 57 --- 100.00 --------
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
Student Test (t – Distribution)
• When we do not know the population
variance or standard deviation we rely
on the sample variance and standard
deviation. In this case the distribution
will follow t-distribution instead of Z
distribution.
• Z = (m-μ) / (δ/√n), and δ=? Then we
use S→ t = (m - μ) / (S/√n)
• t-distribution curve is characterized by:
1.It has a mean of zero.
2.It is symmetrical around its mean.
3.Its range lie between – α & + α.
4.The quantity (n-1) which is called
the degree of freedom (d.f) is
used in computing the sample
variance.
5.Compared with normal distribution; it is of
lower peak & higher tails. This is because
the variability is dependant upon S instead
of δ and since the variability within the
sample is larger than that within the
population, then S is usually > than
δ.(why?)
6.It approaches normal distribution as (n-1)
approaches infinity.
• N.B) as the sample size increases and
approaches the normal distribution (this is
at n > 200), then we will shift from t-
distribution to Z-distribution.
• Notes about t-table :
1. The increase in the value of degree of
freedom (d.f):
• From 1 to 30 is by 1 (i.e. 1, 2, 3...).
• From 30 to 50 is by 5 (i.e. 30, 35, 40…).
• From 50 to 100 is by 10 (i.e. 50, 60, 70…).
• From 100 to 200 is by 20 (i.e.
120,140,160…).
2.If we do not find the value of d.f that we want,
we choose the nearest one, e.g. 41→ 40,
148→ 140, 96→100, and for 150 either 140
or 160.
3.The table stops at 200 and shifts to ∞. Here we will
shift to Z table & Z- distribution.
4.To find t-value we use t-table.
5.The tabulated value for t depends upon 2 factors;
d.f (i.e. sample size) & the probability of error (a) →
d. f t (1-a/2)

e.g. for n =16, & 95%C.I; so d.f.=16-1=15,


1-a/2= 1-(0.05/2)= 1- 0.025= 0.975 then we go
to t-table and look where the row of d.f (15)
cross with the column of 1- a/2 (0.975) to get
the tabulated value, so d.ft(1-a/2) = 15t0.975 =
2.1315.
• CHOOSING Z or t: the variability of the
sample (S2 & S) approaches that of the
population as the sample size increases
so the question will be: when we can use
Z instead of t-distribution (i.e. when to
state that S ≈ δ):
– d.f > 200 → use Z test
– d.f 61-200 → use Z or t test
– d.f 31-60 → t test is preferred on Z
• So when ever d.f > 30 we can use Z test
(& δ = S).
– d.f ≤ 30 → we have to use t test
• Calculated t & Confidence Intervals using
t-distribution:
• A) For population mean:
• t = (m - μ) / (S/√n)

• %C.I for μ = m ± (d.ft(1-∞/2) * S /√n) , where


d.f = n-1
B) For difference between populations
means assuming unequal population
variances:
• t = [(m1-m2) - (μ1-μ2)] /√ [(S12/n1)+( S22/n2)]

• %C.I for (μ1-μ2) =


(m1-m2) ± (d.f t (1-∞/2) *√ [(S12/n1)+( S22/n2)])

• Where; d.f = (S12/n1 + S22/n2)2 / [(S12/n1)2/


n1 + (S22/n2)2/ n2 ] ≈ n1 + n2 - 2
C) For difference between populations
means assuming equal population
variances: (it is important as it is used with
hypothesis testing & C.I)
• t = (m1-m2) - (μ1-μ2) / Sp √(1/n1+1/n2)

• %C.I for (μ1-μ2) = (m1-m2) ± [d.ft(1-∞/2) *


Sp√(1/n1+1/n2)]
• Where; d.f = n1 + n2 – 2
& Sp = √Sp = √ (pooled variance) = √
[(n1-1)S12 + (n2-1)S22 / (n1 +n2 -2)]
D) For mean difference (md) of paired
observations (i.e. mean difference of 2
dependant samples): this is used when the same
volunteers or participants pass through 2
different situations (each participant has 2
readings e.g. as for 2 drugs, 2 different doses for
the same drug, drug and placebo, or rest &
exhaustion…).
• t = (md - μd) / Sd√ (1/n1+1/n2)
• %C.I for μd = md ± (d.ft(1-∞/2) * Sd /√n)

• Where; md = ∑d /n , Sd = √ [{n∑d2 –
(∑d)2 }/ n(n-1)] & d.f = n-1
• N.B) t-test is not used with proportions.
• N.B) from the rules of confidence intervals
using t-distribution we notice that the
interval is affected by sample size more
than with using Z distribution as with t-
distribution even the tabulated value is
affect by the sample size and not only the
standard error.
• Example 1: using specimens obtained
from 10 individuals for teeth contents of
calcium with m = 35.7 & S = 0.7, what is
the expected population mean?
• Estimation
• 95%C.I for μ = m ± (9t0.975 * S /√n) = 35.7 ±
(2.262 * 0.7/√10) = 35.15 to 36.17
• Example 2: a breast cancer researcher
collected the following data in tumors size,
type A (n1=21, m1=3.85 &S1=1.95cm) & for
type B (n2 = 16, m2 =2.8 & S2=1.70). What
is the 95%C.I for the difference in
population's means?
• Here the two groups are of two different
cancers types, so population variances
are different (unequal).
• d.f = (S12/n1 + S22/n2)2 / [(S12/n1)2/ n1 +
(S22/n2)2/ n2 ] =36.4 (while in using d.f = n1
+ n2 – 2 = 35, so using any of the two will
point towards the same tabulated value)
• 95%C.I for (μ1-μ2) = (m1-m2) ± [ d.ft(1-∞/2) * √
{(S12/n1)+( S22/n2)}]
= (3.85-2.8) ± 2.0307 * √
[(1.95)2/21 + (1.7)2/16] = - 0.17 to 2.27 (not
significant as it included the zero)
• Example 3: in a cancer institute, a
research was done on drugs to prolong life
in patients with throat cancer, 200 patients
were randomly divided into 2 equal
groups: on a drug (m1 = 4.6months, S1 =
2.5 months) & on placebo (m2 = 3.4
months & S2 = 1.3). Assuming normality &
equal population variances; find 95%C.I
for (μ1-μ2)? Interpret the result?
• Sp = √ [(n1-1) S12 + (n2-1) S22 / (n1 +n2 -2)]
= 2.816
• d.f = 100 + 100 - 2 = 198
• 198t0.975 = 1.9719 (from t-table)
• %C.I for (μ1-μ2) = (m1-m2) ± [d.ft(1-∞/2) *
Sp√(1/n1+1/n2)]
• 95%C.I for (μ1-μ2) = (4.6-3.4) ± [1.9719 *
2.816√(1/100+1/100)] = 0.415 to 1.985
• Since zero is not included in the interval,
then the drug under trial significantly
prolongs life.
• Example 4: in a pediatric clinic a study
was carried out to see the effectiveness of
a certain antipyretic on (12) 4-years-old
girls suffering from flu, their temperature
is taken immediately before and 1hr after
drug administration, for the following
results find 95%C.I for the mean
difference?
i.d no. Before After D d2

1 39.1 37.6 1.5 2.25

2 39.6 37.8 1.8 3.24

3 38.8 37.9 0.9 0.81

4 39.4 38.4 1.0 1.0

5 38.4 37.7 0.7 0.49

6 38.2 37.9 0.3 0.09

7 39.2 38.3 0.9 0.81

8 39.5 38.8 1.7 2.89

9 39.3 38.2 1.1 1.21

10 39.1 38.4 0.7 0.49

11 38.3 38.5 0.3 0.09

12 38.6 37.9 0.7 0.49

Total ∑d = 11.6 ∑ d2 = 13.86


Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
Hypothesis Testing

• Purpose of hypothesis testing is to help


the clinician, researcher and administrator
on reaching a decision concerning a
population by examining a sample drawn
from that population.
• Definition: it is a statement about one or
more population usually concerned with
the parameter of the population about
which the statement is made.
• Types of hypothesis testing: research type
and statistical type.
• A) Research type: is a research for a long
observation, e.g. an idea is made in mind
about a relation between oral
contraceptives & DVT and estrogen levels.
• B) Statistical hypothesis: stated in a way
that may be evaluated by appropriate
statistical techniques (is it true or not,
significant or not). Types:
• Null hypothesis (H0): it is the particular
type under test and it is the hypothesis of
no difference, e.g. if there is a hypothesis
of significant difference between two
means, the null hypothesis says that it is
not significant.
• Alternative hypothesis (HA): it disagrees
with H0 and is the hypothesis of difference.
• Test statistics: mathematical expression of the
sample value, which provides basis for testing
statistical hypothesis, the best result of this test
will determine whether we will accept Ho and
accordingly the HA will be rejected & vice versa
• Types of errors (calculation mistakes): 2
types lead to such miscalculations;

• α error (Type I error) (rejecting true H0):


incorrectly rejecting true H0.

• β error (Type II error) (accepting false H0):


the chance of error associated with the
failure to reject the null hypothesis given
that a difference actually does exist. This
is the case if we cannot detect a real
difference.
Condition of Null Hypothesis
True False
Correct Action Type II error
Possible Fail to reject H0
1-β β
Action
Type I error Correct action
Reject H0
α 1-α
• So; 1 - α = the probability to reject false H0
(i.e. true negative decision)
• 1 - β = power of the study
= probability to detect a real
difference = true positive decision
• 1 - Power = β = probability to accept a
false H0.
• Definitions:
• Non rejection region: a set of values of the
statistical test lead to the Non rejection of Ho.
• Rejection region: a set of values of the statistical
test lead to the rejection of Ho.
• Critical value: value of test statistics
that separate the rejected region from
the Non rejected (acceptance) region.
It is the tabulated value that lie on
both extremes of the acceptance
area, so the null hypothesis to be
accepted; calculated values must be
less than the critical value (tabulated
value) to fall in the acceptance area
for null hypothesis & so to accept H0.
• P value: it is the smallest value for
alpha for which H0 can be
rejected. It gives more precise
statement about the probability of
rejecting H o when it is true than
alpha level. So instead of saying
the test is significant or not we will
mention the exact probability of
rejecting Ho when it is true.
• Hypothesis testing is a 9 steps procedure:
1.Data: nature of data, frequency, and to
determine what type of test to be used.
2.Assumption: normality, equality,
independence of samples.
3.Hypothesis: you have to put your hypothesis
which is either null or alternative. If we do not
reject null hypothesis we will say that tested data
do not provide sufficient evidence to cause
rejection. But if we reject it, we will say that data
are not compatible with null hypothesis but
support alternative hypothesis.
4.Test statistics: we have to apply test
statistics which uses the sample data to
reach a decision to reject or accept null
hypothesis.
5. Distribution of test statistics: drawing the
curve.
6. Decision rule: to accept or to reject H0,
from
• Calculated value < tabulated value →
we accept H0
• Calculated value ≥ tabulated value →
we reject H0
7. Calculate test statistics.
8. Statistical decision: if the calculated value
falls in the acceptance area or not.
9. Conclusion: according to test statistics,
we will reject H0 & accept HA or vice versa.
• Q/ which method is superior confidence
interval or hypothesis testing (P value)?
Why ?
• EXAMPLE 1: below are the heights of
randomly selected 24 male in the age of 2,
having sickle cell anemia. The mean Ht of
that age and gender in that population is
86.5 cm. Dose the sample provide
sufficient evidence about the effect of this
disease on the heights of children?
• Heights:84.4,89.9,89,81.4,87,78.5,84.1,86
.3,80.6,80,81.3,86.8,83.4,89.8,85.9,80.6,8
5,82.5,80.7,84.3,85.4,85,58 & 81.9.
• Solution:
• Mean = m = (∑ x) / n = 84.1 &
• SD = √ [{(n ∑x2) – (∑x)2} / n(n -1)] = 3.11
cm.
1.Data represent the heights of 24 male
children suffer from sickle cell anemia with
mean height 84.1 cm and standard
deviation of 3.11 cm, and the Mean height
of children at that age and gender is
86.5cm.
2.Assumption: we assume that the sample is
randomly selected from normally distributed
population.
3. Hypothesis:
• Ho: there is no significant difference in mean
heights of the sample from those in the
population.
• HA: there is a significant difference between…
4. Level of significance:
• α = 0.05 → 5% chance factor effect
• 95% influencing factor effect
• d.f = 24 -1 = 23
• Critical point = tabulated t = d.ft(1- a/2) = 23t0.975 =
2.069
7.Testing significance: we use t-test(why)
8. Calculated t = (m - μ) / (S/√n)
= (84.1 – 86.5) / (3.11/√24) = - 3.81
9. Conclusion: calculated t falls in the
rejection area so we reject H0 & accept HA,
i.e. there is a significant difference in
heights between the sickle sample and the
normal population. Or there is a significant
effect of S.C.A on the height for that age
and gender.
• EXAMPLE 2: two samples of 24 males
aged 2 years, mean height in those with
S.C.A 84.1cm and SD 3.11cm and in
those not have S.C.A mean 87.3 & SD
2.13cm. Is there a significant effect of
S.C.A on height?
• Example 3: In a large sample, the mean
RBC count in man is 5.5*1012/L. In a
sample of 100 men a mean count of
5.35*1012/L was found, with SD of (1.1)is
there any significant difference between
the two means?
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
• TABULAR PRESENTATION OF DATA:
1.Single variable frequencies:
a. For qualitative variables: we must put them into
groups & with in each group we put frequencies.
b. For a large data set on a quantitative variable
requiring grouping of the data into classes.
2.Cross Tabulation:
a. Two dimensional tables.
b. Three dimensional tables. As in age, gender &
smoking or not.
Table.2. gender distribution of suspected
pertussis cases in Al-muthanna Gov.2013

Gender Frequency Percent

Female 62 50.82 %

Male 60 49.18 %

Total 122 100%


Table 1. Age group distributions of suspected
cases of pertussis in Al-muthanna Gov.2013

Age frequency Percent


groups
(years)
<1 46 37.70 %
1-4 52 42.62 %
5-14 23 18.85 %
Total 121 100%
• Principles of making tables:
1. Simple: the table must be simple; it
should contain least no. of variables to
make things easy to the reader. It is better
to have 2-3 tables than to have one
complicated table.
2. Understandable & self explanatory: we
should make the table easy to be
understand without the need to return
back to the text, & this is done by:
a. Using symbols, codes & abbreviations which
must be explained by putting down a (foot note)
at the bottom of the table.
b. Clear concise labeling of rows & columns.
c. The units of data must be defined, e.g. when we
want to measure the height we must put cm or
meter.
d. Using clear concise title & the question what
where & when should be answered within the
title, e.g. table for infant mortality rate for 1999 in
Baghdad.
e. The total should be placed as in frequency
distribution; we give the total at the end.
3. We must separate the title from the body
of the table, i.e. put a space between the
title & the body of the table.
4. We must write the source of the data by
saying from where or which book the table
is taken from.
5. Avoid too much over rolling.
• Types of tables:
• Simple tables: consist of 1 row & 1
column.
• Compound tables: consists of 2 or more
rows & columns.
• Contingency tables: relationship between
one or more variables
Table 1: Normal B.Wt total birth (what),
England (where), 1969-1970 (when).
Mother's age (years)
Parity
<30 ≥30 Total

0 514,108 49,895 564,003


1-3 583,889 234,084 817,973
≥4 22,216 64,894 87,110

Total (all parity) 1,120,213 348,873 1,469,089

Source. Osborn J.F., (1975). J.R statistic. 24, 75-84


GRAPHICAL PRESENTATION
OF DATA:
• pictorial display of quantitative data using
coordinate measures. X-axis is for
independent variable (method of
classification), & Y-axis is for dependant
variable (frequency or relative frequency).
• General principles to make a graph:
a. Simple: doesn't contain so much lines & symbols
more than the eye can follow (more easier to the
reader to understand the graph)
b. self explanatory
c. Title: placed on the top or on the bottom of the
graph & answer the questions of what, where &
when.
d. Keys: if there is more than one variable, we
must notify what each variable represent.
e. scale & units: we must write the units which are
used in the graph
Types of graphs:
1. Arithmetic scale line graph
This is particularly beneficial to present the trend of
one or more sets of data.
q In general the Y-axis is 2/3 the X-axis
q An equal distance represents an equal
quantity anywhere on an axis.
q The slop of the line (carved line) indicates the
rate of increase or decrease
q Two or more lines following a parallel path
indicate identical rates of increase or decrease
Cardiovascular Disease Mortality Trends for
Males and Females United States: 1979-2002

520
Deaths in Thousands

500
480
460
440
420
400
380
79

81

83

85

87

89

91

93

95

97

99

01
Years

Males Females

Source: CDC/NCHS.
Histogram
Graphical display of
frequency distribution
of quantitative variable .
The values of the
quantitative variable( as
class interval) will be
placed on the X-axis
(representing the width of
the rectangles), and the
corresponding frequency
(or relative frequency) will
be placed on the Y-axis
(representing the height of
the rectangles)
0 15 25 35 45 55
Frequency Polygon

Another form of
graphical presentation
of frequency
distribution of
quantitative variables.
It is similar to the
histogram , but instead
of using rectangles to
present data, the
midpoint of the top of
each rectangle are
plotted , and
connected together by
straight lines.
Frequency Polygon

More than one set of


data can be
demonstrated on the
same graph, to facilitate
direct comparison.
It is only appropriate
when the variables on
the horizontal axis are
continues
It provides information
about underlying
characteristics of data .
The area under the
frequency polygon is
equal to the area under
the equivalent histogram
Frequency Polygon
Scatter diagram

A p a i r o f
measurements is
plotted as a single
point on a graph.
The value of one
variable of each
pair is plotted on
the X axis and the
value of the other
variable is plotted
on the Y axis
Showing the relation
• A single and effective form to examine the
relation between two quantitative variables
is using a scattered points graph.
• Each point correspond at one subject.
Scatter diagram

A scatter diagram could suggest:


No relationship: when one variable changes with
no change in the other variable ,or when the
pattern is buzzard
Linear relationship: an increase in the 1st
variable is associated with an increase
(positive) or decrease (negative) in the 2nd
variable, and the pattern follows a straight line.
Curvilinear (positive or negative) relationship: the
pattern of increase or decrease will not follow a
straight line .
Scatter diagram
Bar chart

• Used to present discrete or qualitative


data (For categorical data)
It includes separated bars of equal width
The method of classification of the variable
is usually placed on the X-axis, and the Y-
axis usually represents the corresponding
frequency or relative frequency.
Leading Causes of Death for
All Males and Females
United States: 2002

Deaths in Thousands

A Total CVD D Chronic Lower Respiratory


(Preliminary) Diseases
B Cancer E Diabetes Mellitus
C Accidents F Alzheimer’s Disease
Source: CDC/NCHS
Component bar chart

It is a type of charts based on proportion.


It uses bars that are either shaded or
colored to show the relative contribution of
each of its components
Picto-gram ((Picto-chart))

Ø This is a graphical representation of the (relative)


frequencies by using symbols (drawing or picture)
relevant to the subject matter.
ØSymbols of different size should not be used.
ØA unit value of the data should be represented by
standard symbol which may repeat to represent
magnitude, Each symbol represents a fixed number of
units
Pictograms
Pie chart

It is a type of charts based on proportion


It uses wedge-shaped portions of a circle to
illustrate the relative contribution of each
part to the total (division of the whole into
segments)
Pie chart
To demonstrate the angel of each wedge ,
we multiply the relative frequency of each
division by 360 degrees.
Start at 12 o’clock,
It is preferable to arrange segments in order
of their magnitude (starting with the
largest), and proceed clockwise around the
chart.
Most Myocardial Infarctions Are Caused
by Low-Grade Stenoses

Falk E et al, Circulation, 1995


Map charts

• These are used to present the


geographical distribution of one or more
sets of data
Flow chart

It is used to illustrate the sequence of a


series of events.
It is characterized by multiple arrows
Development of Atherosclerotic
Plaques
Normal
Fatty streak Lipid-rich plaque

Foam cells
Fibrous cap

Lipid core
Thrombus

Ross R. Nature. 1993;362:801-809.


Suggestions for the design and use
of tables, graphs, and charts
ØChoose the method most effective for data and
purpose
ØPoint out one idea at a time
Limit the amount of data and include one kind of
data in each presentation
ØBlack and white are better for exhibits that are to
be reproduced
ØUse adequate , properly located titles and labels
ØMention the source , if it is not yours
ØCare and caution in proposing conclusions
Exercises
The following are DBP No.
(mmHg)
the DBP
65-69 3
measurements
70-74 5
(mmHg) of 60
75-79 9
individuals.
80-84 18
Ø Make a 85-89 13
suitable graphical 90-94 9
or pictorial 95-99 3
presentation
Total 60
The following Primary site % of total CA

are the Breast 14.3


proportions of Bronchus &lung 11.2
the commonest Urinary Bladder 7.4
ten cancers in Non-Hodgkin 6.2
Lymphoma
Iraq, 1995
Larynx 5.9
Make a suitable Leukemia 5.2
graphical or Brain & other CNS 4.8
pictorial
presentation
Skin 4.3
Stomach 3.6
Hodgkin Lymphoma 3.0
The following is Type of TB No.
the distribution of
TB cases Smear +ve PTB 360
registered in City
X. Smear –ve PTB 240
Make a suitable
graphical or Extra PTB 200
pictorial
presentation
Total 800
The following is Agent Male No. Female Total
No.
the distribution of
meningitis cases , Viral 168 84 252
Ibn Al-Khateeb
Hospital, 1999. Bacterial 84 42 126
Make a suitable
graphical or TB 21 21 42
pictorial
presentation Total 273 147 420
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
SAMPLING METHODS

• The main task of biostatistics is to


generalize the results, and it is difficult to
include the whole population, so we are
going to have a sample of each population
in order to get results.
• Statistical inference: concerned with
observations of a population, made on the
bases of a sample of observation.
• Population of entities: is the largest collection of
entities in which we have an interest at a
particular time (e.g. population of humans); each
population member has many variables. The unit
of here is the entity (a group of variables).

• Population of values: it is the largest collection of


values of a random variable from which we have
an interest of a variable for a particular time, e.g.
blood urea. The unit here is the value of the
variable.
• Sample of entities: a finite no. of entities
from a population of entities.
• Sample of observation (of values): a finite
no. of values from a population of
observations.
• Sample frame: a numerical list of all the
units composing the study of population,
e.g. population of 3rd grade medical
school, we have 263 students. So we give
every member of this sample an
identification no., from 1 to 263.

• Sample error: it is the difference between
the sample measure & its corresponding
population measure. It is not an error or a
mistake but belong to the chance of
selecting the sample individuals.
• Types of sampling methods:
They are probability methods and non-
probability methods, the problem in the
second type that they cannot be
generalized.
• PROBABILITY SAMPLING: a sample
drown from a population in such a way
that every member of the population has
the same probability (chance) of being
included in the sample and this must be
done without bias, e.g. if I have 100
students 60 males & 40 females& I want
to chose 10 students, so if I ask a girl to
chose for me, she will choose females
more than males, & so the chance will not
be equal because of bias.
• So we need to find appropriate techniques
to give equal chance for every sample.
Types of probability sampling:
1.Simple random sample: (not hap hazard) it
requires the following:
A. Sample frame: we should have a list of
every subject in this class.
B. Sample fraction: sample size to the total
population as 10 out of 100; mean to
choose 10 from 100.
• The selection of a sample from the
population can be done by:
• Lottery method: to give equal chance to
everyone.
• Computer generated random sampling.
• Using the random number table:
• Using the random number table:
– Assign a unique identification no. to each
individual in the population.
– Decide the direction of movement on the table
(horizontally or vertically) on the random digit
table.
– Starting point is located randomly by the tip of
the pencil with closed eyes.
– Move on the table by equal digit no. to the
population digit no., e.g. 47 has 2 digits, 175
has 3 digits.
2. Systematic sampling: here we choose unit
or observation from the sample frame at
regular intervals, e.g. if the population is
composed of 1200 units & we want to
choose 100, then we use the interval
1200/100=12, then choosing the starting
point randomly as mentioned before, so if
we chose 2 then we will select 14,26,38,...
• Disadvantage of simple random &
systematic sampling is that they
do not ensure the sample to have
similar structure to the population
like male to female ratio. To have
so, we rely on other type of
sampling.
e.g. the following are the weights in grams of
both kidneys of 50 normal men aged 30-40:
1 379 11 363 21 252 31 305 41 323

2 309 12 358 22 332 32 387 42 329

3 323 13 358 23 404 33 349 43 327

4 288 14 361 24 277 34 303 44 311

5 301 15 265 25 208 35 293 45 256

6 345 16 311 26 322 36 356 46 310

7 358 17 388 27 307 37 350 47 342

8 340 18 240 28 379 38 470 48 274

9 329 19 260 29 319 39 362 49 329

10 309 20 288 30 369 40 288 50 358


• Now for those 50, if we want 5
observations (that’s mean the chance for
each one is 5/50 = 1/10), I can use: lottery
method or random digit number table;
(each man has identity no. from 1 to 50
that are given equal digits to the total, i.e.
01 to 50) let the pencil randomly assign
the starting point on the random digit
table, & decide to move in a fixed direction
moving from the 1st point till getting 5
observations.
3.Stratified sampling: sample frame of the original
population is divided from the beginning into
strata (groups) according to certain
characteristics (age, gender…) then random or
systematic sampling is performed on each
stratum. Equal (proportional) allocation of no. of
units selected from each stratum is done by
using similar proportions to that of the original
population size, thus stratified sampling is only
feasible when we know the no. & so the
proportion of the sub-populations. Although
proportional allocation is not 100% random, but it
is better to do when certain characteristics are
wanted, e.g. gender.

4.Cluster sampling: it is the most commonly
used. The selection here is of groups of
units rather than individual units. A sample
frame of groups of study units (clusters)
should be available, and then a random
sample of these clusters will be included.
The clusters could be schools, districts,
hospitals, villages, factories or clinics.
• e.g. studies made by the UNICEF take the
group (schools) not the whole no. of
students, because we don't have a list of
all the students to choose from, so we
make a list of primary school and we
choose randomly them we involve every
student in each school (cluster) that was
chosen.
• Be advised that it is better to have a
large no. of small clusters rather than
small no. of large cluster because the
people in the same cluster usually
similar to each other.
5.Multistage sampling: this procedure is
carried in phases (stages), and can
involve more than one of the above
sampling methods. It is used for a very
large population (huge) and when the
sample is not available for the whole
population.
• Nonprobability Sampling: divided
into:
1.Convenience sampling: the study
units that happen to be present at the
time of data collection will be included
in the sample but this is not
representative to the population we
want to study. As in conducting a
study on astronauts, or nuclear
reactor workers. Also known as
Accidental sampling
2. Quota sampling:
The composition of the sample
regarding certain characteristics
is decided from the beginning and
the only requirement is to find the
right no. of people to full these
quotas.
3.Purposive Samples
• Subjects selected for a good reason tied
to purposes of research
• Small samples < 30, not large enough for
power of probability sampling.
– Nature of research requires small sample
– Choose subjects with appropriate variability in
what you are studying
• Hard-to-get populations that cannot be
found through screening general
population
PROBABILITY

• In general, medicine. is an inexact subject


because we must pass through many
steps to reach proper diagnosis, e.g. to
diagnose certain disease we must take
history then put differential diagnosis then
physical examination then investigation to
reach a proper diagnosis, so with each
step the probability of reaching a proper
diagnosis will increase.
• Examples on probability: the probability to
deliver a boy or a girl in both is one of two
or 50%, the probability of rolling the dice &
getting the reading of 2 is the probability of
1/6 and the probability to get either head
or tale on crossing the coin is 50%. If E =
event;
• Probability of Event = P (E) = no. of times
E occurred / no. of times E can occur
• The value of probability must be between
0 & 1 (or 0 & 100 in percent expression), &
the summation must not exceed 1 (or 100
in %) probability of zero means the
probability can never occur & probability of
1 (or of 100 in %) means the probability
definitely will occur.
• Probability can help the clinician in:
1.Quantifying uncertainty inherent in
decision making process.
2.Obtaining conclusions concerning
population of patients based on known
information about a sample of patients
drawn from the population.
3.Probability used in generalization
especially when the sample size is large.
Example: we have the following table for serum
cholesterol of 1047 male patients aged 40-59 year.
S.choles. F r.f c.r.f

<160 31 3 3

160-199 134 12.8 15.8

200-239 358 34.2 50

240-279 326 31.1 81.1

280-319 145 13.7 94.8

320-359 43 4.1 98.9

360 + 12 1.2 100

Totals 1047 100% ---


• The probability to get individuals with
serum cholesterol of 280-319 is
145 / 1047 = 13.7%
• Probability to get those below 200 is
(31 +134)/ 1047 = 15.8

• So we can express probability in


terms of relative frequency or
cumulative relative frequency.
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
Probability
Probability of Event = P (E) = no. of times
E occurred / no. of times E can occur.
The value of probability must be between 0 & 1
(or 0 & 100 in percent expression), & the
summation must not exceed 1 (or 100 in %)
probability of zero means the probability can
never occur & probability of 1 (or of 100 in %)
means the probability definitely will occur
• Probability & 2x2 tables:

Disease Status

+ - Totals

Test + Cell a ( TP) Cell b ( FP) a+b

- Cell c ( FN) Cell d ( TN) c+d

Totals a+c b+d N

Where TP = true positive, TN = true negative, FP = false positive & TN = true negative.
• Gold Standard Test:
is the test that never mistakes, i.e. if
there is disease, so the test is positive,
if there is no disease then the test will
be negative. But it is difficult &/or
costy, so we use simple tests but with
probability to give false readings (FP
& FN) giving results as in the table
above.
Example
on the Bacilli in sputum
diagnosis
of T.B
+ - Totals

+ 7 4 11
CXR
- 3 86 89

Totals 10 90 100

So there are 4 persons diagnosed as TB with CXR while they are not & 3

.
cases of TB labeled as not diseased while they are diseased
• Marginal Probability: it is called like this
because it deals with probability of the margin of the
table (cells of small totals or cells of marginal totals),
from the example above:
• Probability of testing positive = P (T+) = 11/100=
0.11
• Probability of testing negative = P (T-) = 89 /100 =
0.89
• Probability of having the disease = P (D+) = 10 /100
= 0.10
• Probability of not having the disease = P (D-) =
90/100 = 0.90
• Joint Probability: the probability of
2 events or more to occur simultaneously,
or the probability of random picking of a
subject (from a group) has 2 events
simultaneously. From previous example:
• Probability of testing disease+ & test+ =
P (D+&T+) = 7/100 = 0.07
• Probability of testing disease+ & test- = P
(D+&T-) = 3/100 = 0.03
• Conditional Probability: is the probability of an
event occurring given that another event has already occurred.

• Conditional probability = P (B/A) = joint probability /


marginal probability
• From the previous example:
• P (T+/D+) = P (T+∩D+) / P (D+) = 0.07 / 0.10
• P (D+/T+) = P (D+&T+) / P (T+) = 0.07 / 0.11
• P (T-/D+) = P (T-&D+) / P (D+) = 0.03 / 0.10
• Summary:
• Cell? / marginal total = conditional probability
• Cell? / grand total = joint probability
• marginal total / grand total = marginal probability
• Rules Of Probability:
1.Multiplicative Rule: using the word "and"
has 2 events:
• Independent events: it is used when there are 2
or more events & no one of them is affected by
the other event.
ü P (A&B) = P (A∩B) = P (A) * P (B)
• e.g. if there is a pregnant woman, the probability
of having a boy is 50% , & a girl is 50%. If
• She got a boy, the probability of having a girl
later is still 50%
• Having 2 successive boys is 0.50 * 0.50 = 0.25
2.Non-independent events: it is used when the 2
events are related to each other.

v P (A&B) = P (A / B) * P (B) = joint probability

• e.g. from the table of using CXR in screening


TB;
• P (D-&T-) = P (D-/T-) * P (T-) = [P (D-&T-) / P (T-
)] * P (T-)
= (86/89) * (89/100) = 0.86
• Additive Rule: using the word "or" has 2
events:
1.Mutually exclusive events: these 2 events cannot
occur at the same time, e.g. life & death, male or
female; so the occurrence of one event will
exclude the occurrence of the other.
q P (A or B) = P (AUB) = P (A) + P (B)
• e.g. the probability to deliver male or female is
50% +50% =100%
• e.g. from the table of S. cholesterol, if we want to
choose a person randomly with a level <160 or
>360 then the probability will be .03+.012 = .042
= 4.2%
2.Non-mutually exclusive events: it is used
when there are 2 events which can be
occurred together.
Ø P (A or B) = P (AUB) = P (A) + P (B) – P
(A∩B)
• From the e.g. of TB, what is the
probability to have a person that is
diseases free (D-) or CXR negative (T-)?
• P (D- U T-) = P (D-) + P (T-) – P (D-∩T-)
= .90 + .89 - .86 = .93 = 93%
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
• EXAMPLE 1: table below summarizes results of
a study to evaluate the gonodectin (Gd) test as
diagnostic test for gonorrhea in men. The study
involved 240 men with symptoms of exudative
urithritis who were seen at a medical facility for
the diagnosis and treatment of sexually
transmitted diseases. Urethral discharge
specimens obtained from each of the men were
the results obtained from the 240-man sample to
answer the questions about the population of
men visiting the clinic.
Culture results
(Disease Status)

+ - total

+ 175 9 184
Gd test
(Test
Result)
- 8 48 56

total 183 57 240


A. What is the probability that a man has gonorrhea or P(D+)?
B. What is the probability that a man has a positive Gd test or P(T+)?
C. What is the probability that a man has a positive Gd test and
gonorrhea, or P(T+ and D+)?
D. What is the probability that a man has a negative Gd test and does
not have gonorrhea, or P(T- and D-)?
E. What is the probability that a man with gonorrhea has a positive Gd
test, or P(T+/D+)?
F. What is the probability that a man who does not have gonorrhea
has a negative Gd test, or P(T-/D-)?
G. What is the probability that a man who does not have gonorrhea
has a positive Gd test, or P(T+/D-)?
H. What is the probability that a man with gonorrhea has a negative
Gd test, or P(T-/D+)?
I. What is the probability that a man with a positive Gd test has
gonorrhea, or P(D+/T+)?
• Based on data in the above table, what is the probability that a man
who visits the clinic has a positive Gd test or gonorrhea, or P(T+ or
D+)?

• EXAMPLE 2: an outbreak of food


poisoning in a group of students who
attended a back-to-school party. The
following table summarizes data obtained
from 200 students who were at the party.
• Illness state

+ - Total

+ 90 30 120
The state
of eating - 20 60 80
Barbecue
total 110 90 200
PROBABILITY DISTRIBUTION
PROBABILITY DISTRIBUTION
• PROBABILITY DISTRIBUTION

• There are two types:


• Probability distribution of continuous variables.
• Probability distribution of discrete variables.
• If we have a group of continuous variables with certain
class interval, we can represent them by histogram and
frequency polygon. But suppose we have a group of
variables which is huge (see figure below) and the class
interval is very small so the frequency polygon will take a
shape of very smooth curve and that curve is called
"normal distribution curve"
• NORMAL DISTRIBUTION (of continuous
variables): it is the most important
distribution in statistics and mostly used.
Parameters include population mean (μ)
which is the measure of central tendency,
and the standard deviation (δ) which is the
measure of dispersion.
• Characteristics of normal distribution curve include:
1.It is used for continuous variables only.
2.Symmetrical around its mean i.e. the right side is equal to
the left.
3.The mean, mode and median are equal.
4.Total probability under the curve (area under the curve -
AUC ) equals to one.
5.50% of AUC lies to the right of the mean & 50% to the
left.
6.Probability Limits around the mean: If you move by one
standard deviation (δ) away from the mean on each side;
the AUC limited by ± 1δ equals to 68% of total AUC, &
so: μ ± 1δ → 68%, μ ± 2δ → 95%, μ ± 3δ
→ 99.7%, and as 99.7% ≈ 1(or100%) so AUC ≈ 6δ (3
on each side of μ).
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
Characteristics of normal
distribution curve(continuation)
7.Different values of μ and δ shift the graph
of distribution along X & Y axes. If we
change μ while keeping δ constant, the
curve will shift to the right on increasing μ
& to the left on decreasing μ. On changing
δ and keeping μ constant; the curve will
become more flat on increasing δ and
narrower on decreasing δ without any
shifting the curve to any side.
• Example: If population mean of systolic blood
pressure is 120 mmHg with population standard
deviation of 10 mmHg. What is the probability of
getting a patient with systolic BP
• a) Between 120 and 130 mmHg
• b) < 120mmHg
• c) < 100 mmHg
• d) > 140mmHg
• e) Between 120 and 125 mmHg?
• f) > 135mmHg
Answers
A. From 120 to 130 we move one standard
deviation, so the probability is 34% (0.34)
(i.e. half the 68%).
B. Probability of less than 120 mmHg is
50%.
C. Probability of less than 100 mmHg is
2.5%.
D. Probability of more than 140 mmHg is
2.5%.
• For (e) & (f) Probability of SBP between 120 and
125 mmHg, and more than 135mmHg; we must
follow Z scale (solution continues below).
• Z - Scale: for each value of z there is a
specified probability illustrated in z table
(below), this is done by using a table to
obtain a probability associated with any
normal distribution provided that the mean
and the standard deviation are known. So
any continuous variable follows the normal
distribution follows Z – scale, e.g. weight,
height, s. cholesterol, etc.
• Where: x is the value that you want to know
its proportion;
Z = no. of standard deviations = (x - μ) / δ
• If we go back to the same previous example of SBP:
• Z = no. of δ from μ to x = 120-120/10 = 0 for 120
= 130-120/10 = 1 for 130
= 140-120/10 = 2 for 140
= 110-120/10 = -1 for 110
= 100-120/10 = -2 for 100
= 125-120/10 = .5 for 125
= 140-120/10 = 2 for 140
= 135-120/10 = 1.5 for 135
• For probability of population proportion of
SBP lying between 120 and 125 mmHg; z
= 125-120/10 = 0.5, then from Z- table; we
take the probability positioned for z value
of 0.5 which is 0.6915.
• Z-table: contains numbers (probabilities)
that represent AUC between specific
intervals keeping in mind that the whole
AUC= 1.
N.B: probability of z given x represents the probability from that
point to the negative infinity(see figure below).
N.B: Z score is also called Critical Ratio.
E . The probability for the area limited
between 0 (for 120, where z table opposes
z value of 0 with the probability of 0.5) and
0.5 (for 125, where z table opposes z
value of 0.5 with the probability of 0.6915)
= 0.6915 – 0.5 = 0.1915 = 19.15% =
proportion of people having SBP between
120 &125.
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
• Distribution of the sampling mean:
• When sampling is from a normally distributed
population then the mean of the sample will
follow the normal distribution, with increasing
sample size sampling will approximate the
normality or its curve will be similar to that of
NDC, e.g. a sample of 1000 person will follow
the normal distribution more than a sample of 5
persons.
• Z= (x-μ)/δ this is the original rule (usually population
level), as δ2 (var.population)= δ2/n (var.sample) → δm =
δ/ √n, then at sample level; the rule will be
• Z = (x – μ) / (δ / √ n)
• Example: if the cranial length of a certain
large population which is normally
distributed with μ of 185.6mm and δ of
12.7mm. What is the probability that a
random sample of n = 10 from this
population will have m > 190mm?
• Z = (m – μ) / (δ/√n) = (190-185.6) /
(12.7/√10) = 1.09
• From z-table; z value of 1.09 points to the
probability of 0.8621 to - ∞
• Then 1 - 0.8621 = 0.1379 = 13.79%
• Distribution of difference between two
sample means:
• Tow normally distributed population with
means of (μ1) & (μ2) and variances of (δ12)
& (δ22) respectively. The sampling
distribution of the difference of m1-m2
between the means of independent
samples of size n1 & n2 drawn from these
populations is normally distributed with
mean of (μ1-μ2) and variance of [(δ12/n1)+(
δ22/n2)].
• As μm1- μm2 = μ1-μ2

• δ2(m1-m2) = (δ12/n1)+( δ22/n2)→ δ(m1-m2)= √


[(δ12/n1) + (δ22/n2)]
Then
Z = [(m1-m2) – (μ1-μ2)] / √ [(δ12/n1) + (δ22/n2)]
• Example: if the level of vitamin A in the
liver of two human populations normally
distributed, the variance of population no.1
(δ21) is 19600 and variance of population 2
(δ22) is 8100, what is the probability that a
random sample of size 15 from the 1st
population (n1) and size of 10 from the 2nd
population (n2) will give a value of (m1-m2)
≥ to 50 if there is no difference between
the two populations means (μ1-μ2 = 0)?
• Z = [(m1-m2) – (μ1-μ2)] / √ [(δ12/n1) + (δ22
/n2)]
• = [50 - 0] / √ [(19600/15) + (8100/10)] =
1.09
• 1.09 → using z table; Probability towards
– α = 0.8621
• 1 - 0.8621 = 0.1379
• Distribution of sample proportion:
• The distribution is binomial, but for larger
samples (≥ 30) the distribution will be
approximately normally distributed and
have the following characteristics: μp = P,
δ2p= {P(1-P)}/n & δp= √[{P(1-P)}/n ], so

• Z = (p-P) / √ [{P (1-P)}/n]


• Example: suppose in a certain human
population the proportion of color
blindness is 8% (P=0.08). If we randomly
select 150 individuals from this population,
what is the probability that the proportion
of color blindness in the sample will be
greater than 15% (p=0.15)?
• Z = (p-P) / √ [{P(1-P)}/n]
= (0.15-0.08) / √ [{0.08(1-0.08)}/150]
= 3.15 → 0.9992 from z-table

• Probability = 1 – 0.9992 = 0.0008 [this is


very negligible but very significant].
• Distribution of differences between two
sample proportions:
• Characterized by:
• μp1-p2 = P1-P2 ,
• δ2p1-p2 ={P1(1-P1)}/n1 + {P2(1-P2)}/n2→

δp1-p2 = √[{P1(1-P1)}/n1 + {P2(1-P2)}/n2]

• Z = [(p1-p2)-(P1-P2)] / √ [{P1(1-P1)}/n1 +
{P2(1-P2)}/n2]
• Example: a population of teenagers has
the proportions of 10% for obese boys (P1)
& of 10% for obese girls (P2). What is the
probability that a random sample of 250
boys (n1) and of 200 girls (n2) will yield a
value of ≥ 0.06 (p1-p2)?
• Z = [(p1-p2)-(P1-P2)] / √ [{P1 (1-P1)} /n1 +
{P2 (1-P2)} /n2]
• = [0.06-(0.1-0.1)] / √ [{0.1(1-0.1)}/250 +
{0.1(1-0.1)}/200] = 2.11
• 0.9826 from z-table → 1 – 0.9826 =
0.0174
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
• Q1. Suppose it is known that the response
time of healthy subjects to a particular
stimulus is a normally distributed with a
mean of 15 seconds and a variance of 16.
What is the probability that a random
sample of 16 subjects will have a mean
response time of 12 seconds or more?
• Q2. It is known that 35 percent of the members
of a certain population suffer from one or more
chronic diseases. What is the probability that in
a sample of 200 subjects drawn at random from
this population 80 or more will have at least one
chronic disease?
• z from the table for
• 1.5 = 0.9332
• 1.8 = 0.9641
• 1.47 = 0.9292
• -3.3 = 0.0005
• -3 = 0.0013
• -2.9 = 0.0019
Estimation

• Statistical Inference: it is an inference


about the population made on basis of
results obtained from a sample drawn from
that population. Ways to make an
inference are either hypothesis testing or
estimation. Estimation is either point
estimation or interval estimation
• Point Estimation:
• It is a single numerical value obtained
from a random sample, used to estimate
the corresponding population value
(parameter).
• The sample statistics, namely: mean (m),
variance (S2) and standard deviation (S)
are the best point estimates for population
parameters, namely: mean (μ), variance
(δ2) and standard deviation (δ)
respectively.
• The point estimate is not the best because
these statistics (m, S2 & S) cannot be
expected to be equal to the corresponding
parameters (μ, δ2 & δ) and this difference
can be estimated by standard error (S.E)
which determines the dispersion of the
sample mean from the population mean
and thus the variability (&
representativeness) of the sample.
• Interval Estimation(Confidence Interval(
[C.I.]:
• It is consistent of 2 numerical values
(upper & lower values) defining the interval
within which the unknown parameter lies
with certain degree of confidence.
• These values (upper & lower values)
depend upon the confidence level which is
equal to 1- α, where α is the probability of
error.
• Total AUC = 1→ from figure below: if α
(probability of error) is 10% (or 0.1), then
α/2=0.05 will be on each side of the
confidence interval, and the confidence
level at 1- α will be 0.9 or 90%, and so for
α = 0.05, α/2=0.025, & confidence level is
95%, and for α = 0.01, then α/2 = 0.005 &
confidence level is 99%.
• To estimate the upper and lower limits of
the confidence interval:
• Estimator ± (reliability coefficient *
standard error)
• Where for μ; the estimator is m & standard
error is δ/ √n
• For (μ1-μ2) (m1-m2) √ [(δ12/n1) + (δ22/n2)]
• For P p √ [{p (1-p)}/n]
• For (P1-P2) (p1-p2) √ [{p1 (1-p1)}/n1 +
{p2(1-p2)}/n2]
• Reliability coefficient for estimates drawn
from the population or large samples is the
value of Z corresponding to the confidence
level: 1.645 for 90%CI, 1.96 for 95%CI,
and 2.58 for 99%CI.

• Confidence interval (CI) is central and


symmetrical around (m) so that there is
α/2 chance that the parameter is more
than the upper limit & α/2 chance to be
lower than the lower limit.
• Example 1: the mean of indirect serum
billirubin level for 16 four-day old infants
was found to be 5.98mg/dl; population
standard deviation is 3.5mg/dl. Assuming
normality; find 90%, 95% & 99% C.I for μ?
• %C.I for μ = m ± (Z * δ /√n)
• 90%C.I for μ = 5.98 ± (1.645 * 3.5/√16) =
4. 54 to 7.42
• 95%C.I for μ = 5.98 ± (1.96 * 3.5/√16) =
4.265 to 7.695
• 99%C.I for μ = 5.98 ± (2.58 * 3.5/√16) =
3.72 to 8.24
• From this example we can notice that the
width of confidence interval is directly
related to the level of confidence; smallest
with 90% (but not reliable level of
confidence, i.e. high probability of error)
and largest with 99% (a highly confident
estimation but very wide range) and that’s
why 95% level of confidence is the most
practical to use.
• Another factor can affect the width of
confidence level, although it is not
illustrated in this example but it is obvious
from the rule of confidence interval, that
the width of the interval is inversely related
to the square root of the sample size, i.e.
inversely related to the sample size, so we
can decrease the width of this interval by
taking larger samples whenever it is
feasible.
• N.B) small sample size means high
variability (large sample variance and
standard deviation) and consequently a
large confidence interval, so the
explanation of a large confidence interval
is either a small sample size &/or a high
confidence level (99%).
• Example 2: a sample of 10 12-year old
boys of mean height 59.8 inches &
population standard deviation of 2 inches
and a sample of 10 12-year old girls of
mean height 58.5 inches & population
standard deviation of 3 inches. Assuming
normality; find 90%C.I for the difference in
means of height between girls & boys at
this age.
• Example 3: in a survey, 300 adults were
interviewed, 123 said that they had yearly
medical check up. Find the 95%C.I for the
proportion of adults having yearly check
up.
• Example 4: 200 patients suffering from a certain
disease were randomly divided into two equal
groups (i.e. n1 =n2 =100), the 1st group received
new treatment, 90 patients recovered within 3
days, out of 2nd group who received the standard
treatment 78 patients recovered within 3 days.
Find the 95%C.I for the difference in proportions
between the two treatments' groups.

Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
Chi Square (X2) Distribution & Chi Square
Test

• The most widely used X2 test is test of


independence.
• Properties:
1. It is a non-parametric test (deals with
frequencies).
2. It is one of the most widely used tests in
statistical application.
3. Derived from normal distribution.
4. X2 assumes values between zero and +
∞, i.e. no negative values, and has one
tailed curve.
5. X2 relates to the frequencies of
occurrence of individuals or events in
categories of one or more variables.
6. X2 is used to test agreements between
the observed frequencies with certain
characteristics and expected frequencies
under certain hypothesis.
• GENERALLY: two criteria classification,
when applied to the same set of entities,
are independent (no association). In other
words; if a sample of n size drawn from a
population; the frequency of occurrence of
entities are cross classified on the basis of
the two variables of interest (X &Y). The
corresponding cells are formed by the
intersection of the rows and columns &
constructed table is a contingency table as
the adjacent cells are interrelated.
Smoking (Y1) No (Y2) Total

M.I (X1)
a b a+b

Not (X2)
c d c+d

totals
a+c b+d a+b+c+d
• Hypothesis and conclusion are stated on
in terms of association or lack of
association of the two variables. (H0: no
association & HA: there is an association).
• Critical value = Tabulated X2 = d.f X2 (1-α)
(from the X2 table)
• N.B : X2 distribution curve is a single tail
curve so α is not divided by 2.

• Tabulated X2 = 3.841 for 2X2 table with α


= 0.05 (1X2(0.95))

• d.f = (r-1)(c-1) , where r = no. of rows & c


= no. of columns, so d.f. always equals to
1 in 2X2 table.
• Calculated X2 = ∑ [(O - E)2 /E], where for each
cell:
• O = observed frequency in the table
• Calculation of expected frequencies is based
on the probability theory.
• E = expected frequency = (row marginal total *
column marginal total) / grand total
• e.g. E for cell a = (a + b) (a + c) / (a + b + c + d)
• Or calculated X2 = [n (ad-bc)2] / [(a + b) (c + d)
(a + c) (b + d)] but this rule is limited to 2X2
table only
• We have many types of Chi-square
test but the most commonly used one
is :
• Test for independence: to test the null
hypothesis that two criteria of
classification, when applied to the
same set of entities, are independent.
Here both rows and columns are
randomly selected,
• Example :
• For a relation between smoking and M.I, 2
groups taken as M.I and non-M.I and were
asked about smoking exposure:

Smoking No Total
M.I

Non-M.I

totals

H0: M.I & smoking are independent.


§ A group of 350 adults who participated in
a health survey were asked whether or not
they were on a certain diet or not. The
response by gender was as follow:
Male female Total
On diet 14 (19.3) 25 (19.7) 39
Not on 159 (153.7) 152 (153.7) 311
diet
totals 173 177 350

H0: diet & gender are independent. HA: dependant


• E(a) = 173*39/350=19.3
• E(b)= 177*39/350=19.7
• E(c) =173*311/350=153.7
• E(d)= 177*311/350= 157.3
• Tabulated X2 = 1X20.95 = 3.841= critical
point
• Calculated X2 = ∑ [(O - E)2/E]
= (14-19.3)2/19.3 + (25-19.7)2/19.7 + 159-
153.7)2/153.7 + (152-157.3)2/157.3 =
3.243
• Or Calculated X2 = n(ad-bc)2 / [(a + b) (c +
d) (a + c) (b + d)
= 350[(14*152)-(25*159)]2
/ 173*311*39*179= 3.22
• As calculate X2 < tabulated then we accept
H0 & reject HA, so there is no significant
association between gender and diet.
• EXAMPLE: two randomly selected
samples: 50 child with leukemia 30 M and
20 F, and another 50 healthy children 24
M and 26 F. Is the occurrence of leukemia
is affected by gender?
• Solution: here we apply chi square test for
independence: 1st we must arrange a 2x2 table

M F Total

Leukemia 30 (27) 20 (23) 50

Healthy 24 (27) 26 (23) 50

Totals 54 46 100
• Expected value for each cell =
multiplication of marginal totals/ grand total
• So for cell a, E30 = 50*54/100= 27, & so
23 for cell b, 27 for c & 23 for d.
• Data: the two randomly selected samples,
1st of 50 leukemic children consisting of 30
M & 20 F, and 2nd sample of 50 healthy
children consisting of 24 M & 26 F.
• Assumption: the two samples represent 2
independent groups are taken from 2
independent populations.
• Hypotheses:
• HO: no significant difference in M & F
frequencies with and without leukemia. Or there
is no association between leukemia and gender
type.
• HA: there is significant…
• Level of significance: α= 0.05 → 5% chance
factor effect. 95% influencing
factor effect.
• d.f. = (r-1)(c-1)= (2-1)(2-1)= 1*1 = 1
• Critical point = tabulated X2 = d.fX2 1-a = 1X2 0.05 =
3.841
• Testing for significance:
• Calculated X2 = ∑ [(O - E)2/E]
• = (30-27)2/27 + (20-23)2/23
+ (24-27)2/27 + (26-23)2/23 = 1.448
• As calculated X2 < tabulated X2→ p > 0.05,
so we accept Ho and reject HA

• Conclusion: there is no association


between leukemia & gender type.
• Exercise 1: A study included 90
normotensive patient and 10
hypertensive patient. 33% died from
the normotensive and 30% died from
the hypertensive. Can we say that
hypertension is a bad prognostic
sign? or Test the significance of the
blood pressure as a prognostic sign..
(d.f.=1, tabulated value 6.6)
• Exercise 2: in a clinical trial involving two drugs
A & B, used in reducing the level of anxiety in a
certain type of emotionally disturbed persons. A
random sample of 100 emotionally disturbed
persons who were given drug A and 85
experienced a reduction in anxiety level. Drug B
was effective with 105 of a sample of 150
emotionally disturbed subjects. Using alpha level
of 0.01, test for the hypothesis of differential
response to the two drugs. (Tabulated value =
6.635) key: this tabulated value is for chi square
of d.f. of one & alpha of 0.01.
• N.B) you can use either chi square test or test for
difference (for proportion with reduced anxiety, but the
tabulated value here will be for Z which is 2.58).
• Exercise 3: a study was conducted on
100 persons, 20% of them were with
scabies. 90% of those with scabies
experienced itching in worm climate, while
only 10% of the healthy persons had this
complaint. Do these data provide sufficient
association between scabies and itching in
worm place? Use α = 0.01 & tabulated
value 6.63
Thank you
for
listening
Biostatistics

Dr. Aamir S. Alrubaiee


MBCHB FIBMS CM
Linear Correlation & Regression Models
Correlation Model:
• The objective is to find a measure of the relationship
between two random variables (X & Y).
• This from of the relationship between two
variables can be presented usually in a scatter
diagram which is a graph summarize the
relationship between the two variables. The X
axis is traditionally the horizontal axis and
represents the independent variable. The
vertical axis normally represents Y the
dependant variable.
• From above graphs, we find that there
could be a relation between the two
variables, this relation must be assured
mathematically (usually using correlation
coefficient) and measured with the
regression model(.
• Criteria for correlation: the two variables
(characteristics) must be:
– quantitative
– Changeable
– From the same population or sample
– With some association (relation)
Data representation and organization
1. Data used in correlation and regression
analysis consist of Pairs of measurements
made on the same unit of observation(most
often the same study subject) e.g SBP &S.Ch
2. The pairs denoted symbolically (x,y)
X typically represent (independent)var.
y represent (dependent)var.
3.In epidemiological studies the indep.var. (x) is
often a suspected risk factor(e.g. low fiber diet)
and the dep.var.(y) is the occurrence of
disease or other outcome(colorectal CA)
4. In experimental studies values of the
independent var. are fixed by the
investigator, e.g. in study of the
efficacy of a new antiviral drug the
investigator select the dosages (the
independent var.) to be administered,
thus dosage is fixed not random
variable.
Pearson Correlation Coefficient (r):
A measure of the linear or straight-line
relationship between two interval level
variables. This coefficient states how
much variables affect each other and in
which direction(defines both the strength
and the direction of the linear relationship
between two variables)
• r = [(n∑xy) – (∑x)(∑y)] / [√ {(n∑x2)-(∑x)2}
√{(n∑y2)-(∑y) 2}]
• Its value lies between -1 to +1. The size of
the correlation coefficient indicates the
strength of the relationship:
1 means perfect linear correlation.
> 0.9 very strong
0.7 – 0.9 strong
0.4 – 0.7 moderate
0.2 – 0.4 weak
< 0.2 very weak
0 means no correlation.
e.g: +1(perfect direct or pos. linear corr.)
-0.7 (mod. neg. or inverse linear corr.)
• The sign of the correlation coefficient
indicates the direction of the
relationship. Positive correlation
indicates high scores of one variable
are associated with high score of the
other variable. Negative correlation
indicates high scores of one variable
tend to be associated with low scores
of the 2nd variable.
• Testing significance of (r) is done by one
of:
1.Using t-test:
Calculated t = r √ [(n-2) / (1 – r2)]… &
tabulated t = n - 2 t 1-α
2.Comparing calculated (r) with tabulated
one (nr α).
• Where n = number of pairs for both testing
methods.
• N.B) significant correlation does not mean
in anyway a cause effect relationship, but
merely a statistical linear relationship.
The Regression Model (For Simple Linear
Regression):
• It aims at estimating or predicting the
value of one variable corresponding to a
given value of another variable.
• y = a + bx
• Where;
• “x” is the independent variable, i.e.
mathematically; it is pre-selected (non-
random)
• “y” is the dependant variable.
• “a" = the interception point of X & Y axes
= amount of y when x is zero
= (∑y - b∑x) / n = my - bmx (m= mean)

• “b” = average change in y for each unit


change in x (slope)
= [(n∑xy)-(∑x)(∑y)] / [n∑x2 – (∑x)2]
Simple linear regression helps in:

• a) Ascertaining the probable form of


the relationship between variables.

• b) Predicting or estimating the value


of one variable corresponding to a
given value of another variable.
• Example 2: patients' scores on standardized test
(y) and a new test (x).
x y x2 y2 xy
50 61 2500 3721 3050
55 61 3025 3721 3355
60 59 3600 3481 3540
65 71 4225 5041 4615
70 80 4900 6400 5600
75 76 5625 5776 5700
80 90 6100 8100 7200
85 106 7225 11236 9010
90 98 8100 9604 8820
95 100 9025 10000 9500
100 114 10000 12996 11400
825 916 64625 80076 71790
• b = [(n∑xy) - (∑x) (∑y)] / [n∑x2 – (∑x)2]
= [(11)(71790) - (825)(916)] /
[(11)(64625) – (825)2] = 1.1236
• a = (∑y - b∑x) / n = [(916) – (1.1236)(825)]
/ 11 = - 0.9973

• y = a + bx = - 0.9973 + 1.1236 (x)


Q1: an experiment was conducted to study the effect of certain
drug in lowering heart rate in adults. The following data were
obtained for a sample of 10 subjects:
Reduction in
Subject Dose (mg)
HR (b/m)
1 0.5 10
2 0.75 8
3 1.00 12
4 1.25 12
5 1.50 14
6 1.75 12
7 2.00 16
8 2.25 18
9 2.50 17
10 2.75 20

Predict the reduction in heart rate obtained by giving drug with a dose of 3 mg?
Q2/ the following data represent a sample of 7
patients with pneumonia with respect to the
duration of illness (in days) and body
temperature in (oC).
Body 38.1 38.7 38.6 39.1 38.9 40.1 40.0
temperat
ure
Duration 1 2 3 4 5 6 7
of illness

- Calculate correlation coefficient and explain it.


- Determine regression equation.
- If a patient came after 15 days of pneumonia, can you
predict his body temperature?
Thank you
for
listening
Dr. Aamir S. Alrubaiee
MBCHB FIBMS CM
 MEASURES OF CENTRAL TENDENCY:
If they are computed from data of the
sample; they are called statistics, e.g.
sample mean & standard deviation, & if
computed from the population, they are
called parameters, e.g. population mean
and standard deviation.
 MEAN: the sum of all observations divided by their
number.
 Advantages:
1.Simple to calculate & to understand.
2.It is unique (single value) & covers a lot of data in a single
number, so we achieve the aim of statistics in summarization.
3.Takes all the values in consideration (i.e. not going
to skip any single value).
 Disadvantage: it is affected by extreme values (largest &
smallest value). If 9 persons of yearly income of 1000$, their
mean will be affected by the 10th person of 100,000$.
Population mean: μ = ∑ (∑ x) / N ,
where;
μ =population mean, ∑ = summation, & N = no.
of values in population.

 Sample mean: m = (∑ x) / n ,
Where ; n = no. of value in the sample.
 MEDIAN: is the value that divides the sets of data into
two equal parts (i.e. the no. of values above the median
equals to the no. of values below it). To find the site of
the median we must arrange the value in ordered array
then :
 Position of median = (n + 1) / 2 this if the no. of
observations is odd.
 Position of median = n / 2 & (n / 2) + 1 i.e. 2 sites
if the no. of observations is even. The median here is
the average of the readings lie in these two positions.
 e.g. give the median of these values:
 1st set of data: 5, 15, -7, 20, 25, 3, -1, 0 & -3.
 2nd set of data: 7, 9, 16, -5, -9, 3, -4, 6.
 Solution: ordered array:
-7, -3, -1, 0, 3, 5, 15, 20, 25 (1stset)
-9, -5, -4, 3, 6, 7, 9, 16 (2nd set)
 Median position= (9+1)/2=5, i.e. the 5th reading,
median=3 in 1st set
 = 8/2 = 4 & (8/2) + 1 = 5 → 3 & 6, so we take their
average to have one median for the set (3+6)/2= 4.5 =
median of 2nd set.
 Advantages of the median: simple to calculate & to
understand, unique & the most important; not
affected by extreme values.

 Disadvantages: it neglects all the values & takes


only the median one.

 Both mean & median are used only in quantitative


variables & not in qualitative variables.
 Q/ when shall you prefer to use the median over the mean?
 If there are extreme values in a set of data we use the
median, so as not to be misled by extreme value when we
use the mean, e.g. CDH is usually diagnosed by early
months of life but some cases in fifties, so here we use the
median.

 MODE: it is the most frequent value in a set of data. It is


the only measure that can be used in both qualitative &
quantitative variables. Mode is not unique; can be no mode,
1 mode (unimodal), 2 (bimodal) or more.
 QUANTILE: a value below which a certain
proportion of observations occur in an ordered
set of data.

 PERCENTILE: it is the value in an array that


divides the distribution into a hundred equal
parts. 10th percentile as example is a quantile
& is the value below which 10% of
observations lie.
 QUARTILES: observations in an array that divide the
distribution into 4 equal parts.
 1st (lower) quartile: the value below which 25% of
observations lie in an ordered array.
 2nd quartile: it is the median = 50th percentile
 Upper quartile = 75th percentile
 Inter-quartile range: is the middle (50%) of all
observations (i.e. observations limited between 25% &
75% of observations.
 MEASURES OF DISPERSION:
 Dispersion is the variation that the value of
observation can have (i.e. how much data
differ from each other). If all the values are the
same in 30, 30, and 30, then the dispersion is
zero. If values are close together as in 28, 32,
30 & 29, then the dispersion is small. If the
values are not close as in 70, 90, 29, 18, then
the dispersion is large.
 1- RANGE: is the difference between the largest &
smallest values in a set.
R = XL – XS
 Properties of the range:
1.Simple to calculate & easy to understand.
2.It is not based on all observations. It neglects all the
values in the center & depends on the extreme values
only.
3.It is not amenable for further mathematical treatment.
4.It should be used in conjunction with other measures
of variability.
 2- VARIANCE (V): is the difference of each value in the
observation from the mean (it is the mostly used).
 A) Variance of the population
(δ2) = [(N ∑x2) – (∑x)2] / N2
 B) Variance of sample
(S2) = ∑(x – m)2 / (n – 1) = [(n ∑x2) – (∑x)2] / n(n -1)
 Where: x = value of each unit in the set of observation,
μ = population mean
N = no. of observations in the population
m = sample mean
n = no. of observations in the sample

 Q/ what is the mean & variance for this set of data: 1, 2, 3, 4,


and 5.
 Mean = m = 15 / 5 = 3, V = S2 = ∑ (x – m)2 / (n – 1) =
10/4= 2.5
 3- STANDARD DEVIATION (SD): it is the square
root of variance.
 SD = √ S2 = +/- S for the sample
 SD = √ δ2 = +/- δ for the population

 The conception of SD is to give idea about the


variability of each value from the mean of the sample
or population in a sample or population. It is also
benefit to remove the square of the unit.
 Sample variance measures the variability of values
from the sample mean, while population variance
measures it from population mean.
 4- COEFFICIENT OF VARIATION (CV):
used to compare the dispersion in two sets of
data even when units are different. CV is not
used with units.

CV = (SD / m) *100

 5- STANDARD ERROR OF THE SAMPLE


MEAN (SE): it measures the variability of the
sample mean around the population mean and
indicates the degree in which the sample mean
reflects the population mean.

SE = SD / √ n
 SE determines the dispersion of the sample mean from
the population mean (determines sampling error), i.e.
the representativeness of the sample to the population,
so it gives us the idea for how much the sample mean is
far away from the population mean.
 From the rule above we can say that the sampling error
(standard error of the sample mean) is inversely related
to the sample size.
Properties of Variance, SD & SE:
1. Are based on all observations.
2. Deviations are taken from the mean.
3. Most widely used measures of
variability.
4. For SD & SE, the unit is the same as for
the mean.
 Properties of CV:
1. Used for the comparison of relative variability of 2
distributions.
2. It measures level of variability in the data relative to
the average value.
3. It is independent of any unit of measurement so it is
useful for comparison of variability in 2 distributions
having variables expressed in different units.
4. Takes into account each value of the distribution.
5..Expresses the SD as a percentage from the mean
 Example: a sample of 15 patients
making visits to a health center
had traveled these distances
(miles) : calculate the measures of
central tendency and dispersion?
Patient Distance (x) X2

1 5 25

2 9 81

3 11 121

4 3 9

5 12 144

6 13 169

7 12 144

8 6 36

9 13 169

10 7 49

11 3 9

12 15 225

13 12 144

14 15 225

15 5 25

--- ∑ x = 141 ∑ x2 = 1575


 Mean = m = ∑ x / n = 141 / 15 = 9.4 mile
 Ordered array ( 3,3,5,5,6,7,9,11,12,12,12,13,13,15,15)
 The position of the median = (n +1)/2 = 8th reading &
median = 11 mile
 Mode = 12 mile
 Range = 15 – 3 = 12 mile
 Variance = S2 = [(n ∑x2) – (∑x)2] / n(n -1) = 17.8 mile
 SD = √ S2 = +/- 4.2 mile
 CV = (SD / m) *100 = 44.6%
 SE = SD / √n = 1.08 mile
Thank you
for
listening
Linear Correlation & Regression Models
Example1: systolic blood pressure readings in mmHg by 2 methods in 25
patients with essential hypertension as in the table below:

No. Method I (X) Method II (Y) X2 Y2 XY


1 132 130 17424 16900 17160
2 138 134 19044 17956 18492
3 144 132 20736 17424 19008
4 146 140 21316 19600 20440
5 148 150 21904 22500 22200
6 152 144 23104 20736 21888
7 158 150 24964 22500 23700
8 130 122 16900 14881 15860
9 162 160 26244 25600 25920
10 168 150 28224 22500 25200
11 172 160 29584 25600 27520
12 174 178 30276 31684 30972
13 180 168 32400 28224 30240
14 180 174 32400 30276 34320
15 188 186 35344 34596 34968
16 194 172 37636 29584 33368
17 194 182 37636 33124 35308
18 200 178 40000 31684 35600
19 200 196 40000 38446 39200
20 204 188 41616 35344 38352
21 210 180 44100 32400 37800
22 210 196 44100 38416 41160
23 216 210 46656 44100 45360
24 220 190 48400 36100 41300
25 220 202 48400 40804 44140
totals 4440 4172 808408 710952 757276

H0 = 0 & HA ≠ 0

r = [(n∑xy) – (∑x)(∑y)] / [√ {(n∑x2)-(∑x)2} √{(n∑y2)-(∑y)2}]


= (25*757276)-(4440*4172) / √(25*828408)-(4440)2 √(25*710952)-(4172)2
= + 0.995 → VERY strong direct relationship

t = r √ [(n-2) / (1 – r2)] = 0.955 √ [(25-2) / (1 – (0.955)2)] = 16.17


Calculated t (16.17) > tabulated t (n - 2 t 1-α = 23 t 0.975 = 2.0687) → reject H0

b = [(n∑xy) - (∑x) (∑y)] / [n∑x2 – (∑x)2]


= [(25) (757276) - (4440)(4172)] / [(25)(808408) – (4440)2] = 0.822

a = (∑y - b∑x) / n = [(4172) – (0.822) (4440)] / 25 = 20.89

y = a + bx = 20.89 + 0.822 (x)

Potrebbero piacerti anche