Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
NOTES ON
MEDICAL &
ALLIED HEALTH
PROFESSION
EDUCATION:
STATISTICS &
RESEARCH
METHODOLOGY
1
RESEARCH IN MEDICAL EDUCATION
• Types of research
o Basic research and applied research
o Quantitative research and qualitative research
• Qualitative research:
o Ethnography, cognitive anthropology, etc
o Synthetic rather than analytic
o Generally hypothesis generating
o Investigative methods are non-intrusive
o Data are more impressionistic.
2
o Research in such a situation is a function of researcher’s insights
and impressions.
Educational Research
Ethnographic Survey
Experimental Quasi-experimental
Ex Post Facto or
Causal-comparative
Thomas K.Crowl
• Descriptive research
o Include quantitative and qualitative researches.
o Methodologies include observations, surveys, self-report and tests.
o May operates on the basis of hypotheses.
o Deals with naturally occurring phenomena.
• Ethnographic research
o Descriptive and qualitative research.
o Report is detailed verbal description.
o Carried out in natural setting.
o Researcher as participant and observer.
• Survey
o Descriptive
o Quantitative study
• Correlational research
o Investigate the relationship between two or more variables.
o Searching the relationship of variables in natural setting.
3
• Group comparison research
o Comparing the values of two or more groups of population.
• Experimental research
o Random selection of the individuals forming the groups
Experimental group
Control group
• Quasi-experimental research
o A type of group comparison research.
o Groups are randomly selected.
• Ex Post Facto or Causal-comparative study
o Ex Post Facto in latin is “after the fact”.
o Values of independent variable of two groups are preset (al ready
present).
2. Quantitative/Qualitative Research:
• Deductive
o Begin with a theory and collect data to test.
• Inductive:
o Begin with observations and attempt to explain by generalizing.
• Deductive reasoning
o A type of logic in which one goes from a general statement to a
specific instance.
• Inductive reasoning:
o Involves going from a series of specific cases to a general
statement.
o The conclusion in an inductive argument is never guaranteed.
• Confirmatory
o Experimental
o Quasi-experimental
o Correlational (non-experimental)
• Exploratory
o Qualitative
4
3. Qualitative research methods for data collection
• Interviews
• Focus groups
• Survey: open ended questions
• Observations: recorded in field notes
• Document analysis
• What is qualitative data?
o Data in the form of words, rather than numbers, based on:
Asking open ended questions in:
• Interviews
• Group
• Surveys
Examination of documents
• Observation of situations and actions, recorded in fields
notes
• Uses of qualitative data
o Some social sciences e.g
Anthropology
History
Psychology
Sociology
Public health
Policy analysis
Health care evaluation
5
Static group comparison X O1 O2 no control
o True experimental design
Pretest-post test control group design
Exp. Group O1 X O2
R.A
(Random Allocation)
Control O3 O4
Post test control group design
Exp. Group X O1
R.A
Control O2
o Quasi-experimental design
Time series O1 X O2 X O3 X O4
No equivalent control group
• Exp group O1 X O2
• No equivalent control group O3 O4
Separate sample pretest post test design
• R.A – Pretest group O1 X
• R.A – Post test group X O2
5. Purpose of Medical Education Research:
• To improve the functioning of educational programmes by providing
information for:
o Decision making
o Evaluating outcomes
o Supporting advocacy for change
o Contributing to the body of knowledge related to concepts and
methods.
Research is like a plant that grows and grows and grows and grows…
6
When it is grown, it throws off seeds of all types (basic, applied and
practical) which in turn sprout and create more research projects…
The process continues with all of the new research ‘plants’ throwing off
seeds, creating additional, related research projects of various types…
7
STEPS IN RESEARCH
1. Preliminary steps:
• Clarifying the purpose
• Formulating the topic
o State your topic idea as a question
o Identify the main concepts or keywords
8
General objective – what are the purpose of the study
Specific objective – what are the things you want o find in the
study
o Planning of methods
Study population
• Selection and definition
• Sampling
• Sample size
Variables
• Selection
• Definition
• Scales of measurement
Method of data collection
Method of recording and processing
• Preparing for data collection
o Construction of research instrument
o Pretesting the instrument
• Collection the data
• Processing the data
• Interpreting the data
• Writing a report
9
• Who else is concerned about the problem? Are top government
officials concerned? Are medical doctors or other professionals
concerned?
• Are the resources available?
• Are measures available to solve the problem?
Review your answers to these questions, and ranked the problem and
arrange them according to the ranking.
Problem identification
Dissemination of findings
Report writing
Drawing inference
Research question/
hypothesis formulation
Confirmation or
rejection of hypothesis
Planning research
Data analysis
10
SAMPLING METHOD
Sampling Method
Non-Probability Probability
sampling sampling
11
3. Sampling:
• Sampling is the process of selecting a sufficient number of elements
from the population, so that a study of the properties or
characteristics of the sample make it possible to generalize such
properties or characteristics to the population.
5. Probability Sampling:
• Unrestricted sampling:
o More commonly known as simple random sampling.
o Every element in the population has a known and equal chance of
being selected as a subject.
o Advantage:
This kind of sampling method has the least bias.
o Disadvantages:
Cumbersome (difficult) and expensive.
An entirely updated listing (population frame) of the population
may not always available.
• Restricted (complex) random sampling:
o Offer a viable, and sometimes more efficient alternative to the
unrestricted design.
12
o Five most common complex probability sampling methods
Systematic sampling
• Drawing every nth element in the population starting with a
randomly chosen between 1 and n.
• For example, if we want a sample of 60 students from total
population of 300 students, we could sample every 9th
student (9, 18, 27, …) until 60 students are selected.
• The number must be selected randomly for example we san
take out one dollar ringgit and choose the last digit of money
number.
Stratified random sampling
• When sub-population vary considerably, it is advantageous
to sample each sub-population (stratum) independently.
• Stratification is the process of grouping members of the
population into relatively homogenous subgroups before
sampling.
• The strata should be mutually exclusive: every element in the
population must be assigned to only one stratum.
• The strata should also collectively exhaustive: no population
element can be excluded.
• The random sampling is applied within each stratum.
Cluster sampling
• Cluster sampling is used when natural grouping are evident
in the population.
• The total population is divided into groups or clusters.
• Elements within a cluster should be heterogenous as
possible.
• But there should be homogeneity between clusters.
• Each cluster must be mutually exclusive and collectively
exhaustive.
• A random sampling technique is then used on any relevant
clusters to choose which clusters to include in the study.
13
Area sampling
• One version of cluster sampling is area sampling or
geographically clusters sampling.
• Clusters consist of geographical areas.
• A geographically dispersed population can be expensive to
survey.
• Greater economy than simple random sampling can be
achieved by treating several respondents within a local area
as a cluster.
Double sampling
• A sampling design where initially a sample is used in a study
to collect some preliminary information of interest, and later
a sub-sample of this primary sample is used to examine the
matter in more detail
• It is like reverse pilot study because in double sampling take
all population then proceeds with sampling the interest sub-
sample.
6. Non-probability Sampling:
• The elements in the population do not have any probabilities attached
to their being chosen as sample subjects.
• The findings from the study of the sample cannot be confidently
generalized to the population.
• This method is chosen when generalisability is not critical; focus may
be on obtaining preliminary information in a quick and inexpensive
way.
• 2 broad categories:
o Convenience sampling
Collection of information from members of the population who
are conveniently available to provide it.
o Purposive sampling
14
The sampling is confined to specific types of people who can
provide the desired information, either because they are the only
ones who have it, or conform to some criteria set by the
researcher.
2 type of purposive sampling:
• Judgment sampling
o Involves choice of subject who are most advantageously
placed or in the best position to provide the information
required.
o Judgment sampling may curtail the generalisability of the
findings because we are using a sample of experts who
are conveniently available to us.
o Judgment sampling calls for special efforts to locate and
gain access to the individually who do not have the
requisite information.
• Quota sampling
o This method ensures that certain groups are adequately
represented in the study through the assignment of a
quota.
o The quota fixed for each subgroup is based on the total
numbers of each group in the population.
o Considered as a form of appropriateness stratified
sampling, in which a predetermined proportion of people
are sampled from different groups, but on a convenience
basis.
15
SAMPLE SIZE
1. Introduction:
• Questions:
o How large should my sample be?
• Answer:
o It depends…
…large enough to be an accurate representation of the
population.
…large enough to achieve statistically significant results.
2. Determining sample size:
• What is sample size that would be required to make reasonably
precise generalizations with confidence?
• A reliable and valid sample should enable us to generalize the findings
from the sample to the population under investigation.
• The sample statistic (statistic finding) should be reliable estimates and
reflect the population parameter (actual finding) as closely as possible
within a narrow margin of error.
• Precision:
o Precision refers to how close our estimate is to the true population
characteristic.
o Normally, the greater the precision required, the larger is the
sample size needed.
• Confidence:
o Confidence denotes how certain we are that our estimate will really
hold true for the population.
o Confidence reflect the level of certainty with which we can state
that our estimates of the population parameters, based on our
sample statistics, hold true.
o Level of confidence can range from 0 to 100%.
o A level of confidence of 95% is conventionally acceptable.
• Sample size is function of…
o Variability (heterogeneity) in the population
16
The more variance we find, the bigger the sample should be
o Precision or accuracy needed
The more precise or accurate we want, the bigger the sample
size should be
o Confidence level desired
The higher the confidence level we want, the bigger the sample
size should be
o Type of sampling plan used
Different sampling approaches will require different sample size
• Trade-off between confidence and precision
o If there is little variability in the population, a small sample size
will be sufficient to obtain a high confidence and precision level.
o The higher the precision, the lower will our confidence level be.
o The higher the confidence level, the lower will our precision level
be. Æ That is why, in both cases, we need bigger sample size to
increase the precision and confidence.
• Roscoe proposes the following rules of thumb for determining sample
size
o Sample size larger than 30 and less than 500 are appropriate for
most research
o Where samples are to be broken into sub samples, a minimum
sample size of 30 for each category is necessary.
o In multivariate research (including regression analyses) the sample
size should be several times (preferably 10 times or more) as large
as the number of variables in the study.
o For simple experimental research with tight experimental controls,
successful research is possible with samples as small as 10 to 20
in size.
3. The term statistically significant (p<.05) is used merely as a way
indicating the chances are at least 95 out of 100 that the findings obtained
from the sample of people who participate in the study are similar to what
the findings would be if one were actually able to carry out the study with
the entire population.
17
4. Sample size for single mean
5% 1.98
n = sample size
1% 2.58
n = (A + B) 2 * 22 /∆2 = population standard
Power B
deviation
∆ = expected difference of 80% 0.84
90% 1.28
mean 95% 1.54
A = significance level
(usually 95%, equals 1.96)
6. Sample size for single proportion:
B = power (usually 80%,
n =equals
sample0.84)
size
n = (Z/∆) 2 p (1-p)
∆ = precision
Z = Z-score at significance
level
p = population proportion
19
OVERVIEW ON MEDICAL STATISTICS
1. Some introduction:
• I’m interested in research…
• I’m forced to do research…
• Whatever the reason may be…
20
• Modern viewpoint of statistics
o Aid for making scientific decision in the face of uncertainty
o A valuable tool in decision making whenever one is uncertain about
the state of nature
• Statistics in medicine
o Increasingly prevalent in medical practice
Hospital utility statistics, auditing, vaccination uptake,
incidence/prevalence of AIDS and so on…
• Statistics is about common sense and good design
• Statistics is only the guide to make decisions
• Judgment should be made based on both biological and statistical
plausibility
• Concept and applications of statistics in medical sciences
o Let us discuss briefly
o People say “stat is boring”
o Let us make it interesting
6. Classification of statistics
• It consist of two parts
o Descriptive statistics
Concerned with collection, organization, enumeration of the
frequency of characteristics, summarization and presentation of
data.
o Inferential statistics
Statistical inference
Analytical in nature
Consists of a collection of principles or theorems
Allows researcher to generalize characteristics of a “population”
from the observed characteristics of a “sample”
• Statistical jargons
o Population parameter
A fixed numerical value which describes a particular
characteristic of a population
21
E.g. 1 – the mean value in the population of a particular
characteristic of interest (mean systolic blood pressure of
Australia adults)
E.g. 2 – the proportion of individuals in the population with a
particular characteristics of interest (the proportion of low birth
weight babies born in Indonesia)
o Sample statistics
Varies in value from sample to sample
Other terms – statistics, summary statistics, point estimate,
effect size, point estimate of the effect size
o The relationship between sample statistics and population
parameters is the basis of statistical inference.
7. Statistical inference
• 2 broad categories
o Hypothesis formulation and testing
o Estimation
Point estimation
Interval estimation
(Confidence interval)
22
STUDY DESIGN
Descriptive Analytical
23
5. Case report
• Strength
o Hypothesis (question) generation
o Clinical observation
• Weaknesses
o May be one off
o Nothing to compare
6. Case series
• Strengths
o Strengthens the hypothesis
o Able to establish temporal relationship
• Weakness
o Nothing to compare
24
• End point blinding Æ e.g. the pathologist are not given any
information about the study sample slide so the pathologist didn’t
know whose slide it is and he/she will decide based on his/her
independent interpretation about the slide.
• There are few of RCT
o Single blind
o Double blind
o Triple blind
o Multiple blind
o End point blinding
• There are 2 design of RCT
o Parallel RCT
o Cross-over RCT
Population
Population
Eligible subjects
Eligible subjects
Randomization Randomization
Outcome assessment
Test Control
Control Test
Post-treatment Post-treatment
assessment assessment
Outcome assessment
25
• On entry, people are classified according to characteristics that might
be related to outcome
• Other names: longitudinal, prospective, incidence studies.
Disease
Exposed
No Disease
Direction of inquiry
Disease
Unexposed
No Disease
26
o Looking back in nature
o We were not there to measure risk directly
o Associate outcome (disease) with prior exposure
• Calculate indirect estimate of risk: odds ratio (OR)
• Compare the frequency of a risk factor in a group of cases and a
group of controls
• There must be a comparison group that does not have the disease
• There must be enough people in the study so that chance does not
play a large part in the observed results
• Groups must be comparable except for the factor of interest
Exposed
Cases
Unexposed
Direction of inquiry
Exposed
Control
Unexposed
27
• More potential for biases
Observation
Population
Samples
Exposed Disease
Unexposed No Disease
28
• Classical cross-sectional
• Comparative cross-sectional study
o A comparative way of conducting a cross-sectional study
o Samples are drawn from two or more defined different
populations
o Measure exposure and outcome factors
o Investigate the association between exposure and
outcome
o Strengths of cross sectional studies
Very quick and inexpensive to implement
Useful for determining prevalence
Appropriate for diagnostic test validity
o Weaknesses
Difficulty in establishing links of causal effect (temporal
relationship)
Impractical for rare outcomes
29
DESCRIPTIVE STATITISTICS
1. Definition:
• Statistics
o A field of study concerned with the collection, organization and
summarization of data.
• Statistical Methods
o A scientific technique employed for collection, presentation,
analysis and interpretation of data.
• Biostatistics
o Biological field and medicine
2. Uses of statistical methods
• To collect data in the best possible way
o Designing form
o Organizing
o Conducting survey
• To describe the characteristics of a group or a situation
o Data summary
o Data presentation
• To analyses data and to withdraw conclusion
o Scientific, logic
o Decision making
3. Classification of statistics
• Descriptive statistics
o Concerned with collection, organization, enumeration of frequency
of characteristics, summarization and presentation of data.
o Describe characteristics of the observed data
Type of variable
Summary statistics
Distribution
Graphical presentation
• Inferential statistics
o Analytical in nature
30
o Involve hypothesis testing and confidence interval
o Allows researcher to infer/ generalize the characteristics of the
sample (statistic) to the population (parameter)
4. Terms:
• Population:
o Full sets of individuals
o Collection of items objects, things, people
o Parameter – descriptive measure from population data
• Sample
o Subset of population
o Selected to represent the population by sampling technique
o Statistics – descriptive measure from sample data
• Variable
o Any characteristics of even/object/person
o The characteristics being observed/measured
o E.g. age sex, race, height, weight, etc
• Data
o The raw or original measurement of statistics
o Values taken by the characteristics
o E.g. Malay, female, 155cm
5. Classification of variables
31
• Discrete
o Characteristics by gaps or interruptions in the values
o Values that can be assume only whole numbers
o Mainly count
o E.g. no of students, no of teeth extracted
• Continuous
o No gap or interruption
o Any value within specified interval
o Mainly measurement
o E.g. height, weight, BP, age, etc
• Nominal
o Unordered categories
o No implied order among the categories
o E.g. race, sex, medical diagnosis, etc.
• Ordinal
o Ordered categories
o Ranked according to some criteria
o E.g. BP – high, normal, low.
32
6. Categorical Variables:
• Data presentation
o Statistics
Frequency
Percentage (%)
o Graphical
Pie chart
Bar chart
7. Numerical variables
• Measures of central tendency
o A measure of centrality
Mean
• Arithmetic average
• Adding all the values in a population/sample and divided by
the number of values that are added
• Affected by the extreme value
33
Median
• The middle value of data ordered from the lowest to the
highest Æ arrange all value in order
• If n = odd number Æ medical is the middle
• If n = even number Æ median is the mean of 2 middle
observation
• 50th percentile of a set of observation
• The middle value of data ordered from lowest to the highest
value
• Useful for data with non-normal distribution or skewed data
• Less sensitive to extreme values than the mean
• Median (IQR)
Mode
• The most frequent observation
• Point of maximum concentration
• Measures of dispersion/variability
o Range = largest value – smallest value
Different between the largest and smallest value in a set of
observations
Give idea about the variability of data
Simplest to compute
Sensitive to outliers
Least useful
R = Xmax - Xmin
o Variance = s2
Total squares of deviation of observations from the
mean/number of degree of freedom
Average measure of standard deviation of observation from
mean sample
Measures the amount of variability or spread about/from the
mean of a sample
S2 = Σ(xi – xmean)2/n-1
34
o Standard deviation (SD)
A square root of variance
The root mean square of the distances (or differences) from
mean of sample
A better measure of variability of a set of data
Smaller SD indicates closer to the mean
Mean (SD)
S = √[Σ(xi – xmean)2/n-1]
35
o 95% Æ 2 SD
o 99.7% Æ 3SD
• Mean (SD)
36
Histogram
10. Summary
• Categorical data
o Statistics
Frequency (%)
o Graphs
Bar chart
Pie chart
• Numerical data
o Statistics
Mean (SD)
Median (IQR)
o Graphs
Histogram
Box Plot
37
HYPOTHESIS FORMULATION AND TESTING
1. Hypothesis:
• A statement about one or more population
• Research question
o Statement
o Research hypothesis
• Postulating the existence of:
o A difference between groups
o An association among factors
• Usually derived from a hunch, an educated guess based on published
results or preliminary observations.
• There are 2 type:
o Null hypothesis (HO)
Hypothesis of no difference
Hypothesis to tested
o Alternative hypothesis (HA)
The hypothesis that postulates that there is a treatment effect,
an association factors or a difference between groups.
• Inferential statistics – estimating the probability that a given outcome
is due to chance
• If the sample data provide sufficient evidence to discredit HO Æ reject
HO in favor of HA.
2. Hypothesis Testing:
• To aid the researcher in reading a decision concerning a population y
examining the sample.
• Observed differences or associations may have occurred by chance.
• HO : the proportion of patients with disease who die after treatment
with the new drug is not different from the proportion of similar
patients who die after treatment with placebo.
38
• HA : the proportion of patients with disease who die after treatment
with the new drug is lower than the proportion of similar patients who
die after treatment with placebo.
39
E.g. = 0.05, p value = 0.05, (2 tail test taken). Significant level
= 0.05/2 = 0.25. Conclusion does not reject the HO.
• The p value
o The probability of obtaining a value as extreme or more extreme
than that observed in the sample given that the null hypothesis is
true is called p value
o The smallest value of for which the null hypothesis can be
rejected
o The p value is compared to the predetermined significance level
(usually 0.05) to decide whether the null hypothesis should be
rejected
o If p value less than , reject the HO.
o If p value greater than , do not reject the HO.
40
• HO: the mean blood pressure of patients in the new
treatment group is not different from the mean blood
pressure of patients in the old treatment (µ1 = µ2)
• HA: the mean blood pressure of patients in the new treatment
group is different from the mean blood pressure of patients
in the old treatment (µ1 ≠ µ2)
Notes: Notes:
1. in population 2. in sample
- µ = mean - x = mean
- = standard deviation - SD = standard deviation
• Step 2:
o Set the significance level
Usually set at 0.05, 0.01, 0.1
• Step 3:
o Decide which statistical test to use and check the assumption of the
test
Population is approximately normally distributed
Data values are obtained by independent random sampling
Adequate sample size
o To decide which statistical test should be used
E.g. mean, proportion
o Assumption must be adequately met
o If not met alternative procedures can be used
E.g. non parametric test would be used when the data is seriously
non-normal)
• Step 4:
o Compute the test statistic and associated p value
Calculate appropriate test statistics
• Step 5:
o Interpretation
Compare p value with the level of significance
Decide whether or not to reject the null hypothesis
p value < – reject the null hypothesis
41
p value > – do not reject the null hypothesis
• Step 6:
o Draw conclusions
Conclude accordingly based on rejecting/not rejecting null
hypothesis
o Decision rule
Rejection region
• To reject the null hypothesis if the value of the test statistic that
computed from the sample is one of the values in the rejection
region
Acceptance region
• To accept the null hypothesis of the computed values in the
acceptance region
E.g. conclusion:
• The mean blood pressure of patients in new treatment group is
different from the mean blood pressure of patient in old
treatment
42
CONFIDENCE INTERVAL
0.95
Confidence
interval
2. Confidence Interval
• Standard Deviation (SD) vs. Standard Error (SE)
Sample Population
Mean (µ)
SD ()
Mean (x)
SD (s) Sample
Sample
Standard
Error
43
o Sample mean varies from sample to sample (as measured by SE)
How Close?
Sample Population
Mean (µ)
SD ()
Likely to
fall
Population
parameter
44
3. General Comments on Confidence Interval
• As a measure of an estimate of a population parameter (a measure of
the precision of a sample statistic)
• Confidence interval = estimate ± k x (standard error)
• 90% CI, 95% CI, 99% CI
o 95% CI interpretation – 95% certain that the population parameter
lies within its limits.
SE
45
EXPLORATORY DATA ANALYSIS (NUMERICAL DATA)
2. Single Mean:
• Step 1: Null and Alternative Hypothesis
o H0: The mean serum amylase level in the population from which
the sample was drawn is 120 units/100ml.
o HA: The mean serum amylase level in the population from which
the sample was drawn is different from 120 units/ 100ml. (two-
sided test)
• Step 2: Level of significance
o Alpha = 0.05 (alpha/2 = 0.025)
• Step 3: Check the assumption
o Population is approximately normally distributed
o Random sampling
o Independent variable/sample
• Step 4: Statistical test (one sample t test)
o t = (x - µ0)/ (s/√n)
o where
x = sample mean
s = sample standard deviation
n = sample size
µ0 = the hypothesis mean
t stat has n-1 degrees of freedom
• step 5: interpretation
o p-value < 0.025
o reject H0
• step 6: conclusion
46
o The population mean serum amylase is statistically significantly
different from 120 units/ 100ml.
47
o At the 5% level of significance that the mean serum amylase levels
are different in healthy and hospitalized subjects.
48
UNIVARIATE ANALYSIS OF NUMERICAL DATA
Age
Count 900 Average 35.25 St Dev 11.20
Min 19 Median 33 Variance 125.37
Max 75 Mode 27 Covariance 32%
Range 55 Skewness 1.09
Missing 0 Kurtosis 0.88
49
Univariate Analysis – Challenges
Variable
Categorical Numeric
Missing values Missing values
Invalid values Outliers
Numerization Binning
4. Missing data
• Data entry error
• Data processing error
• Certain data may net be available at the time of entry
• How to handle missing data
o Fill in the missing values manually
o Ignore the records with missing data
o Fill in it automatically
A global constant (e.g. “?”)
The variable mean
5. Outliers
• Data points inconsistent with the majority of data
• Different outliers
o Valid: CEO’s salary
o Noisy: one’s age = 200, widely deviated points
• Removal methods
o Box plot
o Clustering
o Curve-fitting
6. Binning
• Binning is a process of transferring continuous variables into
categorical counterparts
• Binning methods
o Equal-width
50
o Equal-frequency
o Entropy-based methods
• Variable values (e.g. age)
o 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equal-width binning
o Bin 1: 0, 4 [-, 10] bin
o Bin 2: 12, 16, 16, 18 [10, 20] bin
o Bin 3: 24, 26, 28 [20, +] bin
• Equal-frequency
o Bin 1: 0, 4, 12 [-, 14] bin
o Bin 2: 16, 16, 18 [14, 21] bin
o Bin 3: 24, 26, 28 [21, +] bin
7. Numerization
• Numerization is the process of transferring categorical variable into
numerical counterparts.
• Numerization methods
o Binary method
o Ordinal method
• Variable values (e.g. housing
o For free, own, rent
• Binary method
o For free: 1, 0, 0
o Own: 0, 1, 0
o Rent: 0, 0, 1
• Ordinal method
o Own: 5
o For free: 3
o Rent: 1
8. Quantification
• Introduction
o To conduct quantitative analysis, responses to open-ended questions
in survey research and the raw data collected using qualitative
methods must be coded numerically.
51
o Most responses to survey research questions already are recorded in
numerical format
In mailed and face-to-face surveys, responses are keypunched
into a data file.
In telephone and internet surveys, responses are automatically
recorded in numerically format.
• Developing code categories
o Coding qualitative data can use an existing scheme or one developed
by examining the data.
o Coding qualitative data into numerical categories sometimes can be
a straightforward process
Coding occupation, for example, can rely upon numerical
categories defined by the Bureau of the census.
o Coding most forms of qualitative data, however, requires much effort
o This coding typically requires using an iterative procedure of trial
and error
o Consider, for example, coding responses to the question, “What is
the biggest problem in attending college today?”
o The researcher must develop a set of codes that are;
Exhaustive of the full range of responses
Mutually exclusive (mostly) of one another.
o In coding responses to the question, “What is the biggest problem in
attending college today?” the researcher might begin, for example,
with a list of 5 categories, then realize that 8 would be better, then
realize that it would be better to combine and use a total of 7
categories
o Each time the researcher makes a change in the coding scheme, it is
necessary to restart the coding process to code all responses using
the same scheme
9. Distribution
• Data analysis begins by examining distributions
52
• One might begin, for example, by examining the distribution of
responses to a question about formal education, where responses are
recorded within six categories
• A frequency distribution will show the number and percent of
responses in each category of a variable
10. Central tendency
• A common measure of central tendency is the average or mean of the
responses
• The median is the values in the middle case when all responses are
rank-ordered
• The mode is the most common responses
• When data are highly skewed, meaning heavily balanced toward one
end of the distribution, the median or mode might be better represent
the most common or centered response.
• Consider this distribution of respondent ages:
o 18, 19, 19, 19, 20, 20, 21, 22, 85
• The mean equals 27. But this number does not adequately represent
the common respondent because the one person who is 85 skews the
distribution toward the high end.
• The median equals 20
• This measure of central tendency gives a more accurate portrayal of the
middle of the distribution
11. Dispersion
• Dispersion refers to the way the values are distributed around some
central value, typically the mean.
• The range is the distance separating the lowest and highest values (e.g.
the range of the ages listed previously equals 18-85)
• The standard deviation is an index of the amount of variability in a set
of data
• The standard deviation represent dispersion with respect to the normal
(bell shape) curve
53
• Assuming a set of numbers is normally distributed, then each standard
deviation equals a certain distance from the mean.
• Each standard deviation (+1, +2, etc) is the same distance from each
other on the bell-shaped curve, but represents a declining percentage of
responses because of the shape of the curve.
• For example, the first standard deviation account 34.1% of the values
below and above the mean
o The figure 34.1% is derived from probability theory and the shape of
the curve.
• Thus approximately 68% of all responses fall within one standard
deviation of the mean
• The second standard deviation accounts for the next 13.6% of the
responses from the mean (27.2% of all responses) and so on.
• Dispersion measures
o Spread around the mean
Variance – too abstract, a step towards standard deviation
Standard deviation (from mean) – more intuitive
o Standard deviation
Average distance between mean and each value in data set
Translates variance into same scale as mean and all the values
High values are generally bad
• If the responses are distributed approximately normal and the range of
responses is low – meaning that most responses fall closely to the mean
– then the standard deviation will be small
o The standard deviation of professional golfer’s score on a gold course
will be low
o The standard deviation of amateur golfer’s scores on a golf course
will be high
13. Continuous and Discrete Variables
• Continuous variables have responses that form a steady progression
(e.g. age, income)
• Discrete (i.e. categorical) variables have responses that are considered
to be separate from one another (i.e. sex, religious)
54
• Sometimes, it is matter of debate within the community of scholars
about whether a measured variable is continuous or discrete
• This issue is important because the statistical procedures appropriate
for continuous-level data, especially as related to the measurement of
the dependent variable.
• Example: suppose one measures amount of formal education within
five categories (less than hs, hs, 2 years vocational/college, college, post
college)
• Is this measure continuous or discrete?
• In practice, five categories seem to be cut off point for considering a
variable as continuous
• Using a seven-point response scale will give the researcher greater
chance of deeming a variable to be continuous.
14. Subgroup comparison
• Collapsing response categories
o Sometimes the researcher might want to analyse a variable by using
fewer response categories than were used to measure it
o In these instances, the researcher might want to collapse one or
more categories into a single category
o The researcher might want to collapse categories to simplify the
presentation of the results or because few observations exist within
some categories
• Collapsing response example
Response Frequency
Strongly disagree 2
Disagree 22
Neither agree nor disagree 45
Agree 31
Strongly agree 1
55
One might want to collapse the extreme responses and work with just three
categories
Response Frequency
Disagree 24
Neither agree nor disagree 45
Agree 32
56
UNIVARIATE ANALYSIS OF CATEGORICAL DATA
57
• Step 4: statistical test
o Chi-square test or
o Fisher exact test
o X2 = Σ (O – E)2 / E
o Chi-square value:
When the difference between observed and expected increase Æ
Value of chi-square increase Æ p-value decrease Æ significant
increase
• Step 5: Interpretation
o p value = 0.123
do not reject H0
o There is no significant association between gender and IHD status.
• Step 6: conclusion
o There is no significance association between gender and IHD status
using Pearson Chi-square tests (p-value = 0.123)
• Data presentation
* Discordant
** Concordant 58
o Matched for age
o Undergone
Simple Mastectomy (SM)
Radical Mastectomy (RM)
o Difference of 5-year survival proportion between two group
• Step 1: state the null and alternative hypothesis
o H0: there is no association between type between type of
mastectomy and 5-year survival proportion in patients with breast
cancer
o HA: there is an association between type of mastectomy and 5-year
survival proportion in patients with breast cancer.
• Step 2: set the significance level
o = 0.05
• Step 3: check the assumption
o Categorical data
o Dependent or matched sample
• Step 4: statistical test
o Mc Nemar’s test
• Step 5: interpretation
o p-value = 0.021
reject H0
o there is significant association between type of mastectomy and 5-
year survival proportion in patients with breast cancer.
• Step 6: conclusion
o There is significant association between type of mastectomy and 5-
year survival proportion in patients with breast cancer using Mc
Nemar’s test (p-value = 0.021)
• Data presentation
59
Table 2: Association between type of mastectomy and 5-year survival
proportion in patients with breast cancer
Simple mastectomy
Variable Live Die p-value
n (%) n (%)
Radical
Live 13 (%) 1 (%) *0.021
Die 9 (%) 2 (%)
* Mc Nemar’s test
60
CORRELATION & REGRESSION
Correlation Regression
CORRELATION
61
3. Correlation coefficient (r)
- X increase Y increase r = 1 (perfect positive)
- X increase Y decrease r = -1 (perfect negative)
- No linear relationship r=0
- r
o < 0.25 Æ poor
o 0.26 – 0.50 Æ fair
o 0.51 – 0.75 Æ good
o 0.76 – 1.00 Æ excellent
- r does not imply a cause and effect relationship
- Correlation should be assessed mathematically, not visually.
- r for statistical sample, ρ (rho) for parameter of population.
- Correlation coefficient:
62
• = 0.05
- Step 3: check the assumption
• Both numerical variable
• One of the variables has normal distribution
Histogram
Box and Whisker plot
- Step 4: statistical test
• Pearson correlation (if assumption is met)
• Spearman’s correlation (if assumption is not met)
- Step 5: Interpretation
Correlations
weight
height Height Weight
height Height Pearson
1 .878(**)
Correlation
Sig. (2-tailed) . .000
N 100 100
weight Weight Pearson
.878(**) 1
Correlation
Sig. (2-tailed) .000 .
N 100 100
** Correlation is significant at the 0.01 level (2-tailed).
• p-value = <0.001
reject H0
- step 6: conclusion
• There is a significant, positive and excellence correlation between
height and weight (r = 0.88, p < 0.001)
- Checklist for reporting correlation (Figure 1)
• Correlation coefficient – Pearson’s correlation coefficient/
Spearman’s Ranked correlation coefficient
• Actual p-value of correlation coefficient
• Sample size
• Scatter plot
63
Figure1: A scatter plot showing high positive correlation between
height and weight
75
70
Weight
65
60
55
1. Regression Analysis
• Regression analysis is a statistical tool that utilizes the relation
between variables so that one variable van be predicted from the other
or others
• Linear regression
o Simple (one independent variable (factor) and one outcome)
o Multiple ( more than one factor and one outcome)
• Logistic Regression (dichotomous dependent variables)
64
2. Simple Linear Regression
• Example of research questions
o Does a relationship exist between oral contraceptive and the
incidence of thromboembolism?
o What is the relationship of a mother’s weight to her baby’s birth
weight?
o Relationship between an animal’s pulse rate and the amount
particular drug administered?
• Simple because only one independent variable
• Linear means the relationship between y (dependent/outcome) and x
(independent/factor) variables can be represented by a straight line
• Analysed linear relationship between two quantitative (numerical)
variables
• Involves estimating the equation of a straight line that defines the
relationship between a dependent variable using a given data set
• The method involved is called method of least squares
• We choose a line such that the sum of squares of vertical distances of
all points from the line is minimized (Q = Σ е2i )
• These vertical distances between y values and their corresponding
estimated values on the line are called residuals (ei = yi – ŷi)
• The line thus obtained is called the regression line or the least-squares
line of best fit
3. Regression line (least squares line of best fit)
• Yi = β0 + β1Xi + єi
o Yi is the value of dependent variable when the value of the
independent variable is Xi
o β0 is Y-interception and is constant
o β1 is the slope of the regression line. It is the change in Yi when Xi is
increased by one unit
o β0 and β1 are called regression coefficients
o єi is random error terms, normally distributed, independent, with
zero mean, and constant variance a2
65
4. Linear Regression Model
• Relationship Independent
(factor/
Population Population explanatory)
Y-interception slope variable
Dependent
(outcome/ Yi = β0 + β1Xi + єi Random error
response) variable
n n
Σ (Yi – Ŷi) = Σ έ2i
i=1 i=1
66
• LS minimizes the Sum of the Squared Differences (SSE)
8. Interpretation of Coefficient
• Slope (β^1)
o The change in the estimated mean value of Y when X is increased by
1 unit
If β^1 = 0.05, then the estimated mean cholesterol level (Y)
changes by 0.05 mmol/dl when the age is (X) increased by 1 year.
• Y-intercept (β^0)
o Average value of Y when X = 0
If β^0 = 3.3, then the mean cholesterol level (Y) is expected to 3.3,
when the age (X) is 0 (???)
8. Measures of variation in Regression
• Total variation (Total Sum of Squares (SSTOT))
o Measures variation of observed Yi around the mean Ymean
• Explained variation (Squared Sum of Regression (SSR))
o Variation due to relationship between X & Y
• Unexplained variation (Square Sum of Error (SSE))
o Variation due to other factor
9. Sum of squares
• Total sum of square (SSTOT)
o Measure of total variation in dependent variable Y
o SSTOT = Σ (Yi – Ymean)2 = SSR + SSE
• Regression Sum of square (SSR)
o Measure the variation ‘explained’ by the regression line
o SSR = Σ (Y^i – Ymean)2
• Error Sum of squares (SSE)
o Measures of the ‘unexplained’ variation in Y or the scatter around
the regression line
o SSE = Σ (Yi – Y^i)2
67
Measure of variation
Yi
Ŷ = β^0 + β^1Xi
(unexplained sum of squares)
SSE = (Yi – Y^i)2
Ymean
X
Xi
Notes:
X, Y and slope:
• Positive slope, Y increases with increase in X
• Negative slope, Y decreases with increase in X
68
o Rejection rule:
Reject H0 if p-value for the F-test less than 0.05 (assumed )
• Assumption
o The errors are normally distributes
o They are independent
o The mean of random error term is equal to zero
o The variance of random error, 2 (sigma square), is constant.
11. How to analyse
• Exploration of the data
o Descriptive
o Scatter plot between two variables
Check for distribution, relationship and outliers
• Fit the square least line (regression line)
o Using least square method
o It is the best fitting straight line trough the data points in a scatter
plot
o It represents the least square equation and estimates the constant
(a) and slope (b) for and β Æ Y^ = a + bx
o It is constructed by using the method of least square – minimizes the
sum of squared deviations of each point from the mean (regression
line)
• Evaluation of model by R2 (R square)
Model Summary
model R R2 Adjusted R2 Std Error of the estimate
1 .592a .350 .338 .9043
a. Predictors: (constant), Time
o R2 = 0.35, meaning that 35% of the total variation in GPA is
explained by the study time
o R2 measures the closeness of fit of the sample regression equation to
the observed values of Y
o It ranges fro 0 to 1
o Is called coefficient of determination
69
• Evaluation b
o Evaluation of β using t-statistics
Coefficients
Model Unstandardized Coefficient 95% CI for
B Std error t Sig. Lower Upper
1 (constant) 1.461 .315 4.639 .000 .829 2.093
Time .389 .073 5.342 .000 .243 .534
70
• Histogram of unstandardized residuals
Linearity
• Plot of unstandardised residuals against unstandardised
predicted values
• Creating residual: go to analyse Æ regression bivariate Æ save
Æ unstandardised residual and predicted values
Let say the assumption is met.
• Interpretation and conclusion
o 35% of the variation in GPA is explained by study time
o There is significant linear association between GPA and study time
o For each 1 hour increase in study, the GPA of a student increase by
0.39
o We are 95% confident that for each 1 hour change in the study tie,
the GPA increase will lie between 0.24 to 0.53
71
CORRELATION
1. Introduction
• Correlation is used to measure and describe a relationship between two
variables
• Correlation measure three characteristics of relationship
o The direction of the relationship
Positive
• It means that when value of one variable increase, the
corresponding value of related variable also increases.
Negative
• It means that when value of one variable increase, the
corresponding value of related variable decreases.
Zero correlation (no correlation)
• It means that when values of one variable increases or
decreases independent to the value of other variable
o The form of the relationship
When value of one variable increases, the corresponding value of
related variable increases or decreases until certain value, but
beyond that value there may have change not in the same trend
or may not have any change at all.
o The degree of the relationship
It measure how strong the relationship between the values of two
variables
2. Application of correlation
• Prediction
o If two variables are positively or negatively related to each other,
then by knowing the value of one of these variables it is possible to
predict the corresponding unknown value of the other variable
• Validity
o Validity is them measure that a test truly is measuring what it
claims to measure.
• Reliability
72
o Reliability is the measure whether the test instrument produces the
stable, consistent measurements it is used again and again in the
same group of students or people.
• Theory verification
o Theory is a statement that makes a specific prediction about the
relationship between two variables
o This predicted relation can be verified by correlation test
3. Measures of correlation
• Pearson correlation (Pearson product-moment correlation)
o The Pearson correlation measures the degree and the direction of the
linear relationship and is denoted by the letter r (correlation
coefficient)
o r = (degree to which x and y vary together)/(degree to which x and y
vary separately)
o r = (covariability of x and y)/(variabilitiy of x and y separately)
o r = (SP)/√(SSx SSy)
o SP = sum of product of deviation = Σxy – [ (Σx Σy)/n]
o SSx = sum of squared deviation of x = Σxx – [ (Σx Σx)/n]
• Or SSx = Σx2 – (Σx)2/n
o SSy = sum of squared deviation of y = Σyy – [ (Σy Σy)/n]
• Or SSy = Σy2 – (Σy)2/n
• Spearman correlation (Spearman rank-order correlation)
o It is used when the data are of ordinal variable.
o If it is not then data must be ranked
o Rank order the score separately for each variables with 1 for the
smallest score
73
4 6 15 4 4
5 7 16 5 5
o If there are same score for more than one respondents the final rank
for the respondents will be the average of the ranks
74
o Correlation is affected by the outliers
4. Hypothesis tests with the Pearson correlation
• Two-tailed
o Ho = ρ = 0 (no correlation)
o HA = ρ ≠ 0 (there is correlation)
• One-tailed
o Ho = ρ ≤ 0 (there is no positive correlation)
o HA = ρ > 0 (there is positive correlation)
• Reporting correlation
o r = 0.65, n = 30, p-value < 0.01, one tail or two tail,
o r2 = coefficient of determination
5. Summary
• Correlation is a statistical test to assess the relation between two
variables
• Relation can be positive or negative
• Two method of test are Pearson and Spearman methods
• Test is used in prediction of relationship testing validity and reliability
and verifying theories
• Can be calculated manually using different formulas or using computer
statistical package like SPSS
• Correlation does not say about cause and effect relationship
• The correlation coefficient is influenced by the outliers and or range of
data under analysis
75
NONPARAMETRIC STATISTICS
76
• Hypothesis
o H0: ∆d = 0 (the median of differences is zero)
o HA: ∆d ≠ 0
• T.S.: smallest of n+ and n-
• RR: Reject H0 if p-value is less than (assumed alpha)
• Procedure
o Exclude the observations for which the difference (di) is zero
o For di > 0 assign (+sign) and for di < assign (-sign)
77
o A difference of zero is not ranked, it is eliminated from the analysis
and the sample size is reduced by one
o Tied observation are assigned an average rank (suppose two smallest
differences; 4,4; each one will get average rank (1+2)/2 = 1.5)
o Assign each rank either a (+) or (-) sign corresponding to the sign of
the difference
o Compute sum of +ve ranks (T)+ and sum of –ve ranks (T-)
o Choose the test statistics (smallest of T+, |T-|)
78
5. Kruskal-Wallis Test
• Counter part of One Way Analysis of Variance (ANOVA: comparing
means of more than two groups) if:
o Normality assumption of ANOVA not justified
o Or the data available is ordinal (consist ranks)
• Assumption:
o The samples are independent and random
o The measurement scale is at least ordinal
o The distribution of the values is sampled populations are identical
except for the possibility that one or more of the population are
composed of values that tend to be larger than those of other
populations.
• Hypothesis
o H0: the two populations are all identical
o HA: At least one of the population tend to exhibit larger values than
others
• Procedure
o If no ties or moderate number of ties the formula simplifies to:
• Rejection region
o When the samples sizes are large (ni ≥ 5) the test statistic T is
distributed approximately as x2 (t – 1)
o Reject H0 if T > x2 (t – 1)
79
NON-PARAMETRIC TESTS
1. Introduction
• Inferential statistics where population parameters are not a requirement
to calculate its value
• A process which is carried out in order to find out whether or not a
particular statistical hypothesis is likely to be true
• A statistical test in which no assumption are made about any statistical
parameter. This is similar to a test in which we do not assume that the
data have any particular distribution
2. The X2 test for goodness of fit
• Test hypothesis about the proportions of a population distribution
• Test how well the sample proportions fit the population proportion
specified by the null hypothesis
• Example:
o Ho : there is no difference in proportion of people in different
categories
o Observed data (fo)
80
o Square the differences (fo – fe)2
81
o Expected data (fe)
82
• There is no relationship between preference of teaching method and
gender of the students
4. Chi-squared test for variance
• This is a test of the null hypothesis that the population variance is 2.
• We have a sample of size n and we compute an unbiased estimate of the
population variance s2 using divisor n-1.
• The distribution used is X2 statistics is (n-1)2 s2/2.
• We assume that the population is normally distributed
• For 95% level of confidence probability level are within 97.5% and 2.5%
• Find out the critical region in MS Excel CHINV (p, df)
83
STATISTICAL ANALYSIS: WHICH TO CHOOSE?
84
• What is/are the research question (s)?
o Common in medical research:
Difference between/ among means
Difference between/ among proportions
Associations between/ among factors
Difference between/ among treatment effects
• Hypothesis
o This is a testable statement that describes the nature of the
proposed relationship between two/ more variables interest
o E.g. there is an association between smoking and coronary heart
disease
4. What is the research design applied and expected result?
• Randomized control trial (RCT)
• Observational studies
o Cross-sectional
o Case-control
o Prospective cohort
o Retrospective cohort
• Case report/ series
• Diagnostic test
• E.g. 1
o Research question: effectiveness of new anti-hypertensive drug
o Research design: randomized controlled trial
85
• E.g. 2
o Research question: Risk factor for enteric fever
o Research design: Case control
Time direction
86
Time direction
Population
(no disease)
87
o Discrete (e.g. number of patients admitted)
• Categorical
o Nominal (e.g. occupation, gender)
o Ordinal (e.g. disease severity, socioeconomic status)
• Statistical tests applied are different based on the type of variables
(must consider both independent and dependent variables)
9. Number of group
• Two group (two levels) (e.g. diabetic and non-diabetic group)
• More than two group (more than two levels) (e.g. race – Malay, Chinese,
Indian, Others)
10. Sample distribution
• Normal distribution Æ parametric test
• Non-normal distribution Æ non-parametric test
• Suggested procedure for assessing normality
o Compare the mean & median (for normal distribution mean =
median)
o Construct a histogram overlaid with normal curve
o Construct a box and whisker plot
o Statistical test
Kolmogorov-Sminov test
Shapiro-wilk test
• Non-parametric test are appropriate when:
o Data is ordinal
o Data is non-normal distribution and cannot be easily transformed
o Data may contain outlier
• Non-parametric methods have two general limitations
o Not as powerful as parametric counterparts
o Test for complex design are not readily available in standard
computer packages
11. Sample type
• Independent sample (e.g. disease and non-disease groups, male and
female)
88
• Dependent/ paired/ matched sample (e.g. difference of blood pressure
measurements before and after treatment, age and sex matched
samples)
12. What to be asked before choosing a statistical test?
• What is the research question/ hypothesis?
• What is the outcome factor and what are the study factors?
• How many variables?
• How many groups?
• What is the distribution like?
• Are the samples independent?
• Is the data numerical/ categorical?
13. Data exploration and cleaning
• Compulsory to do
• Do not rush to analyze data
• Clean and explore first
• Get acquaintance with the data
• Check duplications
• Out-of-range values and location of error
• Distribution of variables
• Missing data checking consistency errors
• Exploring the relationship between variables
• Transformations
• To get acquaintance with data set before the major analysis is carried
out
o Read the protocol again
o Recall the objectives
o Identify major outcome, exposure and potential confounders/ effect
modifier
o To check records with duplicating ID number (to prevent repeated
data entry
• Error checking
o Respondent’s mis-marking answers
89
o Coder’s miscoding response
o Marking errors by data personnel
• Out-of-range values and location errors
o Measurement error
o Recording error
o Genuine observation
• What to do?
o Check again original measurements where possible
o If original measurements suspicious Æ repeat the measurement
o If not possible to check Æ common sense
o If the value is impossible/implausible Æ justifiable to set as
“missing”
14. Distribution of the variables
o Examine each variable
Continuous
• Normal distribution
• If not
o ? transformation
o ? categorization
Categorical
• Frequency distribution
15. Missing data
o Occur when respondent would/could not answer
o Too much missing data
Threat the study
Indicate a problem with a question
o Should not be entered as a blank as some statistical packages
interpret blanks as zeroes
o Common practice – coded as 9, 99 or 999
16. Consistency errors
o Situations where respondents answered a question for which they
were ineligible or when codes were entered incorrectly
o Countercheck with questionnaire/data collection form
90
o Can be prevented by proper programming in some statistical
software
17. Exploring the relationship between variables
• Cross tabulation useful for categorical variables (sometimes better to
categorize)
• Should consider confounding & interaction
• Graphs – mostly for continuous variables
• Relationship between the outcome variable and other variables
o E.g. scatter plot
18. Transformation
• Severely skewed data – two approaches
o Use nonparametric methods
o Apply transformation
• Many distributions in medicine – skewed to the right
• Involve performing a mathematical operation on every value of the
variable
• Improves the symmetry of the distribution
91
• May be the most difficult part for those who are not familiar with
statistical applications
• Should interpret only when considered to be results of final analysis
stage
o E.g. in multivariate analysis, final model should be interpreted for
writing regardless of the prior more-favorable results towards the
hypothesis
• Recall statistical theory and concepts whenever applicable
• May need help from a medical statistician
20. Univariate analysis
• Test hypothesis between one independent and one dependent variable
21. Multivariate analysis
• Why we need multivariate analysis?
• Purpose of using multivariate analysis
• Common multivariate analysis methods in health sciences research.
Variables Variables
Independent Dependent
Predictor Outcome
Explanatory Response
Covariates
Confounders Not the primary interest
Controls Must be recognized
Effect modifiers
92
• Confounding
?
Risk factor Disease
Confounder
Physical ?
activity level Systolic BP
Age
93
• Interaction
?
Risk factor Disease
Interaction
factor
(effect
modifiers)
Employment ?
in an industry Lung cancer
Cigarette
smoking
Smokers
Risk of
lung
cancer
Non-Smokers
94
Surgery
Compare
outcome
Radiation
Diet
Smoking
CHD
Age
Clinical
Pathology Cancer
Demographic prognosis
Socio-economic
95
• Modeling strategies
MTV
Independent Variable Dependent variable
Multiple linear regression >1 1
Multiple logistic regression >1 1
Log-linear regression >1 1
Survival analysis >1 1
96
GLM Independent Variable Dependent variable
Univariate GLM ≥1 1
Multivariate GLM ≥1 >1
Binary
Ordinal Logistic regression
Multiple (xt logic)
97
Count Loglinear regression
(xt poisson)
98
WRITING A RESEARCH PROPOSAL
1. Introduction:
• Clear statement of the problems or issue to be analysed and the overall
objective of the proposed research.
• Brief summary of relevant studies and literature describing what has
previously been done and what is currently known about the pattern.
• Concise statement of the rationale behind the proposed approach to the
problem
3. Study methodology
• Selection of study population
o Size of study population or sample
o Sampling procedure, if any
o Specification of control population, if any
• Description of the experiment or data collection procedure
o Description of research design
o Description of method and intended research tools
o Description of “interfering” (confounding) variables and how they will
be controlled, or how their effects will be evaluated
o If appropriate, a discussion of pitfalls that might be encountered and
of limitations of the procedure proposed
• Diagram of research design (optional): a diagram is useful foe clarifying
points of research strategy
• Analysis plan
o Specify the kinds of data expected to be obtained
99
o Specify the means by which the data will be analysed and
interpreted
• Data processing plan
o Hand tabulation or computer
o Analysis technique: statistical measures
o Use of dummy tables
o Test hypothesis or drive hypothesis to meet the objectives of the
study
6. Personnel
• Principal investigator
• Assistants
• Supporting persons
7. Facilities available
• Office space
• Resources in field area
• Data analysis equipment
• Other assistance
100
8. Collaboration arrangement
• Describe the collaboration
9. Detailed budget
• Personnel
• Consultant fees
• Supplies
• Travel expenses
• Data processing
• Other expenses
101
VARIABLES
1. Types of variables
• Continuous or quantitative variables
• Discrete or qualitative variables
102
3. Qualitative or Discrete Variables
• Discrete variables is also called categorical variables
o Nominal variables
o Ordinal variables
• Nominal variables
o Nominal variables allow for only qualitative classification.
o That is, they can be measured only in terms of whether the
individual items belong to certain distinct categories, but we cannot
quantify or even rank order the categories
o Nominal data has no order, and the assignment of numbers to
categories is purely arbitrary.
o Because of lack of order or equal intervals, one cannot perform
arithmetic (+, -, / or *) or logical operation (<, >, =) on the nominal
data.
o E.g. male and female, unmarried, married, divorce or widower.
• Ordinal variables
o A discrete ordinal variable is a nominal variable, but its different
states are ordered in a meaningful sequence
o Ordinal data has order, but the intervals between scale points may
be uneven.
o Because of lack of equal distances, arithmetic operations are
impossible, but logical operations can be performed on the ordinal
data.
o A typical example of an ordinal variable is socio-economic status of
families.
o We know upper middle is higher than middle but we cannot say how
much higher.
o Ordinal variables are quite useful for subjective assessment of
quality; importance or relevance.
o Ordinal scale data are very frequently used in social and behavioral
research.
103
o Almost al opinion surveys today request answers on three-, five- or
seven-point scale.
o Such data are not appropriate for analysis by classical techniques,
because the numbers are comparable only in terms of relative
magnitude, not actual magnitude.
o Consider for example a questionnaire item on the time involvement
by selecting one of the following codes:
1 = very low or nil
2 = low
3 = medium
4 = great
5 = very great
6. Confounding variable
• A confounding variable (also confounding factor, lurking variable, a
confound, or confounder) is an extraneous variable in a statistical
104
model that correlates (positively or negatively) with both the
dependent variable and the independent variable.
• Extraneous variables are undesirable variables that influence the
relationship between the variables that an experimenter is examining.
• In other words, confounding is a variable that is associated with the
predictor variable and is a cause of the outcome variable.
105
DATA PRESENTATION
2. Tables
• One-way table (Univariate)
o Table 1: Number of respondents by gender
106
Gender Primary Secondary Higher Total
(%) (%) (%) (%)
Male 15 20 16 51
() () () ()
Female 14 20 12 49
() () () ()
Total 29 40 38 100
() () () ()
3. Charts
• Charts is a graphically way to organize data
• Types
o Pie chart
A pie chart is a graphical way to organize data
All pie charts compare parts of a whole
A lie chart uses percentages of fraction to compare data
A type of graph in which percentages values are represented as
proportionally-sized slices of a pie
Pie charts are especially useful in representing proportions,
percents and fractions.
o Bar chart and Histogram
A histogram is a bar graph that shows that frequency data
The first step… collect data and sort it into categories
Label the data as the independent set or the dependent set
Data group would be the independent variable and the frequency
of that set would be the dependent variable
The horizontal axis should be label with independent variable
The vertical axis should be labeled with the dependent variable
Each mark on either axis should be equal increments, such as 2,
4, 6, 8, etc
I think histogram as “sorting bin”
107
You have one variable, and you sort data by this variable by
placing them into “bins”
Then you count how many pieces of data are in each bin
The height of the rectangle you draw on top each bin is
proportional to the number of pieces in that bin
On the other hand, in bar graph you have several measurement
of different items, and compare them
The main question a histogram is “how many measurements are
there in each of the classes of measurement?”
The main question a bar graph answer “what is the measurement
for each item?”
o Line graph
Are more popular than all other graphs combined because their
visual characteristics reveal data trends clearly and these graphs
are easy to create
108
A line graph is a visual comparison of how two variables – shown
on the x- and y-axis – are related or vary with each other.
It shows related information by drawing a continuous line
between all the points on a grid.
Line graphs compare two variables: one is plotted along the x-axis
(horizontal) and the other along the y-axis (vertical)
The y-axis is a line graph usually indicates quantity (e.g. dollars,
liters) or percentage, while the horizontal x-axis often measures
units of time.
o Scattered plot
The pattern of the data points on the scatter plot reveals the
relationship between the variables.
Scatter plots can illustrate various patterns and relationship,
such as:
• Data correlation
• Positive or direct relationships between variables
• Negative or inverse relationship between variables
• Scattered data points
• Non-linear patterns
• Spread of data
• outliers
o Pictograph
109
Z-Score & IT’S USES
In every normal
distribution 0.3413 of
its total area lies
between the mean and
x = the value that is being standardized z = 1.2
μ = the mean of the distribution
σ = standard deviation of the distribution 0.3413
= sample mean
= standard error σ = standard deviation
= population mean n = sample size
2. Value of z-score
• The sign tells whether the score is located above (+) or below (-) the
mean
• The number tells the distance between the score and the mean in terms
of the number of standard deviation.
110
• The z-score for an item, indicated how far and in what direction, that
item deviates from its distribution’s mean, expressed in units of its
distribution’s standard deviation.
• The mathematics of the z score transformation are such that if every
item in a distribution is converted to its z score, the transformed scores
will necessarily have a mean of zero and a standard deviation and a
standard deviation of one.
• Z scores are sometimes called “standard scores”.
• The z score transformation is especially useful when seeking to
compare the relative standings of items from distributions with
different standard deviations.
• Z scores are especially informative when the distribution to which they
refer, is normal.
• In every normal distribution, the distance between the mean and a
given z score cuts off a fixed proportion of the total area under the
curve.
111
o A sample of 20 students attended PBL and average score of this
group of students is 65
o Is this increase of 5 marks in average due to chance or the effect of
PBL?
o Answer can be obtained by z test
= sample mean
= standard error σ = standard deviation
= population mean n = sample size
112
t-test
113
SENSITIVITY & SPECIFICITY
1. Definition
• Sensitivity
o Proportion of subject with a target condition who are identified by a
positive test finding.
o Test’s ability to correctly identify individuals with the condition
o Test’s capacity to detect the condition when it is truly present
o Probability of a test being positive given that the condition is present
o Also called true positive rate or hit rate
o The test will actually classify a person (with the condition) as likely
to have the condition
• Specificity
o Proportion of subjects free of the condition who are correctly
identified by a negative test result
o Test’s ability to correctly identify individuals without the condition
o Test’s capacity to exclude condition when it is truly absent
o Also called true negative rate or correct rejection rate
o The test will actually classify a person (without the condition) as
unlikely to have the condition
114
2. Validity of the Test
Sensitivity: a/(a+c)
= 36/(36+4)
= 0.90 = 90%
Interpretation Æ screening by physical exam and mammography will
identify 90% of all true breast cancer cases
Specificity: d/(b+d)
= 864/(96+864)
= 0.90 = 90%
Interpretation Æ screening by physical exam and mammography will
correctly classify 90% of all non-breast cancer patient as being free disease.
PPP = a/(a+b)
= 36/(36/96)
115
= 0.27 = 27%
NPP = d/(c+d)
= 864/(864 + 4)
= 0.99 = 99%
Validity – the extend to which the test distinguishes between persons with
and without the condition
116