IL2-Describing Variation in Data

Measurement variables
Describing (continuous variables)

Variation in Data
z Variables with an infinite number of values
A/P Koh Woon Puay that are equally spaced
MBBS, PhD z Can be measured along a numerical continuum
Email: ephkwp@nus.edu.sg
EPH office,, MD3,, Level 3 z Eg: height, weight, temperature, blood pressure
Tel: 6516 4975
Long ordinal data Presenting continuous data

z Ordinal data which are graded on a long scale, z Graphically: histogram
especially if numerically represented, may
sometimes be treated as continuous data z Descriptive:
z Eg: depression or anxiety on a scale of 1 to 10 z Summarize data with a single value
z But not true continuous data because (Measure central tendency)
z They have a finite number of distinct values
z There are gaps in the continuum z Measure absolute spread (dispersion)
z Spacing between categories is not numerically equivalent
(this may limit the interpretation of the results in analysis)
1
Distribution of age among diabetic Measures of central tendency
patients in the polyclinic
z Summarizes the set with a single value
z mean median,
mean, median and mode
z The mean is the average value of all the data
in the set.
z The median is the value that has exactly half
the data above it and half below itit.
z The mode is the value that occurs most
frequently in the set (rarely used)
Advantages and
Example Disadvantages
Systolic blood pressure
z Mean
130 145
130, 145, 150
150, 160
160, 165
z Widely used, easy to understand, measures
Mean: (130+145+150+160+165)/5 central location
Median: 150
z Overly sensitive to extreme values
130, 145, 150, 160, 165, 170
Mean: (130+145+150+160+165+170)/6
z Median
Median: (150+160)/2 z Insensitive to very large or very small values
z Determined by the middle points and less
sensitive to the actual numerical values of the
other data points
2
Mean = median A normal distribution
z Bell-curve or bell-shaped
histogram.
histogram
Mean > median

z Most of the values
accumulate around the
Median Mean middle. The mean,
median and mode are all
equal, and the scores at
Mean < median
either end of the
distribution occur less
Mean Median often
Skewness Measures of spread of continuous

or measurement data
z Skewed to right: if the
scores tend to cluster •Skewed to left: most of
toward the lower end of the scores tend to occur z Range
toward the upper end of the
the scale
scale while increasingly
Sex-partners fewer scores occur toward z Quartiles
the lower end.
z Variance and standard deviation
Sex-partners
3
Range Median and quartiles
z Range = difference between highest and z The median divides the data into two equal
lowest observed values sets (Q2).
z Greatly influenced by the presence of just z The lower quartile (Q1) is where 25% of the
one unusually large or small value (outlier). values are smaller than Q1 and 75% are
larger.
z Can be expressed as an interval such as 3-8,
or as an interval width, as a range of 5. z The upper quartile (Q3) is the value where
75% of the values are smaller than Q3 and
25% are larger
Example 1 – Upper and lower quartiles Interquartile Range

z Data:
z Interquartile range =difference between
z 8 49
8, 49, 51
51, 17
17, 45
45, 43
43, 9
9, 41
41, 45
45, 43
43, 38 upper quartile (Q3) and lower quartile (Q1)
z Ordered data z Interquartile range spans 50% of a data set,

and eliminates the influence of outliers
z 8, 9, 17, 38, 41, 43, 43, 45, 45, 49, 51
z Lower quartile: 17
z Median: 43;
z Upper quartile: 45;
4
Graphic illustrations Percentile rank
z Box-plots z Divide all values into 100 parts (percentile)
z Error-bars
z The proportion of values in a distribution that
Upper quartile a specific score is greater than or equal to.
z Eg. if you received a score of 75 on a math
Lower quartile
test and this score was greater than or equal
t the
to th scores off 85% off the
th students
t d t taking
t ki
the test, then your percentile rank would be
85 (85th percentile)
Advantages and disadvantages Variance

z Variance combines all the values in a data set to
z Range produce a measure of spread.
z Very easy to compute
z Very sensitive to extreme observations z The variance (symbolized by s2) is the sum of the
z Poor indication of distribution of points in between squared deviations from the mean, divided by the
number of observations minus 1 (degree of
z Quartiles freedom))
z Less sensitive to outliers
z Some of the observations are not used
n-1
5
Standard Deviation
Degree of freedom z Standard deviation (s) = square root of the variance
(give back the original scale)
zThe number of variables whose values can z Properties
p of standard deviation
be altered without affecting the mean, once it z measure spread or dispersion around the mean of a
is known. data set.
z never negative.
z sensitive to outliers.
zEg. 80, 85, 90, 105, X z for data with approximately the same mean, the
If mean is 95
95, X=115
X 115. Hence only 4 out of 5 values greater the spread,
g p , the greater
g the standard deviation.
can be changed to get back mean = 95.
n-1
More about a normal

Normal distribution distribution
z Many kinds of physiological z If the mean and standard deviation of a normal
data are approximated well by distribution are known, it is relatively easy to figure
the normal distribution.
out the percentile rank.
z Many statistical tests assume
a normal distribution.
z In a normal distribution, about 68% of the scores
z Most of these tests work well
even if the distribution is only are within one standard deviation of the mean,
approximately normal and in about 95% of the scores are within two standard
many cases as long as it does deviations, and about 99% of the scores are
not deviate greatly from within three standard deviations
normality.
6
zEg. 47,000 babies born in a
hospital
z 1,000 babies sampled, 1,000
weights obtained
z M
Mean = 3.25
3 25 kkg, Counts
SD=0.3 kg
95% of all the 1,000 babies
lie within 3.25 +/- (2x0.3) kg.
95% of all the 1,000 babies
lie within 2
2.65
65 and 3
3.85
85 kg
kg.
2.5% weigh less than 2.65 kg
and 2.5 % weigh more than 2.0 2.5 3.0 3.5 4.0
3.85 kg.

IL2-Describing Variation in Data

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

IL2-Describing Variation in Data

Caricato da

Copyright:

Formati disponibili

Measurement variables

Describing (continuous variables)

Long ordinal data Presenting continuous data

Mean > median

Skewness Measures of spread of continuous

Example 1 – Upper and lower quartiles Interquartile Range

z Ordered data z Interquartile range spans 50% of a data set,

Advantages and disadvantages Variance

More about a normal

Potrebbero piacerti anche