Sei sulla pagina 1di 81

SSDC

Discussed the significance of Statistics for Physicians

IDM EDA

Suggested the study strategies for learning Statistics Presented the role of Statistics in the Scientific Process Reviewed basic concepts of Statistics including:
o Sample Selection & Data Collection o Initial Data Manipulation

DDA
PoC

o Tabular & Numerical Methods of Exploratory Data Analysis

Discuss Graphical Methods of Exploratory Data Analysis Present Basic Methods of Statistical Inference Introduce most commonly used statistical tests

Provide directions for further study of Statistics

Graphical methods allow to discover trends & patterns

COMMON GRAPHICAL METHODS


PIE chart LINE chart BAR chart HISTOGRAMS DOT plots STEM-&-LEAF plots BOX-WHISKERS plots SCATTER plots

FREQUENCY POLYGONS

Q-Q plots

A circular chart divided into sectors It illustrates numerical proportion Simple but flawed method Unsuitable for large data sets

Inconvenient for data comparisons


Commonly used in business Avoided in research

A series of data points (markers)

connected by straight line (segments)


It shows how data changes in time & in

response to interventions
It reveals trends very clearly
Perfusion of:

Error bars added to markers provide

additional info about data


Commonly used in research

HORIZONTAL

Useful for presenting DISCRETE data Uses horizontal or vertical bars to show

comparisons among categories


Axes represent categories vs discrete data
Radioactivity (dpm)

Stacked bar graphs: bars divided into subparts


Grouped bar graphs: bars clustered in groups Commonly used in research & business

Frequency

Useful for presenting CONTINUOUS data

Used to show the data distribution


Uses RECTANGLES whose:
BINS

o widths represent class intervals (BINS): i.e. equal width groups into which data are put
Frequency

o heights are proportional to the frequencies of the corresponding bins


BINS

Histogram a.k.a: Frequency Plot (FP) TRUE HISTOGRAM (TH) differs from FP: TH uses RECTANGLES whose:
o widths represent class intervals (BINS): like F.P., but o areas (not heights) are proportional to the corresponding frequencies o heights equal to the Frequency Density of the interval, i.e., the frequency divided by the width of the interval o FP is identical to TH only when relative frequencies and bins of equal length are used for the FP
BINS

Frequency Density

BINS = class intervals:


equal width groups into which data are put series of ranges of numerical value into which

data are sorted

Frequency

There is no "best" number of bins


Different bin sizes can reveal different features

BINS

FREQUENCY:
the number of observations in
FREQUENCY

each class interval (BIN)

Bins

Term histogram sounds awkward for physicians It evokes the word (hystera=uterus), but this is just a coincidence

The origin of Histogram is unknown.


It was likely coined by statistician Karl Pearson as: (histos: mast) + (gramma: writing) or historical + diagram

HISTOGRAM: representation of a frequency distribution by means of rectangles, whose widths represent class intervals (BINS) and whose areas are proportional to frequencies."

Data Set: {1,2,2,3,3,3,3,4,4,5,6} Divide it into BINS, note frequency: {[1] [2,2] [3,3,3,3] [4,4] [5] [6]}

The assignment pattern here is: Bin 1 contains 1: its frequency is 1 Bin 2 contains 2,2: its frequency is 2 Bin 3 contains 3,3,3,3: its frequency is 4, etc.

FREQUENCY BINS (class intervals)

In real data sets most numbers will be unique


Data Set: {3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 38, 45, 49}
Lets use Ranges as BINS
By using Ranges with width of 10,

Data 3 11,12,19 22,23,24,25,27,29 35,36,37,38 45,49

Data Range Frequency

0-10 10-20 20-30 30-40 40-50

1 3 6 4 2

Frequency

data can be organized as follows:

BINS: Data Ranges

HISTOGRAM allows to analyze dataset by reducing it to a single graph showing: significance of primary & secondary peaks

THIS HISTOGRAM tells us that in the dataset:


The peak is:

o well-defined o close to Median & Mean


Outliers are not frequent

Thus: the deviations from the mean are not frequent

THIS HISTOGRAM tells us that in the dataset:


The peak is:

o Not well-defined

o Fairly close to Median & Mean


Outliers are frequent

Thus: the deviations from the mean are frequent

THIS HISTOGRAM tells us that in the dataset:


There are two peaks:

o a taller primary peak & a shorter secondary o Median & Mean are hard to localize
Outliers are hard to define as well

Thus: there is a very poor definition of one signal or two signals in data

is drawn by joining the midpoints of the top of bars of a histogram


Unlike histograms, frequency polygons can be superimposed It is used to compare frequency distributions of multiple datasets on one diagram

FREQUENCY

BINS

FREQUENCY

BINS

Favorite Movies

Data plotted on a simple scale, using circles


SciFi
4 1

Comedy
4

Action
5

Horror
6

Drama

WILKINSON DP: Used for univariate data Simple representation of a distribution Useful for highlighting clusters, gaps & outliers in small datasets

CLEVELAND DP: Used for multivariate data Plots of points belonging to several categories Alternative to bar charts Bars are replaced by dots at the values associated with each category

Two columns separated by a line: o Right: Leaves contain last digit of the number o Left: Stem contains all of the other digits
Data set: 44,46,47,49, 63,64,66,68,68, 72,72,75,76, 81,84,88, 106

Visualization of a distribution in small data set


4 5 6 7 8 9 10 | | | | | | | 4 6 7 9 3 4 6 8 8 2 2 5 6 1 4 8 6

Useful for outliers and finding the mode Used in 80s (easy to make with typewriters) Became less common after computer

Displayed as:

Key: 6|3=63 Leaf unit: 1.0 Stem unit: 10.0

graphics became available

Upper Whisker

Q3 Q2
Mean

MEDIAN

I Q R

Q1

Lower Whisker OUTLIERS

Conventions used for the Box are uniform: Bottom: first quartile Q1 Line inside: Median Q2 Symbol inside (e.g.+): Mean Top: third quartile Q3 Height: Interquartile Range Width of the box is arbitrary, as there is no x-axis Spacings between parts of the box help indicate the degree of dispersion (spread) and skewness (asymmetry)

Upper OUTLIERS Whisker

Conventions used for Whiskers & Outlier vary:


Q3 Q2 Mean

Whiskers: their ends can represent:

o Minimum & Maximum of the data

MEDIAN

o one Standard Deviation above & below the Mean


o 9th percentile and the 91st percentile

Q1

o the 2nd percentile and the 98th percentile


Outliers: data not included between the whiskers are plotted

Lower Whisker OUTLIERS

with dots, circles, stars, or lines

All cases (but exceptions) fit between the upper & lower marks

Distribution is not symmetric: the median line is NOT in the middle of box

EXAMPLE OF MULTIVARIATE BOX-PLOTS

Boxplot for the height of 240 students by gender: MvF shows:


A difference in distributions of

height in MvF
Spread is similar across in M&F:

similar heights of the boxes on either side of the medians


There is more outliers among M

Uses Cartesian coordinates to display

values for two numerical variables


Useful when comparing discrete

variables vs numeric variables


Can suggest various kinds of

correlations between variables

y
THEORETICAL

Used for comparing two probability distributions

(DD) by plotting their quantiles vs each other


x-coordinate: quantile of 1st distribution
x
OBSERVED

y
THEORETICAL

y-coordinate: quantile of 2nd distribution QQ plot on line y=x -> DD are similar QQ plot near line y=x -> DD are linearly related Commonly used to compare an observed data

set to a theoretical model


x
OBSERVED

Drawing conclusions from data subjected to random variation such

as: observational errors, random sampling/experiments


It makes conclusions about populations by analyzing its sample

Includes inter-related areas: ESTIMATION CONFIDENCE INTERVALS HYPOTHESIS TESTS

Statistics reflect acts of interpretation not absolute facts Conclusions of a statistical inference are statistical PROPOSITIONS Final Inference is obtained by using following interrelated propositions: ESTIMATION: calculation of the attribute of sample (statistic) representing the Best Estimate" of the attribute of population (parameter) CONFIDENCE INTERVAL: calculated region likely to contain true values of the attributes of interest, it indicates the reliability of an Estimate HYPOTESIS TESTING: consideration if chance is a plausible explanation of findings, it uses the data to decide if hypothesis - that there is no relationship between measured phenomena (Null Hypothesis) - can be rejected

Statistical ASSUMPTIONS: suppositions about mechanisms & features of Population & Sample Statistical Inference relays on them Statistical MODEL: set of statistical assumptions Assumptions can be divided into: Non-Modeling (re: Population, Sample) Modeling (re: Distribution, Structure, Cross-Variation)

Based upon Modeling Assumptions INFERENCE can be divided into:


PARAMETRIC: assumes that the population fits

certain ideal distribution described by parameters


NON-PARAMETRIC: does not depend on the

population fitting any parameterized distributions


SEMI-PARAMETRIC: has both parametric and

nonparametric components

Assumes existence of idealized distribution for population from which the sample is drawn

Uses the known shape & parameters of that ideal distribution for the Inference

PI is less robust but simpler than NPI


Makes more assumptions than NPI If those assumptions are: o Correct -> better estimates than NPI o Wrong -> it is misleading Thus it is less robust than NPI

Its models are simpler than NPI Thus it is more convenient than NPI PI has been most commonly used Became subject of recent criticism*

(*) Nassim Taleb. The Black Swan: The Impact of the Highly Improbable. 2nd Ed, 2010

Relies on no or few assumptions about the shape or parameters of the

population distribution from which the sample is drawn


NPI is useful for study of populations that take on a ranked order

Compared to PI:
o NPI is frequently less convenient, but always more robust o NPI has less power (larger sample size is required to draw conclusions with same degree of confidence) o NPI can be occasionally simpler than PI

NPI is seen by some as leaving less room for misuse & misunderstanding
CAVEAT: Term non-parametric has additional meanings in Statistics
e.g. denotes techniques that do not assume that the structure of a model is fixed.

Using the data to provide a suitable guess at the population attributes


STATISTIC: any mathematical function of the data in a sample. E.g. ESTIMATOR: statistic for calculating an estimate based on data ESTIMATE: result of an estimator. E.g.: Mean, Variance POINT estimator: yields a single-valued result

INTERVAL estimator: yields a range of plausible values


ERROR of the estimator: reflects the degree of its precision and

reliability. It dependents on sample size.

a probability distribution that describes: the probabilities of the possible values - for a specific statistic

The form of SD will depend on the

population distribution
SD is necessary for constructing confidence

intervals & hypothesis testing

2 pool balls out of 3


Out come 1 2 3 4 5 6 7 8 9 Ball 1 Ball 2 1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3 Mean 1.0 1.5 2.0 1.5 2.0 2.5 2.0 2.5 3.0 3.0 1 0.111 2.0 3 0.333 1.0 1 0.111 Mean FREQ RE0L FREQ

a probability distribution that describes the probabilities of the possible values for the Mean
Consider a Normal Population

1.5

0.222

2.5

0.222

Take repeatedly samples of a given size from it.

Rel. Frequency: PROBABILITY

Sampling Distribution Of The Mean

Calculate Mean for each sample (Sample Mean)


Each sample has its own Mean value

The distribution of these Means is called the

sampling distribution of the mean


MEAN

STANDART ERROR of a statistic: Standard Deviation () of the Sampling Distribution of that statistic
STANDARD ERROR OF THE MEAN: SD () of the Sampling Distribution of the Mean
SEM vs : SEM: estimate of how far the sample mean is likely to be to the population mean. It is an Inferential Statistic. SD: degree to which individuals within the sample differ from the sample mean. It is a descriptive statistic.

the difference between the expected value of an Estimator & the corresponding population parameter - it is designed to estimate

Estimator is unbiased for a parameter A if

its expected value is precisely A


Estimator is biased for a parameter A if its

expected value differs from A


Unbiasedness is usually a desirable property

CLT states that the sampling distribution of any statistic

will be NORMAL or nearly Normal, if the sample size is large enough


Normal distribution is useful for modeling
CLT allows to make assumptions about the population

Standardized normal distribution, with an idealized Mean of 0 & Standard Deviation of 1. It allows to create a compact table for all normal distributions (widely used before computers became available) Score (raw score, datum): original datum (observation) that has not been transformed

Z- Score (Standard Score): number of standard deviations a score is from the mean of population
Positive Z-Score: a datum above the mean

Negative Z-Score: a datum below the mean

probability distribution that is used to estimate normal population parameters when the sample size is small &/or when the population Standard Deviation is unknown Per CLT sampling distribution of a statistic will follow a normal distribution (ND), as long as the sample size is sufficiently large Thus, when we know the standard deviation (SD) of the population, we can compute a z-Score, and use ND to evaluate probabilities with the sample mean In reality, sample sizes are sometimes small, and SDs of the population are unknown When either of these occurs, we have to rely on the distribution of the t statistic (also known as the t score)

Indicates precision & accuracy of the Estimator Margin of Error: reflects Observational Error

(Measurement Error): the difference between a measured value of quantity & its true value

Gives an estimated range of values which is likely


CI Visualized
300 experiments Sample size of 10 Mean: 50 CI shown in: Yellow: 95% CI containing the mean Red: those that do not Blue: 99% CI containing the mean (50) White: those that do not

to include an unknown population parameter


It is calculated from a sample dataset If independent samples are taken repeatedly

from the same population, & CI calculated for each sample, then a certain percentage (Confidence Level) of the intervals will include the unknown population parameter
CI are usually calculated so that this percentage

(CL) is 95% for the unknown parameter

n
Small n

Width of CI indicates how uncertain we are about the unknown parameter


Wide CI: less confident

Narrow CI: more confident


Wide CI -> more data should be collected before anything definite can be said about the parameter

Large n

CIs are more informative than hypothesis tests: they provide a range of plausible values for the unknown parameter
CI are underrated and underused in research

Most Medical Journals prefer to relay on p-value instead

Consideration if chance is a plausible

explanation of findings
It uses the data to decide if Null Hypothesis

can be rejected
Null Hypothesis H0: there is no relationship

between measured phenomena

Alternative Hypothesis H1: there is

relationship between measured phenomena. This is typically the hypothesis of interest to the researcher

The decision in HT - may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true:
TYPE I error is rejecting the Null

Hypothesis H0 when H0 is true


TYPE II error is failing to reject the

Null Hypothesis H0 when the Alternative Hypothesis H1 is true

rejecting the Null Hypothesis H0 when it is true


It asserts something that is absent, a false positive

The rate of TIE is called the Size of a test ()


It usually equals Significance Level of a test ()

If H0 is simple, is the probability of TIE


If H0 is composite, is the maximum of the possible

probabilities of TIE
The rate of TIIE is called false negative rate ()

Significance testing involves calculating the probability - that a statistic would differ as much or more from the parameter specified in the H0 - as does the statistics obtained in the experiment
One-tailed probability: probability computed considering

differences in only one direction, such as the statistic is larger than the parameter

Two-tailed probability: probability computed

considering differences in both directions (statistic either larger or smaller than the parameter)

1. Specify H0 Hypothesis 2. Select Significance Level () 3. Compute p-value

4. Compare p-value with

In significance testing H0 - is typically the hypothesis

that a parameter is zero or that a difference between parameters is zero


E.g.: the null hypothesis might be that the difference

between population means is 0

The Significance Level () is the highest value of a probability

value for which H0 is rejected

Common Significance Levels are: 0.05 & 0.01

For =0.05: H0 is rejected if the probability value p 0.05

Probability value (p value) is the probability of obtaining a

statistic as different or more different from the parameter specified in the H0 as the statistic obtained in experiment
The p value is computed assuming H0 is true

The lower p value, the stronger the evidence that H0 is false


Traditionally, H0 is rejected if p 0.05

Final step: comparison of p value with Significance Level If the p < then H0 is rejected Rejecting H0 is not an all-or-none decision The lower the p value the more confidence that H0 is false

If p > findings are inconclusive


Failure to reject the H0 does not constitute support for H0

It just means: there not sufficiently strong data to reject it

the probability that the test will reject H0 when H1 is true. the probability of not committing Type II Error Power of a Test = 1; ( is rate of TIIE)

SP (aka Sensitivity) is a function of the possible distributions, determined by a parameter, under the H1 As SP increases, the chances of TIIE decrease Power analysis can be used to calculate the minimum:
o sample size required to detect an effect of a given size
o effect size to be detected using a given sample size

SP is used to make comparisons between different statistical tests (e.g.: between a parametric and a nonparametric test of the same hypothesis)

In a research paper the focus is on the Results Statistical Tests are mentioned briefly in Materials & Methods This brief mention may indicate if the paper contains valid research or not.

Have the tests been properly chosen? Have the results been interpreted in

the context of tests capabilities?


When nonstandard, complex tests are

used: Is their use justified?

ANALYSIS
Compare means between 2 groups

PARAMETRIC Test
Two-sample

NONPARAMETRIC Test
Wilcoxon rank-sum test

Example
Is the mean systolic blood pressure (at baseline) for patients assigned to placebo different from the mean for patients assigned to the Tx group? Was there a significant change in systolic blood pressure between baseline and 6-month follow-up in the Tx group?

t-test
Paired

Compare 2 quantitative data from one subject

t-test

Wilcoxon signed-rank test

Compare means between 3 or more groups

Analysis of variance

Kruskal-Wallis

ANOVA
Pearson coefficient of correlation

ANOVA
Spearmans rank correlation

If our experiment had 3 groups & we want to know whether the mean systolic blood pressure at baseline differed among them?

Estimate the degree of association between 2 quantitative variables

Is systolic blood pressure associated with the Pts age?

Source: Hoskin T. Parametric and Nonparametric: Demystifying the Terms. Mayo Clinics

TYPE
o o Parametric Evaluates if the means of two groups are statistically different from each other Especially appropriate for the posttest-only two-group randomized experimental design Population from which the sample is taken has normal distribution The variances of the populations to be compared are equal Examines the difference between means relative to variability of their scores

APPLICATION

CORRECT USE
o

ASSUMPTIONS
o o

PROCEDURE
o

FORMULA
o o o o t = signal to noise ratio Top part: difference between two means (signal) Bottom part is a measure of scores variability (noise) Calculated t is determined as significant or not by using tables

Trochim WV. The Research Methods Knowledge Base, 2nd Edition. 2006

TYPE
o Parametric & Non-parametric

APPLICATION
o Evaluates if the means of more than 2 groups are statistically different

CORRECT USE of ANOVA TYPES


o One-way: difference between two or more groups with one grouping method o One-way repeated measures: when repeated measures are done in one group o Two-way: difference between groups with complex grouping o Two- way repeated measures: for repeated measures structure with an interaction effect

ASSUMPTIONS:
o The expected values of the errors are zero o The variances of all errors are equal to each other o The errors are independent from one another & normally distributed

PROCEDURE
o The mean is calculated for each group o The overall mean is then calculated for all of the groups combined o Within each group, the total deviation of each individuals score from the group mean is calculated. This is called within group variation o Next, the deviation of each group mean from the overall mean is calculated. This is called between group variation o Finally: F statistic is calculated

FORMULA
o F statistic: the ratio of between group variation to within group variation o If the between group variation is significantly greater than the within group variation, it is likely that there is a statistically significant difference between the groups. The statistical software determines if the F statistic is significant or not .

In general, this step entails presenting the

graphical and numerical results & inferred conclusions from other steps of Data Analysis in an accurate and concise form
Specifically, it involves presentation of

Abstract (in the form of a Poster or Oral Presentation) followed by preparation & publication of the Manuscript

SSDC
IDM

EDA
DDA

PoC

As discussed previously:
There is no single best statistical manual Set of personalized references has to be assembled Electronic texts that contain hyperlinks or allow for

the instant Web Searches for terms are best suited for studying Statistics
Following suggestions are examples of texts taken from the large pool of study materials

Harvey Motulsky: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 3rd Ed, 2013
o Non-mathematical Approach to Statistics by a Physician-Statistician

Betty Kirkwood: Essentials of Medical Statistics 2nd Ed, 2003


o Classic Statistical Manual; New Edition is being prepared

Yosef Dlugacz: Measuring Health Care: Using Quality Data for Operational, Financial, and Clinical Improvement. 1st Ed, 2006
o Deals with application of Statistics in Medical Business

Yasar A. Ozcan: Quantitative Methods in Health Care Management: Techniques and Applications 1st Ed, 2009
o Deals with application of Statistics in Medical Business

Android
Statistics Quick Reference by Nuzzed 2013

Statistics Tutor by Statistics Research 2014

iPhone/iPad
Learn Statistics by Miaoshuang Dong 2013 Statistics Video Lectures by Khan Academy 2013

Windows 8
Statistics Formulas by Hexxa 2013
Statistics and Probability by SimpleNEasy 2013

Online Statistics Education: A Multimedia

Course of Study. Rice University


o http://www.onlinestatbook.com/2/index.html

University of Oxford. Introduction to

Statistics for Medical Students


o http://www.well.ox.ac.uk/~kanishka/Lectures/ MSTC_researchers/Notes/notes%20for%20med ical%20students.pdf

G Singh. Medical Science without Statistics. The Internet Journal of

Healthcare Administration. 2006; 4:2


Altman DG. The scandal of poor medical research. BMJ 1994; 308:283 Ghami N. Good Clinical Care Requires Understanding of Statistics.

Psychiatric Times. March 6, 2009


Bennette C, Vickers A. Against quantiles: categorization of continuous

variables in epidemiologic research, and its discontents. BMC Med Res Methodol. 2012;12:21
Taleb N. The Black Swan: The Impact of the Highly Improbable & On

Robustness and Fragility 2nd Ed, 2010

Statistics plays pivotal role in Science & Business of Medicine

Statistics can be abused & misused


Statistical illiteracy & innumeracy can be extremely

detrimental for physicians


Study of Statistics is challenging but unavoidable Statistics is based on several sometimes counterintuitive

principles & conventions


Those axioms have to be mastered first

Statistics reflects acts of interpretation, not absolute facts


Therefore, it is based on numerous assumptions Understanding of statistical assumptive models is critical for

appraisal of Statistical Analyses


Statistics continue to evolve as methodology Reliance on Parametric Statistics & validity of p-value based

Hypothesis Testing are being recently questioned

Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.

Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.

A type of experimental design in which the

experimental and control groups are measured and compared after implementation of an intervention.
Comparisons are made only after the

intervention, since this design assumes that the two groups are equivalent other than the randomly assigned intervention.
Between-group differences are used to

determine treatment effects.

Potrebbero piacerti anche