Sei sulla pagina 1di 82

CHAPTER 7.

USING DATA AND STATISTICAL


TOOLS FOR OPERATIONS IMPROVEMENT
An Integrated Approach to
Improving Quality and Efficiency
Daniel B. McLaughlin
Julie M. Hays
Healthcare Operations
Management
Copyright 2008 Health Administration Press. All rights reserved. 7-2
Chapter 7.Using Data and Statistical
Tools for Operations Management
Data Collection
Graphical Tools
Mathematical Descriptions
Probability and Probability Distributions
Confidence Intervals, Hypothesis Tests
ANOVA/MANOVA /MANCOVA
Regression
Copyright 2008 Health Administration Press. All rights reserved. 7-3
Data Collection
Validity: A valid study has no logic,
sampling, or measurement errors.
- Logic
- Selection or sampling
- Measurement
Copyright 2008 Health Administration Press. All rights reserved. 7-4
Data Collection
Diagram created in
Inspiration by
Inspiration
Software, Inc.
Copyright 2008 Health Administration Press. All rights reserved. 7-5
Data Collection
Logic
Why are the data needed?
What will the data be used for?
What questions are going to be asked of the
data?
Are the patterns of the past going to be
repeated in the future?
Copyright 2008 Health Administration Press. All rights reserved. 7-6
Data Collection
Selection or Sampling
Census versus sample
Nonrandom methods
Simple random sampling
Stratified sampling
Systematic or sequential sampling
Cluster or area sampling
Sample size
Copyright 2008 Health Administration Press. All rights reserved. 7-7
Data Collection
Measurement
Reliability
- Would the
measure-
ment be
the same
if we
repeated
it?
Accuracy
- Does the measurement
measure what we want
it to measure (i.e., say
= do)?
Reliable, but
not accurate
Reliable and
accurate
Not reliable,
but accurate
Precision
- How precise should
the measurements
be?
Copyright 2008 Health Administration Press. All rights reserved. 7-8
Graphical Tools
Mapping
Visual representations of data
Histograms and Pareto charts
Stem plots, dot plots
Box (and whisker) plots
Normal probability plots
Copyright 2008 Health Administration Press. All rights reserved. 7-9
Graphical Tools
Histograms and Pareto Charts
Length of Hospital Stay Diagnosis Category
0
2
4
6
8
10
12
14
1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18
Length of Hospital Stay (days)
F
r
e
q
u
e
n
c
y
0
2
4
6
8
10
12
H
e
a
r
t

D
i
s
e
a
s
e
D
e
l
i
v
e
r
y
P
n
u
e
m
o
n
i
a
M
a
l
i
g
n
a
n
t

N
e
o
p
l
a
s
m
s
P
s
y
c
h
o
s
e
s
F
r
a
c
t
u
r
e
s
Diagnosis
F
r
e
q
u
e
n
c
y
Microsoft Excel screen shots reprinted with permission from Microsoft Corporation.
Copyright 2008 Health Administration Press. All rights reserved. 7-10
Graphical Tools
Dot Plots
Length of Hospital Stay
Days
18 15 12 9 6 3
Dotplot of C1
Produced with Minitab Statistical Software
Copyright 2008 Health Administration Press. All rights reserved. 7-11
Graphical Tools
Turnip Graph
Percentage of diabetic Medicare enrollees receiving eye
exams among 306 hospital referral regions (2001)
Source: Wennberg, J. E. 2005. Data from the Dartmouth Atlas Project. Figure copyrighted by the Trustees of
Dartmouth College. Used with permission.
Copyright 2008 Health Administration Press. All rights reserved. 7-12
Graphical Tools
Normal Probability Plots
Length of Hospital Stay
Observed Cumul ati ve Probabi l i ty
1.00 .75 .50 .25 0.00
E
x
p
e
c
t
e
d

C
u
m
u
l
a
t
i
v
e

P
r
o
b
a
b
i
l
i
t
y
1.00
.75
.50
.25
0.00
Produced with SPSS for Windows
Copyright 2008 Health Administration Press. All rights reserved. 7-13
Graphical Tools
Scatter Plots
Microsoft Excel screen shots reprinted with permission from Microsoft Corporation.
Strong Negative Correlation
X
Y
r = -0.86
Strong Positive Correlation
X
Y
r = 0.91
Positive Correlation
X
Y
r = 0.70
No Correlation
X
Y
r = 0.06
Copyright 2008 Health Administration Press. All rights reserved. 7-14
Mathematical Descriptions
Mean
The mean is the arithmetic average of the
population:


The population mean can be estimated from
a sample:
. population the in values of number
and values individual where , mean Population
=
=

= =
N
x
N
x

5.
5
3 5 8 6 3
set, data simple our For
sample. the in values of number n where , mean Sample
=
+ + + +
=
= = =

x
n
x
x
Copyright 2008 Health Administration Press. All rights reserved. 7-15
Mathematical Descriptions
Median and Mode
The median is the middle value of the sample or
population. If the data are arranged into an array
(an ordered data set):
3, 3, 5, 6, 8

5 would be the middle value or median.
The mode is the most frequently occurring value.
In the above example, the value 3 occurs more
often (two times) than any other value, so 3 would
be the mode.
Copyright 2008 Health Administration Press. All rights reserved. 7-16
Mathematical Descriptions
Range and Mean Absolute Deviation
The range is the difference between the
high and low values in a data set.

The mean absolute deviation (MAD) is the
average of the absolute value of the
differences from the mean.
5 3 8 Range = = =
low high
x x
6 1
5
8
5
3 1 0 2 2
MAD . = =
+ + + +
=

n
x x
Copyright 2008 Health Administration Press. All rights reserved. 7-17
Mathematical Descriptions
Variance, Standard Deviation
The variance is the average square difference
from the mean.



This standard deviation is the square root of the
variance.
4.5
4
18
1 5
9 1 0 4 4
1
variance Sample
6 3
5
18
5
9 1 0 4 4
variance Population
2
2
2
2
= =

+ + + +
=

= =
= =
+ + + +
=

= =


n-
) x (x
s
N
) (x
.
2.1 4.5
4
18
1 5
9 1 0 4 4
deviation standard Sample
1.9 3.6
5
18
5
9 1 0 4 4
deviation standard Population
2
2
2
2
= = =

+ + + +
=

= =
= = =
+ + + +
=

= =

n
) x (x
s
N
) (x

Copyright 2008 Health Administration Press. All rights reserved. 7-18


Mathematical Descriptions
Coefficient of Variation
The coefficient of variation (CV) is a measure
of the relative variation in the data. It is the
standard deviation divided by the mean.
0.4
5
1.9
or CV = = =
x
s

Copyright 2008 Health Administration Press. All rights reserved. 7-19


Probability and Probability
Distributions
Determination of probabilities
Properties of probabilities
Probability distributions
Discrete probability distributions
Continuous probability distributions
Copyright 2008 Health Administration Press. All rights reserved. 7-20
Determination of Probabilities
Observed Probability
Observed probability is the relative frequency
of an eventthe number of times the event
occurred divided by the total number of trials.
n
r
P(A) = =
s experiment or trials, ns, observatio of number Total
occured A times of Number
n
r
P = =
drug the given patients of number Total
cured are patients times of Number
effective) is (drug
Copyright 2008 Health Administration Press. All rights reserved. 7-21
Determination of Probabilities
Theoretical Probability
Theoretical probability is the theoretical
relative frequency of an event; the theoretical
number of times an event will occur divided by
the total number of possible outcomes.
n
r
P(A) = =
outcomes possible of number Total
occur could A times of Number
25 0
52
13
deck the in cards of number Total
deck the in spades of Number
spade) a is (card . = = = P
Copyright 2008 Health Administration Press. All rights reserved. 7-22
Determination of Probabilities
Opinion Probability
Opinion probability is a subjective
determination of the number of times an event
will occur divided by the imaginary total
number of possible outcomes or trials.
n
r
P(A) = =
total l Theoretica
occur will event an times of number of Opinion
n
r
P
=
=
run be would Belmont the times of number total Imaginary
Belmont the win would t Secretaria times of number the on Opinion
Stakes) Belmont the winning at (Secretari
Copyright 2008 Health Administration Press. All rights reserved. 7-23
Properties of Probabilities
Bounds on Probability
Probabilities always must be >0, and an event that
cannot occur has a probability of 0.


Probabilities must always be s1.



P(A) + P(A') = 1 and 1 P(A') = P(A), where A' is
not A.
0
number Any
0
outcomes possible of number Total
occur could A times of number Least
= = = P(A)
1
outcomes possible of number Total
occur could A times of number Greatest
= = =
n
n
P(A)
1 0 s s P(A)
Copyright 2008 Health Administration Press. All rights reserved. 7-24
Properties of Probabilities
Multiplicative Property
For two independent events, the probability of
both A and B occurring, or the intersection ()
of A and B, is the probability of A occurring
times the probability of B occurring.
P(A and B occurring) = P(A B) = P(A) x P(B)
Copyright 2008 Health Administration Press. All rights reserved. 7-25
Properties of Probabilities
Multiplicative Property
Coin Toss Die Toss Probability
1 1/12
2 1/12
H 3 1/12 P(H 3) = 1/12
4 1/12
5 1/12
6 1/12
Start
1 1/12
2 1/12
T 3 1/12
4 1/12
5 1/12
6 1/12
P(H) = 1/2 P(3) = 1/6 P(H) P(3) = 1/2 1/6 = 1/12
Copyright 2008 Health Administration Press. All rights reserved. 7-26
Properties of Probabilities
Additive Property
For two events, the probability of A or B
occurring, or the union () of A with B, is the
probability of A occurring plus the probability
of B occurring, minus the probability of both
A and B occurring.
P(A or B occurring) = P(A B) = P(A) + P(B) + P(A B)
Copyright 2008 Health Administration Press. All rights reserved. 7-27
Properties of Probabilities
Additive Property
Coin Toss Die Toss Probability
1 1/12
2 1/12
H 3 1/12
4 1/12
5 1/12
6 1/12
Start P(H 3) = 7/12
1 1/12
2 1/12
T 3 1/12
4 1/12
5 1/12
6 1/12
P(H) = 1/2 P(3) = 1/6 P(H) + P(3) P(H 3) = 7/12
Copyright 2008 Health Administration Press. All rights reserved. 7-28
Properties of Probabilities
Conditional Probability
The probability of an event occurring if more
information is obtained:
) (
) (
) (
B P
B A P
B A P

=
Contingency Table for ER Wait Times
s30 minute wait >30 minute wait
Friday night 20 30 50
Other times 40 10 50
60 40 100
Copyright 2008 Health Administration Press. All rights reserved. 7-29
Properties of Probabilities
Conditional Probability
Note that:

and if one event has no effect on the other
event (the events are independent), then

Bayes theorem
) ( ) ( ) ( ) ( ) ( A P A B P B P B A P B A P = =
and ) ( ) ( A P B A P =
) ( ) ( ) ( B P A P B A P =
.
) ( ) ( ) ( ) (
) ( ) (
) (
) ( ) (
) (
) (
) (
A P A B P A P A B P
A P A B P
B P
A P A B P
B P
B A P
B A P
'

'
+

=

=
Copyright 2008 Health Administration Press. All rights reserved. 7-30
Probability Distributions
Discrete Probability Distributions
The binomial distribution
describes the number of
times a binary event will
occur in a sequence of
events.


The Poisson distribution is
used to model the number
of events in a specific
period.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3
Number of Heads in 3 Tosses
P
r
o
b
a
b
i
l
i
t
y
x n x
p) ( p
x)! x!(n
n!
P(x)

= 1
!
) (
x
e
x P
x


=
0
0.05
0.1
0.15
0.2
0.25
1 2 3 4 5 6 7 8 9 10 11
Number of Patient Arrivals in 1 Hour
P
r
o
b
a
b
i
l
i
t
y
Copyright 2008 Health Administration Press. All rights reserved. 7-31
Probability Distributions
Continuous Probability Distributions
In the uniform distribution,
the probability of
occurrence is the same for
all outcomes.

The triangular distribution
is described by the mode,
minimum, and maximum
values.
b x a
a b
x P s s

= for
1
) (

s s


s s


=
b x c
c) a)(b (b
x) (b
c x a
a) a)(c (b
a) (x
P(x)
for
2
for
2
a b
X
P
(
X
)
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
X
P
(
X
)
Min = 0.0, Mode = 0.5, Max = 2.0
Copyright 2008 Health Administration Press. All rights reserved. 7-32
Probability Distributions
Exponential Distribution
The exponential distribution is used to
model arrival rate, the rate of occurrence of
an event.

= mean = 1/, median = ln(2)/, mode = 0, and o = 1/
0 for ) ( > =

x e x P
x

lambda = 2
0
0.5
1
1.5
2
0 1 2
X
P
(
X
)
Copyright 2008 Health Administration Press. All rights reserved. 7-33
Probability Distributions
Normal Distribution
The normal
distribution, x ~N(,o
2
),
is commonly observed
in the world and
provides a reasonable
approximation for
many randomly
distributed variables.
2
2 2
2
2
1
/ ) (x
e

P(x)

=
0
0.2
0.4
0.6
-5 -3 -1 1 3 5
X
P
(
X
)
= 0, o = 1.0
= 0, o = 2.5
= 2, o = 0.7
Copyright 2008 Health Administration Press. All rights reserved. 7-34
Probability Distributions
Standard Normal Distribution
The standard normal distribution,
z distribution, is the normal
distribution with = 0 and o =
1.0. Any normal distribution can
be transformed to a standard
normal distribution by:
0
0.2
0.4
-5 -3 -1 1 3 5
X
P
(
X
)
= 0, o = 1.0

x
z

=

z-score limits
Proportion within the
limits (if normally
distributed)
+/ 1 z 0.680
+/ 2 z 0.950
+/ 3 z 0.997
Copyright 2008 Health Administration Press. All rights reserved. 7-35
Confidence Intervals, Hypothesis Testing
Central Limit Theorem
Hypothesis testing
Type I (o) and Type II (|) errors
T-tests
Proportions
Practical significance versus statistical
significance
Copyright 2008 Health Administration Press. All rights reserved. 7-36
Confidence Intervals, Hypothesis Testing
Central Limit Theorem
As the sample size becomes large, the
sampling distribution of the mean
approaches normality, no matter what the
distribution the original variable, and

n
x
o
o = =
x
and
Sampling Distribution Simulation
Copyright 2008 Health Administration Press. All rights reserved. 7-37
Confidence Intervals
Confidence interval for the true value of the
population mean:
n
z x
n
z x
z x z x
x x
o

o
o o
o o
o o
* . *
* *
2 / 2 /
2 / 2 /
+ s s
+ s s
0
0.2
0.4
-3 -2 -1 0 1 2 3
Z
P
(
X
)
2.5% 2.5%
95%
Copyright 2008 Health Administration Press. All rights reserved. 7-38
Hypothesis Testing
Belief or null hypothesis, Ho: = b
Alternate belief or hypothesis, Ha: = b
Decision rule: If z > z* , reject the null
hypothesis. Where
:
x
x
z
o

=
0
0.2
0.4
-3 -2 -1 0 1 2 3
Z
P
(
X
)
Z>Z* Z<-Z*
-Z*< Z < Z* (95% confidence)
Copyright 2008 Health Administration Press. All rights reserved. 7-39
Hypothesis Testing
Type I (o) and Type II (|) Errors
Ho:
1
=
2
Ha:
1
=
2

Type I and Type II ErrorClinic Wait Time Example
Reality
Wait times at
the two clinics
are the same
Wait times at the
two clinics are
NOT the same

1
=
2

1
=
2
Assess-
ment or
guess
Wait times at the
two clinics are the
same

1
=
2
Type II or
| error
Wait times at the
two clinics are
NOT the same

1
=
2
Type I or
o error
Copyright 2008 Health Administration Press. All rights reserved. 7-40
Equal Variance t-Test
t-tests are used to test hypotheses about
two means.
Ho:
1
=
2
Ha:
1
=
2
Decision rule: If t > t*, reject Ho


Confidence interval

2
) 1 ( ) 1 (
where
1 1
2 1
2
2 2
2
1 1
2 1
2 1 2 1
+
+
=
+

=
n n
s n s n
s
n n
s
) ( ) x x (
t
p
p
(

+ + s s
(

+
2 1
*
2 1 2 1
2 1
*
2 1
1 1
* ) (
1 1
* ) (
n n
s t x x
n n
s t x x
p p

Copyright 2008 Health Administration Press. All rights reserved. 7-41
Proportions
Ho: t
1
= t
2
Ha: t
1
=t
2
Decision rule: If z > z*, reject Ho


Confidence interval

where
) 1 ( ) 1 (
) ( ) (
2 1
2 1 2 1
n
p p
n
p p
p p
z

H H
=
2 1
2 2 1 1
n n
p n p n
p
+
+
=
2 1
*
2 1 2 1
2 1
*
2 1
) 1 ( ) 1 (
) (
) 1 ( ) 1 (
) (
n
p p
n
p p
z p p
n
p p
n
p p
z p p

+

+ s s

t t
Copyright 2008 Health Administration Press. All rights reserved. 7-42
Practical Significance Versus
Statistical Significance
Basic confidence interval
statistic [(z*) * (s.e. statistic)] s parameter
s statistic + [(z*) * (s.e. statistic)]
As n increases, s.e. decreases and the
confidence interval gets larger.
Large samples may give statistically
significant results that are not practically
significant.
Copyright 2008 Health Administration Press. All rights reserved. 7-43
ANOVA/MANOVA/MANCOVA
One-way ANalysis Of VAariance (ANOVA) is used
to test hypotheses about three or more levels of
treatment. A t-test will give the same information
as an ANOVA when there are only two treatment
levels of interest.
Two-way and higher ANOVAs are used when
there is more than one type of treatment variable
of interest.
MANOVA/MANCOVA are used when there is
more than one outcome or dependent variable of
interest.
Copyright 2008 Health Administration Press. All rights reserved. 7-44
Regression
Simple linear regressionused to describe
the relationship between two variables
Multiple regressionused to describe the
relationship between multiple predictor
variables and a single dependent variable
General linear model
Artificial neural networks
Design of experiments
Copyright 2008 Health Administration Press. All rights reserved. 7-45
What Is the Equation of a Line?
Algebra
a bX Y

+ =
b mx y + =
Statistics
Where
x
y
run
rise
slope b = = =
0 x when y,
intercept y a
= =
=
Copyright 2008 Health Administration Press. All rights reserved. 7-46
Problem
Student A owns a health insurance firm and
wants us to determine the cost (price would
be a more difficult problem) of providing
healthcare to insured individuals.
Copyright 2008 Health Administration Press. All rights reserved. 7-47
Seeing the Future
Experiences
are irrelevant
Experiences
are relevant
Judgment: To what degree are
these experiences still
relevant?
Data
Deductive reasoning versus inductive reasoning
Copyright 2008 Health Administration Press. All rights reserved. 7-48
What Is the Cost of Healthcare
Related To?
Quantitative
______________
______________
______________
______________
______________
______________
Qualitative
_____________
_____________
_____________
_____________
_____________
_____________
Copyright 2008 Health Administration Press. All rights reserved. 7-49
Selection
Define population
Census or sample
Type of sample
Measurementaccurate, reliable, precise?
X = number of dependents; Y = annual
healthcare expense ($1,000)
Is the study valid?
How do we create knowledge from data?
Copyright 2008 Health Administration Press. All rights reserved. 7-50
Data
Number of
Dependents
Annual
Healthcare
Expense
($1,000)
0 3
1 2
2 6
3 7
4 7
Copyright 2008 Health Administration Press. All rights reserved. 7-51
Scatterplot
y = x + 3
y = 1.2x + 2
y = 5
y = 1.3x + 2.4
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6
XNumber of Dependents
Y

A
n
n
u
a
l

H
e
a
l
t
h
c
a
r
e

C
o
s
t

$
1
,
0
0
0

Copyright 2008 Health Administration Press. All rights reserved. 7-52
Scatterplot Questions
Which is the best line on the scatterplot?
How would you define best (e.g., must be
quantifiable)?
Copyright 2008 Health Administration Press. All rights reserved. 7-53
Professors Model
knowledge 3 1X Y
1
X
Y
slope b
3 intercept Y a
($1,000) estimate cost Y
a bX Y
+ =
=
A
A
= =
= =
=
+ =

Copyright 2008 Health Administration Press. All rights reserved. 7-54


Model Comparison

X

Y

Yhat =
X + 3
Profs
e =
Y Yhat


Student 1
e


Student 2
e
0 3 3 0 1 0.6
1 2 4 -2 1.2 1.7
2 6 5 1 1.6 1
3 7 6 1 1.4 0.7
4 7 7 0 0.2 0.6
(sum) 0 3 0
4 . 2
) ( 3 . 1

+
= X Y
2
) ( 2 . 1

+
= X Y
Copyright 2008 Health Administration Press. All rights reserved. 7-55
Good Model
A good model must be unbiased.
e = 0
Is that enough? What else? Does this
remind you of o
2
?

How do we get rid of signs?
Copyright 2008 Health Administration Press. All rights reserved. 7-56
Model Comparison

X

Y

Yhat =
X + 3

e =
Y Yhat

e
2

Student 1
e
2

0 3 3 0 0 1
1 2 2 2 4 1.44
2 6 6 1 1 2.56
3 7 7 1 1 1.96
4 7 7 0 0 0.04
(sum) 25 25 0 6 7
Copyright 2008 Health Administration Press. All rights reserved. 7-57
Least Squares Technique
Gauss proved that if you use:
X b Y a and
) X (X
) X )(X Y (Y
b
2
=


=
You are guaranteed that
e = 0 and e
2
is a minimum.

Yhat = 1.3X + 2.4, e = 0, and e
2
= 5.1.
Copyright 2008 Health Administration Press. All rights reserved. 7-58
Coefficient of Determination
Are we better off making estimates by using
information (X = number of dependents) and
having created knowledge (Yhat = 1.3X +
2.1) than using no information or knowledge
(i.e., is the model better)?

How would you estimate without using our
knowledge (our model)?
Copyright 2008 Health Administration Press. All rights reserved. 7-59
Sum of Squares Total

X

Y

Yhat = Ybar

e = Y
Ybar
SSTO
(Y
Ybar)
2
0 3 5 2 4
1 2 5 3 9
2 6 5 1 1
3 7 5 2 4
4 7 5 2 4
(sum) 25 25 0 22
Note that this method is unbiased.
Copyright 2008 Health Administration Press. All rights reserved. 7-60
Graph
y = 5
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6
XNumber of Dependents
Y

A
n
n
u
a
l

H
e
a
l
t
h
c
a
r
e

C
o
s
t

$
1
,
0
0
0


Copyright 2008 Health Administration Press. All rights reserved. 7-61
Errors
0
1
2
3
4
5
6
7
8
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
XNumber of Dependents
Y

A
n
n
u
a
l

H
e
a
l
t
h
c
a
r
e

C
o
s
t
s


$
1
,
0
0
0

Copyright 2008 Health Administration Press. All rights reserved. 7-62
Sum of Squares Error
X Y
Yhat =
1.3X +
2.4
e =
Y
Yhat
SSE
e
2
= (Y
Yhat)
2
Ybar
Y
Ybar
SSTO
(Y
Ybar)
2
0 3 2.4 0.6 0.36 5 2 4
1 2 3.7 1.7 2.89 5 3 9
2 6 5 1.0 1.00 5 1 1
3 7 6.3 0.7 0.49 5 2 4
4 7 7.6 0.6 0.36 5 2 4
(sum) 25 25 0 5.1 25 0 22
Copyright 2008 Health Administration Press. All rights reserved. 7-63
Coefficient of Determination
What is the percentage of improvement when
we use knowledge gained from our model?

77% 100
22
16.9
22
22 5.1
level error Old
level error old level error New
t improvemen %
=

=
r
2
= coefficient of determination = 77%
r
2
= 0.77

Copyright 2008 Health Administration Press. All rights reserved. 7-64
Another Viewpoint
Variation in cost of removal is either explained
by knowledge (the model) or not explained.
Copyright 2008 Health Administration Press. All rights reserved. 7-65
Explained and Unexplained Error
0
1
2
3
4
5
6
7
8
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
XNumber of Dependents
Y

A
n
n
u
a
l

H
e
a
l
t
h
c
a
r
e

C
o
s
t
s


$
1
,
0
0
0

----- Explained
___ Unexplained
Copyright 2008 Health Administration Press. All rights reserved. 7-66
Sum of Squares Regression
X Y
Yhat =
1.3X +
2.4
e =
Y
Yhat
SSE
e
2
= (Y
Yhat)
2
Y
bar
Y
Ybar
SSTO
(Y
Ybar)
2
Yhat

Ybar
SSR
(Yhat

Ybar)
2
0 3 2.4 0.6 0.36 5 2 4 2.6 6.76
1 2 3.7 1.7 2.89 5 3 9 1.3 1.69
2 6 5 1.0 1.00 5 1 1 0 0
3 7 6.3 0.7 0.49 5 2 4 1.3 1.69
4 7 7.6 0.6 0.36 5 2 4 2.6 6.76

(sum)
35 25 0 5.1 25 0 22 0 16.9
Copyright 2008 Health Administration Press. All rights reserved. 7-67
Coefficient of Determination
0.77
22.0
16.9
SSTO
SSR
Total
Explained
r
2
= = = =
Note: r
2
is not based on statistics or
probability; it is just a percentage.
Copyright 2008 Health Administration Press. All rights reserved. 7-68
Correlation Coefficient
r = \ r
2
r = Correlation coefficient
= Measure of the strength of the linear
relationship between two variables
1 s r s 1
r = +1
r = 1
Copyright 2008 Health Administration Press. All rights reserved. 7-69
Correlation Coefficient Examples
r = 0.9 r = 0.0
r = 0.5
Copyright 2008 Health Administration Press. All rights reserved. 7-70
Coefficient of Determination
Questions:
If r
2
is low, does that mean there is no
relationship between your variables?

If r
2
is high (close to 1), does that mean you
always get useful predictions from your model?

If r
2
is high, does that mean your model has a
good fit?

Copyright 2008 Health Administration Press. All rights reserved. 7-71
r
2
and Curves

Can we fit a straight line to this?
Yes, and we are guaranteed that the errors
sum to zero and are a minimum.
However, a curve would be better.
X
Y
Copyright 2008 Health Administration Press. All rights reserved. 7-72
Excel Output
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8765
R Square 0.7682
Adjusted R
Square 0.6909
Standard
Error 0.8790
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 7.6818 7.6818 9.9412 0.0511
Residual 3 2.3182 0.7727
Total 4 10
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 90.0% Upper 90.0%
Intercept -0.9545 1.0162 -0.9393 0.4169 -4.1885 2.2794 -3.3460 1.4369
Y - $ 1000
Annual
Health Care
Expense 0.5909 0.1874 3.1530 0.0511 -0.0055 1.1873 0.1499 1.0320
RESIDUAL OUTPUT PROBABILITY OUTPUT
Observation
Predicted X -
Number of
Dependents Residuals
Standard
Residuals Percentile
X - Number
of
Dependents
1 0.8182 -0.8182 -1.0747 10 0
2 0.2273 0.7727 1.0150 30 1
3 2.5909 -0.5909 -0.7762 50 2
4 3.1818 -0.1818 -0.2388 70 3
5 3.1818 0.8182 1.0747 90 4
To get this sheet, go to Tools -> Data Analysis -> Regression. If you don't have Data Analysis
listed in your tools, see Excel help "Install and Use the Analysis ToolPak.
Residual Plot
-1.0000
-0.5000
0.0000
0.5000
1.0000
0 2 4 6 8
Y$ 1,000 Annual Healthcare Expense
R
e
s
i
d
u
a
l
s

Line Fit Plot
0
1
2
3
4
5
0 2 4 6 8
Y$ 1,000 Annual Healthcare Expense
X

N
u
m
b
e
r

o
f


D
e
p
e
n
d
e
n
t
s

XNumber
of
Dependents
Predicted X
Number of
Dependents
Normal Probability Plot
0
5
0 20 40 60 80 100
Sample Percentile
X

N
u
m
b
e
r


o
f


D
e
p
e
n
d
e
n
t
s

Copyright 2008 Health Administration Press. All rights reserved. 7-73
F Test
If F* > F
(1-o;1;n-2)
, reject H
0
: | = 0

(in this case)

MSR/MSE 1 | = 0

MSR/MSE big | = 0

*
2 /
1 /
F
n SSE
SSR
MSE
MSR
=

=
Copyright 2008 Health Administration Press. All rights reserved. 7-74
Assumptions of Linear
Regression
Linear regression is based on several
assumptions. If these assumptions are
violated, the resulting model will be
misleading. The principal assumptions are:
- The dependent and independent variables are
linearly related.
- The errors associated with the model are not
serially correlated.
- The errors are normally distributed and have
constant variance.
Copyright 2008 Health Administration Press. All rights reserved. 7-75
Transformations
If the variables are not linearly related or the
assumptions of regression are violated, the variables
can be transformed to produce a possibly better
model.

X Y
Transform
X ->X
2
3 9 9
2 4 4
1 1 1
0 0 0
1 1 1
2 4 4
3 9 9
0
2
4
6
8
10
0 2 4 6 8 10
X
Y
2
Copyright 2008 Health Administration Press. All rights reserved. 7-76
Multiple Regression
Multiple independent variables are used to
predict a single dependent variable to
improve the model.
Y = o + |1X1 + |2X2 + + |kXk + c
Multicollinearity can be a problem.
Copyright 2008 Health Administration Press. All rights reserved. 7-77
General Linear Model
The most general of all linear models
Multiple predictor variables:
- Metric
- Categorical
- Both
Multiple dependent variables:
- Metric
- Categorical
- Both
Can be used to build complex models
Copyright 2008 Health Administration Press. All rights reserved. 7-78
Artificial Neural Networks
Neural Networks
Large amounts of data
No explanation of
how/why
Used to predict
outcomes
Traditional Models
Limited amount of data
Model explains
how/why
Used to predict
outcomes
Copyright 2008 Health Administration Press. All rights reserved. 7-79
Outline for Analyses
1. Define the problem/question.
2. Determine what data will be needed to address
the problem question.
3. Collect the data.
4. Graph the data.
5. Analyze the data using the appropriate tool.
6. Fix the problem.
7. Evaluate the effectiveness of the fix.
8. Start again.
Copyright 2008 Health Administration Press. All rights reserved. 7-80
Choice of Statistical Technique
Independent
Variable
Dependent
Variable
Mathematical Graphical
Categorical One Categorical One _
2

Many _
2
(layered)
Metric One t-Test
Histogram
type
Many MANOVA Box plot
Many Categorical One _
2

Many _
2
(layered)
Metric One ANOVA Box plots
Many MANOVA
Both GLM
Copyright 2008 Health Administration Press. All rights reserved. 7-81
Choice of Statistical Technique
Independent
Variable
Dependent
Variable
Mathematical Graphical
Metric One Categorical One Logit
Many GLM
Metric One Simple regression Scatterplot
Many GLM
Both MANCOVA
Many Categorical One Logit
Many GLM
Metric One Multiple regression
Many GLM
Both GLM; neural net
Copyright 2008 Health Administration Press. All rights reserved. 7-82
Choice of Statistical Technique
Independent
Variable
Dependent
Variable
Mathematical Graphical
Both Categorical One ANCOVA
Many MANCOVA
Metric One Simple regression
Many Multiple regression
Both
GLM
Neural Net

Potrebbero piacerti anche