Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Nathaniel Schenker is Senior Scientist for Research and Methodology, National Center
for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, MD 20782 .
Trivellore E. Raghunathan is Professor of Biostatistics, School of Public Health and
Research Professor, Institute for Social Research, University of Michigan, Ann Arbor,
MI 48106 . Pei-Lu Chiu is Survey Statistician, Division of Health Interview Statistics,
National Center for Health Statistics, Centers for Disease Control and Prevention,
Hyattsville, MD 20782 . Diane M. Makuc is Associate Director for Science, Office of
Analysis and Epidemiology, National Center for Health Statistics, Centers for Disease
Control and Prevention, Hyattsville, MD 20782 . Guangyu Zhang is a doctoral student,
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48106 . Alan J.
Cohen is Statistician, Office of Analysis and Epidemiology, National Center for Health
Statistics, Centers for Disease Control and Prevention, Hyattsville, MD 20782 . The work
of Raghunathan and Zhang was supported by a cooperative agreement with the National
Center for Health Statistics (DHHS/CDC/NCHS/UR6/CCU51748) and a grant from the
National Science Foundation (SES-0106914).
Published online: 01 Jan 2012.
To cite this article: Nathaniel Schenker, Trivellore E Raghunathan, Pei-Lu Chiu, Diane M Makuc, Guangyu Zhang & Alan J
Cohen (2006) Multiple Imputation of Missing Income Data in the National Health Interview Survey, Journal of the American
Statistical Association, 101:475, 924-933, DOI: 10.1198/016214505000001375
To link to this article: http://dx.doi.org/10.1198/016214505000001375
KEY WORDS: Health insurance; Health status; Missing data; Poverty; Public-use data; Sequential regression multivariate imputation.
1. INTRODUCTION
The National Health Interview Survey (NHIS) is a multipurpose health survey that is the principal source of information on the health of the civilian, noninstitutionalized household
population of the United States (National Center for Health
Statistics 2005). It is conducted by the Bureau of the Census
for the National Center for Health Statistics of the Centers for
Disease Control and Prevention, and it includes approximately
40,000 households containing approximately 100,000 people
each year. The survey provides a rich source of data for studying relationships between income and health and for monitoring health and health care for persons at different income levels.
There is particular interest in the health of vulnerable populations, such as those with low income, as well as these persons
access to and use of health care. However, the nonresponse rates
are high for two key itemstotal family income and personal
earnings from employmentboth referring to the previous calendar year.
To handle the missing data on family income and personal
earnings in the NHIS, multiple imputation of these items has
been performed. This article describes the approach used to create the multiple imputations and evaluates the methods through
analyses of the multiply imputed data. The remainder of Section 1 provides further information on the questions on income
in the NHIS, the missing data, and the products of the imputation project. Section 2 discusses the imputation procedure used.
Nathaniel Schenker is Senior Scientist for Research and Methodology,
National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, MD 20782 (E-mail: nschenker@cdc.gov). Trivellore E.
Raghunathan is Professor of Biostatistics, School of Public Health and Research Professor, Institute for Social Research, University of Michigan,
Ann Arbor, MI 48106 (E-mail: teraghu@umich.edu). Pei-Lu Chiu is Survey Statistician, Division of Health Interview Statistics, National Center for
Health Statistics, Centers for Disease Control and Prevention, Hyattsville,
MD 20782 (E-mail: pchiu@cdc.gov). Diane M. Makuc is Associate Director for Science, Office of Analysis and Epidemiology, National Center for
Health Statistics, Centers for Disease Control and Prevention, Hyattsville,
MD 20782 (E-mail: dmakuc@cdc.gov). Guangyu Zhang is a doctoral student, Department of Biostatistics, University of Michigan, Ann Arbor, MI
48106 (E-mail: guangyuz@umich.edu). Alan J. Cohen is Statistician, Office
of Analysis and Epidemiology, National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, MD 20782 (E-mail:
acohen@cdc.gov). The work of Raghunathan and Zhang was supported
by a cooperative agreement with the National Center for Health Statistics
(DHHS/CDC/NCHS/UR6/CCU51748) and a grant from the National Science
Foundation (SES-0106914).
924
925
Odds ratio
95% confidence
interval
1.54
(1.43, 1.66)
.77
(.72, .83)
.64
.69
.58
.70
.82
(.60, .69)
(.63, .76)
(.53, .62)
(.64, .76)
(.76, .88)
.98
(.96, 1.01)
1.01
1.21
.92
(.91, 1.12)
(1.09, 1.34)
(.78, 1.08)
1.14
(1.05, 1.25)
1.05
.79
.94
(.90, 1.23)
(.71, .88)
(.84, 1.06)
.91
(.79, 1.04)
926
the multiply imputed data, family income and personal earnings are given in 11 categories, and the poverty ratio is given
in 14 categories. Datasets containing the imputed values, along
with documentation, can be obtained from the NHIS website
(http://www.cdc.gov/nchs/nhis.htm).
927
928
The
combined multiple-imputation point estimate is Qm =
l . The estimated variance of this point estimate
m1 m
Q
l=1
m + [(m + 1)/m]Bm , where U
m = m1 m
is Tm = U
l=1 Ul
m
1
2
and Bm = (m 1)
l=1 (Ql Qm ) . The quantity m =
[(m + 1)/m]Bm /Tm is approximately the fraction of information about Q that is missing due to nonresponse (Rubin 1987,
sec. 3.3), also known as the fraction of missing information.
3.2 Analyses of the Relationship Between Poverty
Ratio and Personal Health Status
Table 2 displays point estimates and estimated standard errors for the percentage of persons age 4564 in fair or poor
health, by category of poverty ratio, based on the NHIS with
no imputation, with single imputation, and with multiple imputation. The single-imputation estimates were computed using
the first set of imputations from the multiply imputed data, and
1 and U 1/2 in the notation of Section 3.1.
thus correspond to Q
1
Table 2 also displays the approximate fractions of missing information (m ) and the ratios of the estimated standard errors
based on either no imputation or single imputation to those
1/2
(Tm ) based on multiple imputation.
Table 2. Point Estimates, Estimated Standard Errors (SEs), and Approximate Fractions of Missing Information (m ) for the Percentage of Persons
Age 4564 in Fair or Poor Health, by the Ratio of Family Income to the Federal Poverty Threshold, 2001 NHIS
Poverty
ratio
<1
[1, 2)
[2, 4)
4
No imputation
(NI)
Point
Estimated
estimate
SE
45.6
32.7
16.1
5.9
1.68
1.32
.63
.34
Single imputation
(SI)
Point
Estimated
estimate
SE
39.4
29.8
16.0
6.1
1.34
1.03
.51
.27
Multiple imputation
(MI)
Point
Estimated
estimate
SE
39.9
29.3
15.9
6.2
1.54
1.11
.55
.30
Ratio of
estimated
SEs:
NI/MI
Ratio of
estimated
SEs:
SI/MI
.27
.19
.05
.19
1.09
1.19
1.15
1.11
.87
.93
.94
.90
929
<$20,000
$20,000
41.6
10.6
33.5
11.0
30.9
11.5
32.1
15.3
<$20,000
$20,000
Although all three analyses (no imputation, single imputation, and multiple imputation) indicate a negative association
between poverty ratio and the percentage in fair or poor health,
the point estimates based on single imputation and multiple
imputation are very close to each other, whereas the point estimates based on no imputation (i.e., based on complete-case
analysis) are quite different from those based on data with imputation for the two categories of poverty ratio indicating lower
incomes. To help understand the differences between the point
estimates with and without imputation, Table 3 displays estimates of the percentage of persons age 4564 in fair or poor
health, by two-category income (<$20,000 vs. $20,000), both
for those with an exact value of family income reported (the
respondents) and those without an exact value of family income reported (the nonrespondents). (The estimates for the
nonrespondents were calculated from the large proportion with
categorical values reported.) For those with family incomes
<$20,000, the estimated percentage in fair or poor health is
substantially lower for nonrespondents than for respondents. It
follows that compared with complete-case analysis, imputation
should lower the estimated percentage in fair or poor health
among those with low poverty ratios. The results in Tables
2 and 3 illustrate the possible biases associated with completecase analysis that were mentioned in Section 1.2. They also
illustrate the utility of incorporating information supplied by
categorical reports of income into the imputation procedure.
In Table 2, the ratios of the estimated standard errors with no
imputation to those with multiple imputation are all >1, suggesting that multiple imputation results in more precise point
estimates than does complete-case analysis. These results need
to be interpreted with caution, because the estimated standard
errors for complete-case analysis can be biased when the missing data are not missing completely at random. (See Sec. 3.3
for a likely example of this phenomenon.) Nevertheless, the
results conform with theory, in the sense that if the multipleimputation analyses and complete-case analyses are both unbiased, and if there is additional information supplied through
the imputation procedure, then the multiple-imputation analyses should be more precise than the complete-case analyses.
The ratios of the estimated standard errors with single imputation to those with multiple imputation are all <1, reflecting
the fact that the single-imputation analyses erroneously do not
account for the extra uncertainty due to imputation.
The approximate fractions of missing information in Table 2, which range from 5% to 27%, are all smaller than the
(weighted) nonresponse rate for poverty ratio, which is 34%
for those age 4564 with reported health statuses. In general,
the fraction of missing information tends to be smaller than the
nonresponse rate in practice, because the fraction of missing
information accounts not only for the amount of nonresponse,
but also for the information incorporated into the imputation
model that is predictive of the variable subject to nonresponse.
Moreover, the fraction of missing information is specific to the
quantity being estimated. In Table 2, for example, the differing fractions of missing information across the categories of
poverty ratio reflect the differing amounts of imputation in these
categories and how well the imputation model predicts values
in the different categories.
3.3 Analyses of the Relationship Between Poverty
Ratio and Having Health Insurance
Table 4 displays results for the percentage of persons age
<65 without health insurance, analogous to Table 2 for health
status. As with health status, all three analyses indicate the same
direction of association (negative) between poverty ratio and
the percentage uninsured, although the point estimates from
complete-case analysis are quite different than those based on
the data with imputation. The largest differences occur for the
two higher categories of poverty ratio. These differences are
again explained, at least in part, by an analysis involving twocategory income. Table 5 displays estimates of the percentage of persons age <65 who are uninsured, by two-category
income, for both respondents and nonrespondents. The greatest difference between nonrespondents and respondents occurs
for those with incomes of at least $20,000, with the percentage
uninsured being larger among nonrespondents than among respondents. Once again, imputation appears to be correcting for
Table 4. Point Estimates, Estimated SEs, and Approximate Fractions of Missing Information (m ) for the Percentage of Persons Under Age 65
With no Health Insurance Coverage, by the Ratio of Family Income to the Federal Poverty Threshold, 2001 NHIS
Poverty
ratio
<1
[1, 2)
[2, 4)
4
No imputation
(NI)
Point
Estimated
estimate
SE
31.4
28.7
13.0
4.6
1.06
.70
.35
.22
Single imputation
(SI)
Point
Estimated
estimate
SE
32.6
28.8
14.6
6.2
.85
.57
.33
.23
Multiple imputation
(MI)
Point
Estimated
estimate
SE
32.5
28.7
14.7
6.1
.85
.61
.36
.23
Ratio of
estimated
SEs:
NI/MI
Ratio of
estimated
SEs:
SI/MI
.03
.04
.12
.15
1.25
1.15
.97
.92
1.00
.93
.90
.98
930
biases that would occur if only the complete cases were analyzed, and the observed bounds on income appear to be contributing a substantial amount of information to the imputation
model.
The ratios of estimated standard errors in Table 4 show the
same patterns as they did for the analyses of health status (in
Table 2), with two exceptions. For the two highest poverty
ratio categories, the estimated standard errors from completecase analysis are actually lower than those based on the multiply imputed data. This likely reflects the differences in the
point estimates and the strong relationship of point estimates
of percentages near 0 or 100 to their estimated standard errors. Because the point estimates from complete-case analysis are closer to 0 than those based on the multiply imputed
data (which was shown earlier to be likely due to biases in the
complete-case estimates), the estimated standard errors from
complete-case analysis might be expected to be smaller, even
after the extra information in the imputations is taken into account.
As was the case with health status, the approximate fractions
of missing information for the analysis involving health insurance, which range from 3% to 15%, are much smaller than the
simple (weighted) nonresponse rate for poverty ratio, which is
29% for those under age 65 with reported health insurance status.
Table 6. Point Estimates, Estimated SEs, and Approximate Fractions of Missing Information (m ) for Coefficients of the Logistic Regression
of Being in Fair or Poor Health on Selected Variables, 2001 NHIS
Variable
Constant
Poverty ratio
<1.00
[1, 2)
[2, 4)
4 (reference)
Age (years)
<18
1824
2534
3544
4554
5564
65 (reference)
Race/ethnicity
Hispanic
Non-Hispanic black
Non-Hispanic other
Non-Hispanic white (reference)
Born in the U.S.?
No
Yes (reference)
Gender
Male
Female (reference)
Region of residence
Northeast
South
West
Midwest (reference)
Resides in metropolitan area?
Yes
No (reference)
No imputation
(NI)
Point
Estimated
estimate
SE
Single imputation
(SI)
Point
Estimated
estimate
SE
Multiple imputation
(MI)
Point
Estimated
estimate
SE
Ratio of
estimated
SEs:
SI/MI
Ratio of
estimated
SEs:
NI/MI
2.16
.080
1.97
.061
1.98
.064
.100
1.26
.96
2.17
1.65
.94
.067
.059
.056
1.90
1.42
.85
.055
.046
.044
1.92
1.44
.85
.060
.051
.054
.179
.154
.339
1.12
1.16
1.03
.92
.92
.82
3.23
2.67
2.03
1.33
.65
.10
.074
.096
.071
.059
.055
.055
3.26
2.62
2.11
1.43
.76
.23
.062
.078
.057
.049
.047
.045
3.26
2.63
2.11
1.43
.76
.23
.062
.078
.057
.049
.046
.045
.008
.007
.003
.002
.007
.008
1.19
1.23
1.25
1.21
1.19
1.23
1.00
1.00
1.01
1.01
1.01
1.00
.24
.45
.21
.066
.059
.105
.30
.51
.21
.052
.048
.088
.30
.50
.21
.052
.049
.087
.002
.013
.007
1.27
1.21
1.20
1.00
1.00
1.01
.30
.061
.25
.049
.25
.049
.003
1.24
.99
.02
.030
.03
.022
.03
.022
.006
1.32
.99
.12
.10
.04
.062
.056
.066
.17
.09
.05
.054
.047
.050
.16
.08
.04
.053
.047
.050
.009
.002
.004
1.17
1.19
1.32
1.01
1.00
1.00
.08
.049
.10
.039
.10
.040
.006
1.24
.99
931
Table 7. Point Estimates, Estimated SEs, and Approximate Fractions of Missing Information (m ) for Coefficients of the Logistic Regression
of Being Uninsured on Selected Variables for Persons Under Age 65, 2001 NHIS
Variable
Constant
Poverty ratio
<1.00
[1, 2)
[2, 4)
4 (reference)
Age (years)
<18
1824
2534
3544
4554
5564 (reference)
Race/ethnicity
Hispanic
Non-Hispanic black
Non-Hispanic other
Non-Hispanic white (reference)
Born in the U.S.?
No
Yes (reference)
Gender
Male
Female (reference)
Region of residence
Northeast
South
West
Midwest (reference)
Resides in metropolitan area?
Yes
No (reference)
No imputation
(NI)
Point
Estimated
estimate
SE
Single imputation
(SI)
Point
Estimated
estimate
SE
Multiple imputation
(MI)
Point
Estimated
estimate
SE
Ratio of
estimated
SEs:
NI/MI
Ratio of
estimated
SEs:
SI/MI
3.67
.090
3.35
.079
3.37
.078
.027
1.15
1.02
2.09
1.98
1.05
.074
.059
.060
1.80
1.64
.86
.059
.049
.048
1.82
1.66
.89
.060
.052
.051
.094
.156
.166
1.23
1.14
1.17
.98
.94
.93
.40
.97
.71
.42
.34
.065
.073
.062
.060
.059
.29
.96
.71
.43
.29
.055
.058
.050
.049
.050
.30
.96
.70
.42
.28
.055
.058
.050
.049
.050
.009
.006
.004
.004
.012
1.19
1.25
1.23
1.23
1.17
1.00
1.00
1.00
1.00
1.00
.65
.08
.11
.066
.058
.125
.64
.12
.08
.054
.049
.106
.64
.11
.08
.054
.049
.108
.011
.011
.017
1.22
1.19
1.16
1.00
1.00
.98
.67
.054
.77
.050
.77
.050
.010
1.09
1.00
.25
.025
.23
.021
.23
.021
.001
1.20
1.00
.08
.43
.20
.070
.056
.076
.11
.44
.19
.059
.047
.063
.10
.45
.19
.059
.047
.063
.010
.001
.010
1.18
1.20
1.21
.99
1.01
1.00
.19
.058
.19
.057
.19
.057
.003
1.03
1.00
932
Table 8. Point Estimates and Estimated SEs for the Percentage of Persons Age 4564 in Fair or
Poor Health and the Percentage of Persons Under Age 65 With no Health Insurance Coverage, by
the Ratio of Family Income to the Federal Poverty Threshold, 2001 NHIS, With a Poststratification
Weighting Adjustment Using Control Totals From the 2001 CPS
Poverty
ratio
<1
[1, 2)
[2, 4)
4
1.78
1.34
.62
.34
1.04
.69
.35
.22
933
REFERENCES
Barnard, J., and Rubin, D. B. (1999), Small-Sample Degrees of Freedom With
Multiple Imputation, Biometrika, 86, 948955.
Box, G. E. P., and Cox, D. R. (1964), An Analysis of Transformations (with
discussion), Journal of the Royal Statistical Society, Ser. B, 26, 211252.
Kennickell, A. B. (1991), Imputation of the 1989 Survey of Consumer Finances: Stochastic Relaxation and Multiple Imputation, in Proceedings
of the Survey Research Methods Section, American Statistical Association,
pp. 112121.
Li, K.-H., Meng, X.-L., Raghunathan, T. E., and Rubin, D. B. (1991), Significance Levels From Repeated p-Values With Multiply-Imputed Data, Statistica Sinica, 1, 6592.
Li, K. H., Raghunathan, T. E., and Rubin, D. B. (1991), Large-Sample Significance Levels From Multiply-Imputed Data Using Moment-Based Statistics
and an F Reference Distribution, Journal of the American Statistical Association, 86, 10651073.
Little, R. J. A., and Rubin, D. B. (2002), Statistical Analysis With Missing Data
(2nd ed.), New York: Wiley.
Meng, X.-L. (1994), Multiple-Imputation Inferences With Uncongenial
Sources of Input (with discussion), Statistical Science, 9, 538573.
Meng, X.-L., and Rubin, D. B. (1992), Performing Likelihood Ratio Tests
With Multiply-Imputed Data Sets, Biometrika, 79, 103111.
National Center for Health Statistics (2005), 2004 National Health Interview Survey (NHIS) Public Use Data Release: NHIS Survey Description, Division of Health Interview Statistics, National Center for
Health Statistics, Centers for Disease Control and Prevention, available at
http://www.cdc.gov/nchs/nhis.htm.
Paulin, G. D., and Sweet, E. M. (1996), Modeling Income in the U.S. Consumer Expenditure Survey, Journal of Official Statistics, 12, 403419.
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., and Solenberger, P.
(2001), A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models, Survey Methodology, 27, 8595.
Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New
York: Wiley.
(1996), Multiple Imputation After 18+ Years (with discussion),
Journal of the American Statistical Association, 91, 473489.
Rubin, D. B., and Schenker, N. (1986), Multiple Imputation for Interval Estimation From Simple Random Samples With Ignorable Nonresponse, Journal of the American Statistical Association, 81, 366374.
Stata Corporation (2003), Stata Statistical Software: Release 8.0, College Station, TX: Author.