Sei sulla pagina 1di 7

Paper 75011

Fitting latency models in epidemiological studies


Bryan Langholz, University of Southern California, Los Angeles, CA David B. Richardson, University of North Carolina, Chapel Hill, NC

ABSTRACT
Latency models address important questions about the timing of exposure and subsequent disease risk in epidemiologic research. However, the non-standard form of such models and the complexity of time-varying exposure histories that characterize many epidemiologic studies make such models difcult to t using standard software packages. SAS offers software tools that can be used to t non-standard latency models. In particular, SAS procedures NLP and NLMIXED provide the exibility to specify quite general latency models and maximize the likelihood for latency parameters. While the methods require more sophistication than using a package or procedure that is specialized to the usual log-linear form, the additional programming complexity is not great. The methods are illustrated and compared by tting latency models for radon exposure and lung cancer mortality rates from a cohort study of Colorado Plateau uranium miners. It was found that the conditional logistic likelihood is more computationally efcient than the unconditional and that PROC NLP is much faster than PROC NLMIXED. KEYWORDS: cohort studies, case-control studies, conditional logistic likelihood, epidemiology, general relative risk models, partial likelihood, PROC NLP, PROC NLMIXED

INTRODUCTION
Latency models are used to describe the change in risk of disease due to a given exposure as a function of time since that exposure. Characterization of the evolution of the relative risk of disease following exposure to an agent is important for understanding the long term consequences of exposure. If exposure occurs at a single point in time, then risk as a function of time-since-exposure is typically estimated directly as the change in disease risk or rate as a function of since that exposure. The situation can be complex when exposure is protracted such as the case with many occupational and environmental exposures. For instance, in a study of Colorado Plateau uranium miners, radon exposure in the course of underground mining of uranium continued throughout a miners work life, and varied over time, depending on the mine location and conditions in the mine at a given time [Archer et al., 1973, Hornung and Meinhardt, 1987, Stram et al., 1999]. Ad-hoc methods for investigation of latency for extended exposures include estimation of risk as a function of time since rst or last exposure. These approaches are relatively easy to implement but do not characterize the evolution of risk over time. While more sophisticated approaches based on informative models of latency have been described [e.g., Thomas, 1982, 1988, Breslow and Day, 1987, Langholz et al., 1999, Hauptmann et al., 2001, Berhane et al., 2008], they have rarely been used, in large part because of they are not accommodated by standard modeling software. Recently, methods for tting a quite general class of latency models exploiting the very general modeling facilities available in SAS procedure NLMIXED were described [Richardson, 2009]. This approach was based on an unconditional logistic likelihood with nuisance strata parameters for age intervals. Even more recently, methods have been describe to t conditional logistic likelihoods, including partial likelihoods for proportional hazards models, using PROC NLP [Langholz and Richardson, 2010]. In this paper, we combine the insights about tting latency models given in Richardson [2009] with those about tting conditional logistic likelihoods described in Langholz and Richardson [2010]. We show how latency models can be tted using procedure NLMIXED or NLP using the conditional logistic likelihood and compare the performance of the unconditional likelihood tting approach to the conditional by tting latency models for radon exposure and lung cancer mortality rates to data from a cohort of uranium miners from the Colorado Plateau.

METHODS
LATENCY FUNCTIONS Let t be the current age and u index the ages at exposure, with d(u) the dose at age u and w(t u; ) the latency curve t u years in the past that depends on parameter vector , then the effective dose at age t from an exposure incurred at age u is d(u)w(t u; ). The total effective dose at age t, D(t; ), is then given by D(t; ) =
u

d(u)w(t u; ).

The latency function and latency parameters for a number of latency models are shown in Table 1 [e.g., Thomas, 1988, Langholz et al., 1999, Hauptmann et al., 2001, Richardson, 2009]. 1

Table 1: Some latency functions.


Lag Piecewise constanta Splineb Bilinear Weighting function w(v) I(v )
K k=1 K k=1

Parameters 1 , . . . , K 1 , . . . , K 0 , 1 , 2

I(Ck1 < v Ck ) k

Exponential decay Log-normal

g(v, k ) (v 0 )/(1 0 ) if 0 < v 1 (2 v)/(2 1 ) if 1 < v 2 0 otherwise (v 0 )/(1 0 ) if 0 < v 1 exp((v 1 ) log(2)/2 if 1 < v 0 otherwise pdf(Lognormal,v, , )

0 , 1 , 2 , a,

Gamma pdf(Gamma,a, ) a (C k1 , Ck ] are time intervals. b g(v, ) are basis functions and may be a vector. k k

The total effective dose summarizes the exposure history into a single value that at each age can be related to disease risk. A form of this relationship must be chosen or determined from the data. For various reasons, both theoretical and empirical, radiation effects have been modeled as the excess relative risk as a linear function of total effective dose. This is expressed as (t, D(t; ); ) = 0 (t)(1 + D(t; )) (1)

where 0 (t) is the rate of disease at age t in a (comparable) unexposed population and is the dose-response parameter; the change in relative risk per unit effective dose. Combining this dose-response model with the total effective dose formula (), we note that D(t; ) = d(u)w(t u; ) = d(u) w(t u; ) (2)
u u

so that w(t u; ) gives the excess relative risk per unit dose ascribed to exposure at t u years in the past. Other forms for the rates may be more appropriate in other situations such as the log linear form (Cox model) (t, D(t; ); ) = 0 (t) exp(D(t; )) or more complex relationships. FITTING LATENCY MODELS A major impediment to the investigation of latency in characterizing exposure-disease relationships has been computational. Given that standard software used for the modeling of data from epidemiologic studies accommodate only dose-response models of the log-linear form, there are two aspects of model (1) that are non-standard. The rst is that the excess relative risk form is not log-linear. Second, to estimate the latency parameters , the effective dose D(t; ) given in () needs to be updated at each iteration in the tting algorithm, again not accommodated by standard tting software. In order to t latency models to the Colorado Plateau uranium miners data, Langholz et al. [1999] organized cohort data into time and strata determined risk sets and then nested case-control sampled from the risk sets to reduce the computational burden. Lengthy scripts were written for the specialized package EPICURE that used a grid search to t the models (http://hydra.usc.edu/timefactors/examples/exampl.html, topic 5). Recently, Richardson [2009] described how SAS PROC NLMIXED can be used to t latency models. He used the same nested case-control sampling approach to reduce computational burden and analyzed the data as stratied binary data using an unconditional logistic likelihood, with a separate stratum parameter for each risk set. When disease is rare, the unconditional logistic likelihood approach yields parameter estimates that are close to the partial likelihood. Langholz and Richardson [2010] described how to use PROC NLP (or NLMIXED) to accommodate conditional logistic likelihoods, including how to perform partial likelihood analyses from cohort risk set or nested case-control data. This involves a organizing the data as a single record per risk set and computing the conditional logistic likelihood contributions as a general model. This approach has the advantage over the unconditional logistic likelihood that risk set strata parameters are avoided so that tting is faster and more reliable. For simplicity, the nested case-control data should have only one case per set. We discuss the analysis of tied failure times or grouped time cohort data in the Discussion. UNCONDITIONAL LOGISTIC LIKELIHOOD Figure 1 shows the SAS macro latency described in Richardson [2009]. Briey, the annual exposures d(u) are given as a variable list in macro variable exphx while the latency weighting is entered into the macro variable wt and the regression 2

Figure 1: Unconditional logistic analysis of nested case-control data.


%MACRO latency (data= , case= , exphx=, age= , parms= , lag=, wt=, regmodel= ); PROC NLMIXED DATA=&data TECH=congra ; PARMS &parms; lag=&lag ; edose=0; ARRAY annexp &exphx; endage=FLOOR(&age)-lag; endage=MIN(endage,hbound(annexp)); DO Y = 1 TO endage; t = (&age -y-lag+ (182/365) ); edose = edose + ( ( &wt ) * annexp{y} ); END; odds=&regmodel ; p=odds/(1+odds); MODEL &case BINARY(p); RUN; %MEND latency;

Figure 2: Conditional logistic likelihood analysis of nested case-control data.


%MACRO latency (data= , nv=, maxsetsize=, age= , parms= , lag=, wt=, regmodel= ); PROC NLMIXED DATA=&data tech=congra ; PARMS &parms; lag=&lag ; * note: maxsetsize is set globally by the make_case_control macro; ARRAY z[&nv,&maxsetsize] _z1-_z%EVAL(&nv*&maxsetsize); endage=FLOOR(&age)-lag; endage=MIN(endage,80); sum = 0; _ntot = _ntot; DO imem = _ntot TO 1 by -1; edose=0; DO Y = 1 TO endage; t = (&age -y-lag-(182/365)); edose = edose + ( ( &wt ) * z[y,imem]); END; phi = &regmodel; sum=sum + phi; end; dum = 1; * last rr is for the case; L = phi / sum; MODEL dum GENERAL(log(L)); RUN; %MEND latency;

model in regmodel. Further details and examples are given in [Richardson, 2009]. CONDITIONAL (PARTIAL) LIKELIHOOD To t models via the conditional logistic likelihood, we rst organize the data into a case-control set structure with one line per case-control set and has been described previously [Langholz and Richardson, 2010]. Briey, covariate information from all case-control set members is put into a covariate array z arranged in blocks of maxsetsize, the maximum number of subjects in any case-control set, for each of the nv covariates. In each covariate block, the cases covariate is rst, followed by that of the controls, with z values over the number in the case-control set ntot to missing (and ignored in the analysis). The macro make case control to generate the case-control set structured data is available at http://hydra.usc.edu/timefactors/examples/exampl.html, topic 12. A macro to t latency models via the conditional logistic likelihood using the case-control set organized data is shown in Figure 2 and is similar to the unconditional logistic likelihood tting macro in Figure 1. The macro variables nv and maxsetsize are for the number of covariates and the size of largest case-control (or risk) set. Dose is computed as a latency weighted function wt of annual exposures and is then a rate ratio is computed via the regression model regmodel. However, this is done for all members of the case-control set so that there is a double loop rst over members imem of the set, and second over age Y for a given member. The denominator of the likelihood is computed as the sum of member rate ratios phi and the numerator is the case-rate ratio, who is last in the case-control member loop.

Figure 3: Conditional logistic likelihood analysis of nested case-control data using PROC NLP.
%MACRO latency (data= , nv=, maxsetsize=, age= , parms= , lag=, wt=, regmodel=, profile=); PROC NLP DATA=&data GRADCHECK=none COV=2; PARMS &parms; PROFILE &profile /ALPHA=.05; lag=&lag ; ARRAY z[&nv,&maxsetsize] _z1-_z%EVAL(&nv*&maxsetsize); endage=FLOOR(&age)-lag; endage=MIN(endage,80); sum = 0; _ntot = _ntot; DO i = _ntot TO 1 by -1; edose=0; DO Y = 1 TO endage; t = (&age -Y-lag); edose = edose + ( ( &wt ) * z[y,i]); END; phi = &regmodel; sum=sum + phi; end; * last rr is for the case; logL = log(phi / sum); MAX logL; RUN; %MEND latency;

PROFILE CONFIDENCE INTERVALS USING PROC NLP Wald condence intervals are symmetric around the estimate and are obtained by subtracting and adding to the estimate 1.96 times the standard errors of the estimates. While these are appropriate in many standard model settings, asymmetrical intervals that take the constraints on the parameters and the skewed distribution of the estimates into account are often preferred. The prole likelihood may be asymmetrical and is obtained by nding the values of such that twice the log-likelihood differs from its maximum by 3.84(=1.962 ). This requires maximizing the likelihood for other parameters, at a xed value of the parameter of interest, and can be quite computationally intensive. Langholz and Richardson [2010] suggest that PROC NLP may be preferred over PROC NLMIXED because PROC NLP can compute prole likelihood condence intervals via the PROFILE command. Figure 3 shows the macro for tting the conditional logistic likelihood using PROC NLP. All the basic programming statements are the same as those used with PROC NLMIXED, with the model statement replaced by the NLP command MAX logL to maximize the log likelihood. Prole condence intervals are computed for variables listing in the macro variable profile. EXAMPLE To illustrate and compare the approaches to tting latency models, we used the nested case-control sample from the Colorado Plateau uranium miners data that was used previously illustrate the use of latency methods [Langholz et al., 1999, Richardson, 2009]. Briey, the cohort data consisted 2704 miners who started employment in the Colorado Plateau after 1950 and were followed for lung cancer death to December, 1990. Tied ages of death were broken randomly and a risk set for each lung cancer mortality case was dened as all cohort members who were alive and on-study at age of the cases death and who attained that age during the ve-year calendar period in which the case died; i.e. risk sets with age as time scale with stratication by ve year calendar period; a data set that consists of 55,964 miner/time records. Forty controls were sampled from the risk sets to form the nested case-control data set, with 10,322 miner/time records, used for the analysis. As in Richardson [2009], three models were tted: simple cumulative dose (time constant), bilinear with 0 xed at 0, and log-normal. Each model was tted using the unconditional logistic likelihood, with a separate stratum parameter for each risk set, and using the conditional logistic (partial) likelihood, in the latter case with procedures NLMIXED and NLP to do the tting. For each model and method, we tabulated the dose-response and latency parameter estimates, standard errors, and model deviance as well as the time required to t the model on a Lenovo x300 Thinkpad laptop computer. Finally, we estimated Wald and prole likelihood condence intervals for the dose-response parameter , the latter using the PROFILE command in PROC NLP. The data set and SAS code used to do these analyses have been posted at http://hydra.usc.edu/timefactors/examples/exampl.html, topic 12.

RESULTS
Table 2 gives the parameter estimates, standard errors, and deviances from each of the tted models. The latency and doseresponse parameters are required in the model specication for both unconditional and conditional logistic likelihood methods

Table 2: Latency model and radon dose-response parameters and standard errors and model deviances from unconditional and conditional logistic likelihood tting.
Type of logistic likelihood Unconditional (NLMIXED) Conditional (NLMIXED or NLP) Latency function Time constant Bilinear Parameter 1 2 Estimate 0.32 8.6 34.1 0.83 2.7 0.59 15.2 SE 0.096 3.8 2.9 0.30 0.16 0.13 5.3 Deviance 2329.2 2316.7 Estimate 0.28 9.8 35.0 0.78 2.7 0.53 13.3 SE 0.083 4.2 2.6 0.30 0.14 0.11 4.7 Deviance 1815.3 1803.5

Lognormal

2318.2

1805.2

Table 3: CPU time to t latency models.


Model Time constant Bilinear Log-normal Unconditional logistic (NLMIXED) 4 min 15 sec 9 min 3 sec 17 min 21 sec Conditional logistic (NLMIXED) 8 sec 1 min 21 sec 3 min 29 sec Conditional logistic (NLP) 1 sec 25 sec 29 sec

Table 4: Wald and prole likelihood 95% condence intervals for the dose response parameter .
Time constant 0.28 Bilinear 0.78 Lognormal 13.3 a No convergence. Wald 0.12-0.44 0.19-1.37 1.6-25.0 Prole 0.16-0.52 a 6.9-28.7

but the unconditional also included parameters for each of the 263 risk set dened strata. There are differences between the unconditional logistic likelihood estimates and standard errors and those from the conditional logistic, but these are relatively small. Also, while the different likelihoods yield different absolute deviances, the between-model differences in the deviances are similar. Thus, at least for our example, results from the two likelihoods are consistent with each other. The results using conditional logistic likelihood tted with PROC NLP were virtually identical to those using PROC NLMIXED, with differences in the third decimal. Table 3 shows the reported CPU times to t each of the models. The time required to t the conditional logistic likelihood is a number of orders of magnitude less than needed to t the unconditional, probably because tting a large number of strata parameters is avoided. The last column shows the computing times for the conditional logistic likelihood tting using PROC NLP. These times are a number of orders of magnitude smaller than the conditional logistic tted using PROC NLMIXED. Table 4 Wald and prole likelihood 95% condence intervals from each of the latency models. The intervals are somewhat different, with the prole likelihood interval shifted toward larger values relative to the Wald interval. The prole likelihood intervals were computationally challenging. For the log-normal model, computation of the interval took about 8 additional minutes of CPU time compared to just 29 seconds to estimate the parameter and standard errors. Prole likelihood limits for bilinear model parameter could not be computed using the latency macro because some parameter combinations resulted in negative likelihood contributions. Even after constraining parameters to assure positive rate ratios (phi in the code), the search failed.

DISCUSSION
We have shown that the strategy for tting latency models using an unconditional logistic likelihood given in Richardson [2009] is easily adapted to the conditional logistic setting. Although the effective dose value D(t; ) is a function of the latency function parameters , the code required is fairly straightforward and easy to implement. When appropriate, the conditional logistic likelihood will be preferable because it is the natural partial likelihood for the cohort (or nested case-control data) and for the practical reason that tting is much faster than the unconditional likelihood. Further, we found that PROC NLP was much faster at tting the conditional likelihood than PROC NLMIXED probably because of factors related to the mixed modeling components of NLMIXED, which we do not use in these analyses. Finally, we found that prole likelihood condence

intervals could be computed using PROC NLP for parameters in some of the latency models but not all, and that computing these intervals takes some time. In situations when prole limits do not converge, it may be possible to identify the condence interval by plotting the log-likelihood over a grid of parameter values. We have focused on risk set or nested case-control data with no tied failure times. When there are ties, these can be randomly broken as was done for the miners cohort data. When breaking the ties does not seem appropriate because, for instance, failure status is determined only over larger time intervals, then the unconditional logistic likelihood provides a valid approach to analyze the grouped time data. An alternative may be estimation via the conditional logistic likelihood for multiple failures [Langholz and Richardson, 2010]. Whenever one is tting complex models, it is important to be sure that the estimated parameters are reasonable and well represent the data. For latency models, latency parameters are only estimable when there evidence of dose response so it is important to rst establish an exposure-disease association. Then, descriptive latency models, such as the piecewise constant model can provide an idea of the general shape of the latency function and often suggest good starting values for latency model parameters.

REFERENCES V.E. Archer, J.K. Wagoner, and F.E. Lundin. Uranium mining and cigarette smoking effects on man. Journal of Occupational Medicine, 15:204211, 1973. K. Berhane, M. Hauptmann, and B. Langholz. Using tensor product splines in modeling exposure-time-response relationships: application to the colorado plateau uranium miners cohort. Stat Med, 27(26):54845496, 2008. N. E. Breslow and N. E. Day. Statistical Methods in Cancer Research. Volume II The Design and Analysis of Cohort Studies, volume 82 of IARC Scientic Publications. International Agency for Research on Cancer, Lyon, 1987. M. Hauptmann, K. Behrane, B. Langholz, and Lubin J.H. Using splines to analyze latency in the colorado plateau uranium miners cohort. Journal of Epidemiology and Biostatistics, 6:417424, 2001. R.W. Hornung and T.J. Meinhardt. Quantitative risk assessment of lung cancer in U. S. uranium miners. Health Physics, 52: 417430, 1987. B. Langholz and D. B. Richardson. Fitting general relative risk models for survival time and matched case-control analysis. American Journal of Epidemiology, 171:37783, 2010. B. Langholz, N. Rothman, S. Wacholder, and D.C. Thomas. Cohort studies for characterizing measured genes. Monographs Journal of the National Cancer Institute, 26:3942, 1999. D. B. Richardson. Latency models for analyses of protracted exposures. Epidemiology, 20:3959, 2009. D.O. Stram, B. Langholz, M. Huberman, and D.C. Thomas. Correcting for dosemetry error in a reanalysis of lung cancer mortality for the colorado plateau uranium miners cohort. Health Physics, 77:265275, 1999. D.C. Thomas. Temporal effects and interactions in cancer: Implications of carcinogenic models. In R.L. Prentice and A.S. Whittemore, editors, Environmental Epidemiology: Risk Assesment, pages 107121. Society for Industrial and Applied Mathematics, Philadelphia, 1982. D.C. Thomas. Exposure-time-response relationships with applications to cancer epidemiology. Annual Review of Public Health, 9:451482, 1988.

CONTACT INFORMATION
Comments and questions are valued and encouraged. Contact the authors: Bryan Langholz Department of Preventive Medicine USC Keck School of Medicine 2001 N Soto Street Second Floor, MC 9237 Los Angeles, CA 90089 langholz@usc.edu http://hydra.usc.edu/langholz David B. Richardson Department of Epidemiology School of Public Health University of North Carolina Chapel Hill, NC 27599-7435 david.richardson@unc.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

Potrebbero piacerti anche