Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CorrelationandLinearRegression
CorrelationandLinearRegression
Introduction
Inthissectionwediscusscorrelationanalysiswhichisatechniqueusedtoquantifythe
associationsbetweentwocontinuousvariables.Forexample,wemightwanttoquantify
theassociationbetweenbodymassindexandsystolicbloodpressure,orbetween
hoursofexerciseperweekandpercentbodyfat.Regressionanalysisisarelated
techniquetoassesstherelationshipbetweenanoutcomevariableandoneormorerisk
factorsorconfoundingvariables(confoundingisdiscussedlater).Theoutcomevariable
isalsocalledtheresponseordependentvariable,andtheriskfactorsand
confoundersarecalledthepredictors,orexplanatoryorindependentvariables.In
regressionanalysis,thedependentvariableisdenoted"Y"andtheindependent
variablesaredenotedby"X".
[NOTE:Theterm"predictor"canbemisleadingifitisinterpretedastheabilityto
predictevenbeyondthelimitsofthedata.Also,theterm"explanatoryvariable"might
giveanimpressionofacausaleffectinasituationinwhichinferencesshouldbelimited
toidentifyingassociations.Theterms"independent"and"dependent"variableareless
subjecttotheseinterpretationsastheydonotstronglyimplycauseandeffect.
LearningObjectives
Aftercompletingthismodule,thestudentwillbeableto:
1. Defineandprovideexamplesofdependentandindependentvariablesinastudyofapublichealthproblem
2. Computeandinterpretacorrelationcoefficient
3. Computeandinterpretcoefficientsinalinearregressionanalysis
CorrelationAnalysis
Incorrelationanalysis,weestimateasamplecorrelationcoefficient,morespecificallythePearsonProductMomentcorrelation
coefficient.Thesamplecorrelationcoefficient,denotedr,
rangesbetween1and+1andquantifiesthedirectionandstrengthofthelinearassociationbetweenthetwovariables.Thecorrelation
betweentwovariablescanbepositive(i.e.,higherlevelsofonevariableareassociatedwithhigherlevelsoftheother)ornegative(i.e.,higher
levelsofonevariableareassociatedwithlowerlevelsoftheother).
Thesignofthecorrelationcoefficientindicatesthedirectionoftheassociation.Themagnitudeofthecorrelationcoefficientindicatesthe
strengthoftheassociation.
Forexample,acorrelationofr=0.9suggestsastrong,positiveassociationbetweentwovariables,whereasacorrelationofr=0.2suggesta
weak,negativeassociation.Acorrelationclosetozerosuggestsnolinearassociationbetweentwocontinuousvariables.
Itisimportanttonotethattheremaybeanonlinearassociationbetweentwocontinuousvariables,butcomputationofacorrelationcoefficient
doesnotdetectthis.Therefore,itisalwaysimportanttoevaluatethedatacarefullybeforecomputingacorrelationcoefficient.Graphical
displaysareparticularlyusefultoexploreassociationsbetweenvariables.
ThefigurebelowshowsfourhypotheticalscenariosinwhichonecontinuousvariableisplottedalongtheXaxisandtheotheralongtheYaxis.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
1/10
2/28/2015
CorrelationandLinearRegression
Scenario1depictsastrongpositiveassociation(r=0.9),similartowhatwemightseeforthecorrelationbetweeninfantbirthweightand
birthlength.
Scenario2depictsaweakerassociation(r=0,2)thatwemightexpecttoseebetweenageandbodymassindex(whichtendstoincrease
withage).
Scenario3mightdepictthelackofassociation(rapproximately0)betweentheextentofmediaexposureinadolescenceandageat
whichadolescentsinitiatesexualactivity.
Scenario4mightdepictthestrongnegativeassociation(r=0.9)generallyobservedbetweenthenumberofhoursofaerobicexercise
perweekandpercentbodyfat.
ExampleCorrelationofGestationalAgeandBirthWeight
Asmallstudyisconductedinvolving17infantstoinvestigatetheassociationbetweengestationalageatbirth,measuredinweeks,andbirth
weight,measuredingrams.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
2/10
2/28/2015
CorrelationandLinearRegression
Wewishtoestimatetheassociationbetweengestationalageandinfantbirthweight.Inthisexample,birthweightisthedependentvariableand
gestationalageistheindependentvariable.Thusy=birthweightandx=gestationalage.Thedataaredisplayedinascatterdiagraminthe
figurebelow.
Eachpointrepresentsan(x,y)pair(inthiscasethegestationalage,measuredinweeks,andthebirthweight,measuredingrams).Notethat
theindependentvariableisonthehorizontalaxis(orXaxis),andthedependentvariableisontheverticalaxis(orYaxis).Thescatterplot
showsapositiveordirectassociationbetweengestationalageandbirthweight.Infantswithshortergestationalagesaremorelikelytobeborn
withlowerweightsandinfantswithlongergestationalagesaremorelikelytobebornwithhigherweights.
Theformulaforthesamplecorrelationcoefficientis
whereCov(x,y)isthecovarianceofxandydefinedas
sx2andsy2arethesamplevariancesofxandy,definedas
Thevariancesofxandymeasurethevariabilityofthexscoresandyscoresaroundtheirrespectivesamplemeans(
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
,considered
3/10
2/28/2015
CorrelationandLinearRegression
separately).Thecovariancemeasuresthevariabilityofthe(x,y)pairsaroundthemeanofxandmeanofy,consideredsimultaneously.
Tocomputethesamplecorrelationcoefficient,weneedtocomputethevarianceofgestationalage,thevarianceofbirthweightandalsothe
covarianceofgestationalageandbirthweight.
Wefirstsummarizethegestationalagedata.Themeangestationalageis:
Tocomputethevarianceofgestationalage,weneedtosumthesquareddeviations(ordifferences)betweeneachobservedgestationalage
andthemeangestationalage.Thecomputationsaresummarizedbelow.
Thevarianceofgestationalageis:
Next,wesummarizethebirthweightdata.Themeanbirthweightis:
Thevarianceofbirthweightiscomputedjustaswedidforgestationalageasshowninthetablebelow.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
4/10
2/28/2015
CorrelationandLinearRegression
Thevarianceofbirthweightis:
Nextwecomputethecovariance,
Tocomputethecovarianceofgestationalageandbirthweight,weneedtomultiplythedeviationfromthemeangestationalagebythe
deviationfromthemeanbirthweightforeachparticipant(i.e.,
Thecomputationsaresummarizedbelow.Noticethatwesimplycopythedeviationsfromthemeangestationalageandbirthweightfromthe
twotablesaboveintothetablebelowandmultiply.
Thecovarianceofgestationalageandbirthweightis:
Wenowcomputethesamplecorrelationcoefficient:
Notsurprisingly,thesamplecorrelationcoefficientindicatesastrongpositivecorrelation.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
5/10
2/28/2015
CorrelationandLinearRegression
Aswenoted,samplecorrelationcoefficientsrangefrom1to+1.Inpractice,meaningfulcorrelations(i.e.,correlationsthatareclinicallyor
practicallyimportant)canbeassmallas0.4(or0.4)forpositive(ornegative)associations.Therearealsostatisticalteststodeterminewhether
anobservedcorrelationisstatisticallysignificantornot(i.e.,statisticallysignificantlydifferentfromzero).Procedurestotestwhetheran
observedsamplecorrelationissuggestiveofastatisticallysignificantcorrelationaredescribedindetailinKleinbaum,KupperandMuller.1
RegressionAnalysis
Regressionanalysisisawidelyusedtechniquewhichisusefulformanyapplications.Weintroducethetechniquehereandexpandonitsuses
insubsequentmodules.
SimpleLinearRegression
Simplelinearregressionisatechniquethatisappropriatetounderstandtheassociationbetweenoneindependent(orpredictor)variableand
onecontinuousdependent(oroutcome)variable.Forexample,supposewewanttoassesstheassociationbetweentotalcholesterol(in
milligramsperdeciliter,mg/dL)andbodymassindex(BMI,measuredastheratioofweightinkilogramstoheightinmeters2)wheretotal
cholesterolisthedependentvariable,andBMIistheindependentvariable.Inregressionanalysis,thedependentvariableisdenotedYandthe
independentvariableisdenotedX.So,inthiscase,Y=totalcholesterolandX=BMI.
Whenthereisasinglecontinuousdependentvariableandasingleindependentvariable,theanalysisiscalledasimplelinearregression
analysis.Thisanalysisassumesthatthereisalinearassociationbetweenthetwovariables.(Ifadifferentrelationshipishypothesized,suchas
acurvilinearorexponentialrelationship,alternativeregressionanalysesareperformed.)
ThefigurebelowisascatterdiagramillustratingtherelationshipbetweenBMIandtotalcholesterol.Eachpointrepresentstheobserved(x,y)
pair,inthiscase,BMIandthecorrespondingtotalcholesterolmeasuredineachparticipant.Notethattheindependentvariableisonthe
horizontalaxisandthedependentvariableontheverticalaxis.
BMIandTotalCholesterol
ThegraphshowsthatthereisapositiveordirectassociationbetweenBMIandtotalcholesterolparticipantswithlowerBMIaremorelikelyto
havelowertotalcholesterollevelsandparticipantswithhigherBMIaremorelikelytohavehighertotalcholesterollevels.Incontrast,suppose
weexaminetheassociationbetweenBMIandHDLcholesterol.
Incontrast,thegraphbelowdepictstherelationshipbetweenBMIandHDLHDLcholesterolinthesamesampleofn=20participants.
BMIandHDLCholesterol
ThisgraphshowsanegativeorinverseassociationbetweenBMIandHDLcholesterol,i.e.,thosewithlowerBMIaremorelikelytohavehigher
HDLcholesterollevelsandthosewithhigherBMIaremorelikelytohavelowerHDLcholesterollevels.
Foreitheroftheserelationshipswecouldusesimplelinearregressionanalysistoestimatetheequationofthelinethatbestdescribesthe
associationbetweentheindependentvariableandthedependentvariable.Thesimplelinearregressionequationisasfollows:
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
6/10
2/28/2015
CorrelationandLinearRegression
,where
isthepredictedorexpectedvalueoftheoutcome,Xisthepredictor,b0istheestimatedYintercept,andb1istheestimatedslope.TheY
interceptandslopeareestimatedfromthesampledatasoastominimizethesumofthesquareddifferencesbetweentheobservedandthe
predictedvaluesoftheoutcome,i.e.,theestimatesminimize:
Thesedifferencesbetweenobservedandpredictedvaluesoftheoutcomearecalledresiduals.TheestimatesoftheYinterceptandslope
minimizethesumofthesquaredresiduals,andarecalledtheleastsquaresestimates.1
Residuals
Conceptually,ifthevaluesofXprovidedaperfectpredictionofYthenthesumofthesquared
differencesbetweenobservedandpredictedvaluesofYwouldbe0.Thatwouldmeanthat
variabilityinYcouldbecompletelyexplainedbydifferencesinX.However,ifthedifferences
betweenobservedandpredictedvaluesarenot0,thenweareunabletoentirelyaccountfor
differencesinYbasedonX,thenthereareresidualerrorsintheprediction.Theresidualerror
couldresultfrominaccuratemeasurementsofXorY,ortherecouldbeothervariablesbesidesX
thataffectthevalueofY.
Basedontheobserveddata,thebestestimateofalinearrelationshipwillbeobtainedfromanequationforthelinethatminimizesthe
differencesbetweenobservedandpredictedvaluesoftheoutcome.TheYinterceptofthislineisthevalueofthedependentvariable(Y)
whentheindependentvariable(X)iszero.Theslopeofthelineisthechangeinthedependentvariable(Y)relativetoaoneunitchangeinthe
independentvariable(X).Theleastsquaresestimatesoftheyinterceptandslopearecomputedasfollows:
where
risthesamplecorrelationcoefficient,
thesamplemeansare and
andSxandSyarethestandarddeviationsoftheindependentvariablexandthedependentvariabley,respectively.
BMIandTotalCholesterol
Theleastsquaresestimatesoftheregressioncoefficients,b0andb1,describingtherelationshipbetweenBMIandtotalcholesterolareb0=
28.07andb1=6.49.Thesearecomputedasfollows:
TheestimateoftheYintercept(b0=28.07)representstheestimatedtotalcholesterollevelwhenBMIiszero.BecauseaBMIofzerois
meaningless,theYinterceptisnotinformative.Theestimateoftheslope(b1=6.49)representsthechangeintotalcholesterolrelativetoaone
unitchangeinBMI.Forexample,ifwecomparetwoparticipantswhoseBMIsdifferby1unit,wewouldexpecttheirtotalcholesterolstodifferby
approximately6.49units(withthepersonwiththehigherBMIhavingthehighertotalcholesterol).
Theequationoftheregressionlineisasfollows:
Thegraphbelowshowstheestimatedregressionlinesuperimposedonthescatterdiagram.
Theregressionequationcanbeusedtoestimateaparticipant'stotalcholesterolasafunctionofhis/herBMI.Forexample,supposea
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
7/10
2/28/2015
CorrelationandLinearRegression
participanthasaBMIof25.Wewouldestimatetheirtotalcholesteroltobe28.07+6.49(25)=190.32.Theequationcanalsobeusedto
estimatetotalcholesterolforothervaluesofBMI.However,theequationshouldonlybeusedtoestimatecholesterollevelsforpersonswhose
BMIsareintherangeofthedatausedtogeneratetheregressionequation.Inoursample,BMIrangesfrom20to32,thustheequationshould
onlybeusedtogenerateestimatesoftotalcholesterolforpersonswithBMIinthatrange.
Therearestatisticalteststhatcanbeperformedtoassesswhethertheestimatedregressioncoefficients(b0andb1)arestatisticallysignificantly
differentfromzero.ThetestofmostinterestisusuallyH0:b1=0versusH1:b10,whereb1isthepopulationslope.Ifthepopulationslopeis
significantlydifferentfromzero,weconcludethatthereisastatisticallysignificantassociationbetweentheindependentanddependent
variables.
BMIandHDLCholesterol
Theleastsquaresestimatesoftheregressioncoefficients,b0andb1,describingtherelationshipbetweenBMIandHDLcholesterolareas
follows:b0=111.77andb1=2.35.Thesearecomputedasfollows:
Again,theYinterceptinuninformativebecauseaBMIofzeroismeaningless.Theestimateoftheslope(b1=2.35)representsthechangein
HDLcholesterolrelativetoaoneunitchangeinBMI.IfwecomparetwoparticipantswhoseBMIsdifferby1unit,wewouldexpecttheirHDL
cholesterolstodifferbyapproximately2.35units(withthepersonwiththehigherBMIhavingthelowerHDLcholesterol.Thefigurebelowshows
theregressionlinesuperimposedonthescatterdiagramforBMIandHDLcholesterol.
Linearregressionanalysisrestsontheassumptionthatthedependentvariableiscontinuousandthatthedistributionofthedependentvariable
(Y)ateachvalueoftheindependentvariable(X)isapproximatelynormallydistributed.Note,however,thattheindependentvariablecanbe
continuous(e.g.,BMI)orcanbedichotomous(seebelow).
ComparingMeanHDLLevelsWithRegressionAnalysis
ConsideraclinicaltrialtoevaluatetheefficacyofanewdrugtoincreaseHDLcholesterol.WecouldcomparethemeanHDLlevelsbetween
treatmentgroupsstatisticallyusingatwoindependentsamplesttest.Hereweconsideranalternateapproach.Summarydataforthetrialare
shownbelow:
SampleSize
MeanHDL
StandardDeviationofHDL
NewDrug
50
40.16
4.46
Placebo
50
39.21
3.91
HDLcholesterolisthecontinuousdependentvariableandtreatmentassignment(newdrugversusplacebo)istheindependentvariable.
Supposethedataonn=100participantsareenteredintoastatisticalcomputingpackage.Theoutcome(Y)isHDLcholesterolinmg/dLandthe
independentvariable(X)istreatmentassignment.Forthisanalysis,Xiscodedas1forparticipantswhoreceivedthenewdrugandas0for
participantswhoreceivedtheplacebo.Asimplelinearregressionequationisestimatedasfollows:
=39.21+0.95X,
where istheestimatedHDLlevelandXisadichotomousvariable(alsocalledanindicatorvariable,inthiscaseindicatingwhetherthe
participantwasassignedtothenewdrugortoplacebo).TheestimateoftheYinterceptisb0=39.21.TheYinterceptisthevalueofY(HDL
cholesterol)whenXiszero.Inthisexample,X=0indicatesassignmenttotheplacebogroup.Thus,theYinterceptisexactlyequaltothemean
HDLlevelintheplacebogroup.Theslopeisestimatedasb1=0.95.ThesloperepresentstheestimatedchangeinY(HDLcholesterol)relative
toaoneunitchangeinX.AoneunitchangeinXrepresentsadifferenceintreatmentassignment(placeboversusnewdrug).Theslope
representsthedifferenceinmeanHDLlevelsbetweenthetreatmentgroups.Thus,themeanHDLforparticipantsreceivingthenewdrugis:
=39.21+0.95(1)=40.16
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
8/10
2/28/2015
CorrelationandLinearRegression
Astudywasconductedtoassesstheassociationbetweenaperson'sintelligenceandthe
sizeoftheirbrain.ParticipantscompletedastandardizedIQtestandresearchersused
MagneticResonanceImaging(MRI)todeterminebrainsize.Demographicinformation,
includingthepatient'sgender,wasalsorecorded.
TheControversyOverEnvironmentalTobaccoSmokeExposure
Thereisconvincingevidencethatactivesmokingisacauseoflungcancerandheartdisease.Manystudiesdoneinawidevarietyof
circumstanceshaveconsistentlydemonstratedastrongassociationandalsoindicatethattheriskoflungcancerandcardiovasculardisease
(i.e..,heartattacks)increasesinadoserelatedway.Thesestudieshaveledtotheconclusionthatactivesmokingiscausallyrelatedtolung
cancerandcardiovasculardisease.Studiesinactivesmokershavehadtheadvantagethatthelifetimeexposuretotobaccosmokecanbe
quantifiedwithreasonableaccuracy,sincetheunitdoseisconsistent(onecigarette)andthehabitualnatureoftobaccosmokingmakesit
possibleformostsmokerstoprovideareasonableestimateoftheirtotallifetimeexposurequantifiedintermsofcigarettesperdayorpacksper
day.Frequently,averagedailyexposure(cigarettesorpacks)iscombinedwithdurationofuseinyearsinordertoquantifyexposureas"pack
years".
Ithasbeenmuchmoredifficulttoestablishwhetherenvironmentaltobaccosmoke(ETS)exposureiscausallyrelatedtochronicdiseaseslike
heartdiseaseandlungcancer,becausethetotallifetimeexposuredosageislower,anditismuchmoredifficulttoaccuratelyestimatetotal
lifetimeexposure.Inaddition,quantifyingtheserisksisalsocomplicatedbecauseofconfoundingfactors.Forexample,ETSexposureisusually
classifiedbasedonparentalorspousalsmoking,butthesestudiesareunabletoquantifyotherenvironmentalexposurestotobaccosmoke,
andinabilitytoquantifyandadjustforotherenvironmentalexposuressuchasairpollutionmakesitdifficulttodemonstrateanassociationeven
ifoneexisted.Asaresult,therecontinuestobecontroversyovertheriskimposedbyenvironmentaltobaccosmoke(ETS).Somehavegone
sofarastoclaimthatevenverybriefexposuretoETScancauseamyocardialinfarction(heartattack),butaverylargeprospectivecohort
studybyEnstromandKabatwasunabletodemonstratesignificantassociationsbetweenexposuretospousalETSandcoronaryheart
disease,chronicobstructivepulmonarydisease,orlungcancer.(Itshouldbenoted,however,thatthereportbyEnstromandKabathasbeen
widelycriticizedformethodologicalproblems,andtheseauthorsalsohadfinancialtiestothetobaccoindustry.)
Correlationanalysisprovidesausefultoolforthinkingaboutthiscontroversy.ConsiderdatafromtheBritishDoctorsCohort.Theyreportedthe
annualmortalityforavarietyofdiseaseatfourlevelsofcigarettesmokingperday:Neversmoked,114/day,1524/day,and25+/day.Inorder
toperformacorrelationanalysis,Iroundedtheexposurelevelsto0,10,20,and30respectively.
CVDMortality/100,000men/yr.
LungCancerMortality/100,000men/yr.
CigarettesSmokedPerDay
572
14
10(actually114)
802
105
20(actually1524)
892
208
30(actually>24)
1025
355
Thefiguresbelowshowthetwoestimatedregressionlinessuperimposedonthescatterdiagram.Thecorrelationwithamountofsmokingwas
strongforbothCVDmortality(r=0.98)andforlungcancer(r=0.99).NotealsothattheYinterceptisameaningfulnumberhereitrepresents
thepredictedannualdeathratefromthesediseaseinindividualswhoneversmoked.TheYinterceptforpredictionofCVDisslightlyhigher
thantheobservedrateinneversmokers,whiletheYinterceptforlungcancerislowerthantheobservedrateinneversmokers.
Thelinearityoftheserelationshipssuggeststhatthereisanincrementalriskwitheachadditionalcigarettesmokedperday,andtheadditional
riskisestimatedbytheslopes.ThisperhapshelpsusthinkabouttheconsequencesofETSexposure.Forexample,theriskoflungcancerin
neversmokersisquitelow,butthereisafiniteriskvariousreportssuggestariskof1015lungcancers/100,000peryear.Ifanindividualwho
neversmokedactivelywasexposedtotheequivalentofonecigarette'ssmokeintheformofETS,thentheregressionsuggeststhattheirrisk
wouldincreaseby11.26lungcancerdeathsper100,000peryear.However,theriskisclearlydoserelated.Therefore,ifanonsmokerwas
employedbyatavernwithheavylevelsofETS,theriskmightbesubstantiallygreater.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
9/10
2/28/2015
CorrelationandLinearRegression
Finally,itshouldbenotedthatsomefindingssuggestthattheassociationbetweensmokingandheartdiseaseisnonlinearattheverylowest
exposurelevels,meaningthatnonsmokershaveadisproportionateincreaseinriskwhenexposedtoETSduetoanincreaseinplatelet
aggregation.
Summary
Correlationandlinearregressionanalysisarestatisticaltechniquestoquantifyassociationsbetweenanindependent,sometimescalleda
predictor,variable(X)andacontinuousdependentoutcomevariable(Y).Forcorrelationanalysis,theindependentvariable(X)canbe
continuous(e.g.,gestationalage)orordinal(e.g.,increasingcategoriesofcigarettesperday).Regressionanalysiscanalsoaccommodate
dichotomousindependentvariables.
Theproceduresdescribedhereassumethattheassociationbetweentheindependentanddependentvariablesislinear.Withsome
adjustments,regressionanalysiscanalsobeusedtoestimateassociationsthatfollowanotherfunctionalform(e.g.,curvilinear,quadratic).
Hereweconsiderassociationsbetweenoneindependentvariableandonecontinuousdependentvariable.Theregressionanalysisiscalled
simplelinearregressionsimpleinthiscasereferstothefactthatthereisasingleindependentvariable.Inthenextmodule,weconsider
regressionanalysiswithseveralindependentvariables,orpredictors,consideredsimultaneously.
http://sphweb.bumc.bu.edu/otlt/MPHModules/BS/BS704_CorrelationRegression/BS704_CorrelationRegression_print.html
10/10