Sei sulla pagina 1di 9

1.

Develop a suitable simple linear regression model to check if


there is any relationship between Total Cost to Hospital and
AGE. For the fitted model, interpret the regression coefficient
corresponding to AGE.
> library("ISLR", lib.loc="~/R/win-library/3.3")
> d<-read.csv('E:/KOZHI official/4. Term 4/DA-R/Assignment 2 mission
hospital/Mission_2.csv',header=T)
> names(d)
[1] "SL."
"AGE"
"GENDER"
[4] "MALE"
"MARITAL.STATUS"
"UNMARRIED"
[7] "KEY.COMPLAINTS..CODE"
"ACHD"
"CAD.DVD"
[10] "CAD.SVD"
"CAD.TVD"
"CAD.VSD"
.
.
.
> attach(d)
> mod_1<-lm(TOTAL.COST.TO.HOSPITAL~AGE)
> summary(mod_1)
Call:
lm(formula = TOTAL.COST.TO.HOSPITAL ~ AGE)
Residuals:
Min
1Q Median
3Q
Max
-232683 -61888 -19440 28238 600773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141216.6 10610.7 13.309 < 2e-16 ***
AGE
1991.2
273.8 7.273 4.67e-12 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 111400 on 246 degrees of freedom
Multiple R-squared: 0.177, Adjusted R-squared: 0.1736
F-statistic: 52.9 on 1 and 246 DF, p-value: 4.672e-12
> plot(mod_1,which=c(1,2))

The graphs above show that the assumptions of normality and homoscedasticity is
not being followed, as in the residual vs fitted graph we can see a pattern, the
values are clustered with lower fitted values and far apart with higher fitted
values. This shows that the variances are not same, they depend on the
covariance of fitted values.
Similarly the Normal QQ Plot shows that the plot of the values deviate from the
normal line. Hence the underlying assumptions for a linear relationship are not
satisfied.
So we try the log linear model.
Log model
> mod_2<-lm(log(TOTAL.COST.TO.HOSPITAL)~AGE)
> summary(mod_2)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE)
Residuals:
Min
1Q Median
3Q
Max
-1.51748 -0.24402 -0.00536 0.25388 1.39912
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.814724 0.043326 272.693 < 2e-16 ***
AGE
0.008565 0.001118 7.662 4.21e-13 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.455 on 246 degrees of freedom
Multiple R-squared: 0.1927,
Adjusted R-squared: 0.1894
F-statistic: 58.7 on 1 and 246 DF, p-value: 4.212e-13
> plot(mod_2, which=c(1,2))

We see in the residual vs fitted graph that it shows random variances, and the
pattern that was first visible in the previous graph is not there. Also the normal QQ
Plot shows a better the fit of normality than the previous plot.
The beta 1 shows that one unit change in age will change the total cost to hospital
by a factor of Rs. 1.0086

2. At the time of admission, suppose a patients age is 50 years.


Based on the fitted model in (1), what will be the minimum cost of
treatment for this patient at 95% confidence level?
> p<-predict(mod_2,data.frame(AGE=50),interval="prediction")
>p
fit
lwr
upr
1 12.24298 11.34373 13.14223
> exp(p[2])
[1] 84434.41

3. Suppose Mission Hospital is planning to introduce a package


price for the treatment and has decided to charge INR 250,000 for
patients of age 50 years. What is the probability that the
treatment cost will exceed the package price? Do you think that
the Mission Hospital should revise the package price?
Residual standard error acts as proxy for sigma square
Residual standard error: 0.455 on 246 degrees of freedom
> 1-pnorm((log(250000)-p[1])/.455)
[1] 0.3411574
34% is the probability that the treatment cost exceeds package price. The hospital should
not revise the package price as it is greater than the mean.

4. Build a simple linear regression model between Total Cost to


Hospital and GENDER. Interpret the results.
> mod_3<-lm(TOTAL.COST.TO.HOSPITAL~GENDER)
> plot(mod_2, which=c(1,2))

> mod_4<-lm(log(TOTAL.COST.TO.HOSPITAL)~GENDER)
> plot(mod_4, which=c(1,2))

> summary(mod_4)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ GENDER)
Residuals:
Min
1Q Median
3Q
Max
-1.31142 -0.28273 -0.08258 0.26109 1.57082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.93436 0.05503 216.865 < 2e-16 ***
GENDERM
0.19082 0.06726 2.837 0.00493 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4983 on 246 degrees of freedom
Multiple R-squared: 0.03168,
Adjusted R-squared: 0.02774
F-statistic: 8.048 on 1 and 246 DF, p-value: 0.004934
> contrasts(GENDER)
M
F0
M1
> exp(0.19082)
[1] 1.210242
Gender being a qualitative variable becomes a dummy variable here in the regression
model. The contrast command shows that it is coded as 1 for male and 0 for female. The
dummy variable formed is GENDERM. The model shows that for males the total cost to
hospital will be increased by a factor of 1.210242 and p value shows it to be significant.

5. Build a simple linear regression model between Total Cost to


Hospital and MARITAL STATUS. Interpret the results.

> contrasts(MARITAL.STATUS)
UNMARRIED
MARRIED
0
UNMARRIED
1
> summary(mod_6)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ MARITAL.STATUS)
Residuals:
Min
1Q Median
3Q
Max
-1.3608 -0.2360 -0.0334 0.2396 1.4042
Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept)
12.29182 0.04466 275.229 <2e-16 ***
MARITAL.STATUSUNMARRIED -0.40697 0.05944 -6.847 6e-11 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4641 on 246 degrees of freedom
Multiple R-squared: 0.1601,
Adjusted R-squared: 0.1566
F-statistic: 46.88 on 1 and 246 DF, p-value: 5.998e-11
> exp(-0.40697)
[1] 0.6656642
Marital Status being a qualitative variable becomes a dummy variable here in the
regression model. The contrast command shows that it is coded as 1 for unmarried and 0
for married. The dummy variable formed is MARITAL.STATUSUNNMARRIED. The model

shows that for unmarried people the total cost to hospital will be decreased. The total cost
will be multiplied by a factor of 0.6656642 and the p value shows that it is significant.

6. Build a multiple linear regression model with Total Cost to


Hospital as dependent variable, and AGE, GENDER and
MARITAL STATUS as predictors. Compare the results with that of
(4) and (5).
mod_11<-lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+GENDER+MARITAL.STATUS)
>
> summary(mod_11)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + GENDER + MARITAL.STATUS)
Residuals:
Min
1Q Median
3Q
Max
-1.5285 -0.2603 -0.0104 0.2470 1.3529
Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept)
11.790187 0.151136 78.011 < 2e-16 ***
AGE
0.007637 0.002555 2.989 0.00308 **
GENDERM
0.104211 0.062490 1.668 0.09667 .
MARITAL.STATUSUNMARRIED -0.032630 0.132570 -0.246 0.80578
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4543 on 244 degrees of freedom
Multiple R-squared: 0.2019,
Adjusted R-squared: 0.1921
F-statistic: 20.58 on 3 and 244 DF, p-value: 6.394e-12

Only Age is significant. Gender and marital status are insignificant as seen by p
value, however in 4 and 5 these variables were coming as significant. This shows
if considered independently, the gender and marital status show a lot of
significant impact on the total cost to hospital, however, in the combined model,
the effect is not significant.

7. Build a multiple linear regression model with appropriate set of


predictors. Identify the statistically significant predictors that the
Mission Hospital can use in predicting Total Cost to Hospital.
Comment on the performance of the fitted model.
> mod_9<lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+MALE+UNMARRIED+ACHD+CAD.DVD+CAD.SVD+
CAD.TVD+CAD.VSD+OS.ASD+other..heart+other..respiratory+other.general+other.nervou
s+other.tertalogy+PM.VSD+RHD+BODY.WEIGHT+BODY.HEIGHT+HR.PULSE+BP..HIGH+BP.L
OW+RR+Diabetes1+Diabetes2+hypertension1+hypertension2+hypertension3+other+HB
+UREA+CREATININE+AMBULANCE+TRANSFERRED+ELECTIVE)
>
> summary(mod_9)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + MALE + UNMARRIED +
ACHD + CAD.DVD + CAD.SVD + CAD.TVD + CAD.VSD + OS.ASD + other..heart +
other..respiratory + other.general + other.nervous + other.tertalogy +
PM.VSD + RHD + BODY.WEIGHT + BODY.HEIGHT + HR.PULSE + BP..HIGH +
BP.LOW + RR + Diabetes1 + Diabetes2 + hypertension1 + hypertension2 +
hypertension3 + other + HB + UREA + CREATININE + AMBULANCE +
TRANSFERRED + ELECTIVE)
Residuals:
Min
1Q Median
3Q
Max
-0.96533 -0.18093 -0.01659 0.19462 1.19165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
10.4195765 0.4989676 20.882 < 2e-16 ***
AGE
0.0085850 0.0030825 2.785 0.006015 **
MALE
-0.0410937 0.0716926 -0.573 0.567339
UNMARRIED
0.0964430 0.1444585 0.668 0.505364
ACHD
0.0606913 0.1454933 0.417 0.677148
CAD.DVD
0.4675391 0.1300201 3.596 0.000433 ***
CAD.SVD
0.3492459 0.3141862 1.112 0.268025
CAD.TVD
0.3441462 0.1408546 2.443 0.015670 *
CAD.VSD
0.3220618 0.4186867 0.769 0.442926
OS.ASD
0.2303903 0.1517427 1.518 0.130964
other..heart
0.2947377 0.1152326 2.558 0.011488 *
other..respiratory 0.0736222 0.2061631 0.357 0.721494
other.general
-1.6289222 0.4634972 -3.514 0.000577 ***
other.nervous
0.6509382 0.4193210 1.552 0.122602
other.tertalogy
0.3684828 0.1693884 2.175 0.031108 *
PM.VSD
0.2809374 0.2406915 1.167 0.244907
RHD
0.5645466 0.1333216 4.234 3.9e-05 ***
BODY.WEIGHT
0.0022855 0.0037020 0.617 0.537890
BODY.HEIGHT
0.0005591 0.0016910 0.331 0.741381
HR.PULSE
0.0050994 0.0019315 2.640 0.009129 **
BP..HIGH
-0.0021987 0.0023049 -0.954 0.341603

BP.LOW
-0.0005388 0.0032198 -0.167 0.867311
RR
0.0173013 0.0090719 1.907 0.058343 .
Diabetes1
-0.0931856 0.1643344 -0.567 0.571496
Diabetes2
0.2090071 0.1756235 1.190 0.235820
hypertension1
-0.0623585 0.1217057 -0.512 0.609116
hypertension2
-0.2203463 0.1496889 -1.472 0.143028
hypertension3
0.1137384 0.1999772 0.569 0.570339
other
-0.0703775 0.1239298 -0.568 0.570932
HB
0.0027892 0.0118002 0.236 0.813456
UREA
0.0008210 0.0026521 0.310 0.757307
CREATININE
0.2667857 0.1271125 2.099 0.037444 *
AMBULANCE
0.1048268 0.3199244 0.328 0.743607
TRANSFERRED
-0.2662347 0.2261663 -1.177 0.240923
ELECTIVE
0.0878894 0.3115261 0.282 0.778221
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3965 on 156 degrees of freedom
(57 observations deleted due to missingness)
Multiple R-squared: 0.5307,
Adjusted R-squared: 0.4285
F-statistic: 5.19 on 34 and 156 DF, p-value: 5.174e-13
The significant predictors are highlighted in yellow in the table above.

> mod_10<lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+CAD.DVD+CAD.TVD+other..heart+other.general+
other.tertalogy+RHD+HR.PULSE+CREATININE)
> summary(mod_10)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + CAD.DVD + CAD.TVD +
other..heart + other.general + other.tertalogy + RHD + HR.PULSE +
CREATININE)
Residuals:
Min
1Q Median
3Q
Max
-1.06605 -0.20151 -0.02119 0.19485 1.26342
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
10.974447 0.176893 62.040 < 2e-16 ***
AGE
0.006630 0.001672 3.965 0.000101 ***
CAD.DVD
0.401122 0.105391 3.806 0.000186 ***
CAD.TVD
0.388842 0.109755 3.543 0.000490 ***
other..heart
0.221803 0.074259 2.987 0.003162 **
other.general -1.544496 0.419724 -3.680 0.000298 ***
other.tertalogy 0.288918 0.114124 2.532 0.012103 *
RHD
0.490360 0.100450 4.882 2.11e-06 ***
HR.PULSE
0.005739 0.001594 3.600 0.000399 ***
CREATININE
0.223745 0.064466 3.471 0.000633 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4061 on 205 degrees of freedom
(33 observations deleted due to missingness)
Multiple R-squared: 0.4232,
Adjusted R-squared: 0.3979
F-statistic: 16.71 on 9 and 205 DF, p-value: < 2.2e-16

The fitted model with all the significant predictor also has a
multiple r square of 42.32% and the adjusted r square of 0.3979.
showing the model explains 42.32% of the model.

Potrebbero piacerti anche