Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SYNOPSIS
A multiple regression model is build on the data of baseball players to find out if the
salary payed is based on the performance of the players .The data has six independent
variable and one dependent variable .The salary was taken as a dependent variable and
the influence of all the independent variables where checked. All the insignificant
independent variable was removed and the model was run and rerun using SPSS to get
the multiple regression equation Y = 58.68 + 28.19 Runs, which implies that the salary
increment depended only on the runs scored. The model was not fit for prediction as the
coefficient for variance of regression was higher than desired.
1
[SID - 309270901] ARUN RAJAN
Table of Content
INTRODUCTION…………………………………………….…….…..….….....3
Multicollinearity…………………………………………………....…...…......6
PREDICTION..............…………………………………………....….…......…7
CONCLUSION……………………………………………………..…….....…..7
REFERENCE……………………………………………………....……….......8
APPENDIX 1………………………………………………………....………….9
APPENDIX 2………………………………………………………....…………10
APPENDIX 3…………………………………………………………....………11
APPENDIX 4……………………………………………………………....……11
APPENDIX 5……………………………………………………………..…..…12
APPENDIX 6…………………………………………………………..……..…12
Introduction
1
[SID - 309270901] ARUN RAJAN
The dataset consists of information salary and performance stats of baseball players. A
multiple regression model is built to find out the influence of performance on the salary
of the baseball players. First the data is checked to see if it is appropriate, using scatter
plots. Then the data is cleaned and an initial model is built to find the correlation
coefficients. A linear relationship model is run to check whether the explanatory
variables were significant in explaining the dependent variable. The residual plots
where checked for finding heteroskedasticity and multicollinearity is also checked for.
Finally a multiple regression line is established and interpreted.
Scatter plots are be used to determine the relationship between two variables and also to
find out if the data needs cleaning. The scatter plots are used to find the relation
between the dependent variable y and each of the independent variables and if there is a
relationship, is it a linear relationship, a positive or negative relationship, a weak or
strong relationship and if there are any outliers. Even if the relationship is not visible the
linear regression model can still be created as scatter plots do not take into consideration
the effects of other variables
1
[SID - 309270901] ARUN RAJAN
Correlation coefficient analysis is used to find out the relation between two variable.
The result can be seen from the correlation (Appendix 2). Two variables will have a
strong linear relationship if the correlation coefficient is close to 1 or -1 and also if the
significance value is below .05.
H1 : At least on β j ≠0
If the significance value is 5% then because p < 0.05, we reject null hypothesis, it means
that there is a strong relationship between the variables. Thus we accept the alternate
hypothesis which means at least a β value is greater than 0 and that there is a
relationship between the dependent and the independent variables.
A linear regression model is built with salary as the dependent variable and batting
average, on base percentage, number of runs, number of hits, triples scored, home runs
and errors as the independent variables. The result is shown in Appendix 4
1
[SID - 309270901] ARUN RAJAN
Here interpreting the unstandardized beta implies that each β i tells the average change
in y given a unit change in xi, provided all the other xj remain constant. i.e., for every 1
run scored there is an increase of 4.13$ in salary provided that their is no change in the
β values of the other constant
The Adjusted R2 value of 0.466 tells us that 46% of the variability of the dependent
variable is explained by the variability of the independent variable.
The VIF value of “Runs” and “hits” are also above 5 and this means that there is
multicollinearity.
Multicollinearity
The rule to fix the multicollinearity is to remove one of the variables which has VIF
above 5. In this model, we can choose remove either “runs” or “hits” but considering
the fact that runs could determine the salary more than hits, the variable hits will be
removed and the model is run again.
The results for new model are shown in Appendix 5. The results show that the
multicollinearity has been removed from the model since the VIF values of all variables
are below 5.Subsequently, the residual plots are analysed to examine whether there is a
non-linear relationship, heteroskedasticity or autocorrelation.
The Sig. value for batting average, obp, errors, triple, runs are all above 0.05, so we
remove the variable which has the largest sig value above 0.05 first, then we run the
model again, the result for new model and the process is iterated until a model with
every variable having sig value below 0.05 is obtained. The final model is given in
Appendix 6
1
[SID - 309270901] ARUN RAJAN
There is no problem in significant value in the ANOVA table is 0.000a and VIFs are all
less than 5.
After removing all the explanatory variables that caused problems in the model, the
final regression model was created. The equation for the best regression model is:
Each increase in y can be thus be explained 28.19 times the runs scored
PREDICTION
The standard error of estimate for the model is 882.04 and mean is 1286.5
The coefficient of variation for the regression = standard error/Mean of Y, which is
greater than 10%.
So the prediction interval of the model will be so high. Thus and therefore this model
cannot be used for prediction.
CONCLUSIONS
In summary, we removed all that was not significant to the regression model and the
final regression model obtained is Y = 58.68 + 28.19 Runs. This implies that the salary
given to the players depend solely on the runs scored during the games. Since the
coefficient of variation is so high, the model cannot be used for prediction.
1
[SID - 309270901] ARUN RAJAN
REFERENCES
Hoaglin, D., and Velleman, P. (1995), "A Critical Look at Some Analyses of
Major League Baseball Salaries," The American Statistician, 49, 277-285.
1
[SID - 309270901] ARUN RAJAN
APPENDICES
Appendix 1
BattingA Homeru
Salary vg OBP Runs Hits Triples ns Errors
3300 0.272 0.302 69 153 4 31 3
2600 0.269 0.335 58 111 2 18 3
2500 0.249 0.337 54 115 1 17 5
2475 0.26 0.292 59 128 7 12 21
2313 0.273 0.346 87 169 5 8 8
2175 0.291 0.379 104 170 2 26 4
600 0.258 0.37 34 86 1 14 10
460 0.228 0.279 16 38 2 3 3
240 0.25 0.327 40 61 0 1 2
200 0.203 0.24 39 64 1 10 6
177 0.262 0.283 7 38 0 0 7
140 0.222 0.307 21 45 0 6 3
117 0.227 0.28 4 5 0 1 0
115 0.261 0.37 1 6 0 0 0
2600 0.3 0.368 69 141 3 19 7
1907 0.225 0.292 60 130 1 13 14
1190 0.255 0.321 39 108 8 3 8
990 0.29 0.349 59 141 2 16 6
925 0.246 0.323 22 81 0 6 5
365 0.208 0.265 12 35 1 0 6
302 0.238 0.347 83 134 4 10 27
300 0.267 0.31 73 149 9 6 6
129 0.353 0.435 16 48 2 2 5
111 0.213 0.222 4 13 1 1 0
4450 0.265 0.355 87 130 7 17 10
4125 0.26 0.321 69 150 1 19 7
3213 0.255 0.347 45 71 5 1 3
2319 0.259 0.349 108 146 4 38 31
2000 0.223 0.307 43 84 4 10 4
1600 0.225 0.31 44 96 3 0 15
1394 0.258 0.381 58 108 0 4 5
935 0.275 0.351 40 70 4 4 3
1
[SID - 309270901] ARUN RAJAN
Appendix 2
Descriptive Statistics
1
[SID - 309270901] ARUN RAJAN
Correlations
homerun
Salary Battingavg obp runs triples hits s errors
Salary Pearson Correlation 1 .099 .142 .657** .399** .630** .631** .183
N 50 50 50 50 50 50 50 50
Battingavg Pearson Correlation .099 1 .807** .234 .199 .311* .102 .009
N 50 50 50 50 50 50 50 50
obp Pearson Correlation .142 .807** 1 .284* .087 .282* .068 .119
N 50 50 50 50 50 50 50 50
runs Pearson Correlation .657** .234 .284* 1 .533** .936** .749** .506**
N 50 50 50 50 50 50 50 50
triples Pearson Correlation .399** .199 .087 .533** 1 .522** .225 .330*
N 50 50 50 50 50 50 50 50
hits Pearson Correlation .630** .311* .282* .936** .522** 1 .707** .466**
N 50 50 50 50 50 50 50 50
homeruns Pearson Correlation .631** .102 .068 .749** .225 .707** 1 .328*
N 50 50 50 50 50 50 50 50
errors Pearson Correlation .183 .009 .119 .506** .330* .466** .328* 1
N 50 50 50 50 50 50 50 50
Appendix 4
1
[SID - 309270901] ARUN RAJAN
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics
Appendix 5
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics
Appendix 6
1
[SID - 309270901] ARUN RAJAN
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics