Sei sulla pagina 1di 12

[SID - 309270901] ARUN RAJAN

SYNOPSIS

A multiple regression model is build on the data of baseball players to find out if the
salary payed is based on the performance of the players .The data has six independent
variable and one dependent variable .The salary was taken as a dependent variable and
the influence of all the independent variables where checked. All the insignificant
independent variable was removed and the model was run and rerun using SPSS to get
the multiple regression equation Y = 58.68 + 28.19 Runs, which implies that the salary
increment depended only on the runs scored. The model was not fit for prediction as the
coefficient for variance of regression was higher than desired.

1
[SID - 309270901] ARUN RAJAN

Table of Content
INTRODUCTION…………………………………………….…….…..….….....3

SCATTER PLOT ANALYSIS…………………………….………...…….....…3

CORRELATION COEFFICENT ANALYSIS…………….……..….…...……5

ANALYSIS THE MODEL USING SPSS………………………...……..….….5

Running the Model for the First Time…………………………………....…….......….5

Analysing the Model……………………………………………...….....…..…6

Examining the Model……………………………………………...….….....…6

Multicollinearity…………………………………………………....…...…......6

Running the Model for the Second Time…………………..…………....……….....…6

Running the Model till the Final Result…………………….......………....…….......6

FINAL REGRESSION MODEL…………………………………...………......7

PREDICTION..............…………………………………………....….…......…7

CONCLUSION……………………………………………………..…….....…..7

REFERENCE……………………………………………………....……….......8

APPENDIX 1………………………………………………………....………….9

APPENDIX 2………………………………………………………....…………10

APPENDIX 3…………………………………………………………....………11

APPENDIX 4……………………………………………………………....……11

APPENDIX 5……………………………………………………………..…..…12

APPENDIX 6…………………………………………………………..……..…12

Introduction

1
[SID - 309270901] ARUN RAJAN

The dataset consists of information salary and performance stats of baseball players. A
multiple regression model is built to find out the influence of performance on the salary
of the baseball players. First the data is checked to see if it is appropriate, using scatter
plots. Then the data is cleaned and an initial model is built to find the correlation
coefficients. A linear relationship model is run to check whether the explanatory
variables were significant in explaining the dependent variable. The residual plots
where checked for finding heteroskedasticity and multicollinearity is also checked for.
Finally a multiple regression line is established and interpreted.

Scatter Plot Analysis

Scatter plots are be used to determine the relationship between two variables and also to
find out if the data needs cleaning. The scatter plots are used to find the relation
between the dependent variable y and each of the independent variables and if there is a
relationship, is it a linear relationship, a positive or negative relationship, a weak or
strong relationship and if there are any outliers. Even if the relationship is not visible the
linear regression model can still be created as scatter plots do not take into consideration
the effects of other variables

1
[SID - 309270901] ARUN RAJAN

CORRELATION COEFFICENT ANALYSIS

Correlation coefficient analysis is used to find out the relation between two variable.
The result can be seen from the correlation (Appendix 2). Two variables will have a
strong linear relationship if the correlation coefficient is close to 1 or -1 and also if the
significance value is below .05.

Testing the hypothesis that:

H0 : β 1 = β 2 = ... β i= 0 (There is no linear relationship between the dependent


variable and the explanatory variables)

H1 : At least on β j ≠0

Noting that F = 7.12 and p = 0.00, we reject the Null Hypothesis.

If the significance value is 5% then because p < 0.05, we reject null hypothesis, it means
that there is a strong relationship between the variables. Thus we accept the alternate
hypothesis which means at least a β value is greater than 0 and that there is a
relationship between the dependent and the independent variables.

Running the Model for the First Time

Analysing the model

A linear regression model is built with salary as the dependent variable and batting
average, on base percentage, number of runs, number of hits, triples scored, home runs
and errors as the independent variables. The result is shown in Appendix 4

1
[SID - 309270901] ARUN RAJAN

Here interpreting the unstandardized beta implies that each β i tells the average change
in y given a unit change in xi, provided all the other xj remain constant. i.e., for every 1
run scored there is an increase of 4.13$ in salary provided that their is no change in the
β values of the other constant

Examining the Model

The Adjusted R2 value of 0.466 tells us that 46% of the variability of the dependent
variable is explained by the variability of the independent variable.

The VIF value of “Runs” and “hits” are also above 5 and this means that there is
multicollinearity.

Multicollinearity

The rule to fix the multicollinearity is to remove one of the variables which has VIF
above 5. In this model, we can choose remove either “runs” or “hits” but considering
the fact that runs could determine the salary more than hits, the variable hits will be
removed and the model is run again.

Running the Model for the Second Time

The results for new model are shown in Appendix 5. The results show that the
multicollinearity has been removed from the model since the VIF values of all variables
are below 5.Subsequently, the residual plots are analysed to examine whether there is a
non-linear relationship, heteroskedasticity or autocorrelation.

Running the Model till the Final result

The Sig. value for batting average, obp, errors, triple, runs are all above 0.05, so we
remove the variable which has the largest sig value above 0.05 first, then we run the
model again, the result for new model and the process is iterated until a model with
every variable having sig value below 0.05 is obtained. The final model is given in
Appendix 6

1
[SID - 309270901] ARUN RAJAN

There is no problem in significant value in the ANOVA table is 0.000a and VIFs are all
less than 5.

FINAL REGRESSION MODEL

After removing all the explanatory variables that caused problems in the model, the
final regression model was created. The equation for the best regression model is:

Y = 58.68 + 28.19 Runs

Each increase in y can be thus be explained 28.19 times the runs scored

PREDICTION

The standard error of estimate for the model is 882.04 and mean is 1286.5
The coefficient of variation for the regression = standard error/Mean of Y, which is
greater than 10%.
So the prediction interval of the model will be so high. Thus and therefore this model
cannot be used for prediction.

CONCLUSIONS

In summary, we removed all that was not significant to the regression model and the
final regression model obtained is Y = 58.68 + 28.19 Runs. This implies that the salary
given to the players depend solely on the runs scored during the games. Since the
coefficient of variation is so high, the model cannot be used for prediction.

1
[SID - 309270901] ARUN RAJAN

REFERENCES

Denby, L. (1988), Dataset from Poster Session sponsored by the Section on


Statistical Graphics of the American Statistical Association, on Statlib, ed.
Michael Myers. (http://stat.lib.cmu.edu/datasets)

Hoaglin, D., and Velleman, P. (1995), "A Critical Look at Some Analyses of
Major League Baseball Salaries," The American Statistician, 49, 277-285.

1
[SID - 309270901] ARUN RAJAN

APPENDICES

Appendix 1

BattingA Homeru
Salary vg OBP Runs Hits Triples ns Errors
3300 0.272 0.302 69 153 4 31 3
2600 0.269 0.335 58 111 2 18 3
2500 0.249 0.337 54 115 1 17 5
2475 0.26 0.292 59 128 7 12 21
2313 0.273 0.346 87 169 5 8 8
2175 0.291 0.379 104 170 2 26 4
600 0.258 0.37 34 86 1 14 10
460 0.228 0.279 16 38 2 3 3
240 0.25 0.327 40 61 0 1 2
200 0.203 0.24 39 64 1 10 6
177 0.262 0.283 7 38 0 0 7
140 0.222 0.307 21 45 0 6 3
117 0.227 0.28 4 5 0 1 0
115 0.261 0.37 1 6 0 0 0
2600 0.3 0.368 69 141 3 19 7
1907 0.225 0.292 60 130 1 13 14
1190 0.255 0.321 39 108 8 3 8
990 0.29 0.349 59 141 2 16 6
925 0.246 0.323 22 81 0 6 5
365 0.208 0.265 12 35 1 0 6
302 0.238 0.347 83 134 4 10 27
300 0.267 0.31 73 149 9 6 6
129 0.353 0.435 16 48 2 2 5
111 0.213 0.222 4 13 1 1 0
4450 0.265 0.355 87 130 7 17 10
4125 0.26 0.321 69 150 1 19 7
3213 0.255 0.347 45 71 5 1 3
2319 0.259 0.349 108 146 4 38 31
2000 0.223 0.307 43 84 4 10 4
1600 0.225 0.31 44 96 3 0 15
1394 0.258 0.381 58 108 0 4 5
935 0.275 0.351 40 70 4 4 3

1
[SID - 309270901] ARUN RAJAN

850 0.327 0.424 60 141 3 0 20


775 0.272 0.3 18 62 2 5 3
760 0.241 0.32 33 84 2 6 14
629 0.293 0.355 32 79 0 1 0
275 0.257 0.315 51 96 1 6 8
120 0.225 0.33 7 20 0 0 2
2567 0.196 0.297 36 56 0 12 8
2500 0.252 0.309 66 137 1 28 5
2350 0.294 0.367 84 158 6 21 3
2317 0.297 0.391 48 73 5 3 4
2000 0.258 0.288 46 86 4 12 9
715 0.205 0.268 21 36 3 1 3
660 0.272 0.304 38 82 3 9 9
650 0.243 0.344 20 45 0 0 4
260 0.337 0.413 13 32 0 0 0
250 0.228 0.238 12 36 1 1 2
200 0.24 0.3 51 92 3 13 3
180 0.298 0.378 18 45 2 6 8

Appendix 2
Descriptive Statistics

Mean Std. Deviation N

Salary 1286.50 1157.728 50

Battingavg .25750 .033580 50

obp .32682 .045682 50

runs 43.56 26.977 50

triples 2.40 2.295 50

hits 87.68 46.295 50

homeruns 8.80 9.076 50

errors 6.84 6.504 50

1
[SID - 309270901] ARUN RAJAN

Correlations

homerun
Salary Battingavg obp runs triples hits s errors

Salary Pearson Correlation 1 .099 .142 .657** .399** .630** .631** .183

Sig. (2-tailed) .493 .325 .000 .004 .000 .000 .203

N 50 50 50 50 50 50 50 50

Battingavg Pearson Correlation .099 1 .807** .234 .199 .311* .102 .009

Sig. (2-tailed) .493 .000 .102 .166 .028 .481 .953

N 50 50 50 50 50 50 50 50

obp Pearson Correlation .142 .807** 1 .284* .087 .282* .068 .119

Sig. (2-tailed) .325 .000 .045 .547 .047 .640 .410

N 50 50 50 50 50 50 50 50

runs Pearson Correlation .657** .234 .284* 1 .533** .936** .749** .506**

Sig. (2-tailed) .000 .102 .045 .000 .000 .000 .000

N 50 50 50 50 50 50 50 50

triples Pearson Correlation .399** .199 .087 .533** 1 .522** .225 .330*

Sig. (2-tailed) .004 .166 .547 .000 .000 .116 .019

N 50 50 50 50 50 50 50 50

hits Pearson Correlation .630** .311* .282* .936** .522** 1 .707** .466**

Sig. (2-tailed) .000 .028 .047 .000 .000 .000 .001

N 50 50 50 50 50 50 50 50

homeruns Pearson Correlation .631** .102 .068 .749** .225 .707** 1 .328*

Sig. (2-tailed) .000 .481 .640 .000 .116 .000 .020

N 50 50 50 50 50 50 50 50

errors Pearson Correlation .183 .009 .119 .506** .330* .466** .328* 1

Sig. (2-tailed) .203 .953 .410 .000 .019 .001 .020

N 50 50 50 50 50 50 50 50

**. Correlation is significant at the 0.01 level (2-tailed).

*. Correlation is significant at the 0.05 level (2-tailed).

Appendix 4

1
[SID - 309270901] ARUN RAJAN

Coefficientsa

Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) 559.616 974.900 .574 .569

Battingavg -10523.168 7101.302 -.305 -1.482 .146 .257 3.897

obp 6847.192 5223.871 .270 1.311 .197 .256 3.902

runs 4.139 16.063 .096 .258 .798 .078 12.867

hits 6.098 8.074 .244 .755 .454 .104 9.573

triples 119.380 70.313 .237 1.698 .097 .561 1.784

homeruns 53.452 22.508 .419 2.375 .022 .350 2.859

errors -39.914 22.117 -.224 -1.805 .078 .705 1.418

a. Dependent Variable: Salary

Appendix 5

Coefficientsa

Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) 509.935 967.807 .527 .601

Battingavg -8517.687 6553.196 -.247 -1.300 .201 .298 3.352

obp 5752.682 4993.706 .227 1.152 .256 .278 3.602

triples 117.809 69.930 .233 1.685 .099 .561 1.782

homeruns 53.155 22.392 .417 2.374 .022 .350 2.858

errors -39.046 21.977 -.219 -1.777 .083 .707 1.414

runs 13.920 9.455 .324 1.472 .148 .222 4.503

a. Dependent Variable: Salary

Appendix 6

1
[SID - 309270901] ARUN RAJAN

Coefficientsa

Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) 58.677 238.658 .246 .807

runs 28.187 4.671 .657 6.035 .000 1.000 1.000

a. Dependent Variable: Salary

Potrebbero piacerti anche