Regression Using Spss

[SID - 309270901] ARUN RAJAN
SYNOPSIS
A multiple regression model is build on the data of baseball players to find out if the
salary payed is based on the performance of the players .The data has six independent
variable and one dependent variable .The salary was taken as a dependent variable and
the influence of all the independent variables where checked. All the insignificant
independent variable was removed and the model was run and rerun using SPSS to get
the multiple regression equation Y = 58.68 + 28.19 Runs, which implies that the salary
increment depended only on the runs scored. The model was not fit for prediction as the
coefficient for variance of regression was higher than desired.
1
Table of Content
INTRODUCTION…………………………………………….…….…..….….....3
SCATTER PLOT ANALYSIS…………………………….………...…….....…3
CORRELATION COEFFICENT ANALYSIS…………….……..….…...……5
ANALYSIS THE MODEL USING SPSS………………………...……..….….5
Running the Model for the First Time…………………………………....…….......….5
Analysing the Model……………………………………………...….....…..…6
Examining the Model……………………………………………...….….....…6
Multicollinearity…………………………………………………....…...…......6
Running the Model for the Second Time…………………..…………....……….....…6
Running the Model till the Final Result…………………….......………....…….......6
FINAL REGRESSION MODEL…………………………………...………......7
PREDICTION..............…………………………………………....….…......…7
CONCLUSION……………………………………………………..…….....…..7
REFERENCE……………………………………………………....……….......8
APPENDIX 1………………………………………………………....………….9
APPENDIX 2………………………………………………………....…………10
APPENDIX 3…………………………………………………………....………11
APPENDIX 4……………………………………………………………....……11
APPENDIX 5……………………………………………………………..…..…12
APPENDIX 6…………………………………………………………..……..…12
Introduction
1
The dataset consists of information salary and performance stats of baseball players. A
multiple regression model is built to find out the influence of performance on the salary
of the baseball players. First the data is checked to see if it is appropriate, using scatter
plots. Then the data is cleaned and an initial model is built to find the correlation
coefficients. A linear relationship model is run to check whether the explanatory
variables were significant in explaining the dependent variable. The residual plots
where checked for finding heteroskedasticity and multicollinearity is also checked for.
Finally a multiple regression line is established and interpreted.
Scatter Plot Analysis
Scatter plots are be used to determine the relationship between two variables and also to
find out if the data needs cleaning. The scatter plots are used to find the relation
between the dependent variable y and each of the independent variables and if there is a
relationship, is it a linear relationship, a positive or negative relationship, a weak or
strong relationship and if there are any outliers. Even if the relationship is not visible the
linear regression model can still be created as scatter plots do not take into consideration
the effects of other variables
1
CORRELATION COEFFICENT ANALYSIS
Correlation coefficient analysis is used to find out the relation between two variable.
The result can be seen from the correlation (Appendix 2). Two variables will have a
strong linear relationship if the correlation coefficient is close to 1 or -1 and also if the
significance value is below .05.
Testing the hypothesis that:
H0 : β 1 = β 2 = ... β i= 0 (There is no linear relationship between the dependent

variable and the explanatory variables)
H1 : At least on β j ≠0
Noting that F = 7.12 and p = 0.00, we reject the Null Hypothesis.
If the significance value is 5% then because p < 0.05, we reject null hypothesis, it means
that there is a strong relationship between the variables. Thus we accept the alternate
hypothesis which means at least a β value is greater than 0 and that there is a
relationship between the dependent and the independent variables.
Running the Model for the First Time
Analysing the model
A linear regression model is built with salary as the dependent variable and batting
average, on base percentage, number of runs, number of hits, triples scored, home runs
and errors as the independent variables. The result is shown in Appendix 4
1
Here interpreting the unstandardized beta implies that each β i tells the average change
in y given a unit change in xi, provided all the other xj remain constant. i.e., for every 1
run scored there is an increase of 4.13$ in salary provided that their is no change in the
β values of the other constant
Examining the Model
The Adjusted R2 value of 0.466 tells us that 46% of the variability of the dependent
variable is explained by the variability of the independent variable.
The VIF value of “Runs” and “hits” are also above 5 and this means that there is
multicollinearity.
Multicollinearity
The rule to fix the multicollinearity is to remove one of the variables which has VIF
above 5. In this model, we can choose remove either “runs” or “hits” but considering
the fact that runs could determine the salary more than hits, the variable hits will be
removed and the model is run again.
Running the Model for the Second Time
The results for new model are shown in Appendix 5. The results show that the
multicollinearity has been removed from the model since the VIF values of all variables
are below 5.Subsequently, the residual plots are analysed to examine whether there is a
non-linear relationship, heteroskedasticity or autocorrelation.
Running the Model till the Final result
The Sig. value for batting average, obp, errors, triple, runs are all above 0.05, so we
remove the variable which has the largest sig value above 0.05 first, then we run the
model again, the result for new model and the process is iterated until a model with
every variable having sig value below 0.05 is obtained. The final model is given in
Appendix 6
1
There is no problem in significant value in the ANOVA table is 0.000a and VIFs are all
less than 5.
FINAL REGRESSION MODEL
After removing all the explanatory variables that caused problems in the model, the
final regression model was created. The equation for the best regression model is:
Y = 58.68 + 28.19 Runs
Each increase in y can be thus be explained 28.19 times the runs scored
PREDICTION
The standard error of estimate for the model is 882.04 and mean is 1286.5
The coefficient of variation for the regression = standard error/Mean of Y, which is
greater than 10%.
So the prediction interval of the model will be so high. Thus and therefore this model
cannot be used for prediction.
CONCLUSIONS
In summary, we removed all that was not significant to the regression model and the
final regression model obtained is Y = 58.68 + 28.19 Runs. This implies that the salary
given to the players depend solely on the runs scored during the games. Since the
coefficient of variation is so high, the model cannot be used for prediction.
1
REFERENCES
Denby, L. (1988), Dataset from Poster Session sponsored by the Section on

Statistical Graphics of the American Statistical Association, on Statlib, ed.
Michael Myers. (http://stat.lib.cmu.edu/datasets)
Hoaglin, D., and Velleman, P. (1995), "A Critical Look at Some Analyses of
Major League Baseball Salaries," The American Statistician, 49, 277-285.
1
APPENDICES
Appendix 1
BattingA Homeru
Salary vg OBP Runs Hits Triples ns Errors
3300 0.272 0.302 69 153 4 31 3
2600 0.269 0.335 58 111 2 18 3
2500 0.249 0.337 54 115 1 17 5
2475 0.26 0.292 59 128 7 12 21
2313 0.273 0.346 87 169 5 8 8
2175 0.291 0.379 104 170 2 26 4
600 0.258 0.37 34 86 1 14 10
460 0.228 0.279 16 38 2 3 3
240 0.25 0.327 40 61 0 1 2
200 0.203 0.24 39 64 1 10 6
177 0.262 0.283 7 38 0 0 7
140 0.222 0.307 21 45 0 6 3
117 0.227 0.28 4 5 0 1 0
115 0.261 0.37 1 6 0 0 0
2600 0.3 0.368 69 141 3 19 7
1907 0.225 0.292 60 130 1 13 14
1190 0.255 0.321 39 108 8 3 8
990 0.29 0.349 59 141 2 16 6
925 0.246 0.323 22 81 0 6 5
365 0.208 0.265 12 35 1 0 6
302 0.238 0.347 83 134 4 10 27
300 0.267 0.31 73 149 9 6 6
129 0.353 0.435 16 48 2 2 5
111 0.213 0.222 4 13 1 1 0
4450 0.265 0.355 87 130 7 17 10
4125 0.26 0.321 69 150 1 19 7
3213 0.255 0.347 45 71 5 1 3
2319 0.259 0.349 108 146 4 38 31
2000 0.223 0.307 43 84 4 10 4
1600 0.225 0.31 44 96 3 0 15
1394 0.258 0.381 58 108 0 4 5
935 0.275 0.351 40 70 4 4 3
1
850 0.327 0.424 60 141 3 0 20

775 0.272 0.3 18 62 2 5 3
760 0.241 0.32 33 84 2 6 14
629 0.293 0.355 32 79 0 1 0
275 0.257 0.315 51 96 1 6 8
120 0.225 0.33 7 20 0 0 2
2567 0.196 0.297 36 56 0 12 8
2500 0.252 0.309 66 137 1 28 5
2350 0.294 0.367 84 158 6 21 3
2317 0.297 0.391 48 73 5 3 4
2000 0.258 0.288 46 86 4 12 9
715 0.205 0.268 21 36 3 1 3
660 0.272 0.304 38 82 3 9 9
650 0.243 0.344 20 45 0 0 4
260 0.337 0.413 13 32 0 0 0
250 0.228 0.238 12 36 1 1 2
200 0.24 0.3 51 92 3 13 3
180 0.298 0.378 18 45 2 6 8
Appendix 2
Descriptive Statistics
Mean Std. Deviation N
Salary 1286.50 1157.728 50
Battingavg .25750 .033580 50
obp .32682 .045682 50
runs 43.56 26.977 50
triples 2.40 2.295 50
hits 87.68 46.295 50
homeruns 8.80 9.076 50
errors 6.84 6.504 50
1
Correlations
homerun
Salary Battingavg obp runs triples hits s errors
Salary Pearson Correlation 1 .099 .142 .657** .399** .630** .631** .183
Sig. (2-tailed) .493 .325 .000 .004 .000 .000 .203
N 50 50 50 50 50 50 50 50
Battingavg Pearson Correlation .099 1 .807** .234 .199 .311* .102 .009
Sig. (2-tailed) .493 .000 .102 .166 .028 .481 .953
N 50 50 50 50 50 50 50 50
obp Pearson Correlation .142 .807** 1 .284* .087 .282* .068 .119
Sig. (2-tailed) .325 .000 .045 .547 .047 .640 .410
N 50 50 50 50 50 50 50 50
runs Pearson Correlation .657** .234 .284* 1 .533** .936** .749** .506**
Sig. (2-tailed) .000 .102 .045 .000 .000 .000 .000
N 50 50 50 50 50 50 50 50
triples Pearson Correlation .399** .199 .087 .533** 1 .522** .225 .330*
Sig. (2-tailed) .004 .166 .547 .000 .000 .116 .019
N 50 50 50 50 50 50 50 50
hits Pearson Correlation .630** .311* .282* .936** .522** 1 .707** .466**
Sig. (2-tailed) .000 .028 .047 .000 .000 .000 .001
N 50 50 50 50 50 50 50 50
homeruns Pearson Correlation .631** .102 .068 .749** .225 .707** 1 .328*
Sig. (2-tailed) .000 .481 .640 .000 .116 .000 .020
N 50 50 50 50 50 50 50 50
errors Pearson Correlation .183 .009 .119 .506** .330* .466** .328* 1
Sig. (2-tailed) .203 .953 .410 .000 .019 .001 .020
N 50 50 50 50 50 50 50 50
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
Appendix 4
1
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant) 559.616 974.900 .574 .569
Battingavg -10523.168 7101.302 -.305 -1.482 .146 .257 3.897
obp 6847.192 5223.871 .270 1.311 .197 .256 3.902
runs 4.139 16.063 .096 .258 .798 .078 12.867
hits 6.098 8.074 .244 .755 .454 .104 9.573
triples 119.380 70.313 .237 1.698 .097 .561 1.784
homeruns 53.452 22.508 .419 2.375 .022 .350 2.859
errors -39.914 22.117 -.224 -1.805 .078 .705 1.418
a. Dependent Variable: Salary
Appendix 5
Coefficientsa
Standardized
1 (Constant) 509.935 967.807 .527 .601
Battingavg -8517.687 6553.196 -.247 -1.300 .201 .298 3.352
obp 5752.682 4993.706 .227 1.152 .256 .278 3.602
triples 117.809 69.930 .233 1.685 .099 .561 1.782
homeruns 53.155 22.392 .417 2.374 .022 .350 2.858
errors -39.046 21.977 -.219 -1.777 .083 .707 1.414
runs 13.920 9.455 .324 1.472 .148 .222 4.503
Appendix 6
1
Coefficientsa
Standardized
1 (Constant) 58.677 238.658 .246 .807
runs 28.187 4.671 .657 6.035 .000 1.000 1.000

Regression Using Spss

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Regression Using Spss

Caricato da

Copyright:

Formati disponibili

[SID - 309270901] ARUN RAJAN

SCATTER PLOT ANALYSIS…………………………….………...…….....…3

CORRELATION COEFFICENT ANALYSIS…………….……..….…...……5

ANALYSIS THE MODEL USING SPSS………………………...……..….….5

Running the Model for the First Time…………………………………....…….......….5

Analysing the Model……………………………………………...….....…..…6

Examining the Model……………………………………………...….….....…6

Running the Model for the Second Time…………………..…………....……….....…6

Running the Model till the Final Result…………………….......………....…….......6

FINAL REGRESSION MODEL…………………………………...………......7

Scatter Plot Analysis

CORRELATION COEFFICENT ANALYSIS

Testing the hypothesis that:

H0 : β 1 = β 2 = ... β i= 0 (There is no linear relationship between the dependent

Noting that F = 7.12 and p = 0.00, we reject the Null Hypothesis.

Running the Model for the First Time

Analysing the model

Examining the Model

Running the Model for the Second Time

Running the Model till the Final result

FINAL REGRESSION MODEL

Y = 58.68 + 28.19 Runs

Denby, L. (1988), Dataset from Poster Session sponsored by the Section on

850 0.327 0.424 60 141 3 0 20

Mean Std. Deviation N

Salary 1286.50 1157.728 50

Battingavg .25750 .033580 50

obp .32682 .045682 50

runs 43.56 26.977 50

triples 2.40 2.295 50

hits 87.68 46.295 50

homeruns 8.80 9.076 50

errors 6.84 6.504 50

Sig. (2-tailed) .493 .325 .000 .004 .000 .000 .203

Sig. (2-tailed) .493 .000 .102 .166 .028 .481 .953

Sig. (2-tailed) .325 .000 .045 .547 .047 .640 .410

Sig. (2-tailed) .000 .102 .045 .000 .000 .000 .000

Sig. (2-tailed) .004 .166 .547 .000 .000 .116 .019

Sig. (2-tailed) .000 .028 .047 .000 .000 .000 .001

Sig. (2-tailed) .000 .481 .640 .000 .116 .000 .020

Sig. (2-tailed) .203 .953 .410 .000 .019 .001 .020

**. Correlation is significant at the 0.01 level (2-tailed).

*. Correlation is significant at the 0.05 level (2-tailed).

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) 559.616 974.900 .574 .569

Battingavg -10523.168 7101.302 -.305 -1.482 .146 .257 3.897

obp 6847.192 5223.871 .270 1.311 .197 .256 3.902

runs 4.139 16.063 .096 .258 .798 .078 12.867

hits 6.098 8.074 .244 .755 .454 .104 9.573

triples 119.380 70.313 .237 1.698 .097 .561 1.784

homeruns 53.452 22.508 .419 2.375 .022 .350 2.859

errors -39.914 22.117 -.224 -1.805 .078 .705 1.418

a. Dependent Variable: Salary

Model B Std. Error Beta t Sig. Tolerance VIF

1 (Constant) 509.935 967.807 .527 .601

Battingavg -8517.687 6553.196 -.247 -1.300 .201 .298 3.352

obp 5752.682 4993.706 .227 1.152 .256 .278 3.602

triples 117.809 69.930 .233 1.685 .099 .561 1.782

homeruns 53.155 22.392 .417 2.374 .022 .350 2.858

errors -39.046 21.977 -.219 -1.777 .083 .707 1.414

runs 13.920 9.455 .324 1.472 .148 .222 4.503

a. Dependent Variable: Salary

Model B Std. Error Beta t Sig. Tolerance VIF