Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Excel provides useful statistical tools to carry out regression analysis. But you have to
check that the data analysis (utilitaire danalyse) add-in (complment) is enabled. (Click on
Data)
3- Check that excel add-ins is selected (on the right hand-side) and click on GO
NB: if you have a different version of EXCEL, the steps are quite similar: search for the
add-ins and select Analysis Toolpak; it may be helpful to download related videos.
If data analysis is enabled, it will appear within Data. For regression analysis under Excel,
you need the following tools within the data analysis add-in:
- Correlation
- Regression
1
II. Main steps of regression analysis:
The dependent variable Y is the variable you want to explain/ predict; its also called
explained variable. Example: total Cost, overhead cost
The independent variables X are the potential cost drivers of the dependent variable and
are also called explanatory variables. Regression analysis aims to find the independent
variable(s) which explain(s) the dependent variable. In other words, the best estimators of
the explained variable Y are those cost drivers which have the strongest correlation with it.
Potential cost drivers could be, for instance, direct labor hours, physical output produced,
number of machine hours
NB: The objective of regression analysis is to find the factors which best explain and help
to predict the item of cost which constitutes the dependent variable. In other words, we look
for the cost drivers whose change explains the change in the cost. Thats why, it is essential
to adjust cost data for inflation. As a matter of fact, inflation causes monetary amounts to
change. However, that change is due to the variation of the price index and is not caused by
cost drivers. Therefore, in order to isolate the change in the cost which may be due to cost
drivers, data expressed in monetary units must be adjusted for inflation before regression
analysis is undertaken.
The correlation matrix aims to determine the correlation coefficients between any pair
of variables. All the variables are included in the matrix: dependent and independent
variables.
2
Coefficient of correlation between the dependent variable and an independent variable:
The study of the correlation coefficient between independent variables aims to detect
multicollinearity problem which is useful for the multiple regression. As a matter of fact,
multicollinearity exists when the correlation coefficient between two X (independent)
variables is 70%. Therefore, the high correlation between two independent variables
means that, if they are included in the same (multiple) regression equation, it is impossible
to separate their impact on the cost.
For each independent variable X, write down the simple regression equation:
Y=a+bX
The analysis of the simple regression involves two steps: the study of the overall
significance of the model and the study of the specific significance of the coefficients of the
model.
3
R square is expressed as a percentage and it measures the percentage of the change
in the dependent variable which is explained by the change in the independent variable. The
maximum value is 1 or 100%. The minimal value is 0. If R square is equal to 80% for instance,
then we say that the change in X explains 80% of the change in Y. The remaining 20% pertain
to omitted variables and to the error term.
The Fisher statistics: Fisher statistics measures the % of the variability of Y explained
by the model divided by the % of the variability included in the error term. Rule of thumb: F
must be higher than 2 to say that R square is high.
The p-value approach: We set out a null hypothesis such that there is no association
between X and Y (the same results could be observed by chance). If p-value is less than 5%,
we reject the null hypothesis. An overall significant association between X and Y therefore
exists.
Thus:
- If R square is low and significance of F is higher than 5% (or F is lower than 2): the
overall association between the variability of X and Y is poor. No need to carry on with the
investigation of the significance of coefficients.
-If R square is high and significance of F is less than 5% (or F is higher than 2): the
variability of X explains a high percentage of variability of Y. In other terms, the probability to
obtain the same observations by chance is negligible.
In this case, check the significance of the coefficients and interpret the expected
response of Y to a change in X.
For coefficients with p-value less than 5%, the coefficient measures the expected
impact on the dependent variable caused by a change of one unit in the independent
variable.
4
Example: total cost = 1000+50DLH
Interpretation: The OLS coefficient estimate is 50. So, if DLH changes by one unit, total
cost changes by 50.
Note that at a confidence interval of 95%, Excel displays a range of possible expected
values for the coefficient. Suppose for instance that we have the following results: at a
confidence interval of 95%: lower limit 40; upper limit 70; p-value<<5%
Choose the best simple regression with respect to the following criteria:
- Highest R square
- Confidence in the parameters of the model: coefficient estimates with p-value less
than 5%, absence of statistical problems such as unexpected sign of the variable
(example :a negative intercept in a cost function)
Multiple regression aims to explain the change in the dependent variable by taking into
account more than one single variable. The cost would be therefore determined by many
factors which, together, would cause it to change.
The analysis steps are similar to simple regression steps: study of the overall significance
of the regression and then study of the significance of the coefficients of the regression.
5
BUT, pay attention to the following:
If R square is high (significance of F is less than 5%) and the different coefficients are
significant but multicollinearity exists: the overall association measured by the
coefficient of determination is significant but the coefficient estimates are
biased/unreliable.
In other terms, the model as a whole is reliable and can be used to predict the
dependent variable. However, due to multicollinearity, it is impossible to separate
the relative impact (measured by the coefficients) of each X variable on Y.
Example: Y=100+2X1+5X2
For a change of 1 unit in X1, the expected value of Y will change by 2 units; if X2 is
held constant.