Sei sulla pagina 1di 68

Quantitative Business Methods

Unit 5 and 6 – Inferential Statistics and


Introduction to Regression Analysis
Lecturer: Hashem Zarafat
Hypothesis Testing and Inferential
Statistics
Population
Parameters
Mean, m
Variance, s2
Sampling Proportion, p

Inference

Sample
Statistics
Descriptive Using
Mean, 𝑥
Analysis Statistical Theory
Variance, s2
Proportion, p
Hypotheses Testing
• Once the data are ready for analysis, i.e., out of
range, missing responses are cleared up the
goodness of the measures is established, the
researcher is ready to test the hypotheses
already developed for the study with appropriate
statistical techniques.
Hypotheses Testing
• Allows a researcher to define how safe it is to go beyond a
specific sample of data.
– Null hypothesis
– Alternative hypothesis
• Confidence level (known or given)
• Significance level (given)
• P-value (the probability of having a statistic as extreme as the
one calculated by the sample or more)
• Decision rule:
– If the p-value is less than alpha, reject the null hypothesis.
– If the p-value is greater than alpha, retain the null
hypothesis.
Hypotheses Testing

• Assessing Association (or relationship)


– H0: There is no relationship between X and Y
• Assessing Differences
– H0: The is no difference between X and Y
Examples of Hypothesis Testing

• The average time (in minutes) that people spend online


is 158 in Germany. A researcher wants to check if people
in the US spend less time online than that of people in
Germany. He gets a study sample of 15 people in the US
and observes their going online habits and measures the
total time that these people spend online.
a) State the hypothesis for this study.
b) By developing a random sample of 15 people in the US,
investigate if the researcher’s claim is correct about the
time spent online in the US.
Examples of Hypothesis Testing

• A shop manager paid to a marketing company to


advertise their products online for a year. Since the
marketing company charged different fees, there was a
variation in the number of times they advertised for the
company per month. After the contract was due, the shop
manager is interested to know if there is any association
between the number of times the sales advertisement
was published on various websites and the sales figures
in that year.
• State the hypothesis.
The Nature of Econometrics
and Economic Data
• Economic model of crime (Becker (1968))
– Derives equation for criminal activity based on utility maximization

Hours spent in
criminal activities

Age
„Wage“ of cri-
minal activities Probability of Expected
Wage for legal
Other Probability of conviction if sentence
employment
income getting caught caught
– Functional form of relationship not specified

– Equation could have been postulated without economic modeling


The Nature of Econometrics
and Economic Data
• Econometric model of criminal activity
– The functional form has to be specified
– Variables may have to be approximated by other quantities
– Most of econometrics deals with the specification of the error (u or
e)
Measure of criminal Wage for legal Other Frequency of
activity employment income prior arrests
Unobserved deter-
minants of criminal
activity

e.g. moral character,


wage in criminal activity,
Frequency of Average sentence Age family background …
conviction length after conviction
The Simple
Regression Model
Introduction to Regression Analysis
The Simple
Regression Model
• Definition of the simple linear regression model

"Explains variable in terms of variable "

Intercept Slope parameter

Dependent variable,
explained variable, Error term,
Independent variable, disturbance,
response variable,…
explanatory variable, unobservables,…
regressor,…
The Simple
Regression Model – Changes in y
• Interpretation of the simple linear regression model

"Studies how varies with changes in :"

as long as

By how much does the dependent Interpretation only correct if all other
variable change if the independent things remain equal when the indepen-
variable is increased by one unit? dent variable is increased by one unit
The Simple
Regression Model
• Example 1: Soybean yield and fertilizer

Rainfall,
land quality,
presence of parasites, …
Measures the effect of fertilizer on
yield, holding all other factors fixed

• Example 2: A simple wage equation

Labor force experience,


tenure with current employer,
work ethic, intelligence …
Measures the change in hourly wage
given another year of education,
holding all other factors fixed
The Simple
Regression Model
• When is there a causal interpretation?
• Conditional mean independence assumption

The independent (explanatory) variable must not


contain information about the mean
of the unobserved factors

• Example: wage equation

e.g. intelligence …

The conditional mean independence assumption is unlikely to hold because


individuals with more education will also be more intelligent on average.
The Simple
Regression Model
• Population regression function (PFR)
– The conditional mean independence assumption implies that

– This means that the average (expected) value of the dependent


variable (y) can be expressed as a linear function of the indepdendent
(or explanatory) variable (x)
The Simple
Regression Model
• In order to estimate the regression model one needs data

• A random sample of observations

First observation

Second observation

Third observation Value of the dependent


variable of the i-th ob-
Value of the expla-
servation
natory variable of
the i-th observation
n-th observation
The Simple
Regression Model
• Fit as good as possible a regression line through the data points:

Fitted regression line


For example, the i-th
data point
Thinking Challenge

• How would you draw a line through the


points?
• How do you determine which line ‘fits best’?

y
60
40
20
0 x
0 20 40 60
The Simple
Regression Model
• What does "as good as possible" mean?
• Regression residuals

• Minimize sum of squared regression residuals

• Ordinary Least Squares (OLS) estimates


The Simple
Regression Model
• CEO Salary and return on equity

Salary in thousands of dollars Return on equity of the CEO‘s firm

• Fitted regression

Intercept
If the return on equity increases by 1 percent,
then salary is predicted to change by 18,501 $
• Causal interpretation?
The Simple
Regression Model

Fitted regression line


(depends on sample)

Unknown population regression line


The Simple
Regression Model
• Properties of OLS on any sample of data
• Fitted values and residuals

Fitted or predicted values Deviations from regression line (= residuals)

• Algebraic properties of OLS regression

Deviations from regression Correlation between deviations Sample averages of y and x lie
line sum up to zero and regressors is zero on regression line
The Simple Regression Model –
Sum of Squares
• Goodness-of-Fit

"How well does the explanatory variable explain the dependent variable?"

• Measures of Variation
𝑛 𝑛
𝑆𝑆𝑅 = (𝑦𝑖 − 𝑦)2 𝑆𝑆𝐸 = (𝑦𝑖 − 𝑦𝑖 )2
𝑖=1 𝑖=1

Total sum of squares, Explained sum of squares, Residual or Error sum of squares,
represents total variation represents variation represents variation not
in dependent variable explained by regression explained by regression
The Simple
Regression Model
• Decomposition of total variation

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸


Total variation Explained part Unexplained part

• Goodness-of-fit measure (R-squared)

𝑆𝑆𝑅
2
𝑆𝑆𝐸
𝑅 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇 R-squared measures the fraction of the
total variation that is explained by the
regression
The Simple
Regression Model
• CEO Salary and return on equity

The regression explains only 1.3 %


of the total variation in salaries

• Voting outcomes and campaign expenditures

The regression explains 85.6 % of the


total variation in election outcomes

• Caution: A high R-squared does not necessarily mean that the


regression has a causal interpretation!
The Simple
Regression Model
• Incorporating nonlinearities: Semi-logarithmic form
• Regression of log wages on years of eduction

Natural logarithm of wage

• This changes the interpretation of the regression coefficient:

Percentage change of wage

… if years of education
are increased by one year
The Simple
Regression Model
• Fitted regression

The wage increases by 8.3 % for


every additional year of education
(= return to education)

For example:

Growth rate of wage is 8.3 %


per year of education
The Simple
Regression Model
• Incorporating nonlinearities: Log-logarithmic form
• CEO salary and firm sales

Natural logarithm of CEO salary Natural logarithm of his/her firm‘s sales

• This changes the interpretation of the regression coefficient:

Percentage change of salary


… if sales increase by 1 %

Logarithmic changes are


always percentage changes
The Simple
Regression Model
• CEO salary and firm sales: fitted regression

• For example: + 1 % sales ! + 0.257 % salary

• The log-log form postulates a constant elasticity model, whereas


the semi-log form assumes a semi-elasticity model
The Simple
Regression Model
• Expected values and variances of the OLS estimators
• The estimated regression coefficients are random variables
because they are calculated from a random sample

Data is random and depends on particular sample that has been drawn

• The question is how unbiased our coefficients will be?..

• To answer this, we have to establsh clear assumptions for SLR.


The Simple
Regression Model
• Standard assumptions for the linear regression model

• Assumption SLR.1 (Linear in parameters)

In the population, the relationship


between y and x is linear

• Assumption SLR.2 (Random sampling)

The data in a random sample is


drawn from the population

Each data point therefore follows


the population equation
The Simple
Regression Model
• Assumptions for the linear regression model (cont.)

• Assumption SLR.3 (Sample variation in explanatory variable)

The values of the explanatory variables are not all


the same (otherwise it would be impossible to stu-
dy how different values of the explanatory variable
lead to different values of the dependent variable)

• Assumption SLR.4 (Zero conditional mean)

The value of the explanatory variable must


contain no information about the mean of the
unobserved factors
The Simple
Regression Model
• Theorem 2.1 (Unbiasedness of OLS)

• Interpretation of unbiasedness
– The estimated coefficients may be smaller or larger, depending on the
sample that is the result of a random draw
– However, on average, they will be equal to the values that charac-
terize the true relationship between y and x in the population
– "On average" means if sampling was repeated, i.e. if drawing the
random sample and doing the estimation was repeated many times
– In a given sample, estimates may differ considerably from true values
The Simple
Regression Model
• Estimating the error variance

The variance of u does not depend on x, i.e. is


equal to the unconditional variance

One could estimate the variance of the


errors by calculating the variance of the
residuals in the sample; unfortunately this
estimate would be biased

An unbiased estimate of the error variance can be obtained by


substracting the number of estimated regression coefficients
from the number of observations
Homoscedasticity:
homogeneity of variance
• Constant variance of error terms
The Simple
Regression Model
• Theorem 2.3 (Unbiasedness of the error variance)

• Calculation of standard errors for regression coefficients

Plug in for
the unknown

The estimated standard deviations of the regression coefficients are called "standard errors".
They measure how precisely the regression coefficients are estimated.
The Multiple
Regression Model
Introduction to Regression Analysis
Multiple Regression
Analysis: Estimation
• Definition of the multiple linear regression model

"Explains variable in terms of variables "

Intercept Slope parameters

Dependent variable,
explained variable, Error term,
Independent variables, disturbance,
response variable,…
explanatory variables, unobservables,…
regressors,…
Multiple Regression
Analysis: Estimation
• Motivation for multiple regression
– Incorporate more explanatory factors into the model
– Explicitly hold fixed other factors that otherwise would be in
– Allow for more flexible functional forms

• Example: Wage equation

Now measures effect of education explicitly holding experience fixed

All other factors…

Hourly wage Years of education Labor market experience


Multiple Regression
Analysis: Estimation
• Example: Determinants of college GPA

Grade point average at college High school grade point average Achievement test score

• Interpretation
– Holding ACT fixed (“Controlling for…”), another point on high school grade point average is
associated with another .453 points college grade point average
– Or: If we compare two students with the same ACT, but the hsGPA of student A is one point
higher, we predict student A to have a colGPA that is .453 higher than that of student B
– Holding high school grade point average fixed, another 10 points on ACT are associated
with less than one point on college GPA
Multiple Regression
Analysis: Estimation
• Example: Explaining arrest records

Number of times Proportion prior arrests Months in prison 1986 Quarters employed 1986
arrested 1986 that led to conviction

• Interpretation:
– Proportion prior arrests: 0.5 = -.150(0.5) =-.075 = -7.5 arrests per 100 men
– Months in prison: 12 = -.034(12) = -0.408 arrests for given man
– Quarters employed: 1 = -.104(1) = -10.4 arrests per 100 men
Multiple Regression
Analysis: Estimation
• OLS Estimation of the multiple regression model

• Random sample

• Regression residuals

• Minimize sum of squared residuals

Minimization will be carried out by computer


Multiple Regression
Analysis: Estimation
• Standard assumptions for the multiple regression model

• Assumption MLR.1 (Linear in parameters)


In the population, the relation-
ship between y and the expla-
natory variables is linear

• Assumption MLR.2 (Random sampling)

The data is a random sample


drawn from the population

Each data point therefore follows the population equation


Multiple Regression
Analysis: Estimation
• Standard assumptions for the multiple regression model (cont.)

• Assumption MLR.3 (No perfect collinearity)


"In the sample (and therefore in the population), none
of the independent variables is constant and there are
no exact relationships among the independent variables"

• Remarks on MLR.3
– The assumption only rules out perfect collinearity/correlation bet-ween
explanatory variables; imperfect correlation is allowed
– If an explanatory variable is a perfect linear combination of other
explanatory variables it is superfluous and may be eliminated
– Constant variables are also ruled out (collinear with intercept)
Detecting Multicollinearity

1. Significant correlations between pairs of independent


variables.
2. Nonsignificant t–tests for all of the individual 
parameters when the F-test for overall model
adequacy is significant.
3. Sign opposite from what is expected in the estimated 
parameters.
Multiple Regression
Analysis: Estimation
• Standard assumptions for the multiple regression model (cont.)
• Assumption MLR.4 (Zero conditional mean)

The value of the explanatory variables


must contain no information about the mean of
the unobserved factors

– In a multiple regression model, the zero conditional mean assumption


is much more likely to hold because fewer things end up in the error

Theorem 3.1 (Unbiasedness of OLS)


Unbiasedness is an average property in repeated samples; in a given
sample, the estimates may still be far away from the true values
Multiple Regression
Analysis: Estimation
• Including irrelevant variables in a regression model

= 0 in the population
No problem because .

However, including irrevelant variables may increase sampling variance.

• Omitting relevant variables: the simple case

True model (contains x1 and x2)

Estimated model (x2 is omitted)


Multiple Regression
Analysis: Estimation
• Omitted variable bias
If x1 and x2 are correlated, assume a linear
regression relationship between them

If y is only regressed If y is only regressed error term


on x1 this will be the on x1, this will be the
estimated intercept estimated slope on x1

• Conclusion: All estimated coefficients will be biased


Multiple Regression
Analysis: Estimation
• Example: Omitting ability in a wage equation

Will both be positive

The return to education will be overestimated because . It will look


as if people with many years of education earn very high wages, but this is partly
due to the fact that people with more education are also more able on average.

• When is there no omitted variable bias?


– If the omitted variable is irrelevant or uncorrelated
Nonlinear Transformations

• Equation for General Linear Regression:

Predicted Y = a + b1X1 + b2X2 + … + bkXk

• General linear regression does not require that any of the


variables be the original variables in the dataset.
• Often, the variables being used are transformed variables.
• Nonlinear transformations are used whenever curvature is
detected in scatterplots.
• Either the dependent, or the independent, or all of the
variables can be transformed.
• Typical nonlinear transformations are: logarithm, square
root, the reciprocal (1/x), and the square.
Multiple Regression
Analysis: Estimation
• Using quadratic functional forms
• Example: Family income and family consumption

Other factors

Family consumption Family income Family income squared

– Model has two explanatory variables: inome and income squared


– Consumption is explained as a quadratic function of income
– One has to be very careful when interpreting the coefficients:

By how much does consumption Depends on how


increase if income is increased much income is
by one unit? already there
Multiple Regression Analysis:
Further Issues
• Models with interaction terms

Interaction term

The effect of the number of


bedrooms depends on the
level of square footage

• Interaction effects complicate interpretation of parameters

Effect of number of bedrooms, but for a square footage of zero


Multiple Regression Analysis:
Further Issues
Correct degrees of freedom of
• Adjusted R-squared nominator and denominator

– A better estimate taking into account degrees of freedom would be

– The adjusted R-squared imposes a penalty for adding new regressors


– The adjusted R-squared increases if, and only if, the t-statistic of a
newly added regressor is greater than one in absolute value
• Relationship between R-squared and adjusted R-squared

The adjusted R-squared


may even get negative
Multiple Regression Analysis:
Further Issues
• Comparing nested models: F-test (ANOVA)

• Using adjusted R-squared to choose between nonnested models

– Models are nonnested if neither model is a special case of the other

– A comparison between the R-squared of both models would be unfair to the first model because the first model contains fewer

parameters

– In the given example, even after adjusting for the difference in degrees of freedom, the quadratic model is preferred
Variable Selection:
Stepwise Regression

 The user first identifies the response, y, and the set of potentially
important independent variables, x1, x2, … , xk, where k is generally
large. The response and independent variables are then entered
into the computer software, and the stepwise procedure begins
(aka mixed selection).

 The result of the stepwise procedure is a model containing only


those terms with t-values that are significant at the specified  level.

 Thus, in most practical situations, only several of the large number


of independent variables remain.

 Parsimony guides you to select the regression model with the


fewest independent variables that can predict the dependent
variable adequately.
Multiple Regression Analysis:
Qualitative Information
• Qualitative Information
– Examples: gender, race, industry, region, rating grade, …
– A way to incorporate qualitative information is to use dummy variables
– They may appear as the dependent or as independent variables

• A single dummy independent variable

= the wage gain/loss if the person is a Dummy variable:


woman rather than a man (holding =1 if the person is a woman
other things fixed) =0 if the person is man
(arbitrary assignment)
Multiple Regression Analysis:
Qualitative Information
• Dummy variable trap This model cannot be estimated (perfect collinearity)

When using dummy variables, one category always has to be omitted:

The base category are men

The base category are women


Multiple Regression Analysis:
Qualitative Information
• Using dummy variables for multiple categories
– 1) Define membership in each category by a dummy variable
– 2) Leave out one category (which becomes the base category)

Holding other things fixed, married


women earn 19.8% less than single
men (= the base category)
Multiple Regression Analysis:
Qualitative Information
• A Binary dependent variable: the linear probability model
• Linear regression when the dependent variable is binary

If the dependent variable only


takes on the values 1 and 0

Linear probability
model (LPM)

In the linear probability model, the coefficients


describe the effect of the explanatory variables on
the probability that y=1
Extrapolation

Extrapolation is the prediction of y-values of the


independent variables that are outside the region
in which the model was developed.
This is a dangerous practice.
Regression Analysis: Statistical Inference
Hypothesis Tests for the
Regression Coefficients

• An important piece of information in regression


outputs: The t-values for the individual regression
coefficients.
• Each t-value is the ratio of the estimated coefficient to
its standard error and indicates how many standard
errors the regression coefficient is from zero.
• A t-value can be used in a hypothesis test for the
corresponding regression coefficient: If a variable’s
coefficient is zero, there is no point including this
variable in the equation.
• To run this test, simply compare the t-value in the
regression output with a tabulated t-value and reject
the null hypothesis if the output t-value is greater.
A Test for the Overall Fit: The ANOVA
Table

• It is conceivable that none of the variables in the regression


equation explains the dependent variable.
• First indication of this problem is a very small R2 value.
• Another way to say this is that the same value of Y will be
predicted regardless of the values of Xs.
• Hypotheses for ANOVA test: The null hypothesis is that all
coefficients of the explanatory variables are zero. The alternative
is that at least one of these coefficients is not zero.
• Two ways to test the hypotheses:
– Individual t-values (small, or statistically insignificant).
– F test (ANOVA test): A formal procedure for testing
whether the explained variation is large compared to
the unexplained variation.
ANOVA Table Elements

• F-value has an associated p-value that allows you to run the test easily;
it is reported in most regression outputs.
Example: Estimation and
Prediction: CI vs. PI
Homework Assignment (see in R
script!!!)

1
W

 Reading: “Basic Business Statistics: Concepts


and Applications”
 Chapter 13: Sections 13.1, 13.2, 13.5
 Chapter 14: Sections 14.2, 14.6, 14.7
 Chapter 15: all Sections

Potrebbero piacerti anche