Unit 5 and 6 - Inferential Statistics and Regression Analysis

Quantitative Business Methods
Unit 5 and 6 – Inferential Statistics and

Introduction to Regression Analysis
Lecturer: Hashem Zarafat
Hypothesis Testing and Inferential
Statistics
Population
Parameters
Mean, m
Variance, s2
Sampling Proportion, p
Inference
Sample
Statistics
Descriptive Using
Mean, 𝑥
Analysis Statistical Theory
Variance, s2
Proportion, p
Hypotheses Testing
• Once the data are ready for analysis, i.e., out of
range, missing responses are cleared up the
goodness of the measures is established, the
researcher is ready to test the hypotheses
already developed for the study with appropriate
statistical techniques.
Hypotheses Testing
• Allows a researcher to define how safe it is to go beyond a
specific sample of data.
– Null hypothesis
– Alternative hypothesis
• Confidence level (known or given)
• Significance level (given)
• P-value (the probability of having a statistic as extreme as the
one calculated by the sample or more)
• Decision rule:
– If the p-value is less than alpha, reject the null hypothesis.
– If the p-value is greater than alpha, retain the null
hypothesis.
Hypotheses Testing
• Assessing Association (or relationship)

– H0: There is no relationship between X and Y
• Assessing Differences
– H0: The is no difference between X and Y
Examples of Hypothesis Testing
• The average time (in minutes) that people spend online

is 158 in Germany. A researcher wants to check if people
in the US spend less time online than that of people in
Germany. He gets a study sample of 15 people in the US
and observes their going online habits and measures the
total time that these people spend online.
a) State the hypothesis for this study.
b) By developing a random sample of 15 people in the US,
investigate if the researcher’s claim is correct about the
time spent online in the US.
Examples of Hypothesis Testing
• A shop manager paid to a marketing company to

advertise their products online for a year. Since the
marketing company charged different fees, there was a
variation in the number of times they advertised for the
company per month. After the contract was due, the shop
manager is interested to know if there is any association
between the number of times the sales advertisement
was published on various websites and the sales figures
in that year.
• State the hypothesis.
The Nature of Econometrics
and Economic Data
• Economic model of crime (Becker (1968))
– Derives equation for criminal activity based on utility maximization
Hours spent in
criminal activities
Age
„Wage“ of cri-
minal activities Probability of Expected
Wage for legal
Other Probability of conviction if sentence
employment
income getting caught caught
– Functional form of relationship not specified
– Equation could have been postulated without economic modeling

The Nature of Econometrics
and Economic Data
• Econometric model of criminal activity
– The functional form has to be specified
– Variables may have to be approximated by other quantities
– Most of econometrics deals with the specification of the error (u or
e)
Measure of criminal Wage for legal Other Frequency of
activity employment income prior arrests
Unobserved deter-
minants of criminal
activity
e.g. moral character,

wage in criminal activity,
Frequency of Average sentence Age family background …
conviction length after conviction
The Simple
Regression Model
The Simple
Regression Model
• Definition of the simple linear regression model
"Explains variable in terms of variable "
Intercept Slope parameter
Dependent variable,
explained variable, Error term,
Independent variable, disturbance,
response variable,…
explanatory variable, unobservables,…
regressor,…
The Simple
Regression Model – Changes in y
• Interpretation of the simple linear regression model
"Studies how varies with changes in :"
as long as
By how much does the dependent Interpretation only correct if all other
variable change if the independent things remain equal when the indepen-
variable is increased by one unit? dent variable is increased by one unit
The Simple
Regression Model
• Example 1: Soybean yield and fertilizer
Rainfall,
land quality,
presence of parasites, …
Measures the effect of fertilizer on
yield, holding all other factors fixed
• Example 2: A simple wage equation
Labor force experience,

tenure with current employer,
work ethic, intelligence …
Measures the change in hourly wage
given another year of education,
holding all other factors fixed
The Simple
Regression Model
• When is there a causal interpretation?
• Conditional mean independence assumption
The independent (explanatory) variable must not

contain information about the mean
of the unobserved factors
• Example: wage equation
e.g. intelligence …
The conditional mean independence assumption is unlikely to hold because

individuals with more education will also be more intelligent on average.
The Simple
Regression Model
• Population regression function (PFR)
– The conditional mean independence assumption implies that
– This means that the average (expected) value of the dependent

variable (y) can be expressed as a linear function of the indepdendent
(or explanatory) variable (x)
The Simple
Regression Model
• In order to estimate the regression model one needs data
• A random sample of observations
First observation
Second observation
Third observation Value of the dependent

variable of the i-th ob-
Value of the expla-
servation
natory variable of
the i-th observation
n-th observation
The Simple
Regression Model
• Fit as good as possible a regression line through the data points:
Fitted regression line

For example, the i-th
data point
Thinking Challenge
• How would you draw a line through the

points?
• How do you determine which line ‘fits best’?
y
60
40
20
0 x
0 20 40 60
The Simple
Regression Model
• What does "as good as possible" mean?
• Regression residuals
• Minimize sum of squared regression residuals
• Ordinary Least Squares (OLS) estimates

The Simple
Regression Model
• CEO Salary and return on equity
Salary in thousands of dollars Return on equity of the CEO‘s firm
• Fitted regression
Intercept
If the return on equity increases by 1 percent,
then salary is predicted to change by 18,501 $
• Causal interpretation?
The Simple
Regression Model
Fitted regression line

(depends on sample)
Unknown population regression line

The Simple
Regression Model
• Properties of OLS on any sample of data
• Fitted values and residuals
Fitted or predicted values Deviations from regression line (= residuals)
• Algebraic properties of OLS regression
Deviations from regression Correlation between deviations Sample averages of y and x lie
line sum up to zero and regressors is zero on regression line
The Simple Regression Model –
Sum of Squares
• Goodness-of-Fit
"How well does the explanatory variable explain the dependent variable?"
• Measures of Variation
𝑛 𝑛
𝑆𝑆𝑅 = (𝑦𝑖 − 𝑦)2 𝑆𝑆𝐸 = (𝑦𝑖 − 𝑦𝑖 )2
𝑖=1 𝑖=1
Total sum of squares, Explained sum of squares, Residual or Error sum of squares,
represents total variation represents variation represents variation not
in dependent variable explained by regression explained by regression
The Simple
Regression Model
• Decomposition of total variation
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

Total variation Explained part Unexplained part
• Goodness-of-fit measure (R-squared)
𝑆𝑆𝑅
2
𝑆𝑆𝐸
𝑅 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇 R-squared measures the fraction of the
total variation that is explained by the
regression
The Simple
Regression Model
• CEO Salary and return on equity
The regression explains only 1.3 %

of the total variation in salaries
• Voting outcomes and campaign expenditures
The regression explains 85.6 % of the

total variation in election outcomes
• Caution: A high R-squared does not necessarily mean that the

regression has a causal interpretation!
The Simple
Regression Model
• Incorporating nonlinearities: Semi-logarithmic form
• Regression of log wages on years of eduction
Natural logarithm of wage
• This changes the interpretation of the regression coefficient:
Percentage change of wage
… if years of education
are increased by one year
The Simple
Regression Model
• Fitted regression
The wage increases by 8.3 % for

every additional year of education
(= return to education)
For example:
Growth rate of wage is 8.3 %

per year of education
The Simple
Regression Model
• Incorporating nonlinearities: Log-logarithmic form
• CEO salary and firm sales
Natural logarithm of CEO salary Natural logarithm of his/her firm‘s sales
• This changes the interpretation of the regression coefficient:
Percentage change of salary

… if sales increase by 1 %
Logarithmic changes are

always percentage changes
The Simple
Regression Model
• CEO salary and firm sales: fitted regression
• For example: + 1 % sales ! + 0.257 % salary
• The log-log form postulates a constant elasticity model, whereas

the semi-log form assumes a semi-elasticity model
The Simple
Regression Model
• Expected values and variances of the OLS estimators
• The estimated regression coefficients are random variables
because they are calculated from a random sample
Data is random and depends on particular sample that has been drawn
• The question is how unbiased our coefficients will be?..
• To answer this, we have to establsh clear assumptions for SLR.

The Simple
Regression Model
• Standard assumptions for the linear regression model
• Assumption SLR.1 (Linear in parameters)
In the population, the relationship

between y and x is linear
• Assumption SLR.2 (Random sampling)
The data in a random sample is

drawn from the population
Each data point therefore follows

the population equation
The Simple
Regression Model
• Assumptions for the linear regression model (cont.)
• Assumption SLR.3 (Sample variation in explanatory variable)
The values of the explanatory variables are not all

the same (otherwise it would be impossible to stu-
dy how different values of the explanatory variable
lead to different values of the dependent variable)
• Assumption SLR.4 (Zero conditional mean)
The value of the explanatory variable must

contain no information about the mean of the
unobserved factors
The Simple
Regression Model
• Theorem 2.1 (Unbiasedness of OLS)
• Interpretation of unbiasedness
– The estimated coefficients may be smaller or larger, depending on the
sample that is the result of a random draw
– However, on average, they will be equal to the values that charac-
terize the true relationship between y and x in the population
– "On average" means if sampling was repeated, i.e. if drawing the
random sample and doing the estimation was repeated many times
– In a given sample, estimates may differ considerably from true values
The Simple
Regression Model
• Estimating the error variance
The variance of u does not depend on x, i.e. is

equal to the unconditional variance
One could estimate the variance of the

errors by calculating the variance of the
residuals in the sample; unfortunately this
estimate would be biased
An unbiased estimate of the error variance can be obtained by

substracting the number of estimated regression coefficients
from the number of observations
Homoscedasticity:
homogeneity of variance
• Constant variance of error terms
The Simple
Regression Model
• Theorem 2.3 (Unbiasedness of the error variance)
• Calculation of standard errors for regression coefficients
Plug in for
the unknown
The estimated standard deviations of the regression coefficients are called "standard errors".
They measure how precisely the regression coefficients are estimated.
The Multiple
Regression Model
Multiple Regression
Analysis: Estimation
• Definition of the multiple linear regression model
"Explains variable in terms of variables "
Intercept Slope parameters
Dependent variable,
explained variable, Error term,
Independent variables, disturbance,
response variable,…
explanatory variables, unobservables,…
regressors,…
Multiple Regression
• Motivation for multiple regression
– Incorporate more explanatory factors into the model
– Explicitly hold fixed other factors that otherwise would be in
– Allow for more flexible functional forms
• Example: Wage equation
Now measures effect of education explicitly holding experience fixed
All other factors…
Hourly wage Years of education Labor market experience

Multiple Regression
• Example: Determinants of college GPA
Grade point average at college High school grade point average Achievement test score
• Interpretation
– Holding ACT fixed (“Controlling for…”), another point on high school grade point average is
associated with another .453 points college grade point average
– Or: If we compare two students with the same ACT, but the hsGPA of student A is one point
higher, we predict student A to have a colGPA that is .453 higher than that of student B
– Holding high school grade point average fixed, another 10 points on ACT are associated
with less than one point on college GPA
Multiple Regression
• Example: Explaining arrest records
Number of times Proportion prior arrests Months in prison 1986 Quarters employed 1986
arrested 1986 that led to conviction
• Interpretation:
– Proportion prior arrests: 0.5 = -.150(0.5) =-.075 = -7.5 arrests per 100 men
– Months in prison: 12 = -.034(12) = -0.408 arrests for given man
– Quarters employed: 1 = -.104(1) = -10.4 arrests per 100 men
Multiple Regression
• OLS Estimation of the multiple regression model
• Random sample
• Regression residuals
• Minimize sum of squared residuals
Minimization will be carried out by computer

Multiple Regression
• Standard assumptions for the multiple regression model
• Assumption MLR.1 (Linear in parameters)

In the population, the relation-
ship between y and the expla-
natory variables is linear
• Assumption MLR.2 (Random sampling)
The data is a random sample

drawn from the population
Each data point therefore follows the population equation

Multiple Regression
• Standard assumptions for the multiple regression model (cont.)
• Assumption MLR.3 (No perfect collinearity)

"In the sample (and therefore in the population), none
of the independent variables is constant and there are
no exact relationships among the independent variables"
• Remarks on MLR.3
– The assumption only rules out perfect collinearity/correlation bet-ween
explanatory variables; imperfect correlation is allowed
– If an explanatory variable is a perfect linear combination of other
explanatory variables it is superfluous and may be eliminated
– Constant variables are also ruled out (collinear with intercept)
Detecting Multicollinearity
1. Significant correlations between pairs of independent

variables.
2. Nonsignificant t–tests for all of the individual 
parameters when the F-test for overall model
adequacy is significant.
3. Sign opposite from what is expected in the estimated 
parameters.
Multiple Regression
• Standard assumptions for the multiple regression model (cont.)
• Assumption MLR.4 (Zero conditional mean)
The value of the explanatory variables

must contain no information about the mean of
the unobserved factors
– In a multiple regression model, the zero conditional mean assumption

is much more likely to hold because fewer things end up in the error
Theorem 3.1 (Unbiasedness of OLS)

Unbiasedness is an average property in repeated samples; in a given
sample, the estimates may still be far away from the true values
Multiple Regression
• Including irrelevant variables in a regression model
= 0 in the population
No problem because .
However, including irrevelant variables may increase sampling variance.
• Omitting relevant variables: the simple case
True model (contains x1 and x2)
Estimated model (x2 is omitted)

Multiple Regression
• Omitted variable bias
If x1 and x2 are correlated, assume a linear
regression relationship between them
If y is only regressed If y is only regressed error term

on x1 this will be the on x1, this will be the
estimated intercept estimated slope on x1
• Conclusion: All estimated coefficients will be biased

Multiple Regression
• Example: Omitting ability in a wage equation
Will both be positive
The return to education will be overestimated because . It will look

as if people with many years of education earn very high wages, but this is partly
due to the fact that people with more education are also more able on average.
• When is there no omitted variable bias?

– If the omitted variable is irrelevant or uncorrelated
Nonlinear Transformations
• Equation for General Linear Regression:
Predicted Y = a + b1X1 + b2X2 + … + bkXk
• General linear regression does not require that any of the

variables be the original variables in the dataset.
• Often, the variables being used are transformed variables.
• Nonlinear transformations are used whenever curvature is
detected in scatterplots.
• Either the dependent, or the independent, or all of the
variables can be transformed.
• Typical nonlinear transformations are: logarithm, square
root, the reciprocal (1/x), and the square.
Multiple Regression
• Using quadratic functional forms
• Example: Family income and family consumption
Other factors
Family consumption Family income Family income squared
– Model has two explanatory variables: inome and income squared

– Consumption is explained as a quadratic function of income
– One has to be very careful when interpreting the coefficients:
By how much does consumption Depends on how

increase if income is increased much income is
by one unit? already there
Multiple Regression Analysis:
Further Issues
• Models with interaction terms
Interaction term
The effect of the number of

bedrooms depends on the
level of square footage
• Interaction effects complicate interpretation of parameters
Effect of number of bedrooms, but for a square footage of zero

Further Issues
Correct degrees of freedom of
• Adjusted R-squared nominator and denominator
– A better estimate taking into account degrees of freedom would be
– The adjusted R-squared imposes a penalty for adding new regressors

– The adjusted R-squared increases if, and only if, the t-statistic of a
newly added regressor is greater than one in absolute value
• Relationship between R-squared and adjusted R-squared
The adjusted R-squared

may even get negative
Further Issues
• Comparing nested models: F-test (ANOVA)
• Using adjusted R-squared to choose between nonnested models
– Models are nonnested if neither model is a special case of the other
– A comparison between the R-squared of both models would be unfair to the first model because the first model contains fewer
parameters
– In the given example, even after adjusting for the difference in degrees of freedom, the quadratic model is preferred
Variable Selection:
Stepwise Regression
 The user first identifies the response, y, and the set of potentially
important independent variables, x1, x2, … , xk, where k is generally
large. The response and independent variables are then entered
into the computer software, and the stepwise procedure begins
(aka mixed selection).
 The result of the stepwise procedure is a model containing only

those terms with t-values that are significant at the specified  level.
 Thus, in most practical situations, only several of the large number

of independent variables remain.
 Parsimony guides you to select the regression model with the

fewest independent variables that can predict the dependent
variable adequately.
Qualitative Information
• Qualitative Information
– Examples: gender, race, industry, region, rating grade, …
– A way to incorporate qualitative information is to use dummy variables
– They may appear as the dependent or as independent variables
• A single dummy independent variable
= the wage gain/loss if the person is a Dummy variable:

woman rather than a man (holding =1 if the person is a woman
other things fixed) =0 if the person is man
(arbitrary assignment)
• Dummy variable trap This model cannot be estimated (perfect collinearity)
When using dummy variables, one category always has to be omitted:
The base category are men
The base category are women

• Using dummy variables for multiple categories
– 1) Define membership in each category by a dummy variable
– 2) Leave out one category (which becomes the base category)
Holding other things fixed, married

women earn 19.8% less than single
men (= the base category)
• A Binary dependent variable: the linear probability model
• Linear regression when the dependent variable is binary
If the dependent variable only

takes on the values 1 and 0
Linear probability
model (LPM)
In the linear probability model, the coefficients

describe the effect of the explanatory variables on
the probability that y=1
Extrapolation
Extrapolation is the prediction of y-values of the

independent variables that are outside the region
in which the model was developed.
This is a dangerous practice.
Regression Analysis: Statistical Inference
Hypothesis Tests for the
Regression Coefficients
• An important piece of information in regression

outputs: The t-values for the individual regression
coefficients.
• Each t-value is the ratio of the estimated coefficient to
its standard error and indicates how many standard
errors the regression coefficient is from zero.
• A t-value can be used in a hypothesis test for the
corresponding regression coefficient: If a variable’s
coefficient is zero, there is no point including this
variable in the equation.
• To run this test, simply compare the t-value in the
regression output with a tabulated t-value and reject
the null hypothesis if the output t-value is greater.
A Test for the Overall Fit: The ANOVA
Table
• It is conceivable that none of the variables in the regression

equation explains the dependent variable.
• First indication of this problem is a very small R2 value.
• Another way to say this is that the same value of Y will be
predicted regardless of the values of Xs.
• Hypotheses for ANOVA test: The null hypothesis is that all
coefficients of the explanatory variables are zero. The alternative
is that at least one of these coefficients is not zero.
• Two ways to test the hypotheses:
– Individual t-values (small, or statistically insignificant).
– F test (ANOVA test): A formal procedure for testing
whether the explained variation is large compared to
the unexplained variation.
ANOVA Table Elements
• F-value has an associated p-value that allows you to run the test easily;
it is reported in most regression outputs.
Example: Estimation and
Prediction: CI vs. PI
Homework Assignment (see in R
script!!!)
1
W
 Reading: “Basic Business Statistics: Concepts

and Applications”
 Chapter 13: Sections 13.1, 13.2, 13.5
 Chapter 14: Sections 14.2, 14.6, 14.7
 Chapter 15: all Sections

Unit 5 and 6 - Inferential Statistics and Regression Analysis

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Unit 5 and 6 - Inferential Statistics and Regression Analysis

Caricato da

Copyright:

Formati disponibili

Quantitative Business Methods

Unit 5 and 6 – Inferential Statistics and

• Assessing Association (or relationship)

• The average time (in minutes) that people spend online

• A shop manager paid to a marketing company to

– Equation could have been postulated without economic modeling

e.g. moral character,

"Explains variable in terms of variable "

Intercept Slope parameter

"Studies how varies with changes in :"

• Example 2: A simple wage equation

Labor force experience,

The independent (explanatory) variable must not

• Example: wage equation

The conditional mean independence assumption is unlikely to hold because

– This means that the average (expected) value of the dependent

• A random sample of observations

Third observation Value of the dependent

Fitted regression line

• How would you draw a line through the

• Minimize sum of squared regression residuals

• Ordinary Least Squares (OLS) estimates

Salary in thousands of dollars Return on equity of the CEO‘s firm

Fitted regression line

Unknown population regression line

Fitted or predicted values Deviations from regression line (= residuals)

• Algebraic properties of OLS regression

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

• Goodness-of-fit measure (R-squared)

The regression explains only 1.3 %

• Voting outcomes and campaign expenditures

The regression explains 85.6 % of the

• Caution: A high R-squared does not necessarily mean that the

Natural logarithm of wage

• This changes the interpretation of the regression coefficient:

Percentage change of wage

The wage increases by 8.3 % for

Growth rate of wage is 8.3 %

Natural logarithm of CEO salary Natural logarithm of his/her firm‘s sales

• This changes the interpretation of the regression coefficient:

Percentage change of salary

Logarithmic changes are

• For example: + 1 % sales ! + 0.257 % salary

• The log-log form postulates a constant elasticity model, whereas

• The question is how unbiased our coefficients will be?..

• To answer this, we have to establsh clear assumptions for SLR.

• Assumption SLR.1 (Linear in parameters)

In the population, the relationship

• Assumption SLR.2 (Random sampling)

The data in a random sample is

Each data point therefore follows

• Assumption SLR.3 (Sample variation in explanatory variable)

The values of the explanatory variables are not all

• Assumption SLR.4 (Zero conditional mean)

The value of the explanatory variable must

The variance of u does not depend on x, i.e. is

One could estimate the variance of the

An unbiased estimate of the error variance can be obtained by

• Calculation of standard errors for regression coefficients

"Explains variable in terms of variables "

Intercept Slope parameters

• Example: Wage equation

Now measures effect of education explicitly holding experience fixed