Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Linear regression
From Wikipedia, the free encyclopedia
To meet Wikipedia's quality standards, this article or section may require cleanup.
Please discuss this issue on the talk page, or replace this tag with a more specific message. Editing help is
available.
This article has been tagged since February 2006.
"Line of best fit" redirects here. For the song "Line of Best Fit" by Death Cab for Cutie, see You
Can Play these Songs with Chords.
In statistics, linear regression is a regression method of modeling the conditional expected value
of one variable y given the values of some other variable or variables x.
Linear regression is called "linear" because the relation of the response to the explanatory
variables is assumed to be a linear function of some parameters. It is often erroneously thought
that the reason the technique is called "linear regression" is that the graph of y = + x is a line.
But in fact, if the model is (for example)
(,) in the role formerly played by ), then the problem is still one of linear regression, even
though the graph is not a straight line.
Regression models which are not a linear function of the parameters are called nonlinear
regression models (for example, a multi-layer artificial neural network).
Still more generally, regression may be viewed as a special case of density estimation. The joint
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (1 of 16)1/25/2007 10:17:47 AM
distribution of the response and explanatory variables can be constructed from the conditional
distribution of the response variable and the marginal distribution of the explanatory variables. In
some problems, it is convenient to work in the other direction: from the joint distribution, the
conditional distribution of the response variable can be derived. Regression lines can be
extrapolated, where the line is extended to fit the model for values of the explanatory variables
outside their original range. However extrapolation may be very inaccurate and can only be used
reliably in certain instances.
Contents
1 Historical remarks
2 Naming conventions
3 Statement of the simple linear regression model
4 Parameter estimation
4.1 Robust regression
4.2 Summarizing the data
4.3 Estimating beta (the slope)
4.4 Estimating alpha (the intercept)
4.5 Displaying the residuals
4.6 Ancillary statistics
5 Multiple linear regression
5.1 Polynomial fitting
5.2 Correlation coefficient
6 Examples
6.1 Medicine
6.2 Finance
7 See also
8 References
8.1 Historical
8.2 Modern theory
9 References
10 External links
Historical remarks
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (2 of 16)1/25/2007 10:17:47 AM
The earliest form of linear regression was the method of least squares, which was published by
Legendre in 1805, and by Gauss in 1809. The term "least squares" is from Legendre's term,
moindres carrs. However, Gauss claimed that he had known the method since 1795.
Legendre and Gauss both applied the method to the problem of determining, from astronomical
observations, the orbits of bodies about the sun. Euler had worked on the same problem (1748)
without success. Gauss published a further development of the theory of least squares in 1821,
including a version of the Gauss-Markov theorem.
The term "reversion" was used in the nineteenth century to describe a biological phenomenon,
namely that the progeny of exceptional individuals tend on average to be less exceptional than
their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon,
and applied the slightly misleading term "regression towards mediocrity" to it (parents of
exceptional individuals also tend on average to be less exceptional than their children). For
Galton, regression had only this biological meaning, but his work (1877, 1885) was extended by
Karl Pearson and Udny Yule to a more general statistical context (1897, 1903). In the work of
Pearson and Yule, the joint distribution of the response and explanatory variables is assumed to be
Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Fisher
assumed that the conditional distribution of the response variable is Gaussian, but the joint
distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of
1821.
Naming conventions
The measured variable, y, is conventionally called the "response variable". The terms "endogenous
variable," "output variable," "criterion variable," and "dependent variable" are also used. The
controlled or manipulated variable, x, are called the explanatory variables. The terms "exogenous
variables," "input variables," "predictor variables" and "independent variables" are also used.
The terms independent or explanatory variable suggest that the variables are statistically
independent, which is not what the terms describe. These terms may also convey that the value of
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (3 of 16)1/25/2007 10:17:47 AM
the independent/explanatory variable can be chosen at will. The response variable is then seen as
an effect, that is, causally dependent on the independent/explanatory variable, as in a stimulusresponse model. Although many linear regression models are formulated as models of cause and
effect, the direction of causation may just as well go the other way, or indeed there need not be
any causal relation at all. For that reason, one may prefer the terms "predictor / response" or
"endogenous / exogenous," which do not imply causality. (See correlation implies causation for
more on the topic of cause and effect relationships in correlative designs.)
The explanatory and response variables may be scalars or vectors. Multiple linear regression
includes cases with more than one explanatory variable.
The right hand side may take more general forms, but generally comprises a linear combination of
the parameters, here denoted and . The term represents the unpredicted or unexplained
variation in the response variable; it is conventionally called the "error" whether it is really a
measurement error or not, and is assumed to be independent of x. The error term is conventionally
assumed to have expected value equal to zero (a nonzero expected value could be absorbed into
). See also errors and residuals in statistics; the difference between an error and a residual is also
dealt with below.
An equivalent formulation which explicitly shows the linear regression as a model of conditional
expectation is
with the conditional distribution of y given x essentially the same as the distribution of the error
term.
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (4 of 16)1/25/2007 10:17:47 AM
A linear regression model need not be affine, let alone linear, in the explanatory variables x. For
example,
is a linear regression model, for the right-hand side is a linear combination of the parameters , ,
and . In this case it is useful to think of x2 as a new explanatory variable, formed by modifying
the original variable x. Indeed, any linear combination of functions f(x), g(x), h(x), ..., is linear
regression model, so long as these functions do not have any free parameters (otherwise the model
is generally a nonlinear regression model). The least-squares estimates of , , and are linear in
the response variable y, and nonlinear in x (they are nonlinear in x even if the and terms are
absent; if only were present then doubling all observed x values would multiply the least-squares
estimate of by 1/2).
Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:
The random errors are uncorrelated (this is weaker than an assumption of probabilistic
independence).
The random errors are "homoskedastic", i.e., they all have the same variance.
i
i
(See also Gauss-Markov theorem. That result says that under the assumptions above, least-squares
estimators are in a certain sense optimal.)
Sometimes a stronger set of assumptions is relied on:
If xi is a vector we can take the product xi to be a scalar product (see "dot product").
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (5 of 16)1/25/2007 10:17:47 AM
A statistician will usually estimate the unobservable values of the parameters and by the
method of least squares, which consists of finding the values of a and b that minimize the sum of
squares of the residuals
Those values of
and
may be regarded as estimates of the errors; see also errors and residuals in statistics.
Notice that, whereas the errors are independent, the residuals cannot be independent because the
use of least-squares estimates implies that the sum of the residuals must be 0, and the scalar
product of the vector of residuals with the vector of x-values must be 0, i.e., we must have
and
These two linear constraints imply that the vector of residuals must lie within a certain (n 2)n
dimensional subspace of R ; hence we say that there are "n 2 degrees of freedom for error". If
one assumes the errors are normally distributed and independent, then it can be shown to follow
that 1) the sum of squares of residuals
is distributed as
So we have :
the sum of squares divided by the error-variance 2, has a chi-square distribution with n
2 degrees of freedom,
the sum of squares of residuals is actually probabilistically independent of the estimates
of the parameters and .
These facts make it possible to use Student's t-distribution with n 2 degrees of freedom (so
named in honor of the pseudonymous "Student") to find confidence intervals for and .
Parameter estimation
By recognizing that the
equations we can express the model using data matrix X, target vector Y and parameter vector .
The ith row of X and Y will contain the x and y value for the ith data sample. Then the model can
be written as
where is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance 2
I , where In is the nn identity matrix. The matrix
n
(where (remember)
is the vector of
estimates of the components of ) is then the orthogonal projection of Y onto the column space of
X.
Basic matrix algebra yields
The fact that the matrix X(XX)1X is a symmetric idempotent matrix is incessantly relied on in
proofs of theorems. The linearity of
is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of
estimation.
The matrix In X (X X)1 X that appears above is a symmetric idempotent matrix of rank n 2.
Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional
spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by
an orthogonal matrix G, i.e., the matrix GMG is a diagonal matrix. If the matrix M is also
idempotent, then the diagonal entries in GMG must be idempotent numbers. Only two real
numbers are idempotent: 0 and 1. So I X(XX) 1X, after diagonalization, has n 2 0s and two 1s
n
on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chisquare distribution with n2 degrees of freedom.
Regression parameters can also be estimated by Bayesian methods. This has the advantages that
confidence intervals can be produced for parameter estimates without the use of asymptotic
approximations,
prior information can be incorporated into the analysis.
we know from domain knowledge that alpha can only take one of the values {1, +1} but we do
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (8 of 16)1/25/2007 10:17:47 AM
not know which. We can build this information into the analysis by choosing a prior for alpha
which is a discrete distribution with a probability of 0.5 on 1 and 0.5 on +1. The posterior for
alpha will also be a discrete distribution on {1, +1}, but the probability weights will change to
reflect the evidence from the data.
In modern computer applications, the actual value of is calculated using the QR decomposition
or slightly more fancy methods when X'X is near singular. The code for the matlab \ function is an
excellent example of a robust method.
Robust regression
A host of alternative approaches to the computation of regression parameters are included in the
category known as robust regression. One technique minimizes the mean absolute error, or some
other function of the residuals, instead of mean squared error as in linear regression. Robust
regression is much more computationally intensive than linear regression and is somewhat more
difficult to implement as well. While least squares estimates are not very sensitive to breaking the
normality of the errors assumption, this is not true when the variance or mean of the error
distribution is not bounded, or when an analyst that can identify outliers is unavailable.
In the Stata culture, Robust regression means linear regression with Huber-White standard error
estimates. This relaxes the assumption of homoskedasticity for variance estimates only, the
predictors are still ordinary least squares (OLS) estimates.
and S similarly.
Y
A consequence of this estimate is that the regression line will always pass through the "center"
.
against the explanatory variable, x. There should be no discernible trend or pattern if the model is
file:///C|/Documents%20and%20Settings/user/My%20Do...n%20-%20Wikipedia,%20the%20free%20encyclopedia.htm (10 of 16)1/25/2007 10:17:47 AM
Residuals increase (or decrease) as the explanatory variable increases indicates mistakes
in the calculations. Find the mistakes and correct them.
Residuals first rise and then fall (or first fall and then rise) indicates that the appropriate
model is (at least) quadratic. Adding a quadratic term (and then possibly higher) to the
model may be appropriate. See nonlinear regression and multiple linear regression.
One residual is much larger than the others suggests that there is one unusual
observation which is distorting the fit.
Verify its value before publishing or
Eliminate it, document your decision to do so, and recalculate the statistics.
Studentized residuals can be used in outlier detection.
The vertical spread of the residuals increases as the explanatory variable increases (funnelshaped plot) indicates that the homoskedasticity assumption is violated (i.e. there is
heteroskedasticity: the variability of the response depends on the value of x.
Transform the data: for example, the logarithm or logit transformations are often
useful.
Use a more general modeling approach that can account for non-constant variance,
for example a general linear model or a generalized linear model.
Ancillary statistics
The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the
dispersion of the response variable is explained by the explanatory variable.
The correlation coefficient, r, can be calculated by
This statistic is a measure of how well a straight line describes the data. Values near zero suggest
that the model is ineffective. r2 is frequently interpreted as the fraction of the variability explained
by the explanatory variable, X.
Here X and Y are explanatory variables. The values of the parameters , and are estimated by
the method of least squares, that minimize the sum of squares of the residuals
three linear equations i.e. the normal equations to find estimates of parameters , and :
Polynomial fitting
A polynomal fit is a specific type of multiple regression. The simple regression model (a firstorder polynomial) can be trivially extended to higher orders. The regression model
is a system of polynomial equations of
order m with polynomial coefficients
data matrix X, target vector Y and parameter vector . The ith row of X and Y will contain the x
and y value for the ith data sample. Then the model can be written as as system of linear equations:
The simple regression model is a special case of a polynomial fit, where the polynomial order
m = 1.
Correlation coefficient
The multiple regression correlation coefficient (R2) is a measure of the proportion of variability
explained by, or due to the regression (linear relationship) in a sample of paired data. It is a
number between zero and one and a value close to zero suggests a poor model. [1]
Examples
Linear regression is widely used in biological, behavioural and social sciences to describe
relationships between variables. It ranks as one of the most important tools used in these
disciplines.
Medicine
As one example, early evidence relating tobacco smoking to mortality and morbidity came from
studies employing regression. Researchers usually include several variables in their regression
analysis in an effort to remove factors that might produce spurious correlations. For the cigarette
smoking example, researchers might include socio-economic status in addition to smoking to
ensure that any observed effect of smoking on mortality is not due to some effect of education or
income. However, it is never possible to include all possible confounding variables in a study
employing regression. For the smoking example, a hypothetical gene might increase mortality and
also cause people to smoke more. For this reason, randomized controlled trials are considered to
be more trustworthy than a regression analysis.
Finance
Linear regression underlies the capital asset pricing model, and the concept of using Beta for
analyzing and quantifying the nonsystematic risk of an investment comes directly from the Beta
coefficient of the linear regression model that relates the return on the investment to the return on
all risky assets.
See also
Econometrics
Regression analysis
Robust regression
Least squares
Median-median line
Instrumental variable
Hierarchical linear modeling
Empirical Bayes methods
References
Historical
A.M. Legendre. Nouvelles mthodes pour la dtermination des orbites des comtes (1805).
"Sur la Mthode des moindres quarrs" appears as an appendix.
C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem
Ambientum. (1809)
C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae.
(1821/1823)
Charles Darwin. The Variation of Animals and Plants under Domestication. (1869)
(Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the
term "reversion".)
Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533.
(Galton uses the term "reversion" in this paper, which discusses the size of peas.)
Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the
term "regression" in this paper, which discusses the height of humans.)
Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the
Anthropological Institute, 15:246-263 (1886). (Facsimile at: [2])
G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54.
Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral
Heredity", Biometrika (1903)
R.A. Fisher. "The goodness of fit of regression formulae, and the distribution of regression
coefficients", J. Royal Statist. Soc., 85, 597-612 (1922)
R.A. Fisher. Statistical Methods for Research Workers (1925)
Modern theory
References
External links