Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Field data is often accompanied by noise. Even though all control parameters (independent
variables) remain constant, the resultant outcomes (dependent variables) vary. A process of
quantitatively estimating the trend of the outcomes, also known as regression or curve fitting,
therefore becomes necessary.
The curve fitting process fits equations of approximating curves to the raw field data.
Nevertheless, for a given set of data, the fitting curves of a given type are generally NOT unique.
Thus, a curve with a minimal deviation from all data points is desired. This best-fitting curve can
be obtained by the method of least squares.
The method of least squares assumes that the best-fit curve of a given type is the curve that has
the minimal sum of the deviations squared (least square error) from a given set of data.
Suppose that the data points are , , ..., where is the independent
variable and is the dependent variable. The fitting curve has the deviation (error) from
each data point, i.e., , , ..., . According to the
method of least squares, the best fitting curve has the property that:
Please note that and are unknown coefficients while all and are given. To obtain the
least square error, the unknown coefficients and must yield zero first derivatives.
Expanding the above equations, we have:
Please note that , , and are unknown coefficients while all and are given. To obtain
the least square error, the unknown coefficients , , and must yield zero first derivatives.
Expanding the above equations, we have
The unknown coefficients , , and can hence be obtained by solving the above linear
equations.
Please note that , , , ..., and are unknown coefficients while all and are given.
To obtain the least square error, the unknown coefficients , , , ..., and must yield
zero first derivatives.
Expanding the above equations, we have
The unknown coefficients , , , ..., and can hence be obtained by solving the above
linear equations.
Multiple Regression
Multiple regression estimates the outcomes (dependent variables) which may be affected by
more than one control parameter (independent variables) or there may be more than one control
parameter being changed at the same time.
An example is the two independent variables and and one dependent variable in the linear
relationship case:
For a given data set , , ..., , where , the best fitting curve
has the least square error, i.e.,
Please note that , , and are unknown coefficients while all , , and are given. To
obtain the least square error, the unknown coefficients , , and must yield zero first
derivatives.
The unknown coefficients , , and can hence be obtained by solving the above linear
equations.
Least Squares Fitting
A mathematical procedure for finding the best-fitting curve to a given set of points by
minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve.
The sum of the squares of the offsets is used instead of the offset absolute values because this
allows the residuals to be treated as a continuous differentiable quantity. However, because
squares of the offsets are used, outlying points can have a disproportionate effect on the fit, a
property which may or may not be desirable depending on the problem at hand.
In practice, the vertical offsets from a line (polynomial, surface, hyperplane, etc.) are almost
always minimized instead of the perpendicular offsets. This provides a fitting function for the
independent variable that estimates for a given (most often what an experimenter wants),
allows uncertainties of the data points along the - and -axes to be incorporated simply, and
also provides a much simpler analytic form for the fitting parameters than would be obtained
using a fit based on perpendicular offsets. In addition, the fitting technique can be easily
generalized from a best-fit line to a best-fit polynomial when sums of vertical distances are used.
In any case, for a reasonable number of noisy data points, the difference between vertical and
perpendicular fits is quite small.
The linear least squares fitting technique is the simplest and most commonly applied form of
linear regression and provides a solution to the problem of finding the best fitting straight line
through a set of points. In fact, if the functional relationship between the two quantities being
graphed is known to within additive or multiplicative constants, it is common practice to
transform the data in such a way that the resulting line is a straight line, say by plotting vs.
instead of vs. in the case of analyzing the period of a pendulum as a function of its length .
For this reason, standard forms for exponential, logarithmic, and power laws are often explicitly
computed. The formulas for linear least squares fitting were independently derived by Gauss and
Legendre.
For nonlinear least squares fitting to a number of unknown parameters, linear least squares fitting
may be applied iteratively to a linearized form of the function until convergence is achieved.
However, it is often also possible to linearize a nonlinear function at the outset and still use linear
methods for determining fit parameters without resorting to iterative procedures. This approach
does commonly violate the implicit assumption that the distribution of errors is normal, but often
still gives acceptable results using normal equations, a pseudoinverse, etc. Depending on the type
of fit and initial parameters chosen, the nonlinear fit may have good or poor convergence
properties. If uncertainties (in the most general case, error ellipses) are given for the points,
points can be weighted differently in order to give the high-quality points more weight.
Vertical least squares fitting proceeds by finding the sum of the squares of the vertical deviations
of a set of data points
(1)
from a function . Note that this procedure does not minimize the actual deviations from the line
(which would be measured perpendicular to the given function). In addition, although the
unsquared sum of distances might seem a more appropriate quantity to minimize, use of the
absolute value results in discontinuous derivatives which cannot be treated analytically. The
square deviations from each point are therefore summed, and the resulting residual is then
minimized to find the best fit line. This procedure results in outlying points being given
disproportionately large weighting.
(2)
(3)
so
(4)
(5)
(6)
(7)
(8)
In matrix form,
(9)
so
(10)
(11)
so
(12)
(13)
(14)
(15)
(Kenney and Keeping 1962). These can be rewritten in a simpler form by defining the sums of
squares
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
Here, is the covariance and and are variances. Note that the quantities and
can also be interpreted as the dot products
(25)
(26)
(27)
(28)
The overall quality of the fit is then parameterized in terms of a quantity known as the
correlation coefficient, defined by
(29)
which gives the proportion of which is accounted for by the regression.
(30)
then the error between the actual vertical point and the fitted point is given by
(31)
(32)
(33)
(Acton 1966, pp. 32-35; Gonick and Smith 1993, pp. 202-204).
(34)
(35)
ANOVA
"Analysis of Variance." A statistical test for heterogeneity of means by analysis of group
variances. ANOVA is implemented as ANOVA[data] in the Mathematica package ANOVA` .
To apply the test, assume random sampling of a variate with equal variances, independent
errors, and a normal distribution. Let be the number of replicates (sets of identical
observations) within each of factor levels (treatment groups), and be the th observation
within factor level . Also assume that the ANOVA is "balanced" by restricting to be the same
for each factor level.
(1)
(2)
(3)
(4)
(5)
which are the total, treatment, and error sums of squares. Here, is the mean of observations
within factor level , and is the "group" mean (i.e., mean of means). Compute the entries in the
following table, obtaining the P-value corresponding to the calculated F-ratio of the mean
squared values
(6)
F-
category freedom SS mean squared ratio
SS
model A
SS
error
E
SS
total
T
If the P-value is small, reject the null hypothesis that all means are the same for the different
groups.
Correlation Coefficient
The correlation coefficient, sometimes also called the cross-correlation coefficient, is a quantity
that gives the quality of a least squares fitting to the original data. To define the correlation
coefficient, first consider the sum of squared values , , and of a set of data points
about their respective means,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
These quantities are simply unnormalized forms of the variances and covariance of and given
by
(13)
(14)
(15)
For linear least squares fitting, the coefficient in
(16)
is given by
(17)
(18)
and the coefficient in
(19)
is given by
(20)
The correlation coefficient (sometimes also denoted ) is then defined by
(21)
(22)
The correlation coefficient is also known as the product-moment coefficient of correlation or
Pearson's correlation. The correlation coefficients for linear fits to increasingly noisy data are
shown above.
The correlation coefficient has an important physical interpretation. To see this, define
(23)
and denote the "expected" value for as . Sums of are then
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
The sum of squared errors is then
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
and the sum of squared residuals is
(42)
(43)
(44)
(45)
(46)
But
(47)
(48)
so
(49)
(50)
(51)
(52)
and
(53)
The square of the correlation coefficient is therefore given by
(54)
(55)
(56)
(1)
(2)
(3)
(4)
where and .
This fit gives greater weights to small values so, in order to weight the points equally, it is often
better to minimize the function
(5)
(7)
(8)
(9)
(10)
In the plot above, the short-dashed curve is the fit computed from () and () and the long-
dashed curve is the fit computed from (9) and (10).
(1)
(2)
Least Squares Fitting--Perpendicular Offsets
In practice, the vertical offsets from a line (polynomial, surface, hyperplane, etc.) are almost
always minimized instead of the perpendicular offsets. This provides a fitting function for the
independent variable that estimates for a given (most often what an experimenter wants),
allows uncertainties of the data points along the - and -axes to be incorporated simply, and
also provides a much simpler analytic form for the fitting parameters than would be obtained
using a fit based on perpendicular offsets.
The residuals of the best-fit line for a set of points using unsquared perpendicular distances
of points are given by
(1)
(2)
(3)
Unfortunately, because the absolute value function does not have continuous derivatives,
minimizing is not amenable to analytic solution. However, if the square of the perpendicular
distances
(4)
is minimized instead, the problem can be solved in closed form. is a minimum when
(5)
and
(6)
(7)
(8)
(9)
But
(10)
(11)
so (10) becomes
(12)
(13)
(14)
(16)
So define
(17)
(18)
(19)
with found using (). Note the rather unwieldy form of the best-fit parameters in the
formulation. In addition, minimizing for a second- or higher-order polynomial leads to
polynomial equations having higher order, so this formulation cannot be extended.
(1)
(2)
(4)
(5)
(6)
(7)
(8)
(9)
This is a Vandermonde matrix. We can also obtain the matrix for a least squares fit by writing
(10)
Premultiplying both sides by the transpose of the first matrix then gives
(11)
so
(12)
As before, given points and fitting with polynomial coefficients , ..., gives
(13)
(14)
(15)
This matrix equation can be solved numerically, or can be inverted directly if it is well formed,
to yield the solution vector
(16)
(1)
(2)
(3)
where and .
(1)
(2)
We desire to solve these equations to obtain the values , ..., which best satisfy this system of
equations. Pick an initial guess for the and then define
(3)
(4)
(5)
(6)
In more concise matrix form,
(7)
(8)
Defining
(9)
(10)
in terms of the known quantities and then gives the matrix equation
(11)
which can be solved for using standard matrix techniques such as Gaussian elimination. This
offset is then applied to and a new is calculated. By iteratively applying this procedure until
the elements of become smaller than some prescribed limit, a solution is obtained. Note that
the procedure may not converge very well for some functions and also that convergence is often
greatly improved by picking initial values close to the best-fit value. The sum of square residuals
is given by after the final iteration.
(12)
is shown above, where the thin solid curve is the initial guess, the dotted curves are intermediate
iterations, and the heavy solid curve is the fit to which the solution converges. The actual
parameters are , the initial guess was (0.8, 15, 4), and the converged values
are (1.03105, 20.1369, 4.86022), with . The partial derivatives used to construct the
matrix are
(13)
(14)
(15)
The technique could obviously be generalized to multiple Gaussians, to include slopes, etc.,
although the convergence properties generally worsen as the number of free parameters is
increased.
An analogous technique can be used to solve an overdetermined set of equations. This problem
might, for example, arise when solving for the best-fit Euler angles corresponding to a noisy
rotation matrix, in which case there are three unknown angles, but nine correlated matrix
elements. In such a case, write the different functions as for , ..., , call their
actual values , and define
(16)
and
(17)
where are the numerical values obtained after the th iteration. Again, set up the equations as
(18)