| @cindex fitting |
| @cindex least squares fit |
| @cindex regression, least squares |
| @cindex weighted linear fits |
| @cindex unweighted linear fits |
| This chapter describes routines for performing least squares fits to |
| experimental data using linear combinations of functions. The data |
| may be weighted or unweighted, i.e. with known or unknown errors. For |
| weighted data the functions compute the best fit parameters and their |
| associated covariance matrix. For unweighted data the covariance |
| matrix is estimated from the scatter of the points, giving a |
| variance-covariance matrix. |
| |
| The functions are divided into separate versions for simple one- or |
| two-parameter regression and multiple-parameter fits. The functions |
| are declared in the header file @file{gsl_fit.h}. |
| |
| @menu |
| * Fitting Overview:: |
| * Linear regression:: |
| * Linear fitting without a constant term:: |
| * Multi-parameter fitting:: |
| * Fitting Examples:: |
| * Fitting References and Further Reading:: |
| @end menu |
| |
| @node Fitting Overview |
| @section Overview |
| |
| Least-squares fits are found by minimizing @math{\chi^2} |
| (chi-squared), the weighted sum of squared residuals over @math{n} |
| experimental datapoints @math{(x_i, y_i)} for the model @math{Y(c,x)}, |
| @tex |
| \beforedisplay |
| $$ |
| \chi^2 = \sum_i w_i (y_i - Y(c, x_i))^2 |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| \chi^2 = \sum_i w_i (y_i - Y(c, x_i))^2 |
| @end example |
| |
| @end ifinfo |
| @noindent |
| The @math{p} parameters of the model are @c{$c = \{c_0, c_1, \dots\}$} |
| @math{c = @{c_0, c_1, @dots{}@}}. The |
| weight factors @math{w_i} are given by @math{w_i = 1/\sigma_i^2}, |
| where @math{\sigma_i} is the experimental error on the data-point |
| @math{y_i}. The errors are assumed to be |
| gaussian and uncorrelated. |
| For unweighted data the chi-squared sum is computed without any weight factors. |
| |
| The fitting routines return the best-fit parameters @math{c} and their |
| @math{p \times p} covariance matrix. The covariance matrix measures the |
| statistical errors on the best-fit parameters resulting from the |
| errors on the data, @math{\sigma_i}, and is defined |
| @cindex covariance matrix, linear fits |
| as @c{$C_{ab} = \langle \delta c_a \delta c_b \rangle$} |
| @math{C_@{ab@} = <\delta c_a \delta c_b>} where @c{$\langle \, \rangle$} |
| @math{< >} denotes an average over the gaussian error distributions of the underlying datapoints. |
| |
| The covariance matrix is calculated by error propagation from the data |
| errors @math{\sigma_i}. The change in a fitted parameter @math{\delta |
| c_a} caused by a small change in the data @math{\delta y_i} is given |
| by |
| @tex |
| \beforedisplay |
| $$ |
| \delta c_a = \sum_i {\partial c_a \over \partial y_i} \delta y_i |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| \delta c_a = \sum_i (dc_a/dy_i) \delta y_i |
| @end example |
| |
| @end ifinfo |
| @noindent |
| allowing the covariance matrix to be written in terms of the errors on the data, |
| @tex |
| \beforedisplay |
| $$ |
| C_{ab} = \sum_{i,j} {\partial c_a \over \partial y_i} |
| {\partial c_b \over \partial y_j} |
| \langle \delta y_i \delta y_j \rangle |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| C_@{ab@} = \sum_@{i,j@} (dc_a/dy_i) (dc_b/dy_j) <\delta y_i \delta y_j> |
| @end example |
| |
| @end ifinfo |
| @noindent |
| For uncorrelated data the fluctuations of the underlying datapoints satisfy |
| @c{$\langle \delta y_i \delta y_j \rangle = \sigma_i^2 \delta_{ij}$} |
| @math{<\delta y_i \delta y_j> = \sigma_i^2 \delta_@{ij@}}, giving a |
| corresponding parameter covariance matrix of |
| @tex |
| \beforedisplay |
| $$ |
| C_{ab} = \sum_{i} {1 \over w_i} {\partial c_a \over \partial y_i} {\partial c_b \over \partial y_i} |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| C_@{ab@} = \sum_i (1/w_i) (dc_a/dy_i) (dc_b/dy_i) |
| @end example |
| |
| @end ifinfo |
| @noindent |
| When computing the covariance matrix for unweighted data, i.e. data with unknown errors, |
| the weight factors @math{w_i} in this sum are replaced by the single estimate @math{w = |
| 1/\sigma^2}, where @math{\sigma^2} is the computed variance of the |
| residuals about the best-fit model, @math{\sigma^2 = \sum (y_i - Y(c,x_i))^2 / (n-p)}. |
| This is referred to as the @dfn{variance-covariance matrix}. |
| @cindex variance-covariance matrix, linear fits |
| |
| The standard deviations of the best-fit parameters are given by the |
| square root of the corresponding diagonal elements of |
| the covariance matrix, @c{$\sigma_{c_a} = \sqrt{C_{aa}}$} |
| @math{\sigma_@{c_a@} = \sqrt@{C_@{aa@}@}}. |
| |
| @node Linear regression |
| @section Linear regression |
| @cindex linear regression |
| The functions described in this section can be used to perform |
| least-squares fits to a straight line model, @math{Y(c,x) = c_0 + c_1 x}. |
| |
| @cindex covariance matrix, from linear regression |
| @deftypefun int gsl_fit_linear (const double * @var{x}, const size_t @var{xstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c0}, double * @var{c1}, double * @var{cov00}, double * @var{cov01}, double * @var{cov11}, double * @var{sumsq}) |
| This function computes the best-fit linear regression coefficients |
| (@var{c0},@var{c1}) of the model @math{Y = c_0 + c_1 X} for the dataset |
| (@var{x}, @var{y}), two vectors of length @var{n} with strides |
| @var{xstride} and @var{ystride}. The errors on @var{y} are assumed unknown so |
| the variance-covariance matrix for the |
| parameters (@var{c0}, @var{c1}) is estimated from the scatter of the |
| points around the best-fit line and returned via the parameters |
| (@var{cov00}, @var{cov01}, @var{cov11}). |
| The sum of squares of the residuals from the best-fit line is returned |
| in @var{sumsq}. |
| @end deftypefun |
| |
| @deftypefun int gsl_fit_wlinear (const double * @var{x}, const size_t @var{xstride}, const double * @var{w}, const size_t @var{wstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c0}, double * @var{c1}, double * @var{cov00}, double * @var{cov01}, double * @var{cov11}, double * @var{chisq}) |
| This function computes the best-fit linear regression coefficients |
| (@var{c0},@var{c1}) of the model @math{Y = c_0 + c_1 X} for the weighted |
| dataset (@var{x}, @var{y}), two vectors of length @var{n} with strides |
| @var{xstride} and @var{ystride}. The vector @var{w}, of length @var{n} |
| and stride @var{wstride}, specifies the weight of each datapoint. The |
| weight is the reciprocal of the variance for each datapoint in @var{y}. |
| |
| The covariance matrix for the parameters (@var{c0}, @var{c1}) is |
| computed using the weights and returned via the parameters |
| (@var{cov00}, @var{cov01}, @var{cov11}). The weighted sum of squares |
| of the residuals from the best-fit line, @math{\chi^2}, is returned in |
| @var{chisq}. |
| @end deftypefun |
| |
| @deftypefun int gsl_fit_linear_est (double @var{x}, double @var{c0}, double @var{c1}, double @var{c00}, double @var{c01}, double @var{c11}, double * @var{y}, double * @var{y_err}) |
| This function uses the best-fit linear regression coefficients |
| @var{c0},@var{c1} and their covariance |
| @var{cov00},@var{cov01},@var{cov11} to compute the fitted function |
| @var{y} and its standard deviation @var{y_err} for the model @math{Y = |
| c_0 + c_1 X} at the point @var{x}. |
| @end deftypefun |
| |
| @node Linear fitting without a constant term |
| @section Linear fitting without a constant term |
| |
| The functions described in this section can be used to perform |
| least-squares fits to a straight line model without a constant term, |
| @math{Y = c_1 X}. |
| |
| @deftypefun int gsl_fit_mul (const double * @var{x}, const size_t @var{xstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c1}, double * @var{cov11}, double * @var{sumsq}) |
| This function computes the best-fit linear regression coefficient |
| @var{c1} of the model @math{Y = c_1 X} for the datasets (@var{x}, |
| @var{y}), two vectors of length @var{n} with strides @var{xstride} and |
| @var{ystride}. The errors on @var{y} are assumed unknown so the |
| variance of the parameter @var{c1} is estimated from |
| the scatter of the points around the best-fit line and returned via the |
| parameter @var{cov11}. The sum of squares of the residuals from the |
| best-fit line is returned in @var{sumsq}. |
| @end deftypefun |
| |
| @deftypefun int gsl_fit_wmul (const double * @var{x}, const size_t @var{xstride}, const double * @var{w}, const size_t @var{wstride}, const double * @var{y}, const size_t @var{ystride}, size_t @var{n}, double * @var{c1}, double * @var{cov11}, double * @var{sumsq}) |
| This function computes the best-fit linear regression coefficient |
| @var{c1} of the model @math{Y = c_1 X} for the weighted datasets |
| (@var{x}, @var{y}), two vectors of length @var{n} with strides |
| @var{xstride} and @var{ystride}. The vector @var{w}, of length @var{n} |
| and stride @var{wstride}, specifies the weight of each datapoint. The |
| weight is the reciprocal of the variance for each datapoint in @var{y}. |
| |
| The variance of the parameter @var{c1} is computed using the weights |
| and returned via the parameter @var{cov11}. The weighted sum of |
| squares of the residuals from the best-fit line, @math{\chi^2}, is |
| returned in @var{chisq}. |
| @end deftypefun |
| |
| @deftypefun int gsl_fit_mul_est (double @var{x}, double @var{c1}, double @var{c11}, double * @var{y}, double * @var{y_err}) |
| This function uses the best-fit linear regression coefficient @var{c1} |
| and its covariance @var{cov11} to compute the fitted function |
| @var{y} and its standard deviation @var{y_err} for the model @math{Y = |
| c_1 X} at the point @var{x}. |
| @end deftypefun |
| |
| @node Multi-parameter fitting |
| @section Multi-parameter fitting |
| @cindex multi-parameter regression |
| @cindex fits, multi-parameter linear |
| The functions described in this section perform least-squares fits to a |
| general linear model, @math{y = X c} where @math{y} is a vector of |
| @math{n} observations, @math{X} is an @math{n} by @math{p} matrix of |
| predictor variables, and the elements of the vector @math{c} are the @math{p} unknown best-fit parameters which are to be estimated. The chi-squared value is given by @c{$\chi^2 = \sum_i w_i (y_i - \sum_j X_{ij} c_j)^2$} |
| @math{\chi^2 = \sum_i w_i (y_i - \sum_j X_@{ij@} c_j)^2}. |
| |
| This formulation can be used for fits to any number of functions and/or |
| variables by preparing the @math{n}-by-@math{p} matrix @math{X} |
| appropriately. For example, to fit to a @math{p}-th order polynomial in |
| @var{x}, use the following matrix, |
| @tex |
| \beforedisplay |
| $$ |
| X_{ij} = x_i^j |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| X_@{ij@} = x_i^j |
| @end example |
| |
| @end ifinfo |
| @noindent |
| where the index @math{i} runs over the observations and the index |
| @math{j} runs from 0 to @math{p-1}. |
| |
| To fit to a set of @math{p} sinusoidal functions with fixed frequencies |
| @math{\omega_1}, @math{\omega_2}, @dots{}, @math{\omega_p}, use, |
| @tex |
| \beforedisplay |
| $$ |
| X_{ij} = \sin(\omega_j x_i) |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| X_@{ij@} = sin(\omega_j x_i) |
| @end example |
| |
| @end ifinfo |
| @noindent |
| To fit to @math{p} independent variables @math{x_1}, @math{x_2}, @dots{}, |
| @math{x_p}, use, |
| @tex |
| \beforedisplay |
| $$ |
| X_{ij} = x_j(i) |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| X_@{ij@} = x_j(i) |
| @end example |
| |
| @end ifinfo |
| @noindent |
| where @math{x_j(i)} is the @math{i}-th value of the predictor variable |
| @math{x_j}. |
| |
| The functions described in this section are declared in the header file |
| @file{gsl_multifit.h}. |
| |
| The solution of the general linear least-squares system requires an |
| additional working space for intermediate results, such as the singular |
| value decomposition of the matrix @math{X}. |
| |
| @deftypefun {gsl_multifit_linear_workspace *} gsl_multifit_linear_alloc (size_t @var{n}, size_t @var{p}) |
| This function allocates a workspace for fitting a model to @var{n} |
| observations using @var{p} parameters. |
| @end deftypefun |
| |
| @deftypefun void gsl_multifit_linear_free (gsl_multifit_linear_workspace * @var{work}) |
| This function frees the memory associated with the workspace @var{w}. |
| @end deftypefun |
| |
| @deftypefun int gsl_multifit_linear (const gsl_matrix * @var{X}, const gsl_vector * @var{y}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work}) |
| @deftypefunx int gsl_multifit_linear_svd (const gsl_matrix * @var{X}, const gsl_vector * @var{y}, double @var{tol}, size_t * @var{rank}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work}) |
| These functions compute the best-fit parameters @var{c} of the model |
| @math{y = X c} for the observations @var{y} and the matrix of predictor |
| variables @var{X}. The variance-covariance matrix of the model |
| parameters @var{cov} is estimated from the scatter of the observations |
| about the best-fit. The sum of squares of the residuals from the |
| best-fit, @math{\chi^2}, is returned in @var{chisq}. |
| |
| The best-fit is found by singular value decomposition of the matrix |
| @var{X} using the preallocated workspace provided in @var{work}. The |
| modified Golub-Reinsch SVD algorithm is used, with column scaling to |
| improve the accuracy of the singular values. Any components which have |
| zero singular value (to machine precision) are discarded from the fit. |
| In the second form of the function the components are discarded if the |
| ratio of singular values @math{s_i/s_0} falls below the user-specified |
| tolerance @var{tol}, and the effective rank is returned in @var{rank}. |
| @end deftypefun |
| |
| @deftypefun int gsl_multifit_wlinear (const gsl_matrix * @var{X}, const gsl_vector * @var{w}, const gsl_vector * @var{y}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work}) |
| @deftypefunx int gsl_multifit_wlinear_svd (const gsl_matrix * @var{X}, const gsl_vector * @var{w}, const gsl_vector * @var{y}, double @var{tol}, size_t * @var{rank}, gsl_vector * @var{c}, gsl_matrix * @var{cov}, double * @var{chisq}, gsl_multifit_linear_workspace * @var{work}) |
| |
| This function computes the best-fit parameters @var{c} of the weighted |
| model @math{y = X c} for the observations @var{y} with weights @var{w} |
| and the matrix of predictor variables @var{X}. The covariance matrix of |
| the model parameters @var{cov} is computed with the given weights. The |
| weighted sum of squares of the residuals from the best-fit, |
| @math{\chi^2}, is returned in @var{chisq}. |
| |
| The best-fit is found by singular value decomposition of the matrix |
| @var{X} using the preallocated workspace provided in @var{work}. Any |
| components which have zero singular value (to machine precision) are |
| discarded from the fit. In the second form of the function the |
| components are discarded if the ratio of singular values @math{s_i/s_0} |
| falls below the user-specified tolerance @var{tol}, and the effective |
| rank is returned in @var{rank}. |
| @end deftypefun |
| |
| @deftypefun int gsl_multifit_linear_est (const gsl_vector * @var{x}, const gsl_vector * @var{c}, const gsl_matrix * @var{cov}, double * @var{y}, double * @var{y_err}) |
| This function uses the best-fit multilinear regression coefficients |
| @var{c} and their covariance matrix |
| @var{cov} to compute the fitted function value |
| @var{y} and its standard deviation @var{y_err} for the model @math{y = x.c} |
| at the point @var{x}. |
| @end deftypefun |
| |
| @node Fitting Examples |
| @section Examples |
| |
| The following program computes a least squares straight-line fit to a |
| simple dataset, and outputs the best-fit line and its |
| associated one standard-deviation error bars. |
| |
| @example |
| @verbatiminclude examples/fitting.c |
| @end example |
| |
| @noindent |
| The following commands extract the data from the output of the program |
| and display it using the @sc{gnu} plotutils @code{graph} utility, |
| |
| @example |
| $ ./demo > tmp |
| $ more tmp |
| # best fit: Y = -106.6 + 0.06 X |
| # covariance matrix: |
| # [ 39602, -19.9 |
| # -19.9, 0.01] |
| # chisq = 0.8 |
| |
| $ for n in data fit hi lo ; |
| do |
| grep "^$n" tmp | cut -d: -f2 > $n ; |
| done |
| $ graph -T X -X x -Y y -y 0 20 -m 0 -S 2 -Ie data |
| -S 0 -I a -m 1 fit -m 2 hi -m 2 lo |
| @end example |
| |
| @iftex |
| @sp 1 |
| @center @image{fit-wlinear,3.0in} |
| @end iftex |
| |
| The next program performs a quadratic fit @math{y = c_0 + c_1 x + c_2 |
| x^2} to a weighted dataset using the generalised linear fitting function |
| @code{gsl_multifit_wlinear}. The model matrix @math{X} for a quadratic |
| fit is given by, |
| @tex |
| \beforedisplay |
| $$ |
| X=\pmatrix{1&x_0&x_0^2\cr |
| 1&x_1&x_1^2\cr |
| 1&x_2&x_2^2\cr |
| \dots&\dots&\dots\cr} |
| $$ |
| \afterdisplay |
| @end tex |
| @ifinfo |
| |
| @example |
| X = [ 1 , x_0 , x_0^2 ; |
| 1 , x_1 , x_1^2 ; |
| 1 , x_2 , x_2^2 ; |
| ... , ... , ... ] |
| @end example |
| |
| @end ifinfo |
| @noindent |
| where the column of ones corresponds to the constant term @math{c_0}. |
| The two remaining columns corresponds to the terms @math{c_1 x} and |
| @math{c_2 x^2}. |
| |
| The program reads @var{n} lines of data in the format (@var{x}, @var{y}, |
| @var{err}) where @var{err} is the error (standard deviation) in the |
| value @var{y}. |
| |
| @example |
| @verbatiminclude examples/fitting2.c |
| @end example |
| |
| @noindent |
| A suitable set of data for fitting can be generated using the following |
| program. It outputs a set of points with gaussian errors from the curve |
| @math{y = e^x} in the region @math{0 < x < 2}. |
| |
| @example |
| @verbatiminclude examples/fitting3.c |
| @end example |
| |
| @noindent |
| The data can be prepared by running the resulting executable program, |
| |
| @example |
| $ ./generate > exp.dat |
| $ more exp.dat |
| 0.1 0.97935 0.110517 |
| 0.2 1.3359 0.12214 |
| 0.3 1.52573 0.134986 |
| 0.4 1.60318 0.149182 |
| 0.5 1.81731 0.164872 |
| 0.6 1.92475 0.182212 |
| .... |
| @end example |
| |
| @noindent |
| To fit the data use the previous program, with the number of data points |
| given as the first argument. In this case there are 19 data points. |
| |
| @example |
| $ ./fit 19 < exp.dat |
| 0.1 0.97935 +/- 0.110517 |
| 0.2 1.3359 +/- 0.12214 |
| ... |
| # best fit: Y = 1.02318 + 0.956201 X + 0.876796 X^2 |
| # covariance matrix: |
| [ +1.25612e-02, -3.64387e-02, +1.94389e-02 |
| -3.64387e-02, +1.42339e-01, -8.48761e-02 |
| +1.94389e-02, -8.48761e-02, +5.60243e-02 ] |
| # chisq = 23.0987 |
| @end example |
| |
| @noindent |
| The parameters of the quadratic fit match the coefficients of the |
| expansion of @math{e^x}, taking into account the errors on the |
| parameters and the @math{O(x^3)} difference between the exponential and |
| quadratic functions for the larger values of @math{x}. The errors on |
| the parameters are given by the square-root of the corresponding |
| diagonal elements of the covariance matrix. The chi-squared per degree |
| of freedom is 1.4, indicating a reasonable fit to the data. |
| |
| @iftex |
| @sp 1 |
| @center @image{fit-wlinear2,3.0in} |
| @end iftex |
| |
| @node Fitting References and Further Reading |
| @section References and Further Reading |
| |
| A summary of formulas and techniques for least squares fitting can be |
| found in the ``Statistics'' chapter of the Annual Review of Particle |
| Physics prepared by the Particle Data Group, |
| |
| @itemize @asis |
| @item |
| @cite{Review of Particle Properties}, |
| R.M. Barnett et al., Physical Review D54, 1 (1996) |
| @uref{http://pdg.lbl.gov/} |
| @end itemize |
| |
| @noindent |
| The Review of Particle Physics is available online at the website given |
| above. |
| |
| @cindex NIST Statistical Reference Datasets |
| @cindex Statistical Reference Datasets (StRD) |
| The tests used to prepare these routines are based on the NIST |
| Statistical Reference Datasets. The datasets and their documentation are |
| available from NIST at the following website, |
| |
| @center @uref{http://www.nist.gov/itl/div898/strd/index.html}. |
| |
| |