Computational Physics Lab 10
Regression and Nonparametric Modeling
The goal of this lab is to develop techniques and some intuition for
fitting linear regression models and for using nonparametric density
estimation.
Part 1. Parametric Regression
Fit 1-d polynomial regression model to the following synthetic dataset: lab10_data.txt.
The first column in the file is the predictor value, the second column
is the response, and the third is the error of the response
measurement. Assume that the error probability density is Gaussian with
the standard deviation given by the third column, and that all
measurements are independent from each other (which means that the
covariance matrix of the measurements is diagonal). Represent the
regression curve by a sum of Legendre polynomials (the polynomials
should be translated appropriately so that they are orthogonal on the
range which covers the predictor values), and fit this linear model
using the method of least squares discussed during the lecture on regression.
Try different expansions with maximum polynomial degrees of 2, 3,
and 4. Obtain the least squares estimate of the expansion coefficients
and calculate the covariance matrix of the estimate. Use numpy arrays
and matrices in your calculations.
Visualize
your fits.
In addition, plot the normalized residuals vs. the predictor values.
Normalized residual is the difference between the observed response and
the fitted value divided by the error of the response. Do you see any
remaining structure in the residual plots?
Calculate the fit confidence levels using the chi-square statistic (you
can use the same code for calculating the chi-square tail as in Lab 9).
What maximum polynomial degree should be used to fit this data? You can
quickly see whether a coefficient in the polynomial expansion can be
set to 0 if its fitted value divided by its standard deviation (the
square root of the corresponding diagonal element in the covariance
matrix) is not far from zero, e.g.,
within the [-2, 2] range. More quantitative statements can be made by
comparing the fit confidence levels and by examining the plots of residuals. If the confidence levels of two
fits are similar and the resuduals look structureless in both cases, it is better to use a simpler model.
Part 2. Kernel Density Estimation (KDE)
Build a 1-d kernel density estimator for the mass measurement dataset from Lab 9. Try two different functions as kernels: the Gaussian

and the symmetric beta density

where the normalization term N(m) ensures
that the kernel integrates to 1 on [-1, 1]. This particular functional
shape is used quite often in the KDE practice, and in this context the
functions for several lowest m values have their own names: m = 0 is the boxcar kernel, m = 1 is Epanechnikov, m = 2 is biweight, m = 3 is triweight, and m = 4 is quadweight. Use the quadweight kernel for which N(4) = 315/256.
The
density estimate "inherits" the properties of the kernel. For example,
the estimate obtained with the Gaussuan kernel will have an infinite
number of continuous derivatives, and two separate data points can show
up as distinct peaks if they are at least the distance 2 h apart, where h is the bandwidth of the estimate.
Plot
the density estimate using several different bandwidth values. Note
that both kernel functions result in very similar density estimates
when the bandwidth is scaled appropriately: the bandwidth h used with the Gaussian kernel corresponds to the bandwidth of about 3.3 h used with the quadweight. What bandwidth value results in the estimated density shape closest to the parametric fit?
The
density estimates obtained with the symmetric kernels described above
will exhibit notable distortions near the dataset boundaries at 20 and
40. This is because kernels placed at the data points near the
boundaries do not integrate to 1. To fix this problem, apply a
normalization factor N(xi, h) to each data point xi so that the condition

is
always satisfied. It is sufficient to do this for the Gaussian kernel
only: results obtained with quadweight would be similar. You might find
the following formula useful for calculating the normalization term:

The normalization for the [20, 40] interval and bandwidth h
can be obtained from this formula with an appropriate linear variable
transformation. Modify your density estimation code so that you
use N(xi, h) K((x - xi)/h) instead of just K((x - xi)/h). To use the "erf" function in python, issue the following statement in your code
from scipy.special import erf
To
minimize the amount of coding, KDE estimators can be built as callables
which take both the kernel function and the dataset as inputs during
initialization. When possible, use numpy arrays and array functions to
speed up the code execution.
Submit your lab report by email
before 2 pm 04/10/2008.
Include the code, the plots, and a brief discussion of results.