Computational Physics Lab 10

Regression and Nonparametric Modeling

The goal of this lab is to develop techniques and some intuition for fitting linear regression models and for using nonparametric density estimation.

Part 1. Parametric Regression

Fit 1-d polynomial regression model to the following synthetic dataset: lab10_data.txt. The first column in the file is the predictor value, the second column is the response, and the third is the error of the response measurement. Assume that the error probability density is Gaussian with the standard deviation given by the third column, and that all measurements are independent from each other (which means that the covariance matrix of the measurements is diagonal). Represent the regression curve by a sum of Legendre polynomials (the polynomials should be translated appropriately so that they are orthogonal on the range which covers the predictor values), and fit this linear model using the method of least squares discussed during the lecture on regression. Try different expansions with maximum polynomial degrees of 2, 3, and 4. Obtain the least squares estimate of the expansion coefficients and calculate the covariance matrix of the estimate. Use numpy arrays and matrices in your calculations.

Visualize your fits. In addition, plot the normalized residuals vs. the predictor values. Normalized residual is the difference between the observed response and the fitted value divided by the error of the response. Do you see any remaining structure in the residual plots?

Calculate the fit confidence levels using the chi-square statistic (you can use the same code for calculating the chi-square tail as in Lab 9). What maximum polynomial degree should be used to fit this data? You can quickly see whether a coefficient in the polynomial expansion can be set to 0 if its fitted value divided by its standard deviation (the square root of the corresponding diagonal element in the covariance matrix) is not far from zero, e.g., within the [-2, 2] range. More quantitative statements can be made by comparing the fit confidence levels and by examining the plots of residuals. If the confidence levels of two fits are similar and the resuduals look structureless in both cases, it is better to use a simpler model.

Part 2. Kernel Density Estimation (KDE)

Build a 1-d kernel density estimator for the mass measurement dataset from Lab 9. Try two different functions as kernels: the Gaussian

and the symmetric beta density


where the normalization term N(m) ensures that the kernel integrates to 1 on [-1, 1]. This particular functional shape is used quite often in the KDE practice, and in this context the functions for several lowest m values have their own names: m = 0 is the boxcar kernel, m = 1 is Epanechnikov, m = 2 is biweight, m = 3 is triweight, and m = 4 is quadweight. Use the quadweight kernel for which N(4) = 315/256.

The density estimate "inherits" the properties of the kernel. For example, the estimate obtained with the Gaussuan kernel will have an infinite number of continuous derivatives, and two separate data points can show up as distinct peaks if they are at least the distance 2 h apart, where h is the bandwidth of the estimate.

Plot the density estimate using several different bandwidth values. Note that both kernel functions result in very similar density estimates when the bandwidth is scaled appropriately: the bandwidth h used with the Gaussian kernel corresponds to the bandwidth of about 3.3 h used with the quadweight. What bandwidth value results in the estimated density shape closest to the parametric fit?

The density estimates obtained with the symmetric kernels described above will exhibit notable distortions near the dataset boundaries at 20 and 40. This is because kernels placed at the data points near the boundaries do not integrate to 1. To fix this problem, apply a normalization factor N(xi, h) to each data point xi so that the condition

is always satisfied. It is sufficient to do this for the Gaussian kernel only: results obtained with quadweight would be similar. You might find the following formula useful for calculating the normalization term:

The normalization for the [20, 40] interval and bandwidth h can be obtained from this formula with an appropriate linear variable transformation. Modify your density estimation code so that you use N(xi, h) K((x - xi)/h) instead of just K((x - xi)/h). To use the "erf" function in python, issue the following statement in your code

from scipy.special import erf

To minimize the amount of coding, KDE estimators can be built as callables which take both the kernel function and the dataset as inputs during initialization. When possible, use numpy arrays and array functions to speed up the code execution.


Submit your lab report by email before 2 pm 04/10/2008. Include the code, the plots, and a brief discussion of results.