Computational Physics Lab 9
Fitting a Parametric
Density Model
The goal of this lab is to learn how to fit a parametric density model
to observed data and how to build frequentist interval estimators of
the model parameters.
The file lab9_data.txt
contains a set
of points sampled from a one-dimensional probability distribution which
has two components: a slowly varying background and a signal peak which
can be reasonably well represented by the Cauchy density.
Such distributions are typical in the studies of mass
spectra of
elementary particles (similar problems are encountered in atomic
spectrography, observational astronomy, etc.). In such a study,
the peak in the spectrum is
produced by the invariant mass of decay products of the particle we are
interested in, while the background is due to events with
similar
properties but produced in a different, unrelated particle
reaction. We
want to know the mass of the particle under study (which corresponds to
the peak
position), its lifetime (which is inversely proportional to the width
of the signal peak), and the fraction of the signal events in the
sample.
The sample space for the data is continuous, and it is between xmin
= 20 and
xmax
= 40 mass units (these units can be, for example, GeV/c2).
1) Get some
idea about the shape of the distribution
by histogramming the data. Here is a code snippet which can be
used to read the data into a python list and then make a histogram out
of it (it assumes that you are running "ipython -pylab"):
def
read_data(filename):
data = []
f = open(filename, "r")
try:
for line in f:
if not line.isspace():
l = line.strip()
if not l.startswith("#"):
data.append(float(l))
finally:
f.close()
return data
data
= read_data("lab9_data.txt")
hist(data,
20)
The complete description of the "hist" command can be obtained here
or by typing "help(hist)" at the ipython prompt.
2) Build the
likelihood function, L(θ),
where θ
is the vector of
model parameters. It is
probably best to implement the likelihood function in a class which has
the __call__ method accepting
the vector of model parameters as its input. The density
model used by your likelihood function should look like this:

Here, the full set of parameters θ includes
the signal fraction q,
the signal parameters θs,
and the backround parameters θb.
The signal function s(x|θs)
is the Cauchy distribution with location parameter μ, scale
parameter Γ, and appropriate normalization factor discussed below. The
slowly varying background shape b(x|θb)
should be modeled by a polynomial (it is useful to try linear,
quadratic, and cubic approximations). It also has to be normalized
properly.
For the maximum likelihood method to work, it is absolutely
critical that f(x|θ) is, indeed, a
density for any value of θ.
It means that the probability density normalization condition

must be satisfied for any θ.
This condition will be automatically satisfied for any value of q if both s(x|θs)
and b(x|θb) are
normalized. For example, s(x|θs)
should look like

where C(x|θs)
is the usual Cauchy density defined on the infinite interval:

It is easy to evaluate the normalization integrals for the
Cauchy distribution and the polynomial function algebraically. Numerical
integration can be used for more complicated models.
I suggest using Legendre
polynomials orthogonal on the interval (xmin,
xmax)
as building blocks for your polynomial approximations instead of using
monomials xn. This trivializes the
calculation of the normalization integral (only the 0th
order polynomial contributes) and, more importantly, improves the
convergence of subsequent likelihood maximization procedure.
To make Legendre polynomials orthogonal on (xmin,
xmax), use a linear mapping in which xmin →
−1 and xmax → 1.
In order not to run out of the dynamic range of the floating point numbers, your code should calculate log(L(θ)) by summing logs of f(xi|θ) instead of directly multiplying f(xi|θ).
3) Obtain the point estimator of θ by finding that value of θ which maximizes the likelihood function L(θ). In your code, you should minimize −log(L(θ)) which leads to the same result. Use one of the methods from the pylab optimize package, for example, fmin_bfgs.
It is very important to provide a reasonable initial guess to the minimizer. To do this, plot f(x|θ) using various values of θ and choose a value for which the density looks similar to the histogram from section 1).
The following code snippet illustrates usage of a callable with the fmin_bfgs function. Of course, your code should calculate −log(L(θ)) inside the __call__ method rather than emulate this simple example.
from numpy import array
from scipy.optimize.optimize import fmin_bfgs
class my_functor(object):
def __init__(self, data):
self.data = array(data)
def __call__(self, parameters):
delta = parameters - self.data
return sum(delta*delta)
my_data = (1, 2)
fcn = my_functor(my_data)
my_initial_parameter_guess = array((5, 5))
parameter_estimator = fmin_bfgs(fcn, my_initial_parameter_guess)
print parameter_estimator
4) Check
the quality of your fit. This can be done in several different ways.
The easiest one (although not the most powerful) consists in comparing
the height of the histogram bins with the fitted density. Do it
visually first: multiply the density for the optimal value of θ by
the number of data points (400 in the data file given) and by the bin
width of the histogram, and plot it together with the histogram --
the curve and the histogram should now look quite similar. These
factors (the number of data points and the bin width) are selected in
such a way that the curve and the top edges of the histogram bins
should appear on your plot at the same level.
Then perform a quantitative analysis by executing the Pearson's chi-square test. Using results of this test, decide which polynomial model is best for the background: linear, quadratic, or cubic.
To get the function which evaluates the integral of the chi-square distribution from some value x to infinity, download the following files: chtail.pyf, chtail.f, and gerfc.c. Compile a Python extension with
f2py -c chtail.pyf chtail.f gerfc.c
You can now calculate the integral from x to infinity of the chi-square distribution with n degrees of freedom as follows:
from chtail import chtail
fit_confidence_level = chtail(x, n)
Note that x must be non-negative, and n must be positive for the returned result to make sense.
5) Build the frequentist interval estimator of the location of the signal peak treating all other parameters in the model as nuisance parameters. Assume that the distribution of θ
estimator is sufficiently close to multivariate Gaussian (it
becomes multivariate Gaussian asymptotically, when the number of data
points is very large). Under this assumption, the interval
estimator for μ can be easily obtained by analyzing the profile likelihood for μ which
is going to look like univariate Gaussian (and, therefore, its
logarithm should look like a parabola). The profile likelihood
function, Lprof(μ), is built by rerunning the optimization step 3) in which the value of μ is fixed. That is, L(θ) is maximized over all parameters in θ except μ, and Lprof(μ) is defined as that optimal L(θ) value. Make a plot which shows the value of −log(Lprof(μ)) as a function of μ. You only need to do this in a narrow range near the optimal μ value. The interval estimator for μ is obtained by finding those μ values for which −log(Lprof(μ)) is increased from its minimum value by 0.5. If the Gaussian approximation for Lprof(μ) works reasonably well, with this procedure you will find the points one standard deviation away from the mean.
Submit your lab report by email
before 2 pm 04/03/2008.
Include the code, the plots, and a brief discussion of results.