Penalized splines for smooth representation of high-dimensional Monte Carlo datasets

Nathan Whitehorn,Jakob Van Santen,Sven Lafebre

doi:10.1016/j.cpc.2013.04.008

Nathan Whitehorn, Jakob Van Santen + Show 1 more

Open Access

https://doi.org/10.1016/j.cpc.2013.04.008

Copy DOI

Abstract

Detector response to a high-energy physics process is often estimated by Monte Carlo simulation. For purposes of data analysis, the results of this simulation are typically stored in large multi-dimensional histograms, which can quickly become both too large to easily store and manipulate and numerically problematic due to unfilled bins or interpolation artifacts. We describe here an application of the penalized spline technique (Marx and Eilers, 1996) [1] to efficiently compute B-spline representations of such tables and discuss aspects of the resulting B-spline fits that simplify many common tasks in handling tabulated Monte Carlo data in high-energy physics analysis, in particular their use in maximum-likelihood fitting. Program summaryProgram title: PhotosplineCatalogue identifier: AEPK_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEPK_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: 2-clause BSDNo. of lines in distributed program, including test data, etc.: 9723No. of bytes in distributed program, including test data, etc.: 156138Distribution format: tar.gzProgramming language: C, PythonComputer: 32- and 64-bit x86, 32- and 64-bit PowerPCOperating system: Linux, Mac OS X, FreeBSDHas the code been vectorized or parallelized?: BothRAM: Approximately proportional to number of knots used in fitting, depends on problem conditionClassification: 4.9External routines: SuiteSparse (http://www.cise.ufl.edu/research/sparse/SuiteSparse/), Python (http://www.python.org/), BLAS (http://www.netlib.org/blas/), Numpy (http://www.numpy.org/)Nature of problem:An algorithm to smoothly represent histograms, including mathematical operations and convolutions. Using histograms of Monte Carlo simulation for likelihood fitting can be unstable due to binning artifacts from statistical fluctuations and hard bin-to-bin transitions. This package provides a toolkit for using penalized spline fits on extremely large multi-dimensional datasets to reduce or eliminate such issues.Solution method:Uses sparse matrix operations, non-negative least-squares fitting, and generalized linear array models in conjunction with a number of other algorithms to allow fits to be made, manipulated, and saved with very low computational requirements. This enables even very large problems to be solved on commercially available machines.Restrictions:Required computation time and memory increase very rapidly with the number of dimensions. Fits without stacking involving more than 5 dimensions and 20 knots on each are usually not practical on 2012-era hardware.Running time:Roughly proportional to the cube of the number of knots used, depends strongly on conditioning of the problem.

Full Text