SymbolFit: Automatic Parametric Modeling with Symbolic Regression
We introduce SymbolFit (API: https://github.com/hftsoi/symbolfit), a framework that automates parametric modeling by using symbolic regression to perform a machine-search for functions that fit the data while simultaneously providing uncertainty estimates in a single run. Traditionally, constructing a parametric model to accurately describe binned data has been a manual and iterative process, requiring an adequate functional form to be determined before the fit can be performed. The main challenge arises when the appropriate functional forms cannot be derived from first principles, especially when there is no underlying true closed-form function for the distribution. In this work, we develop a framework that automates and streamlines the process by utilizing symbolic regression, a machine learning technique that explores a vast space of candidate functions without requiring a predefined functional form because the functional form itself is treated as a trainable parameter, making the process far more efficient and effortless than traditional regression methods. We demonstrate the framework in high-energy physics experiments at the CERN Large Hadron Collider (LHC) using five real proton-proton collision datasets from new physics searches, including background modeling in resonance searches for high-mass dijet, trijet, paired-dijet, diphoton, and dimuon events. We show that our framework can flexibly and efficiently generate a wide range of candidate functions that fit a nontrivial distribution well using a simple fit configuration that varies only by random seed, and that the same fit configuration, which defines a vast function space, can also be applied to distributions of different shapes, whereas achieving a comparable result with traditional methods would have required extensive manual effort.
3
- 10.1103/physrevc.108.l021901
- Aug 22, 2023
- Physical Review C
1
- 10.1103/physrevlett.133.011801
- Jul 3, 2024
- Physical review letters
28
- 10.3847/1538-4357/ac4d30
- Mar 1, 2022
- The Astrophysical Journal
4
- 10.1007/jhep02(2023)230
- Feb 23, 2023
- Journal of High Energy Physics
81
- 10.1088/1748-0221/10/04/p04015
- Apr 1, 2015
- Journal of Instrumentation
9
- 10.21468/scipostphys.16.1.037
- Jan 29, 2024
- SciPost Physics
76
- 10.1145/3377929.3398099
- Jul 8, 2020
24
- 10.1007/jhep12(2020)085
- Dec 1, 2020
- Journal of High Energy Physics
540
- 10.5281/zenodo.11813
- Sep 21, 2014
4
- 10.1056/nejm197307122890231
- Jul 12, 1973
- The New England journal of medicine
- Research Article
16
- 10.2307/1349434
- May 1, 1996
- Applied Economic Perspectives and Policy
Economic theory is useful in many aspects of model specification, such as identifying relevant variables for a supply equation or suggesting homogeneity restrictions, curvature requirements, or symmetry restrictions across equations. Theory is seldom sufficient, though, to determine functional form, either a hypothesized true form or a reasonable approximation. Because the validity of statistical tests and inferences are conditional on model specification, the functional form should be appropriate for the specific research use or hypotheses to be tested, capture applicable theoretical concerns, and also allow the data to speak. In addition to theoretical considerations, empirical priors (e.g., knowledge about the technology or industry characteristics) and/or model pretesting are often considered by the analyst in choosing a functional form. The choice of functional form is not a trivial matter. Empirical estimates, including own-price elasticities, elasticities of substitution, returns to scale, and model specification test conclusions are often sensitive to choice of functional form (i.e., Berndt and Khaled; Chalfant; Swamy and Binswanger; Shumway and Lim). Perhaps of greatest importance is the fact that predicted responses of policy analyses using an inferior functional form may be biased and inaccurate, thus posing serious problems for policy impact analysis. Identifying suitable functional forms before estimating parameters of concern is clearly important. The most frequently used functional forms in production (and consumption) analyses are second-order Taylor-series expansions, also termed locally or just flexible functional forms.' These forms have a sufficient number of parameters to represent comparative statics at a point without imposing any restrictions across effects (Fuss, McFadden, and Mundlak). Three flexible functional forms dominate the recent empirical production economics literature translog, generalized Leontief, and quadratic. An examination of the 113 published articles cited by Fox and Kivanda and Shumway that estimated static dual models of agricultural production between 1972 and 1993 revealed that one-half used the translog (TL) functional form, one-fourth used the normalized quadratic (NQ), one-eighth used the generalized Leontief (GL), and one-eighth used a variety of other functional forms. Empirical priors for specifying functional form for an agricultural production model are often limited, both because of the small number of functional form tests conducted and because of differences in findings among them. For example, using different data sets for U.S. agriculture, Gottret failed to reject any of these three functional forms for a restricted profit function, Ornelas, Shumway, and Ozuna failed to reject only the NQ for a restricted profit function, and Chalfant rejected both functional David P. Anderson is an Agricultural Economist, Livestock Marketing Information Center; Andrew Tan Khee Guan is a lecturer, Universiti Sains Malaysia; and the remaining authors are, respectively, Research Assistant, Graduate Student, Research Assistant, and Professor of Agricultural Economics, Texas A&M University. The authors are listed alphabetically. Senior authorship is shared equally. The constructive comments of anonymous RAE reviewers on earlier drafts of this manuscript are gratefully acknowledged. The manuscript reports research conducted by the Texas Agricultural Experiment Station, Texas A&M University System. 'They can be expansions of a monotonic transformation of the underlying function, not necessarily of the function itself.
- Conference Article
3
- 10.1115/imece2003-41977
- Jan 1, 2003
In the present study we propose the application of evolutionary algorithms to find correlations that can predict the performance of a compact heat exchanger. Genetic programming (GP) is a search technique in which computer codes, representing functions as parse trees, evolve as the search proceeds. As a symbolic regression approach, GP looks for both the functional form and the coefficients that enable the closest fit to experimental data. Two different data sets are used to test the symbolic regression capability of genetic programming, the first being artificial data from a one-dimensional function, while the second are data generated by previously determined correlations from experimental measurements of a single-phase air-water heat exchanger. The results demonstrate that the correlations found by symbolic regression are able to predict well the data from which they were determined, and that the GP technique may be suitable for modeling the nonlinear behavior of heat exchangers. It is also shown that there is not a unique answer for the best-fit correlation from this procedure. The advantage of using genetic programming as symbolic regression is that no initial assumptions on the functional forms are needed, which is contrary to the traditional approach.
- Research Article
2
- 10.1371/journal.pone.0287546
- Jun 23, 2023
- PLOS ONE
Given that an excellent performance of any parametric functional form for the Lorenz curve that is based on a single country case study and a limited range of distribution must be treated with great caution, this study investigates the performance of a single-parameter functional form proposed by Paul and Shankar (2020) who use income data of Australia to show that their functional form is superior to the other existing widely used functional forms considered in their study. By using both mathematical proof and empirical data of 40 countries around the world, this study demonstrates that Paul and Shankar (2020)'s functional form not only fails to fit the actual observations well but also is generally outperformed by the other popular functional forms considered in their study. Moreover, to overcome the limitation of the performance of a single-parameter functional form on the criterion of the estimated Gini index, this study employs a functional form that has more than one parameter in order to show that, by and large, it performs better than all popular single-parameter functional forms considered in Paul and Shankar (2020)'s study. Thus, before applying any functional form to estimate the Lorenz curve, policymakers should check if it could describe the shape of income distributions of different countries through the changes in parameter values and yield the values of the estimated Gini index that are close to their observed data. Using a functional form that does not fit the actual observations could adversely affect inequality measures and income distribution policies.
- Conference Article
- 10.1145/3373376.3380612
- Mar 9, 2020
The High Energy Physics (HEP) Experiments at Particle Colliders need complex computing infrastructures in order to extract knowledge from the large datasets collected, with over 1 Exabyte of data stored by the experiments by now. The computing needs from the top world machine, the Large Hadron Collider (LHC) at CERN/Geneva, have seeded the realisation of the large scale GRID R&D and deployment efforts during the first decade of 2000, a posteriori proven to be adequate for the LHC data processing. The upcoming upgrade of the LHC collider, called High Luminosity LHC (HL-LHC) is foreseen to require an increase in computing resources by a factor between 10x and 100x, currently expected to be beyond the scalability of the existing distributed infrastructure. Current lines of R&D are presented and discussed. With the start of big scientific endeavours with a computing complexity similar to HL-LHC (SKA, CTA, Dune, ...) they are expected to be valid for science fields outside HEP.
- Research Article
44
- 10.1016/j.amar.2013.11.001
- Dec 9, 2013
- Analytic Methods in Accident Research
Analyzing different functional forms of the varying weight parameter for finite mixture of negative binomial regression models
- Research Article
2
- 10.1371/journal.pone.0270222
- Dec 22, 2022
- PLOS ONE
In the estimation of demand functions for energy resources, parametric econometric models of energy demand are commonly used to predict future energy needs. The functional forms commonly assumed in parametric energy demand models include linear functional forms, log-linear forms, trans-log models, and an almost ideal demand system. It is frequently debated which is the “best” functional form to employ in order to accurately represent the underlying relationships between the demand for various energy resources and explanatory variables such as energy prices, income, and other demographic factors. The study has focused on developing proper non-nested tests to compare the two demand systems, the double log model and the LA-AIDS model. C-test is used to test the validity of using the two parametric functional forms in models of residential energy demand. Cross-sectional household-level data of the Pakistan Social and Living Standards Measurement (PSLM) 2013–14 and Asian Development Bank (ADB) Asia and Pacific 2018 are used. Results indicate that the LA-AIDS model is better than the double log model. The estimated elasticity of own prices, cross-prices, and income in terms of spending, family size, and equipment are particularly important to producers and policymakers in making investment and incentive choices. A significant part of the budget for families is for electricity, natural gas, firewood, and other fuels; smaller budget shares are set down to other items such as kerosene oil, cylinder gas, and diesel. Household per capita demand for energy resources will rise over the next decade; therefore, the government needs to make progress on developing energy-saving strategies. If not addressed the issue properly, we may face energy shortages and high energy import bills.
- Research Article
4
- 10.1088/2632-2153/adaad8
- Jan 29, 2025
- Machine Learning: Science and Technology
Compact symbolic expressions have been shown to be more efficient than neural network (NN) models in terms of resource consumption and inference speed when implemented on custom hardware such as field-programmable gate arrays (FPGAs), while maintaining comparable accuracy (Tsoi et al 2024 EPJ Web Conf. 295 09036). These capabilities are highly valuable in environments with stringent computational resource constraints, such as high-energy physics experiments at the CERN Large Hadron Collider. However, finding compact expressions for high-dimensional datasets remains challenging due to the inherent limitations of genetic programming (GP), the search algorithm of most symbolic regression (SR) methods. Contrary to GP, the NN approach to SR offers scalability to high-dimensional inputs and leverages gradient methods for faster equation searching. Common ways of constraining expression complexity often involve multistage pruning with fine-tuning, which can result in significant performance loss. In this work, we propose S y m b o l N e t , a NN approach to SR specifically designed as a model compression technique, aimed at enabling low-latency inference for high-dimensional inputs on custom hardware such as FPGAs. This framework allows dynamic pruning of model weights, input features, and mathematical operators in a single training process, where both training loss and expression complexity are optimized simultaneously. We introduce a sparsity regularization term for each pruning type, which can adaptively adjust its strength, leading to convergence at a target sparsity ratio. Unlike most existing SR methods that struggle with datasets containing more than O ( 10 ) inputs, we demonstrate the effectiveness of our model on the LHC jet tagging task (16 inputs), MNIST (784 inputs), and SVHN (3072 inputs).
- Discussion
255
- 10.1016/0094-1190(85)90012-9
- Sep 1, 1985
- Journal of Urban Economics
The choice of functional forms for hedonic price equations: Comment
- Research Article
5
- 10.1007/s40819-015-0113-z
- Oct 30, 2015
- International Journal of Applied and Computational Mathematics
The objective of this paper is to represent the interval number in different functional forms. We have represented the interval number by parametric product functional form, symmetric functional form, asymmetric functional form and convex combination functional form. We represent positive interval number in parametric product functional form, symmetric and asymmetric functional forms. However any interval number can be represented by a convex combination functional form. We also study the arithmetic operations of interval numbers based on the different forms of functional representation. Numerical examples are given to illustrate our proposed approach for arithmetic operations of the interval number in different functional form. Finally some open problems are mentioned at the end of the paper.
- Research Article
42
- 10.1088/1361-6471/ab7ff7
- May 19, 2020
- Journal of Physics G: Nuclear and Particle Physics
This document summarises proposed searches for new physics accessible in the heavy-ion mode at the CERN Large Hadron Collider (LHC), both through hadronic and ultraperipheral γγ interactions, and that have a competitive or, even, unique discovery potential compared to standard proton–proton collision studies. Illustrative examples include searches for new particles—such as axion-like pseudoscalars, radions, magnetic monopoles, new long-lived particles, dark photons, and sexaquarks as dark matter candidates—as well as new interactions, such as nonlinear or non-commutative QED extensions. We argue that such interesting possibilities constitute a well-justified scientific motivation, complementing standard quark-gluon-plasma physics studies, to continue running with ions at the LHC after the Run-4, i.e. beyond 2030, including light and intermediate-mass ion species, accumulating nucleon–nucleon integrated luminosities in the accessible fb−1 range per month.
- Research Article
45
- 10.1007/s10107-018-1289-x
- May 11, 2018
- Mathematical Programming
Symbolic regression methods generate expression trees that simultaneously define the functional form of a regression model and the regression parameter values. As a result, the regression problem can search many nonlinear functional forms using only the specification of simple mathematical operators such as addition, subtraction, multiplication, and division, among others. Currently, state-of-the-art symbolic regression methods leverage genetic algorithms and adaptive programming techniques. Genetic algorithms lack optimality certifications and are typically stochastic in nature. In contrast, we propose an optimization formulation for the rigorous deterministic optimization of the symbolic regression problem. We present a mixed-integer nonlinear programming (MINLP) formulation to solve the symbolic regression problem as well as several alternative models to eliminate redundancies and symmetries. We demonstrate this symbolic regression technique using an array of experiments based upon literature instances. We then use a set of 24 MINLPs from symbolic regression to compare the performance of five local and five global MINLP solvers. Finally, we use larger instances to demonstrate that a portfolio of models provides an effective solution mechanism for problems of the size typically addressed in the symbolic regression literature.
- Research Article
11
- 10.17221/4501-jfs
- Apr 30, 2006
- Journal of Forest Science
Forestmodellers have long faced the problem of selecting an appropriate mathematical model to describe tree ontogenetic or size-shape empirical relationships for tree species. A common practice is to develop many models (or a model pool) that include different functional forms, and then to select the most appropriate one for a given data set. However, this process may impose subjective restrictions on the functional form. In this process, little attention is paid to the features (e.g. asymptote and inflection point rather than asymptote and nonasymptote) of different functional forms, and to the intrinsic curve of a given data set. In order to find a better way of comparing and selecting the growth models, this paper describes and analyses the characteristics of the Schnute model. This model has both flexibility and versatility that have not been used in forestry. In this study, the Schnute model was applied to different data sets of selected forest species to determine their functional forms. The results indicate that the model shows some desirable properties for the examined data sets, and allows for discerning the different intrinsic curve shapes such as sigmoid, concave and other curve shapes. Since no suitable functional form for a given data set is usually known prior to the comparison of candidate models, it is recommended that the Schnute model be used as the first step to determine an appropriate functional form of the data set under investigation in order to avoid using a functional form a priori.
- Research Article
1
- 10.1002/acs.2578
- May 25, 2015
- International Journal of Adaptive Control and Signal Processing
SummarySeveral works have demonstrated detection of changes of state equations (called structural changes) based on statistical measures but have given no suggestions regarding the functional forms of the state equations after changes. This paper deals with the estimation of structural changes in nonlinear time series models by using particle filters, genetic programming (GP), and its applications. We consider the problems of state estimation from the observed time series that are generated based on nonlinear state equations. It is assumed that structural changes can be detected by some measure of likelihood and that the state equation after changes is modified from its current functional form. Individuals corresponding to functional forms in the GP pool are generated at random, and we apply the crossover operation between the current functional form and the individuals by giving possible multiple functional forms. Then, we have the optimal functional form among the possible functional forms generated by GP from the current form. As an application, we show the estimation of structural change for an artificially generated time series and also discuss the estimation of functional forms for a real economic time series before and after structural changes. Copyright © 2015 John Wiley & Sons, Ltd.
- Research Article
13
- 10.1007/s10950-017-9698-5
- Sep 18, 2017
- Journal of Seismology
Advancement in the seismic networks results in formulation of different functional forms for developing any new ground motion prediction equation (GMPE) for a region. Till date, various guidelines and tools are available for selecting a suitable GMPE for any seismic study area. However, these methods are efficient in quantifying the GMPE but not for determining a proper functional form and capturing the epistemic uncertainty associated with selection of GMPE. In this study, the compatibility of the recent available functional forms for the active region is tested for distance and magnitude scaling. Analysis is carried out by determining the residuals using the recorded and the predicted spectral acceleration values at different periods. Mixed effect regressions are performed on the calculated residuals for determining the intra- and interevent residuals. Additionally, spatial correlation is used in mixed effect regression by changing its likelihood function. Distance scaling and magnitude scaling are respectively examined by studying the trends of intraevent residuals with distance and the trend of the event term with magnitude. Further, these trends are statistically studied for a respective functional form of a ground motion. Additionally, genetic algorithm and Monte Carlo method are used respectively for calculating the hinge point and standard error for magnitude and distance scaling for a newly determined functional form. The whole procedure is applied and tested for the available strong motion data for the Himalayan region. The functional form used for testing are five Himalayan GMPEs, five GMPEs developed under NGA-West 2 project, two from Pan-European, and one from Japan region. It is observed that bilinear functional form with magnitude and distance hinged at 6.5 M w and 300 km respectively is suitable for the Himalayan region. Finally, a new regression coefficient for peak ground acceleration for a suitable functional form that governs the attenuation characteristic of the Himalayan region is derived.
- Research Article
20
- 10.1016/j.soildyn.2021.107024
- Oct 23, 2021
- Soil Dynamics and Earthquake Engineering
A comparison of artificial neural network and classical regression models for earthquake-induced slope displacements
- New
- Research Article
- 10.1007/s41781-025-00148-1
- Nov 5, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00146-3
- Oct 21, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00133-8
- Jul 13, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00143-6
- Jul 1, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00142-7
- Jul 1, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00140-9
- Jul 1, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00141-8
- Jul 1, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00137-4
- May 22, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00138-3
- May 22, 2025
- Computing and Software for Big Science
- Research Article
- 10.1007/s41781-025-00139-2
- May 21, 2025
- Computing and Software for Big Science
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.