Abstract

Subset selection in multiple linear regression is to identify a subset of explanatory variables that are most relevant to the response variable while discarding the remaining variables. Traditionally, a regression model is obtained by solving a given optimization problem, and then, some statistical tests and evaluations are performed to confirm its validity and to verify the satisfaction of qualifying requirements. Practically, human intervention is necessary at this stage. To prevent the need for manual inspection, we propose a new mixed integer programming formulation for the multiple linear regression subset selection problem. Our model enforces the statistical significance and the multicollinearity through explicit constraints along with several other well-known criteria. In contrast to existing approaches that incorporate these criteria gradually through a cutting-plane structure, this effort considers them explicitly within the formulation. The effectiveness of the approach is evaluated over different real datasets and compared with the existing methods in the literature. The results demonstrate that the significance of coefficients can be provided without any considerable change in the value of mean squared error.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call