RFSA: An R Package for Finding Best Subsets and Interactions.

Joshua Lambert,Arnold Stromberg,Katherine Thompson,Liyu Gong,Corrine,F Elliott

doi:10.32614/rj-2018-059

Abstract

Herein we present the R package rFSA, which implements an algorithm for improved variable selection. The algorithm searches a data space for models of a user-specified form that are statistically optimal under a measure of model quality. Many iterations afford a set of feasible solutions (or candidate models) that the researcher can evaluate for relevance to his or her questions of interest. The algorithm can be used to formulate new or to improve upon existing models in bioinformatics, health care, and myriad other fields in which the volume of available data has outstripped researchers' practical and computational ability to explore larger subsets or higher-order interaction terms. The package accommodates linear and generalized linear models, as well as a variety of criterion functions such as Allen's PRESS and AIC. New modeling strategies and criterion functions can be adapted easily to work with rFSA.

Highlights

In recent years, novel statistical modeling techniques have become more computationally intensive in an effort to accommodate the massive datasets afforded by advances in fields such as data mining and genetic sequencing
Such a process is tedious and timeconsuming, and usually results in interactions being ignored or overlooked due to the sheer number of possibilities. These factors unite to afford a widespread lack of consideration for interactions, thereby undermining the predictive power of models attempting to capture complex relationships (Foster and Stine, 2004). We address these limitations by implementing an Feasible Solutions Algorithm (FSA) with the capacity to explore higher-order terms, combined with the accessibility and ease of use associated with an R package
Timing comparisons We present the results of a simulation conducted to compare the performance of rFSA against that of of exhaustive search with leaps

Summary

Introduction

Novel statistical modeling techniques have become more computationally intensive in an effort to accommodate the massive datasets afforded by advances in fields such as data mining and genetic sequencing. Given an FSA object as its argument, the print command will display a table containing the original user-specified model and all feasible solutions that the algorithm found over numrs random starts. The bestglm (McLeod and Xu, 2017) package seeks to extend best-subset model selection to generalized linear models but is not natively capable of looking for higher-order interactions, using external criterion functions, or accommodating other statistical methods It makes no special consideration for large problems, rendering it unsuitable for datasets with more than 100 predictors. For large p, we argue that rFSA is a practical solution for researchers who wish to consider high-dimensional data, higher-order terms, generalized linear or mixed models, or other non-traditional statistical methods and criterion functions. The package is easy to manipulate (as demonstrated by the sparseness of the code provided in this example) as well as highly efficient, and generally returns multiple subsets of variables to permit flexible exploration and validation

Conclusion

Summary