Abstract

In binary classification problems with a rare class of interest, there is relatively little information available for the rare class to build a model. On the other hand, the number of useful variables to develop a model for classification can be high-dimensional. For example, in drug discovery, there are usually a very few bioactive compounds in a large chemical library, whereas thousands of potentially useful explanatory variables characterize a compound's chemical structure. The sparsity of information for the rare class of interest makes it difficult for the standard classification models to exploit the richness of the useful feature variables. Thus, the objective of this paper is to develop an R package which clusters the feature variables into diverse subsets to be aggregated into a powerful ensemble for the detection of a rare class object. The ensemble of phalanxes (EPX) builds a classifier by exploiting the richness of feature variables using several diverse subsets of variables, called phalanxes, and outperforms many competitive state-of-the-art classification methods in terms of predictive ranking of the rare class of interest. We present an R package EPX which implements the algorithm to form the ensemble of phalanxes as well as its associated functions. We further show how the ensemble of phalanxes can be constructed using parallel computing to lower the computational burden given high-dimensional data. The R package EPX shows a flexible way of clustering feature variable space into smaller and diverse subsets of variables to develop an ensemble of phalanxes which better ranks a rare class object in a highly unbalanced two class classification problem. The ensemble EPX will be useful to detect the rare drug-like active biomolecules for development in drug discovery (Tomal et al., Mar. 2016) [1] and homologous proteins using similarity scores of amino acid sequences in protein homology (Tomal et al., 2019) [2]. The package EPX is freely available to download from CRAN (https://CRAN.R-project.org/package=EPX).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call