Abstract

Over the last decades, the collection and storage of data has become massive with the advance of technology and variable selection has become a fundamental tool to large dimensional statistical modelling problems. In this study we implement data mining techniques, metaheuristics and use experimental designs in databases in order to determine the most relevant variables for classification in regression problems in cases where observations and labels of a large database are available. We propose a database-driven scheme for the encryption of specific fields of a database in order to select an optimal supersaturated design consisting of the variables of a large database which have been found to influence significantly the response outcome. The proposed design selection approach is quite promising, since we are able to retrieve an optimal supersaturated plan using a very small percentage of the available runs, a fact that makes the statistical analysis of a large database computationally feasible and affordable.

Highlights

  • The advent of new technologies has enabled scientists to measure the class label of hundreds of variables simultaneously and large dimensional problems are becoming more and more common since large amounts of data are increasingly produced and stored

  • Stepwise deletion and subset selection [21] are some of the existing traditional variable selection techniques which are useful for exploratory investigations but are very time-consuming or even impossible in cases where the number of predictor variables of interest is large

  • The proposed data-driven scheme is a combination of metaheuristics and data mining techniques, and enables the experimenter to identify the optimal supersaturated plan retrieved from a database for variable selection purposes

Read more

Summary

Introduction

The advent of new technologies has enabled scientists to measure the class label of hundreds of variables simultaneously and large dimensional problems are becoming more and more common since large amounts of data are increasingly produced and stored. Variable selection procedures via penalized likelihood (see, for example [5] and [16]) are and quickly implemented even in a large-dimensional problem, but they remain very timeconsuming when they are applied during a large dimensional statistical analysis. This computational difficulty prevents these methods from being widely used when there is a large number of predictors in real life problems. The proposed data-driven scheme is a combination of metaheuristics and data mining techniques, and enables the experimenter to identify the optimal supersaturated plan retrieved from a database for variable selection purposes.

The use of SSDs for variable selection
The employed methods
Simple genetic algorithm
L1-norm support vector machine
The proposed method
Medical data
Performance criteria
The optimal supersaturated plan
Comparative results
Subsequent Analysis
Findings
Concluding Remarks

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.