Supersaturated plans for variable selection in large databases

Christina Parpoula,Stella Stylianou,Christos Koukouvinos,Dimitrios Simos

doi:10.19139/75

Abstract

Over the last decades, the collection and storage of data has become massive with the advance of technology and variable selection has become a fundamental tool to large dimensional statistical modelling problems. In this study we implement data mining techniques, metaheuristics and use experimental designs in databases in order to determine the most relevant variables for classification in regression problems in cases where observations and labels of a large database are available. We propose a database-driven scheme for the encryption of specific fields of a database in order to select an optimal supersaturated design consisting of the variables of a large database which have been found to influence significantly the response outcome. The proposed design selection approach is quite promising, since we are able to retrieve an optimal supersaturated plan using a very small percentage of the available runs, a fact that makes the statistical analysis of a large database computationally feasible and affordable.

Highlights

The advent of new technologies has enabled scientists to measure the class label of hundreds of variables simultaneously and large dimensional problems are becoming more and more common since large amounts of data are increasingly produced and stored
Stepwise deletion and subset selection [21] are some of the existing traditional variable selection techniques which are useful for exploratory investigations but are very time-consuming or even impossible in cases where the number of predictor variables of interest is large
The proposed data-driven scheme is a combination of metaheuristics and data mining techniques, and enables the experimenter to identify the optimal supersaturated plan retrieved from a database for variable selection purposes

Summary

Introduction

The advent of new technologies has enabled scientists to measure the class label of hundreds of variables simultaneously and large dimensional problems are becoming more and more common since large amounts of data are increasingly produced and stored. Variable selection procedures via penalized likelihood (see, for example [5] and [16]) are and quickly implemented even in a large-dimensional problem, but they remain very timeconsuming when they are applied during a large dimensional statistical analysis. This computational difficulty prevents these methods from being widely used when there is a large number of predictors in real life problems. The proposed data-driven scheme is a combination of metaheuristics and data mining techniques, and enables the experimenter to identify the optimal supersaturated plan retrieved from a database for variable selection purposes.

The use of SSDs for variable selection

The employed methods

Simple genetic algorithm

L1-norm support vector machine

The proposed method

Medical data

Performance criteria

The optimal supersaturated plan

Comparative results

Subsequent Analysis

Findings

Concluding Remarks

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Supersaturated plans for variable selection in large databases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Statistics, Optimization & Information Computing

Lead the way for us

Journal: Statistics, Optimization & Information Computing	Publication Date: Jun 1, 2014
License type: cc-by

Similar Papers

Supersaturated plans for variable selection in large databases
Christina Parpoula ... Stella Stylianou
Statistics, Optimization & Information Computing | VOL. 2
Christina Parpoula, et. al.Christina Parpoula ... Stella Stylianou
01 Jun 2014
Statistics, Optimization & Information Computing | VOL. 2

Genetic Algorithm and Data Mining Techniques for Design Selection in Databases
Christos Koukouvinos ... Dimitris E Simos
-
Christos Koukouvinos, et. al.Christos Koukouvinos ... Dimitris E Simos
01 Sep 2013
01 Sep 2013

Database research in transfusion medicine: The power of large numbers.
Steven Kleinman ... Simone A Glynn
Transfusion | VOL. 55
Steven Kleinman, et. al.Steven Kleinman ... Simone A Glynn
01 Jul 2015
Transfusion | VOL. 55

Analysis of large databases in vascular surgery
Louis L Nguyen ... Neal R Barshes
Journal of Vascular Surgery | VOL. 52
Louis L Nguyen, et. al.Louis L Nguyen ... Neal R Barshes
01 Sep 2010
Journal of Vascular Surgery | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Supersaturated plans for variable selection in large databases

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Statistics, Optimization &amp; Information Computing

More From: Statistics, Optimization & Information Computing