Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets.

Olga Krakovska,Martin Ester,Sylvain Moreno,Gregory Christie,Andrew Sixsmith

doi:10.1371/journal.pone.0213584

Abstract

Large survey databases for aging-related analysis are often examined to discover key factors that affect a dependent variable of interest. Typically, this analysis is performed with methods assuming linear dependencies between variables. Such assumptions however do not hold in many cases, wherein data are linked by way of non-linear dependencies. This in turn requires applications of analytic methods, which are more accurate in identifying potentially non-linear dependencies. Here, we objectively compared the feature selection performance of several frequently-used linear selection methods and three non-linear selection methods in the context of large survey data. These methods were assessed using both synthetic and real-world datasets, wherein relationships between the features and dependent variables were known in advance. In contrast to linear methods, we found that the non-linear methods offered better overall feature selection performance than linear methods in all usage conditions. Moreover, the performance of the non-linear methods was more stable, being unaffected by the inclusion or exclusion of variables from the datasets. These properties make non-linear feature selection methods a potentially preferable tool for both hypothesis-driven and exploratory analyses for aging-related datasets.

Highlights

Within the field of statistical gerontology, there has been increasing use of large databases to explore relationships between key factorsand some outcome variable(s) of interest (dependent variable(s))
A systematic review of 893 papers illustrated, that 92% of incorporated papers using linear methods were unclear about assumptions of the methods used [10].The purpose of this paper is to provide a systematic evaluation of different approaches to feature selection.We will do this firstly by reviewing and discussingin more detail some of the key problems and limitations in the analysis of large survey databases, including variable selection when dealing with non-linear relationships
The researcher could select features that are relevant to the question of interest, along with other features that they believe may confound the results, and ignore all other, presumably irrelevant features

Summary

Introduction

Within the field of statistical gerontology, there has been increasing use of large databases to explore relationships between key factorsand some outcome variable(s) of interest (dependent variable(s)). Comparison of linear and non-linear feature selection methods for survey datasets It is not uncommon for large survey databases to store dozens or hundreds of different measurements for each person (we refer to these measurements as features). Researchers will often select a handful of features and assess the predictive ability of these features using a variant of regression such as linear regression Both of these operations—feature selection and prediction—are potentially problematic for the analysis of many large survey databases. In most aging-related datasets analyses, experimenters must identify and select features that are relevant to the dependent variables of interest and reject all other, irrelevant features Construed, this is typically done using one of two, non-exclusive approaches. The researcher could select features that are relevant to the question of interest (e.g. number of alcoholic units consumed per week), along with other features that they believe may confound the results (e.g. education level), and ignore all other, presumably irrelevant features

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Mar 21, 2019
Citations: 26	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Unsupervised Nonlinear Feature Selection from High-Dimensional Signed Networks
Qiang Huang ... Tingyu Xia
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Qiang Huang, et. al.Qiang Huang ... Tingyu Xia
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Application of genetic algorithm-kernel partial least square as a novel nonlinear feature selection method: Activity of carbonic anhydrase II inhibitors
Mehdi Jalali-Heravi ... Anahita Kyani
European Journal of Medicinal Chemistry | VOL. 42
Mehdi Jalali-Heravi, et. al.Mehdi Jalali-Heravi ... Anahita Kyani
12 Jan 2007
European Journal of Medicinal Chemistry | VOL. 42

The added utility of nonlinear methods compared to linear methods in rescaling soil moisture products
M.H Afshar ... M.T Yilmaz
Remote Sensing of Environment | VOL. 196
M.H Afshar, et. al.M.H Afshar ... M.T Yilmaz
18 May 2017
Remote Sensing of Environment | VOL. 196

Estimating errors in the determination of activation energy by nonlinear methods applied for thermoanalytical measurements performed under constant heating rates
P Budrugeac
Thermochimica Acta | VOL. 670
P BudrugeacP Budrugeac
06 Oct 2018
Thermochimica Acta | VOL. 670

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE