Abstract
Large survey databases for aging-related analysis are often examined to discover key factors that affect a dependent variable of interest. Typically, this analysis is performed with methods assuming linear dependencies between variables. Such assumptions however do not hold in many cases, wherein data are linked by way of non-linear dependencies. This in turn requires applications of analytic methods, which are more accurate in identifying potentially non-linear dependencies. Here, we objectively compared the feature selection performance of several frequently-used linear selection methods and three non-linear selection methods in the context of large survey data. These methods were assessed using both synthetic and real-world datasets, wherein relationships between the features and dependent variables were known in advance. In contrast to linear methods, we found that the non-linear methods offered better overall feature selection performance than linear methods in all usage conditions. Moreover, the performance of the non-linear methods was more stable, being unaffected by the inclusion or exclusion of variables from the datasets. These properties make non-linear feature selection methods a potentially preferable tool for both hypothesis-driven and exploratory analyses for aging-related datasets.
Highlights
Within the field of statistical gerontology, there has been increasing use of large databases to explore relationships between key factorsand some outcome variable(s) of interest (dependent variable(s))
A systematic review of 893 papers illustrated, that 92% of incorporated papers using linear methods were unclear about assumptions of the methods used [10].The purpose of this paper is to provide a systematic evaluation of different approaches to feature selection.We will do this firstly by reviewing and discussingin more detail some of the key problems and limitations in the analysis of large survey databases, including variable selection when dealing with non-linear relationships
The researcher could select features that are relevant to the question of interest, along with other features that they believe may confound the results, and ignore all other, presumably irrelevant features
Summary
Within the field of statistical gerontology, there has been increasing use of large databases to explore relationships between key factorsand some outcome variable(s) of interest (dependent variable(s)). Comparison of linear and non-linear feature selection methods for survey datasets It is not uncommon for large survey databases to store dozens or hundreds of different measurements for each person (we refer to these measurements as features). Researchers will often select a handful of features and assess the predictive ability of these features using a variant of regression such as linear regression Both of these operations—feature selection and prediction—are potentially problematic for the analysis of many large survey databases. In most aging-related datasets analyses, experimenters must identify and select features that are relevant to the dependent variables of interest and reject all other, irrelevant features Construed, this is typically done using one of two, non-exclusive approaches. The researcher could select features that are relevant to the question of interest (e.g. number of alcoholic units consumed per week), along with other features that they believe may confound the results (e.g. education level), and ignore all other, presumably irrelevant features
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.