Abstract

BackgroundMetabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp.ResultsWe show that non-parametric hypothesis testing, based on minimum classification error rates as test statistics, can find statistically significantly shifted variables. The discriminatory ability of variables becomes more apparent when error rates are evaluated based on their corresponding p-values, as relatively high error rates can still be statistically significant. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. ERp retains (if known) or reveals (if unknown) the shift direction, aiding in biological interpretation. The threshold resulting in the minimum error rate can immediately be used to classify new subjects.We use NMR generated metabolomics data to illustrate how ERp is able to discriminate subjects diagnosed with Mycobacterium tuberculosis infected meningitis from a control group. The list of discriminatory variables produced by ERp contains all biologically relevant variables with appropriate shift directions discussed in the original paper from which this data is taken.ConclusionsERp performs variable selection and classification, is non-parametric and aids biological interpretation while handling unequal group sizes and misclassification costs. All this is achieved by a single approach which is easy to perform and interpret. ERp has the potential to address many other characteristics of metabolomics data. Future research aims to extend ERp to account for a large proportion of observations below the detection limit, as well as expand on interactions between variables.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0867-7) contains supplementary material, which is available to authorized users.

Highlights

  • Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question

  • Traditional statistical methods often make assumptions which make the validity of results questionable in the case of metabolomics

  • Such projection based statistical methods are generally rather sophisticated, limiting their practicality [4, 8]. This is likely the reason why metabolomics researchers often combine the results of a variety of statistical and even machine learning methods to select a subset of variables

Read more

Summary

Introduction

Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The thresholds resulting in the minimum error rates can be used to classify new subjects This approach transforms error rates into p-values and is referred to as ERp. A major aim of metabolomics studies is to find metabolites that distinguish a control group of reference or “normal” subjects from a group of experimental or “abnormal” subjects which differ from the control group subjects as a result of disease, treatment with drugs, toxicity, environmental, genetic or physiological effects [1,2,3]. Inference after variable selection is not advisable without correcting for the uncertainty associated with the selection step [7] These methods require model specifications such as linearity in the variables of the regression function. Doing so can become cumbersome if they do not reach the same decision and again requires estimation of “post-selection error” [10] when used in further model development

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call