Identifying indicator species in ecological habitats using Deep Optimal Feature Learning.

Yiting Tsai,Susan A Baldwin,Bhushan Gopaluni,Zaher Mundher Yaseen

doi:10.1371/journal.pone.0256782

Yiting Tsai, Susan A Baldwin + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0256782

Copy DOI

Journal: PloS one	Publication Date: Sep 10, 2021
License type: CC BY 4.0

Affiliation: University of British Columbia

Abstract

Much of the current research on supervised modelling is focused on maximizing outcome prediction accuracy. However, in engineering disciplines, an arguably more important goal is that of feature extraction, the identification of relevant features associated with the various outcomes. For instance, in microbial communities, the identification of keystone species can often lead to improved prediction of future behavioral shifts. This paper proposes a novel feature extractor based on Deep Learning, which is largely agnostic to underlying assumptions regarding the training data. Starting from a collection of microbial species abundance counts, the Deep Learning model first trains itself to classify the selected distinct habitats. It then identifies indicator species associated with the habitats. The results are then compared and contrasted with those obtained by traditional statistical techniques. The indicator species are similar when compared at top taxonomic levels such as Domain and Phylum, despite visible differences in lower levels such as Class and Order. More importantly, when our estimated indicators are used to predict final habitat labels using simpler models (such as Support Vector Machines and traditional Artificial Neural Networks), the prediction accuracy is improved. Overall, this study serves as a preliminary step that bridges modern, black-box Machine Learning models with traditional, domain expertise-rich techniques.

Highlights

The main motivation of this work is to propose a Machine Learning-based feature extractor which works generally for non-linear datasets of high dimensionality, assess its efficacy on a real biological case with pre-determined features, to determine whether it is sufficiently reliable to be used for other similar studies
Most of the analysis will be focused on predictive classifiers using Deep Learning (DL) architectures, as well as its optimally-separating variant which we propose as an improved feature extractor
The results show that the optimally-separating model produces much larger separating margins between the binary classes, whilst still achieving perfect test accuracy when compared to the traditional

Summary

Introduction

The main motivation of this work is to propose a Machine Learning-based feature extractor which works generally for non-linear datasets of high dimensionality, assess its efficacy on a real biological case with pre-determined features, to determine whether it is sufficiently reliable to be used for other similar studies. Biologically-related disciplines are often faced with the task of modelling on high-dimensional datasets (i.e. many raw input variables). In disciplines such as disease diagnosis and prevention, much of the current research focus is on maximizing prediction accuracy of outcome labels.

Methods

Results

Conclusion