Abstract
Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
Highlights
Advances in DNA sequencing are rapidly producing complex microbiome data sets in fields ranging from human health to environmental studies [1]
As described in detail below, we analyzed three distinct 16S ribosomal RNA (rRNA) amplicon sequence data sets: 1) data provided by the American Gut Project (AGP), in which samples are labeled by body site origin and thereby reflect microbiome phenotype; 2) data published by Gevers et al (Gevers), which is labeled by disease diagnosis, i.e., host phenotype; and 3) the SILVA rRNA database, a large corpus of comprehensive and quality checked 16S rRNA sequences with taxonomic labels
We have shown that our proposed attention-based deep neural network model for read-level classification, namely Read2Pheno models, are capable of comparable accuracy prediction performance while offering automated model interpretation on three distinct kinds of tasks: (1) prediction of microbiome phenotype, (2) prediction of host phenotype, and (3) taxonomic classification of full length 16S rRNA sequences
Summary
Advances in DNA sequencing are rapidly producing complex microbiome data sets in fields ranging from human health to environmental studies [1]. Large-scale microbial projects provide rich information, enabling prediction of sample-level traits (i.e., phenotypes), aiding biological discovery, and supporting medical diagnosis. A typical microbiome study may contain hundreds to thousands of samples. In turn, contains thousands of reads depending on the sequencing depth. These reads are fragments of DNA/RNA material extracted from microbes residing in the environment where the sample was collected. An environmental sample can be sequenced via 16S ribosomal RNA amplicon technology, to provide a comprehensive taxonomic survey of an environment’s or subject’s microbial community [2, 3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.