Abstract

Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences, especially the internal transcribed spacer (ITS) region, are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challenging for existing predictors. In this study, we developed Its2vec, a bioinformatics tool for the classification of fungal ITS barcodes to the species level. An ITS database covering more than 25,000 species in a broad range of fungal taxa was assembled. For dimensionality reduction, a word embedding algorithm was used to represent an ITS sequence as a dense low-dimensional vector. A random forest-based classifier was built for species identification. Benchmarking results showed that our model achieved an accuracy comparable to that of several state-of-the-art predictors, and more importantly, it could implement large datasets and greatly reduce dimensionality. We expect the Its2vec model to be helpful for fungal species identification and, thus, for revealing microbial community structures and in deepening our understanding of their functional mechanisms.

Highlights

  • Metabarcoding is among the most promising approaches in the study of microbial communities [1,2,3] and has provided new insights into microbial impacts on crop yields [4], human health [5], and ecology [6]

  • Several rRNA genes have been successfully employed for fungal species identification, including the small ribosomal subunit, the large ribosomal subunit, the RNA polymerase II binding protein, and the internal transcribed spacer (ITS)

  • As some species hypotheses (SHs) were represented by hundreds of sequences, to reduce the heterogeneity in sequence numbers, 10 sequences were randomly selected for SH that contained more than 10 sequences

Read more

Summary

Introduction

Metabarcoding is among the most promising approaches in the study of microbial communities [1,2,3] and has provided new insights into microbial impacts on crop yields [4], human health [5], and ecology [6]. Fungi are immensely diverse; the latest best estimate within this kingdom suggests that their total species number is somewhere between 2.2 and 2.8 million [7]. Several rRNA genes have been successfully employed for fungal species identification, including the small ribosomal subunit, the large ribosomal subunit, the RNA polymerase II binding protein, and the internal transcribed spacer (ITS). The ITS (including ITS1 and ITS2 separated by the 5.8S genic region) has been widely adopted as a marker for fungal identification and diversity exploration [15,16,17,18,19] because this region is ubiquitous and shows great variation in sequence and length [9]

Objectives
Methods
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.