Abstract

Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.

Highlights

  • Metagenomics is the analysis of genomes contained in environmental samples, such as soil, seawater, and human gut samples [1,2,3]

  • We explore the possibility of using a CNNbased approach in gene prediction using metagenomics fragments

  • If the predicted open reading frames (ORFs) is incorrectly identified as a gene, it is considered a false positive (FP)

Read more

Summary

Introduction

Metagenomics is the analysis of genomes contained in environmental samples, such as soil, seawater, and human gut samples [1,2,3]. Metagenomics analysis uses modern techniques to study microbial organisms directly in their natural environments, without the need for the isolation and lab cultivation of individual species [4]. Metagenomics has many useful applications in medicine, engineering, agriculture, and ecology [5, 6]. Gene prediction is an important step in the metagenomics pipeline. Gene prediction is the process of finding the location of coding regions in genomics sequences [7, 8]. Studies identified genes through experiments on living cells and organisms [9], a reliable but expensive

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call