Abstract

BackgroundComputational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences.ResultsIn this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content.ConclusionOur proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.

Highlights

  • Computational approaches, machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction

  • Performance measures Gene prediction performance is measured by comparing the model prediction with the true gene annotation in fragments that were obtained from GenBank [25]

  • We count the number of true positives, false positives, and false negatives

Read more

Summary

Introduction

Computational approaches, machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Al-Ajlan and El Allali BioData Mining (2018) 11:9 studies identified genes through reliable experiments on living cells and organisms. It is usually an expensive and time-consuming task [8]. Content-based methods try to overcome these limitations using statistical approaches to detect variations between coding and non-coding regions [1, 8]. While, these approaches are very successful in genomic sequences, there is still work to be done for metagenomics due to the nature of the data [6, 12]. The greatest challenges for gene prediction algorithms in metagenomics are the short read-length and the incomplete and fragmented nature of the data [1, 13]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.