Abstract

BackgroundMicrobes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates.ResultsWe introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval.ConclusionsWe provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/, and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark.

Highlights

  • Microbes perform a fundamental economic, social, and environmental role in our society

  • Gene annotation has grown in recent years, there are still countless genes that have not been annotated, making predictions solely based on available known reference genomes quite limited and will not always be sufficient to describe the main role of these microorganisms

  • We propose geneRFinder, an ab initio gene prediction tool capable of identifying coding sequences (CDS) and intergenic region in sequences with distinct metagenomic complexities

Read more

Summary

Results

Benchmark data The impact of metagenomic sample complexity on gene prediction was not fully explored by prediction tools until now. Each tool considers different datasets for its analysis These previous predictors produced similar results, the databases used to evaluate two well-known gene prediction tools—FragGeneScan and Prodigal, for example, contain less than 25% of common organisms (Fig. 1). When evaluating predictors performance in sequences from low complexity metagenome (test2low), Fig. 7, geneRFinder obtained the best accuracy, with a percentage variation of 53% compared to Prodigal and 63% against FragGeneScan. When analyzing the proportion of predictors sensitivity and specificity represented by the AUC, geneRFinder achieved better performance in the 4 datasets This proportion, measured as a percentage by the area under the ROC curve, was at least 24 percentage points higher than in other tools. FragGeneScan, for example, has the best average sensitivity rate This means that, from all sequences that corresponded to CDS, this tool rated approximately 99% of them correctly on all datasets. GeneRFinder could achieve superior rates for specificity—beyond equivalent rates for sensitivity, which can be seen in Fig. 9, demonstrating its better performance

Conclusions
Background
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call