Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.

Moritz Kohls,Jessica Krepel,Klaus Jung,Pamela Liebig,Magdalena Kircher

doi:10.3390/genes12111755

Moritz Kohls, Jessica Krepel + Show 3 more

Open Access

https://doi.org/10.3390/genes12111755

Copy DOI

Abstract

Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.

Highlights

Next-generation sequencing (NGS) is regularly used to identify viral sequences in the biological sample of an infected host in order to relate the presence of a virus with disease symptoms of the host [1,2,3]
The taxonomy database containing data needed for taxonomic classification was downloaded from the ftp-server
We found that the artificial neural networks (ANN) models and an support vector machines (SVM) with polynomial kernel performed clearly better than a mapping approach which can not cope with unknown viruses, and better than linear discriminant analysis (LDA) and SVMs with other kernels

Summary

Introduction

Next-generation sequencing (NGS) is regularly used to identify viral sequences in the biological sample of an infected host in order to relate the presence of a virus with disease symptoms of the host [1,2,3]. Most computational virus detection pipelines or pipelines for determining the taxonomic composition map sequencing reads or assembled contigs against viral reference sequences available in public or own curated databases [6,7,8,9,10,11,12,13]. These mapping approaches have been proven to be successful in a large number of examples, they mostly fail to classify reads from new emerging viruses whose sequences are not yet deposited in a database. Despite improvements of the metagenomic data analysis algorithm, Kraken-2 [17] still shows low sensitivity on novel viruses

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes

Lead the way for us

Journal: Genes	Publication Date: Oct 31, 2021
License type: CC BY 4.0

Similar Papers

Landslide susceptibility assessment using feature selection-based machine learning models
...
Geomechanics and Engineering | VOL. 25
, et. al. ...
01 Jan 2020
Geomechanics and Engineering | VOL. 25

Machine Learning Models for Blood Glucose Level Prediction in Patients With Diabetes Mellitus: Systematic Review and Network Meta-Analysis.
Kui Liu ... Changsheng Chen
JMIR Medical Informatics | VOL. 11
Kui Liu, et. al.Kui Liu ... Changsheng Chen
20 Nov 2023
JMIR Medical Informatics | VOL. 11

Does Artificial Intelligence Outperform Natural Intelligence in Interpreting Musculoskeletal Radiological Studies? A Systematic Review.
Olivier Q Groot ... Joeky T Senders
Clinical Orthopaedics & Related Research | VOL. 478
Olivier Q Groot, et. al.Olivier Q Groot ... Joeky T Senders
30 Jul 2020
Clinical Orthopaedics & Related Research | VOL. 478

Pushing the limits of solubility prediction via quality-oriented data selection.
Murat Cihan Sorkun ... Süleyman Er
iScience | VOL. 24
Murat Cihan Sorkun, et. al.Murat Cihan Sorkun ... Süleyman Er
17 Dec 2020
iScience | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes