Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments.

Erdal Cosgun,Min Oh

doi:10.1155/2020/8531502

Abstract

Background Next-generation sequencing enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concerns is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any preevaluation of the quality scores. Method We presented a machine learning approach for estimating quality scores of variant calls derived from BWA+GATK. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Also, twenty-four human genome sequencing reads resulting from Illumina paired-end sequencing with at least 30x coverage were secured from the Sequence Read Archive. Results Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively).

Highlights

With the development of next-generation sequencing (NGS) technology, progress has been made in the field of bioinformatics
We simulated next-generation sequencing reads based on the synthetic human genomes using ART, which was used as a primary tool for the simulation study of the 1000 Genomes Project [18]
We built data-driven predictive models for estimating quality scores of variant calls in Variant Calling Format (VCF) data derived from 24 simulated human genome reads and 24 real human genome reads using supervised machine learning techniques

Summary

Introduction

With the development of next-generation sequencing (NGS) technology, progress has been made in the field of bioinformatics. The most important stage of this process is the stage of the alignment and variant calling defined as secondary analysis. One of the most important data obtained from this pipeline is Variant Calling Format (VCF) file. To test the predictive models, we simulated 24 paired-end Illumina sequencing reads with 30x coverage base. Using BWA+GATK, VCFs were derived from simulated and real sequencing reads. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). The robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively)

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BioMed Research International	Publication Date: Feb 25, 2020
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BioMed Research International

Lead the way for us

Similar Papers

Don't just dump your data and run: Authors should submit as much experimental information as possible when uploading sequence data.
Matheus Sanitá Lima ... David Roy Smith
EMBO reports | VOL. 18
Matheus Sanitá Lima, et. al.Matheus Sanitá Lima ... David Roy Smith
27 Oct 2017
EMBO reports | VOL. 18

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation: Cardiovascular Genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation: Cardiovascular Genetics | VOL. 6

A Challenge to Integrate Bioinformatics and Biodiversity Informatics Data as Museomics
Takeru Nakazato
Biodiversity Information Science and Standards | VOL. 2
Takeru NakazatoTakeru Nakazato
22 May 2018
Biodiversity Information Science and Standards | VOL. 2

Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive
Takeru Nakazato ... Hidemasa Bono
PLoS ONE | VOL. 8
Takeru Nakazato, et. al.Takeru Nakazato ... Hidemasa Bono
22 Oct 2013
PLoS ONE | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring the Consistency of the Quality Scores with Machine Learning for Next-Generation Sequencing Experiments.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BioMed Research International