A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model

Jiaqi Liu,Xiaoyan Zhu,Xin Lai,Juan Wang,Xuanping Zhang,Zhimin Li,Zhongmeng Zhao,Jiayin Wang,Daocheng Dai,Xiao Xiao

doi:10.1186/s12864-020-07008-9

Abstract

BackgroundThe emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high level of the sequencing error rates, which inevitably affects the downstream analysis. Although the issue of sequencing error has been improving these years, large amounts of data were produced at high sequencing errors, and huge waste will be caused if they are discarded. Thus, the error correction for the third generation sequencing data is especially important. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. Therefore, it is a lack of error correction algorithms for the heterozygous loci, especially at low coverages.ResultsIn this article, we propose a error correction method, named QIHC. QIHC is a hybrid correction method, which needs both the next generation and third generation sequencing data. QIHC greatly enhances the sensitivity of identifying the heterozygous sites from sequencing errors, which leads to a high accuracy on error correction. To achieve this, QIHC established a set of probabilistic models based on Bayesian classifier, to estimate the heterozygosity of a site and makes a judgment by calculating the posterior probabilities. The proposed method is consisted of three modules, which respectively generates a pseudo reference sequence, obtains the read alignments, estimates the heterozygosity the sites and corrects the read harboring them. The last module is the core module of QIHC, which is designed to fit for the calculations of multiple cases at a heterozygous site. The other two modules enable the reads mapping to the pseudo reference sequence which somehow overcomes the inefficiency of multiple mappings that adopt by the existing error correction methods.ConclusionsTo verify the performance of our method, we selected Canu and Jabba to compare with QIHC in several aspects. As a hybrid correction method, we first conducted a groups of experiments under different coverages of the next-generation sequencing data. QIHC is far ahead of Jabba on accuracy. Meanwhile, we varied the coverages of the third generation sequencing data and compared performances again among Canu, Jabba and QIHC. QIHC outperforms the other two methods on accuracy of both correcting the sequencing errors and identifying the heterozygous sites, especially at low coverage. We carried out a comparison analysis between Canu and QIHC on the different error rates of the third generation sequencing data. QIHC still performs better. Therefore, QIHC is superior to the existing error correction methods when heterozygous sites exist.

Highlights

The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the generation sequencing technology and greatly promoted the biological research
Experimental protocol Let L denote a set of third-generation sequencing (TGS) reads and S denote a set of next-generation sequencing (NGS) data, respectively
The existing error correction methods are quite complete for the correction strategy at normal sites, but they are often not considered in the correction of heterozygous variation positions, which is an aspect that cannot be ignored

Summary

Introduction

The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the generation sequencing technology and greatly promoted the biological research. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. It is a lack of error correction algorithms for the heterozygous loci, especially at low coverages. The emergence of TGS technology inherits the high throughput of the next-generation sequencing (NGS), and produces longer reads with the lengths greater than 10kbp compared to NGS reads which are generally limited to 100bp [1,2,3,4,5,6,7,8]. It is considered that the sequencing errors can be corrected by algorithmic methods

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Nov 1, 2020
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications
Ziwen He ... Suhua Shi
BMC Genomics | VOL. 14
Ziwen He, et. al.Ziwen He ... Suhua Shi
07 Aug 2013
BMC Genomics | VOL. 14

Defind: Detecting Genomic Deletions by Integrating Read Depth, GC Content, Mapping Quality and Paired-end Mapping Signatures of Next Generation Sequencing Data
Xin Wang ... Xiaojing Liu
Current Bioinformatics | VOL. 14
Xin Wang, et. al.Xin Wang ... Xiaojing Liu
07 Jan 2019
Current Bioinformatics | VOL. 14

HLAreporter: a tool for HLA typing from next generation sequencing data.
Yazhi Huang ... Jing Yang
Genome Medicine | VOL. 7
Yazhi Huang, et. al.Yazhi Huang ... Jing Yang
16 Mar 2015
Genome Medicine | VOL. 7

A random forest classifier for detecting rare variants in NGS data from viral populations
Raunaq Malhotra ... Raj Acharya
Computational and Structural Biotechnology Journal | VOL. 15
Raunaq Malhotra, et. al.Raunaq Malhotra ... Raj Acharya
01 Jan 2017
Computational and Structural Biotechnology Journal | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics