Abstract

The usefulness and the utility of the next generation sequencing (NGS) technology are based on the assumption that the DNA or cDNA cleavage required to generate short sequence reads is random. Several previous reports suggest the existence of sequencing bias of NGS reads. To address this question in greater detail, we analyze NGS data from four organisms with different GC content, Plasmodium falciparum (19.39%), Arabidopsis thaliana (36.03%), Homo sapiens (40.91%) and Streptomyces coelicolor (72.00%). Using machine learning techniques, we recognize the pattern that the NGS read start is positioned in the local region where the nucleotide distribution is dissimilar from the global nucleotide distribution. We also demonstrate that the mono-nucleotide distribution underestimates sequencing bias, and the recognized pattern is explained largely by the distribution of multi-nucleotides (di-, tri-, and tetra- nucleotides) rather than mono-nucleotides. This implies that the correction of sequencing bias needs to be performed on the basis of the multi-nucleotide distribution. Providing companion software to quantify the effect of the recognized pattern on read positioning, we exemplify that the bias correction based on the mono-nucleotide distribution may not be sufficient to clean sequencing bias.

Highlights

  • next generation sequencing (NGS) is the most popular high-throughput sequencing technology in biological and medical research

  • The existence of the pattern on read positioning was examined by C4.5 decision tree and Bayesian Network (BN)

  • After the class labels of the test set were predicted by the classifier which was trained on the corresponding train set, the percentages of the test instances classified correctly were measured as the classification accuracy

Read more

Summary

Introduction

NGS is the most popular high-throughput sequencing technology in biological and medical research. DNA or cDNA fragment reads are mapped to a reference genome and the read enrichment is measured for experiments such as genome sequencing and transcriptome profiling [1,2]. The technology assumes that the DNA or cDNA cleavage is random and so the read start position is independent of the genomic sequence [3]. It allows to use the number of reads mapping to certain regions of the genome as a quantitative measurement. Number of different techniques is used to reduce the size of DNA or cDNA molecules to accommodate them for sequencing that generate short reads. There is a bias in the pattern of DNA shearing that is dependent on certain DNA sequence context as well as the type of shearing used (or combination of sequence context and shearing method used), this bias may alter quantification of the results generated by NGS

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.