Abstract

BackgroundA number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required.ResultsHere, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools.ConclusionsSimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data.

Highlights

  • A number of simulators have been developed for emulating nextgeneration sequencing data by incorporating known errors such as base substitutions and indels

  • Real sequencing data To investigate the profiles of the samples generated from different sequencing platforms, the FASTQ files of 8 samples (Table S1 in Additional file 1) are downloaded from the Sequence Read Archive (SRA) of NCBI by using SRA ToolKit

  • The reads are aligned to the hg19 human reference genome using BWA [29] tool, and germline SNPs are further inferred from the BAM files by using GATK HaplotypeCaller under default parameters

Read more

Summary

Introduction

A number of simulators have been developed for emulating nextgeneration sequencing data by incorporating known errors such as base substitutions and indels. Their practicality may be degraded by functional and runtime limitations. The existing studies demonstrate that specific patterns of substitution error and distributions of quality scores are observed in Illumina sequencing platforms [5, 6]. Investigating these statistical differences in NGS reads is essential to obtain useful knowledge that can be employed to improve the read alignment quality, and to emulate reliable sequencing data

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.