Abstract
BackgroundHigh-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.ResultsFASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step.ConclusionsFASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.Electronic supplementary materialThe online version of this article (doi:10.1186/1756-0500-7-533) contains supplementary material, which is available to authorized users.
Highlights
High-throughput generation sequencing technologies have enabled rapid characterization of clinical and environmental samples
The dataset was spiked with E. coli str
Gigabases of data can be generated in a few hours, demanding rapid and accurate analysis algorithms and software
Summary
High-throughput generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. The largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. The advent of high-throughput sequencing technologies has enabled rapid characterization of clinical and environmental samples. With the decreasing cost of sequencing technology, sample processing and bioinformatics analysis pose the largest bottleneck to actionable data for critical medical and defense applications [1]. The ability to generate customized read length, quality, and error distribution profiles enables a platform-independent approach to in silico data simulation. The indel and single point mutation rate profiles listed above refer to instrument-specific values
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.