Read Datasets Research Articles

Because a vast majority (99%) of microbes in a given community is likely to be non-cultivable, metagenomics has gradually entered the mainstream of microbial research methods. With the development of high-throughput sequencing techniques, an increasing number of sequencing read data sets of metagenomes from various microbial communities have become available. For these data sets, metagenomic analysis based on mapping reads to microbial genomes has been hampered by the limited number of microbial genomes that are available. Further, this type of analysis is computationally intensive. Thus alignment-free methods, which characterize the sequencing reads with a genomic signature instead of with genomic alignments, can be applied. However, the main requirement of these alignment-free methods is a stable genomic signature that performs reliably.Here, we propose a novel genomic signature of microbial genomes called the intrinsic correlation of oligonucleotides (ICOs). This signature represents the quantification of an intrinsic relationship between any two oligonucleotides. We analyzed microbial genomes at different taxonomic levels using ICO profiles and confirmed the wide availability of useful ICOs. We used intra-genomic and inter-genomic distances and relational grades to evaluate the performance of ICOs as a genomic signature. The results of these experiments showed that ICOs can characterize microbial genomes well, and ICOs were better at distinguishing species than tetranucleotide composition, not only in terms of whole genomes but also in terms of sequence fragments. In addition, we evaluated the performance of a hybrid feature that combined ICOs and tetranucleotide composition. The experimental results showed that the hybrid feature performed better than ICOs or tetranucleotide composition alone.ICOs can characterize microbial genomes successfully and are capable of distinguishing organisms at different taxonomic levels. ICOs perform better than tetranucleotide composition in characterizing microbial genomes. The hybrid feature that used a combination of the two kinds of sequence features had advantages over a single sequence feature.

Read full abstract

BackgroundHigh-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.ResultsFASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step.ConclusionsFASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.Electronic supplementary materialThe online version of this article (doi:10.1186/1756-0500-7-533) contains supplementary material, which is available to authorized users.

Read full abstract

Read Datasets Research Articles

Related Topics

Articles published on Read Datasets

Important biological information uncovered in previously unaligned reads from chromatin immunoprecipitation experiments (ChIP-Seq).

Nonintrusive Load Monitoring: A Temporal Multilabel Classification Approach

A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.

Identifying Long-Memory Trends in Pre-Seismic MHz Disturbances through Support Vector Machines

Reference-free detection of isolated SNPs.

A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer.

Separation of Text Components from Complex Colored Images

Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons

Genomic characterization of large heterochromatic gaps in the human genome assembly.

AUTOMATIC TEXT EXTRACTION FROM COMPLEX COLORED IMAGES USING GAMMA CORRECTION METHOD

HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data

Intrinsic correlation of oligonucleotides: A novel genomic signature for metagenome analysis

Benchmarking of Methods for Genomic Taxonomy

Water Demand Pattern Classification from Smart Meter Data

FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads.

Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes

Reading subskill differences between students in Shanghai-China and the US: evidence from PISA 2009

Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Read Datasets Research Articles

Related Topics

Articles published on Read Datasets

Important biological information uncovered in previously unaligned reads from chromatin immunoprecipitation experiments (ChIP-Seq).

Nonintrusive Load Monitoring: A Temporal Multilabel Classification Approach

A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads.

Identifying Long-Memory Trends in Pre-Seismic MHz Disturbances through Support Vector Machines

Reference-free detection of isolated SNPs.

A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer.

Separation of Text Components from Complex Colored Images

Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons

Genomic characterization of large heterochromatic gaps in the human genome assembly.

AUTOMATIC TEXT EXTRACTION FROM COMPLEX COLORED IMAGES USING GAMMA CORRECTION METHOD

HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data

Intrinsic correlation of oligonucleotides: A novel genomic signature for metagenome analysis

Benchmarking of Methods for Genomic Taxonomy

Water Demand Pattern Classification from Smart Meter Data

FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads.

Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes

Reading subskill differences between students in Shanghai-China and the US: evidence from PISA 2009

Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects.