K-mer Frequency Research Articles

BackgroundA rapidly increasing flow of genomic data requires the development of efficient methods for obtaining its compact representation. Feature extraction facilitates classification, clustering and model analysis for testing and refining biological hypotheses. “Shotgun” metagenome is an analytically challenging type of genomic data - containing sequences of all genes from the totality of a complex microbial community. Recently, researchers started to analyze metagenomes using reference-free methods based on the analysis of oligonucleotides (k-mers) frequency spectrum previously applied to isolated genomes. However, little is known about their correlation with the existing approaches for metagenomic feature extraction, as well as the limits of applicability. Here we evaluated a metagenomic pairwise dissimilarity measure based on short k-mer spectrum using the example of human gut microbiota, a biomedically significant object of study.ResultsWe developed a method for calculating pairwise dissimilarity (beta-diversity) of “shotgun” metagenomes based on short k-mer spectra (5≤k≤11). The method was validated on simulated metagenomes and further applied to a large collection of human gut metagenomes from the populations of the world (n=281). The k-mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog. This difference turned out to be associated with a significant presence of viral reads in a number of metagenomes. Simulations showed limited impact of bacterial genetic variability as well as sequencing errors on k-mer spectra. Specific differences between the datasets from individual populations were identified.ConclusionsOur approach allows rapid estimation of pairwise dissimilarity between metagenomes. Though we applied this technique to gut microbiota, it should be useful for arbitrary metagenomes, even metagenomes with novel microbiota. Dissimilarity measure based on k-mer spectrum provides a wider perspective in comparison with the ones based on the alignment against reference sequence sets. It helps not to miss possible outstanding features of metagenomic composition, particularly related to the presence of an unknown bacteria, virus or eukaryote, as well as to technical artifacts (sample contamination, reads of non-biological origin, etc.) at the early stages of bioinformatic analysis. Our method is complementary to reference-based approaches and can be easily integrated into metagenomic analysis pipelines.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0875-7) contains supplementary material, which is available to authorized users.

BackgroundDeep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.ResultsHere we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.ConclusionsThe normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-014-0357-3) contains supplementary material, which is available to authorized users.

K-mer Frequency Research Articles

Related Topics

Articles published on K-mer Frequency

Prediction of fine-tuned promoter activity from DNA sequence.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.

HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing.

Reference-free inference of tumor phylogenies from single-cell sequencing data.

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

Recovering full-length viral genomes from metagenomes.

Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae

Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

LAF: Logic Alignment Free and its application to bacterial genomes classification.

Probabilistic topic modeling for the analysis and classification of genomic sequences.

MBBC: an efficient approach for metagenomic binning based on clustering.

Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach.

Assembly of viral genomes from metagenomes.

Determining the quality and complexity of next-generation sequencing data without a reference genome.

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

Ecological roles of dominant and rare prokaryotes in acid mine drainage revealed by metagenomics and metatranscriptomics.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Enhanced regulatory sequence prediction using gapped k-mer features.

Metavir 2: new tools for viral metagenome comparison and assembled virome analysis

Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

K-mer Frequency Research Articles

Related Topics

Articles published on K-mer Frequency

Prediction of fine-tuned promoter activity from DNA sequence.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.

HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing.

Reference-free inference of tumor phylogenies from single-cell sequencing data.

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

Recovering full-length viral genomes from metagenomes.

Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae

Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data.

LAF: Logic Alignment Free and its application to bacterial genomes classification.

Probabilistic topic modeling for the analysis and classification of genomic sequences.

MBBC: an efficient approach for metagenomic binning based on clustering.

Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach.

Assembly of viral genomes from metagenomes.

Determining the quality and complexity of next-generation sequencing data without a reference genome.

NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

Ecological roles of dominant and rare prokaryotes in acid mine drainage revealed by metagenomics and metatranscriptomics.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Enhanced regulatory sequence prediction using gapped k-mer features.

Metavir 2: new tools for viral metagenome comparison and assembled virome analysis

Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome