K-mer Size Research Articles

Motivation: Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.Results: We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning the method involves training machine learning models on about 108 samples in 107 dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2–17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.Availability and implementation: Data and codes are available at http://cbio.ensmp.fr/largescalemetagenomics.Contact: pierre.mahe@biomerieux.comSupplementary information: Supplementary data are available at Bioinformatics online.

BackgroundRNA-seq has shown huge potential for phylogenomic inferences in non-model organisms. However, error, incompleteness, and redundant assembled transcripts for each gene in de novo assembly of short reads cause noise in analyses and a large amount of missing data in the aligned matrix. To address these problems, we compare de novo assemblies of paired end 90 bp RNA-seq reads using Oases, Trinity, Trans-ABySS and SOAPdenovo-Trans to transcripts from genome annotation of the model plant Ricinus communis. By doing so we evaluate strategies for optimizing total gene coverage and minimizing assembly chimeras and redundancy.ResultsWe found that the frequency and structure of chimeras vary dramatically among different software packages. The differences were largely due to the number of trans-self chimeras that contain repeats in the opposite direction. More than half of the total chimeras in Oases and Trinity were trans-self chimeras. Within each package, we found a trade-off between maximizing reference coverage and minimizing redundancy and chimera rate. In order to reduce redundancy, we investigated three methods: 1) using cap3 and CD-HIT-EST to combine highly similar transcripts, 2) only retaining the transcript with the highest read coverage, or removing the transcript with the lowest read coverage for each subcomponent in Trinity, and 3) filtering Oases single k-mer assemblies by number of transcripts per locus and relative transcript length, and then finding the transcript with the highest read coverage. We then utilized results from blastx against model protein sequences to effectively remove trans chimeras. After optimization, seven assembly strategies among all four packages successfully assembled 42.9–47.1% of reference genes to more than 200 bp, with a chimera rate of 0.92–2.21%, and on average 1.8–3.1 transcripts per reference gene assembled.ConclusionsWith rapidly improving sequencing and assembly tools, our study provides a framework to benchmark and optimize performance before choosing tools or parameter combinations for analyzing short-read RNA-seq data. Our study demonstrates that choice of assembly package, k-mer sizes, post-assembly redundancy-reduction and chimera cleanup, and strand-specific RNA-seq library preparation and assembly dramatically improves gene coverage by non-redundant and non-chimeric transcripts that are optimized for downstream phylogenomic analyses.

K-mer Size Research Articles

Related Topics

Articles published on K-mer Size

Large-scale machine learning for metagenomics sequence classification

Random sequential adsorption of straight rigid rods on a simple cubic lattice

Optimizing Transcriptome Assemblies for Eleusine indica Leaf and Seedling by Combining Multiple Assemblies from Three De Novo Assemblers.

Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly.

HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly

The complex task of choosing a de novo assembly: Lessons from fungal genomes

Employing whole genome mapping for optimal de novo assembly of bacterial genomes.

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

KAnalyze: a fast versatile pipelined K-mer toolkit

Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana

Construction of a Public CHO Cell Line Transcript Database Using Versatile Bioinformatics Analysis Pipelines

Percolation of polyatomic species on a simple cubic lattice

Separating homeologs by phasing in the tetraploid wheat transcriptome

Fine de novo sequencing of a fungal genome using only SOLiD short read data: verification on Aspergillus oryzae RIB40.

Optimizing de novo assembly of short-read RNA-seq data for phylogenomics

Additive multiple k-mer transcriptome of the keelworm Pomatoceros lamarckii (Annelida; Serpulidae) reveals annelid trochophore transcription factor cassette

Isotropic-nematic phase diagram for interacting rigid rods on two-dimensional lattices

Cutoffs and k-mers: implications from a transcriptome study in allopolyploid plants

Evaluation of short read metagenomic assembly

MiR-192 Mediates TGF-β/Smad3-Driven Renal Fibrosis

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

K-mer Size Research Articles

Related Topics

Articles published on K-mer Size

Large-scale machine learning for metagenomics sequence classification

Random sequential adsorption of straight rigid rods on a simple cubic lattice

Optimizing Transcriptome Assemblies for Eleusine indica Leaf and Seedling by Combining Multiple Assemblies from Three De Novo Assemblers.

Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly.

HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly

The complex task of choosing a de novo assembly: Lessons from fungal genomes

Employing whole genome mapping for optimal de novo assembly of bacterial genomes.

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

KAnalyze: a fast versatile pipelined K-mer toolkit

Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana

Construction of a Public CHO Cell Line Transcript Database Using Versatile Bioinformatics Analysis Pipelines

Percolation of polyatomic species on a simple cubic lattice

Separating homeologs by phasing in the tetraploid wheat transcriptome

Fine de novo sequencing of a fungal genome using only SOLiD short read data: verification on Aspergillus oryzae RIB40.

Optimizing de novo assembly of short-read RNA-seq data for phylogenomics

Additive multiple k-mer transcriptome of the keelworm Pomatoceros lamarckii (Annelida; Serpulidae) reveals annelid trochophore transcription factor cassette

Isotropic-nematic phase diagram for interacting rigid rods on two-dimensional lattices

Cutoffs and k-mers: implications from a transcriptome study in allopolyploid plants

Evaluation of short read metagenomic assembly

MiR-192 Mediates TGF-β/Smad3-Driven Renal Fibrosis