A framework for space-efficient read clustering in metagenomic samples

Jarno Alanko,Veli Mäkinen,Djamal Belazzougui,Fabio Cunial

doi:10.1186/s12859-017-1466-6

Abstract

BackgroundA metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed.ResultsWe design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length ℓ each, on an alphabet of total size σ, our algorithms take O(n(t+logσ)) time and just 2n+o(n)+O(max{ℓσlogn,K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure.ConclusionsOur experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.

Highlights

A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species
A fundamental problem in metagenomics is to cluster the reads produced by a high-throughput experiment, according to which species they originate from
A cluster corresponding to an unknown taxonomic unit could be positioned inside a taxonomy of known genomes by comparing their substring composition, and two metagenomic samples with annotated clusters could be compared in time proportional to the number of clusters, for example using the measures described in [6], rather than in time proportional to the number of distinct substrings of a specific

Summary

Introduction

A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. A fundamental problem in metagenomics is to cluster the reads produced by a high-throughput experiment, according to which species (or, more generally, taxonomic unit) they originate from. This can be done in a supervised manner, by mapping the reads to a database of known genomes, or in an unsupervised way, by performing extensive comparisons of all reads against each other without relying on any reference database. Unsupervised methods are attractive, and in most practical cases the only option available, Having accurate clusters for reads that come from unknown taxonomic units allows one to estimate key measures of environmental biodiversity, and to assemble the corresponding genomes more accurately and using less memory [1,2,3]. A cluster corresponding to an unknown taxonomic unit could be positioned inside a taxonomy of known genomes by comparing their substring composition, and two metagenomic samples with annotated clusters could be compared in time proportional to the number of clusters, for example using the measures described in [6], rather than in time proportional to the number of distinct substrings of a specific

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 1, 2017
Citations: 8	License type: open-access

R Discovery Prime

R Discovery Prime

A framework for space-efficient read clustering in metagenomic samples

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Efficient Gene Assembly and Identification for Many Genome Samples
... Yanjie Wei
-
, et. al. ... Yanjie Wei
01 Jan 2019
01 Jan 2019

Disjoint-Set Data Structure-Aided Structured Gaussian Elimination for Solving Sparse Linear Systems
Xuan He ... Kui Cai
IEEE Communications Letters | VOL. 24
Xuan He, et. al.Xuan He ... Kui Cai
28 Jul 2020
IEEE Communications Letters | VOL. 24

Computational methods for the identification and quantification of microbial organisms in metagenomes

-

01 Aug 2014
01 Aug 2014

Applications of forbidden 0-1 matrices to search tree and path compression-based data structures

-

17 Jan 2010
17 Jan 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A framework for space-efficient read clustering in metagenomic samples

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics