Binning Problem Research Articles

BackgroundWith the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as “binning”. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases.ResultsIn this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets.Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%.ConclusionsWe provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.

BackgroundThe characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods.ResultsWe compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency.ConclusionWe have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms.

Binning Problem Research Articles

Articles published on Binning Problem

Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

RepBin: Constraint-Based Graph Representation Learning for Metagenomic Binning

Genomic style: yet another deep-learning approach to characterize bacterial genome sequences.

A Concentration Inequality for Random Polytopes, Dirichlet–Voronoi Tiling Numbers and the Geometric Balls and Bins Problem

Optimal Binning for Genomics

A binning formula of bi-histogram for joint entropy estimation using mean square error minimization

Efficient Algorithms for Sorting <i>k</i>-Sets in Bins

Adaptive Coordinating Construction of Truss Structures Using Distributed Equal-Mass Partitioning

EVALUATING MIXTURE MODELS FOR BUILDING RNA KNOWLEDGE-BASED POTENTIALS

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments

Barcodes for genomes and applications

Approximate Equilibria and Ball Fusion

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Binning Problem Research Articles

Articles published on Binning Problem

Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

RepBin: Constraint-Based Graph Representation Learning for Metagenomic Binning

Genomic style: yet another deep-learning approach to characterize bacterial genome sequences.

A Concentration Inequality for Random Polytopes, Dirichlet–Voronoi Tiling Numbers and the Geometric Balls and Bins Problem

Optimal Binning for Genomics

A binning formula of bi-histogram for joint entropy estimation using mean square error minimization

Efficient Algorithms for Sorting &lt;i&gt;k&lt;/i&gt;-Sets in Bins

Adaptive Coordinating Construction of Truss Structures Using Distributed Equal-Mass Partitioning

EVALUATING MIXTURE MODELS FOR BUILDING RNA KNOWLEDGE-BASED POTENTIALS

Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments

Barcodes for genomes and applications

Approximate Equilibria and Ball Fusion

Efficient Algorithms for Sorting <i>k</i>-Sets in Bins