Abstract

The binning of reads is a crucial step in metagenomic data analysis. While unsupervised methods which are based on composition features are only efficient for long reads, genome abundance-based methods are often used in the binning of short reads. Previous abundance-based binning approaches usually use fixed-length \(l\)-mer frequencies to separate reads into groups such that reads in each group belong to genomes (or species) of very similar abundances. However, their classification performances are very sensitive to the length of \(l\)-mers, and they get difficult to separate reads from low-abundance genomes due to the repeat of short length \(l\)-mers in the genomes. In this paper, a new variable-length \(l\)-mer counting method is proposed to enable dealing with the short length \(l\)-mer repetition for improving the accuracy of abundance-based binning approaches. Computational experiments demonstrate that an improved approach of AbundanceBin (a commonly used binning method) in which the proposed method is applied achieves higher accuracy than the original one. The software implementing the approach can be downloaded at http://fit.hcmute.edu.vn/bioinfo/MetaSeqBin/index.htm.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.