Abstract

In this paper we have investigated linear combinations of oligonucleotide (k-mer) frequencies for binning the metagenomic DNA fragments of short-to-moderate lengths. The k-mer frequencies have been widely used for gene prediction, phylogenetic tree construction, and metagenomic binning. However, the k-mer frequencies will lead to a high dimensional feature space even for a modest value of k. Existing methods to reduce the dimensionality of the feature space focus on particular oligonucleotide patterns or rather small values of k. We have applied the principal component analysis (PCA) on the oligonucleotide frequencies, based on which we can not only achieve a reduction of the feature dimensionality at a ratio higher than five, but can also retain the most informative features. Our experiments on simulated metagenomic data sets with four types of classifiers have shown that (i) the PCA-based linear combinations of k-mer frequencies are capable of capturing the intrinsic characteristics of DNA fragments and can therefore adequately serve as the binning features; (ii) the PCA-based linear combinations of k-mer frequencies tend to be more effective and stable as the DNA fragment length increases; and (iii) the rather simple linear classifiers can achieve high accuracy for the metagenomic DNA fragment binning at various taxonomic levels, even at a level as specific as species.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.