Abstract

Metagenomic sequencing data, obtained from host-associated microbial communities, are usually contaminated with host genome sequence fragments. Prior to performing any downstream analyses, it is necessary to identify and remove such contaminating sequence fragments. The time and memory requirements of available host-contamination detection techniques are enormous. Thus, processing of large metagenomic datasets is a challenging task. This study presents CS-SCORE — a novel algorithm that can rapidly identify host sequences contaminating metagenomic datasets. Validation results indicate that CS-SCORE is 2–6 times faster than the current state-of-the-art methods. Furthermore, the memory footprint of CS-SCORE is in the range of 2–2.5GB, which is significantly lower than other available tools. CS-SCORE achieves this efficiency by incorporating (1) a heuristic pre-filtering mechanism and (2) a directed-mapping approach that utilizes a novel sequence composition metric (cs-score). CS-SCORE is expected to be a handy ‘pre-processing’ utility for researchers analyzing metagenomic datasets. AvailabilityFor academic users, an implementation of CS-SCORE is freely available at: http://metagenomics.atc.tcs.com/cs-score (or) https://metagenomics.atc.tcs.com/preprocessing/cs-score.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call