CS-SCORE: Rapid identification and removal of human genome contaminants from metagenomic datasets

Mohammed Monzoorul Haque,Tungadri Bose,Anirban Dutta,Chennareddy Venkata Siva Kumar Reddy,Sharmila S Mande

doi:10.1016/j.ygeno.2015.04.005

Mohammed Monzoorul Haque, Tungadri Bose + Show 3 more

Open Access

https://doi.org/10.1016/j.ygeno.2015.04.005

Copy DOI

Journal: Genomics	Publication Date: May 2, 2015
Citations: 17	License type: elsevier-specific: oa user license

Abstract

Metagenomic sequencing data, obtained from host-associated microbial communities, are usually contaminated with host genome sequence fragments. Prior to performing any downstream analyses, it is necessary to identify and remove such contaminating sequence fragments. The time and memory requirements of available host-contamination detection techniques are enormous. Thus, processing of large metagenomic datasets is a challenging task. This study presents CS-SCORE — a novel algorithm that can rapidly identify host sequences contaminating metagenomic datasets. Validation results indicate that CS-SCORE is 2–6 times faster than the current state-of-the-art methods. Furthermore, the memory footprint of CS-SCORE is in the range of 2–2.5GB, which is significantly lower than other available tools. CS-SCORE achieves this efficiency by incorporating (1) a heuristic pre-filtering mechanism and (2) a directed-mapping approach that utilizes a novel sequence composition metric (cs-score). CS-SCORE is expected to be a handy ‘pre-processing’ utility for researchers analyzing metagenomic datasets. AvailabilityFor academic users, an implementation of CS-SCORE is freely available at: http://metagenomics.atc.tcs.com/cs-score (or) https://metagenomics.atc.tcs.com/preprocessing/cs-score.

Full Text