Abstract Cancer health disparities exist among different ethnicities and races. Various factors such as lifestyle, environmental exposure, genetics, and epigenetics are thought to play a role in the existence of these disparities. Viruses, which outnumber cells in the human body by 100-fold, have a major impact on human health. Viruses have been estimated to cause 15-20% of human cancers, and we expect that some of these oncogenic viruses may have the potential to impact health disparities. Next generation sequencing (NGS) technologies are being used to investigate novel virus-cancer associations and interactions, and several bioinformatics tools for the detection and analysis of virus sequences in human NGS data have recently become available. These tools have not been validated, however, for use in human cancers with extremely low levels of viral sequences. In this study, we have compared the sensitivity and specificity of READSCAN to a manually constructed analysis workflow in a common, commercial NGS web application. READSCAN is a freely available software application that utilizes automated “digital subtraction” to eliminate host reads and identify specific virus sequences in NGS data. The functionality of this tool was compared to a workflow constructed within the web interface for the commercially available NGS software Partek. This constructed workflow first subtracted human sequences and then aligned the remaining sequences against select viral genomes. The sensitivity of the various tools was compared using NGS data obtained from human cytomegalovirus (HCMV) infected cells. Specificity of the tools was determined by analyzing the same data set against the Epstein-Barr virus (EBV) and Human Papilloma virus type 16 (HPV-16). Partek and READSCAN detected 69.66% and 60.54% of the input number of reads as HCMV sequences. Under these conditions, neither bioinformatics tool detected EBV or HPV16 specific reads in the HCMV infected cell data. This indicated that both Partek and READSCAN were capable of readily detecting the presence of large amounts of virus reads in NGS data from infected cells in a specific manner. In order to test the ability of these tools to detect viral specific reads in NGS data from tumors, data obtained from the HPV16 positive cervical squamous cell carcinoma cell line SiHa were analyzed. Both Partek and READSCAN detected six reads out of 969,798 total reads as HPV16 sequences; EBV or HCMV sequences were not detected. Our results are in agreement with previously published observations for this SiHa cell line, in which five HPV16 specific reads were detected. While we are still in the process of testing additional bioinformatics tools and configurations, our results attest that the open–source bioinformatics tool READSCAN and the commercially available Partek are comparable in performance, but differ in utility. Partek is faster, has a user-friendly interface, and a knowledge of Linux commands is not required. READSCAN lacks these features, but it is free and has an algorithm more clearly proscribed in the literature for digital subtraction. Next steps include utilizing a wider range of digital subtraction software, evaluating a broader range of NGS data from demographically diverse origins, and investigating discordance as it is found between alternative tools and configuration parameters. Ultimately, an ensemble of tools and configuration contexts is expected to yield critical information on the role of viruses in cancer and associated health disparities Citation Format: Barbara Swanson, Scott Harrison, Dukka KC, Perpetua Muganda. Comparative analysis of bioinformatic tools for the detection of viral DNA sequences in tumor cells. [abstract]. In: Proceedings of the Sixth AACR Conference: The Science of Cancer Health Disparities; Dec 6–9, 2013; Atlanta, GA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2014;23(11 Suppl):Abstract nr C04. doi:10.1158/1538-7755.DISP13-C04
Read full abstract