Abstract Understanding tumors as evolutionary systems is an important area of study with far-reaching implications in diagnostic and treatment paradigms. Computational phylogenetics is a valuable method for inferring tumor evolution in terms of evolutionary trees, phylogenies, where paths in a tree correspond to possible tumor progression pathways. The location of specific cell-types and patient samples in the tree provide information on tumor sub-types and development of heterogeneity. We previously developed a tumor phylogeny inference pipeline for array comparative genome hybridization (aCGH)-based tumor copy number profiles. Steps in the pipeline included extraction of robust progression markers from the data, which could differentiate stages of tumor evolution or the different paths in the tree, and assigning amplification states to the inferred markers in those stages. We introduced a novel multi-sample model for amplicon identification and calling, HMMCNA, which jointly extracted markers from and assigned amplification states to small sets of tumor aCGH profiles. HMMCNA employs a Hidden Markov Model (HMM), a probabilistic model, to classify data into normal and amplified states based on an underlying distribution for the two copy number states and a hidden state space of possible amplification states. We assumed two possible amplification states per sample: normal (0) or amplified (1). Joint segmentation and calling is performed by identifying a most likely sequence of amplification states across all genomic sites probes and samples. This approach limits in the number of samples the HMM can handle since the number of possible hidden amplification states increases exponentially with the number of samples. Here, we present an extension of the approach to handle large datasets. We incorporate a heuristic prior to the HMM classification to reduce the hidden state space by first screening out amplification states not strongly supported at any individual genome coordinates. The introduction of this heuristic reduces the state space on average by 99%. We further reduce the set of possible amplification states based on the frequency of occurrence of the states by only allowing those states occuring at multiple aCGH probes or array genome coordinate. This step accounts for the presence of random noise in the data and gives a further reduction of 80%. We demonstrate the method on a breast tumor aCGH dataset comprising copy number profiles derived from sectioned biopsy samples (NCBI GEO GSE16672, Navin et al., 2010). Our method was able to quickly segment the data into sets of robust normal and amplified segments suitable for downstream phylogeny building. The amplicons inferred carried several known markers of tumor progression. Further steps include tuning the parameters of the HMM to handle noise-levels across different datasets. Citation Format: Ayshwarya Subramanian, Stanley Shackney, Russell Schwartz. Inference of tumor phylogenetic markers from large copy number datasets. [abstract]. In: Proceedings of the 104th Annual Meeting of the American Association for Cancer Research; 2013 Apr 6-10; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2013;73(8 Suppl):Abstract nr 5133. doi:10.1158/1538-7445.AM2013-5133
Read full abstract