Abstract

Cancer progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution array-based comparative genomic hybridization (aCGH) is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. Typical studies of aCGH data sets take a pipeline approach, starting with segmentation of profiles, calls of gains and losses, and finally determination of frequent CNAs across samples. A drawback of pipelines is that choices at each step may produce different results, and biases are propagated forward. We present a mathematically robust new method that exploits probe-level correlations in aCGH data to discover subsets of samples that display common CNAs. Our algorithm is related to recent work on maximum-margin clustering. It does not require pre-segmentation of the data and also provides grouping of recurrent CNAs into clusters. We tested our approach on a large cohort of glioblastoma aCGH samples from The Cancer Genome Atlas and recovered almost all CNAs reported in the initial study. We also found additional significant CNAs missed by the original analysis but supported by earlier studies, and we identified significant correlations between CNAs.

Highlights

  • Cancers are a complex set of proliferative diseases whose progression, in most cases, is driven in part by an accumulation of genetic changes, including copy number aberrations (CNAs) of large or small genomic regions [1,2,3] which may for example lead to amplification of oncogenes or loss of tumor suppressor genes

  • Array-based comparative genomic hybridization [4,5] and single nucleotide polymorphism (SNP) arrays [6] have been used to analyze the CNAs of tumor samples at a genomic scale and at progressively higher resolutions

  • Our approach builds on a recently proposed maximum margin clustering algorithm [21,22], which brings ideas from largemargin supervised learning techniques like support vector machine classification and support vector regression to the unsupervised clustering problem; the choice of constraints was motivated by recent work on fused lasso regression [28]

Read more

Summary

Introduction

Cancers are a complex set of proliferative diseases whose progression, in most cases, is driven in part by an accumulation of genetic changes, including copy number aberrations (CNAs) of large or small genomic regions [1,2,3] which may for example lead to amplification of oncogenes or loss of tumor suppressor genes. Cancer progression is often characterized by increasing genomic instability, potentially generating many ‘‘passenger’’ CNAs that do not confer clonal growth advantage. These processes give rise to a complicated landscape of genomic alterations within an individual tumor and great diversity of these CNAs across tumor samples, making it difficult to identify driver mutations associated with cancer progression. Numerous large-scale tumor profiling studies have generated copy number data sets for large cohorts of tumors [7,8] These large and complex ‘‘cancer genome’’ data sets present difficult statistical challenges [9]. Individual CNAs may be as small as a few adjacent probes or as large as a whole chromosomes and may be difficult to detect above probe-level noise; it is unclear how to make sense out of diverse CNAs from hundreds of tumors

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call