Abstract The karyotype of human tumors are often aneuploid: besides these numerical deviations, there are often structural rearrangements within individual chromosomes such as amplifications, deletions and translocations. In the resulting genome, profound and complex alterations in the underlying gene network and dosage occur, giving rise to the observed malignant phenotype. To understand how these events contribute to the biology of a tumor cell, a first crucial step is to be able to detect variation in chromosomal structure and, in particular, copy number. The recent development in technology, such as array comparative genomic hybridization (aCGH), SNP microarray and, more recently, high throughput sequencing, made it feasible to detect copy number aberrations (CNA). Using these techniques it is possible to obtain, for thousands of regions across the genome, a numerical value proportional to the chromosome copy number for each of the accessed regions. Comparing DNA from tumor to normal samples, it is possible to identify copy number aberrations across the genome. However, the methodology used to calculate CNA from the raw data usually makes one or both of the following assumptions: • The starting material consists of genetically homogeneous cells. • The overall size of a tumor genome is very similar to the size of the normal genome. The first assumption is completely reasonable when dealing with cell lines, but the same cannot be said when the DNA analyzed is isolated from patients’ tumors. Infiltrations with stromal or endothelial cells are, in fact, largely inevitable. The second assumption, that has a strong impact on the normalization step, is often not correct. Given the scale, extent and severity of chromosomal structural changes, the overall genetic material of a cancer cell might be significantly different (usually larger) than a normal one. Therefore, assuming that the total size of a cancer genome is comparable to the size of the normal genome, might lead to artifacts and misleading results. Some methods do not strictly require assumptions on the size of the genome, but they rely on the ability to detect SNP variants and distinguish between the two alleles of a heterozygous region. They can then infer when two, three, four or more copies are present. Although high throughput sequencing could also detect SNP variants, a very high read coverage would be required, making the technology far too expensive at present. Here, we propose a method to obtain CNA from high throughput sequencing that avoids the two assumptions mentioned above. The method counts the number of reads mapped to a region of fixed length both in tumor and normal DNAs from the same patient. For each region, the ratio between number of reads from the tumor sample and the normal sample depends mainly on three factors: • The average number of copies (including stromal contamination) of a given chromosomal region in the two samples. • The overall depth of coverage (total number of reads) of the two samples. • Sampling error. The goal is to detect, for each chromosomal region, the underlying ratio between number of copies in the tumor genome versus the normal genome, despite the noise due to different depth of coverage and the sampling error. For each chromosomal region, we first calculate the ratio of read counts between tumor and normal. This way we balance out various biases due to biological and technical issues (i.e. GC content, aligning artifacts). Second, to reduce the sampling error with minimal effect on the resolution, we use a segmentation algorithm on the obtained ratio. At this point, we look at the distribution of ratio in all chromosomal regions and we fit a model of several normal distributions with equally spaced means. The ratio from each region can thus be assigned to one of these distributions and underlying CNA estimated. We are testing the algorithm on a series of simulated and real samples to access the strength of the proposed method.
Read full abstract