Using theRPackagecrlmmfor Genotyping and Copy Number Estimation

Robert B Scharpf,Ingo Ruczinski,Rafael A Irizarry,Benilton Carvalho,Matthew E Ritchie

doi:10.18637/jss.v040.i12

Abstract

Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number and integration of the marker-level estimates with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R.

Highlights

Duplications and deletions spanning kilobases of the genome contribute to a substantial proportion of the genetic variation between individuals
We have applied the crlmm software to the HapMap phase 3 data, illustrating the steps of preprocessing, the genotyping of polymorphic markers, and the estimation of allele-specific copy number
We organize the normalized intensities, statistical summaries from the genotyping and copy number estimation steps, and meta-data on the features and samples in a single container. This container extends the eSet class defined in Biobase, with additional slots to accommodate batch-specific statistical summaries relevant for copy number analyses

Summary

Introduction

Duplications and deletions spanning kilobases of the genome contribute to a substantial proportion of the genetic variation between individuals. Current estimates regarding the frequency and size of segmental duplications and deletions in the human genome are largely based on high-throughput arrays that quantitate copy number on a genomic scale. Two such technologies are array comparative genomic hybridization (aCGH) and genotyping platforms such as the Affymetrix oligonucleotide arrays and the Illumina BeadArrays. This paper describes software for the first of a two-stage approach for identifying CNV in high-throughput genotyping arrays.

Preprocessing and genotyping

Locus-level copy number estimation

Downstream tools

Discussion

Session information