One limitation of genome-wide linkage screens aiming to identify disease-susceptibility genes is the resolution at which they define individual candidates. Fine mapping of linked loci can often be laborious, protracted, and poorly directed. We reasoned that an integrative genomics approach could help to expedite candidate gene identification. We tested this hypothesis by prioritizing genes within three linkage regions (chromosomes 2q33-36, 8pter-22, and 12p13-12) previously identified in the Boston Early-Onset COPD cohort (1), using two independent microarray data sets. Genes within previously defined linkage regions were identified using Golden Path (University of California Santa Cruz, http://genome.ucsc.edu), and associated probes were assigned using NetAffx (Affymetrix, Santa Clara, CA). Probe sets were filtered by sequence verification using The Lung Transcriptome (http://lungtranscriptome.bwh.harvard.edu). The first microarray data set studied consists of whole lung tissue samples derived from 20 patients who had undergone lung volume reduction surgery with severe emphysema and from 14 control subjects with mild to moderate obstruction (2). The second data set consists of whole lung tissue samples (uninvolved margin) derived from 31 patients undergoing surgery for solitary pulmonary nodules, with varying degrees of airflow obstruction (FEV1% predicted: mean, 72; range, 10–133), and has not been previously described. Signal intensities were derived using both MAS5 and RMA algorithms. Unsupervised clustering with the nonparametric bootstrap was applied to check for undesirable and unanticipated structure or associations among the samples. Pearson and Spearman correlations were used to test for significant associations between gene expression and continuous phenotypic variables (e.g., TLC, FEV1, FVC, FEV1/FVC, DlCO, FEF25–75, smoking history). All analysis methods were exhaustively repeated for each gene/probe set and each data set. For each gene/probe set, analysis results were summarized where data implicated an association between gene expression and the disease variable. Finally, genes were rank-prioritized based upon their frequency of significant association. A number of genes within each locus were repeatedly associated with multiple phenotypic variables in each data set. As has been previously noted, there was limited consistency between data sets, which in this case may be due to distinctions in patient populations and/or technical limitations. However, this approach consistently identified a small number of genes, including the recently implicated candidate susceptibility gene SERPINE2, as targets for further study. We believe this provides a rational approach that is applicable to many diseases and model systems for which data may already exist. However, further investigation is necessary to assert causation between prioritized candidates and the phenotype being investigated.
Read full abstract