Abstract The Cancer Genome Atlas (TCGA) has made available comprehensive molecular profiles including gene expression using RNA-seq for many human tumor types and, thereby, provided a great opportunity to use pan-cancer gene expression patterns to identify unique features that can classify tumor types. Those features may serve as biomarkers for tumor diagnosis and potential targets for drug development. Because gender differences in cancer susceptibility are one of the most consistent findings in cancer epidemiology, knowing whether the distinguishing features differ between males and females for the same tumor types might enhance their utility as biomarkers. Our goals are: 1) to identify a set of genes whose expression levels can classify pan-cancer tumor types when gender is ignored; and 2) to identify analogous sets of genes for pan-cancer classification in sex non-specific tumors from men and from women separately. Here we analyzed the RNA-seq expression data for 9,096 TCGA tumor samples from 31 tumor types; we selected features for sample classification using a genetic algorithm, GA/KNN, which we developed. We randomly assigned half of all samples into a training set and half into a test set, proportionally allocating samples from each tumor type. We used the training set to obtain 2,000 near-optimal classifiers (each set consisting of 20 genes) from repeated GA/KNN runs. We subsequently applied each resultant near-optimal classifier to the test set. The predicted class was then compared with the true class to calculate both training- and testing-set prediction accuracies. To see if sex non-specific tumors (those that are not related to reproductive organs) differ between males and females, we repeated the analysis for 24 sex non-specific tumor types separately for males and for females. In the analysis ignoring gender, we could accurately classify 88.7% training samples and 88.1% of the testing samples on average. The top 20 genes that separated the 31 tumor types were TCF21, TBX5, EMX20S, EMX2, PA2G4P4, HNF1B, NACA2, SFTPA1, ATP5EP2, PTTG3P, FTHL3, ANXA2P3, GATA3, NAPSA, SFTA3, HSPB1P1, HOXA9, IGBP1P1, RPL19P12, and SFTPB. Not surprisingly, we achieved similar average classification accuracies for testing samples (88.2% for males and 88.4% for females). Seventy four of the 100 top-ranked genes were common between males and females. The main reasons for less than total overlap between males and females are: stochastic nature of the algorithm, training/testing partitioning (only done once) and some intrinsic differences between genders. For example, BNC1, FAT2, and MSX2 ranked 36, 84, and 94 in males but 461, 603, and 1085 in females. Conversely, HPN, POU3F3, and FOXA1 ranked 75, 78, and 91 but 421, 355, and 349, respectively. It remains unclear whether genes likes those play any role in differential cancer susceptibility between genders. In conclusion, we were able to select many sets of 20 genes that can correctly classify ~88% of the tumor samples from 31 different types for a validation set using RNA-seq gene expression alone. This accuracy is remarkable given the number of the tumor types involved. This result was largely replicated in tumors between genders. We were also identified a few genes that might play a role in gender differences among sex non-specific tumors. Citation Format: YuanYuan Li, David M. Umbach, Leping Li. A comprehensive genomic pan-cancer analysis comparing males and females using The Cancer Genome Atlas gene expression data. [abstract]. In: Proceedings of the AACR Precision Medicine Series: Targeting the Vulnerabilities of Cancer; May 16-19, 2016; Miami, FL. Philadelphia (PA): AACR; Clin Cancer Res 2017;23(1_Suppl):Abstract nr A46.
Read full abstract