Abstract Background: In clinical practice it is a common challenge to correctly classify disease against normal cases and to identify disease subtypes. In complex diseases such as cancer where patterns are heterogeneous, highly complex interaction of pathways are involved, and continuous multi-level genomic changes occur, this is a challenging task. Using genomics data, analyzed by un-supervised and supervised machine learning tools, we demonstrate the ability to quantitatively and accurately describe a patient's tumor. Design: We used Level III RNASeq gene expression data from 20531 genes in 889 tumor samples of RCC across three subtypes - clear cell renal cell carcinoma (CCRCC), papillary renal cell carcinoma (PRCC), chromophobe carcinoma (ChRCC) and 129 normal samples from the Cancer Genome Atlas (TCGA). We developed a computational framework for feature (gene/transcript) selection and subtype predictive model construction. This framework relies on a well-known “random forest” (RF) method with iterative feature selection and 10-fold cross-validation. We performed a series of (1) tumor vs normal tissue experiments for each subtype; (2) pairwise subtype comparison, and finally (3) all three subtypes comparison and predictive genes identification. Results: In each computational run our method detected 2054 (10%) top varying genes, and estimated the predictive power of each of the selected genes using RF. On average this method demonstrated 93-97% accuracy. We identified genes known to play a role in renal cell carcinoma for example, CA9, LOX, SFRP1, SLC4A1, CDKN2A, KISS1R, EGF and others. In addition, our analysis uncovered genes that may represent characteristic patterns for subtyping and differentiation from normal renal tissue cells, for example TCF21, IRX1, STC2, UMOD, AQP2, ANGPTL4, BSND and FABP7, genes not previously associated with renal cell carcinoma. In three different experiments we differentiated each of the three subtypes from normal tissue, and performed enrichment analysis for the top most significant genes in each case. We observed that both CCRCC and PRCC have genes involved in the “glycosaminoglycan biosynthesis - heparan sulfate” (HS6ST2, HS6ST3) and “riboflavin metabolism” (ACPP) pathways. Whereas ChRCC is more strongly associated with the “glycosphingolipid biosynthesis - lacto and neolacto series” pathway (B3GNT3, FUT6) and have five genes (B3GNT3, CYP2B6, CYP2J2, FUT6, UGT2A3) involved in other metabolic pathways. Changes in glycosaminoglycan and glycolipid were also previously reported for associations with RCC. Conclusions: We demonstrate the effectiveness of a computational framework and predictive power of gene expression data for tumor subtyping in RCC. Our framework is generic and can be applied in combination with other types of data such as different modalities of genomic data (copy number variations, methylation) as well as clinical data. Citation Format: Konstantin Volyanskyy, Yong Mao, Yee Him Cheung, Balaji Santhanam, Vlado Menkovski, Zharko Aleksovski, Minghao Zhong, John T. Fallon, Nevenka Dimitrova. Normal versus tumor and subtype prediction in renal cell carcinoma TCGA data sets. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 3651.