Abstract

Abstract Background: RNA-seq data from tumor samples can be used to identify novel cancer subtypes using cluster analysis. The number of features is often large compared to the number of samples and different clusters can appear in different subsets of the feature space. Feature selection techniques are therefore commonly used to reduce the dimension and remove redundant and irrelevant features before performing cluster analysis. An abundance of feature selection methods have been proposed in the literature, but it is unclear how the ability to identify novel subtypes of cancer using cluster analysis is affected by choice of feature selection method. Method: We evaluated 13 feature selection methods on four publicly available cancer data sets from The Cancer Genome Atlas. RNA-seq data and associated clinical data were retrieved from Broad institute GDAC Firehose. Overlap and characterization of the highest ranked features (i.e. genes) were studied for top 100, 1000 and 3000 genes. Performance was measured by comparing known cancer subtypes to partitions obtained using hierarchical clustering with Euclidean distance and Ward´s linkage, based on the selected features. The result was compared to both a random selection (negative control) and a supervised approach (positive control) using adjusted Rand index. Results: The relative performance of the feature selection methods varied heavily depending on number of included genes and the data set used. Based on all data, gene selection using the Dip-test statistic and the Bimodality index generated the overall highest clustering performance. The overlap of selected genes between the Bimodality index, the Dip-test and the supervised approach was relatively low. The Dip-test and Bimodality index tended to favor genes with relatively low expression values. Conclusions: The choice of feature selection method can have a huge impact on the ability to identify cancer subtypes using cluster analysis, and the relative performance is highly dependent on the data. Low overlap of selected genes between the highest ranked methods suggests that the methods identify genes that contain complementary information, and that it might be beneficial to combine two or more feature selection methods. Citation Format: Linda Vidman, David Källberg, Patrik Rydén. Evaluation of feature selection methods used for cluster analysis in identification of novel cancer subtypes [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 5475.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.