Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes.

David Källberg,Linda Vidman,Patrik Rydén

doi:10.3389/fgene.2021.632620

David Källberg, Linda Vidman + Show 1 more

Open Access

https://doi.org/10.3389/fgene.2021.632620

Copy DOI

Journal: Frontiers in genetics	Publication Date: Feb 24, 2021
Citations: 11	License type: CC BY 4.0

Affiliation: Umeå University

Abstract

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (−0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

Highlights

The human genome consists of around 21,000 protein coding genes (Pertea et al, 2018)
A cluster analysis aimed at detecting novel disease subtypes should only utilize genes that are informative for the task, i.e., genes that have their expression mainly governed by which disease subtype the patient has
We tested the performance of 13 feature selection methods when identifying subgroups using cluster analysis on four human cancer data sets

Summary

Introduction

The human genome consists of around 21,000 protein coding genes (Pertea et al, 2018). It is of interest to apply some sort of gene selection procedure prior to the cluster analysis This task would be relatively easy if it was known which subtypes (i.e., labels) the patients have, but for unsupervised classification problems, the labels are unknown making gene selection a true challenge. If instead it is assumed that informative genes are likely to be expressed at a relatively high level it makes sense to select highly expressed genes Another class of measures is based on quantifying the extent to which the gene expression distribution can be described by two or more relatively distinct peaks, or modes, which represent different subtypes. We assume that the tumor samples can be divided into two subtypes Given that this assumption is true, the gene expression of an informative gene may have a bimodal distribution. The dip-test suggested by Hartigan and Hartigan (1985) addresses this problem

Methods

Results

Conclusion