Abstract

To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature’s coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention.

Highlights

  • The rapid development of high-throughput sequencing technology has produced a large amount of genomic data related to protein, gene and life metabolism

  • It is very difficult to find a stable and good generalization algorithm for analyzing this kind of genomic data. To tackle this challenging task, this paper will focus on the feature selection problem for genomic data analysis under an unsupervised learning scenario. It will propose the unsupervised feature selection technique based on the standard deviation and the cosine similarity of variables

  • This subsection will test the power of the unsupervised feature selection algorithms SCEFS, SCRFS, and SCAFS using high dimensional gene expression datasets of cancers

Read more

Summary

INTRODUCTION

The rapid development of high-throughput sequencing technology has produced a large amount of genomic data related to protein, gene and life metabolism. He et al (2017) proposed the unsupervised feature selection algorithm named DGFS (Decision graph-based feature selection) They defined the local density and the discriminant distance for a feature, and the decision score to evaluate the feature. To tackle this challenging task, this paper will focus on the feature selection problem for genomic data analysis under an unsupervised learning scenario It will propose the unsupervised feature selection technique based on the standard deviation and the cosine similarity of variables. We know that the features with higher scores have strong discernibility and low redundancy These features comprise the feature subset, which coincides with the original destination (Fu et al, 1970; Ding and Peng, 2005; Peng et al, 2005) of feature selection.

A Toy Case Study
EXPERIMENTS AND ANALYSES
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call