Abstract

BackgroundMany variable selection techniques have been proposed for the clustering of gene expression data. While these methods tend to filter out irrelevant genes and identify informative genes that contribute to a clustering solution, they are based on criteria that do not consider the potential interactive influence among individual genes. Motivated by ensemble clustering, there is a strong interest in leveraging the structure of gene networks for gene selection, so that the relationship information between genes can be effectively utilized, while the selected genes are expected to preserve all the possible clustering structures in the data.ResultsWe present a new filter method that uses the gene connectivity in the gene co-expression network as the evaluation criteria for variable selection. The gene connectivity measures the importance of the genes in term of their expression similarity with others in the co-expression network. The hard threshold and soft threshold transformations are employed to construct the gene co-expression networks. Both simulation studies and real data analysis have shown that the network based on soft thresholding is more effective in selecting relevant variables and provides better clustering results compared to the hard thresholding transformation and two other canonical filter methods for variable selection. Furthermore, a new module analysis approach is proposed to reveal the higher order organization of the gene space, where the genes of a module share significant topological similarity and are associated with a consensus partition of the sample space. We demonstrate that the identified modules can lead to biologically meaningful sample partitions that might be missed by other methods.ConclusionsBy leveraging the structure of gene co-expression network, first we propose a variable selection method that selects individual genes with top connectivity. Both simulation studies and real data application have demonstrated that our method has better performance in terms of the reliability of the selected genes and sample clustering results. In addition, we propose a module recovery method that can help discover novel sample partitions that might be hidden when performing clustering analyses using all available genes. The source code of our program is available at http://nba.uth.tmc.edu/homepage/liu/netVar/.

Highlights

  • Many variable selection techniques have been proposed for the clustering of gene expression data

  • In the hard thresholding transformation, we considered the effects of two parameters on the performance of variable selection: the hard threshold τ that determines the number of genes and edges included in the unweighted coexpression network, and the percentage of genes to be selected based on their connectivity in the resulted network, determined by equation (3)

  • It is not surprising to observe that selecting all of the 500 genes in the dataset can only lead to a low F-score (0.15) and a high classification error rate (CER) (0.29), as shown in Figure 1a and b, because too many noninformative genes were included without the variable selection step

Read more

Summary

Introduction

Many variable selection techniques have been proposed for the clustering of gene expression data. While these methods tend to filter out irrelevant genes and identify informative genes that contribute to a clustering solution, they are based on criteria that do not consider the potential interactive influence among individual genes. Variable selection algorithms can be categorized as either wrappers or filters. In the context of clustering, the wrapper approach searches for variables best suited to a specific clustering algorithm aiming to improve the clustering performance [1,2]. Filter-based algorithms do not involve clustering algorithms for the evaluation of variable subsets. The filter approach is considered faster and more efficient than the wrapper method in high-dimensional data analysis

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call