Abstract

Selection of informative genes is an important problem in gene expression studies. The small sample size and the large number of genes in gene expression data make the selection process complex. Further, the selected informative genes may act as a vital input for gene co-expression network analysis. Moreover, the identification of hub genes and module interactions in gene co-expression networks is yet to be fully explored. This paper presents a statistically sound gene selection technique based on support vector machine algorithm for selecting informative genes from high dimensional gene expression data. Also, an attempt has been made to develop a statistical approach for identification of hub genes in the gene co-expression network. Besides, a differential hub gene analysis approach has also been developed to group the identified hub genes into various groups based on their gene connectivity in a case vs. control study. Based on this proposed approach, an R package, i.e., dhga (https://cran.r-project.org/web/packages/dhga) has been developed. The comparative performance of the proposed gene selection technique as well as hub gene identification approach was evaluated on three different crop microarray datasets. The proposed gene selection technique outperformed most of the existing techniques for selecting robust set of informative genes. Based on the proposed hub gene identification approach, a few number of hub genes were identified as compared to the existing approach, which is in accordance with the principle of scale free property of real networks. In this study, some key genes along with their Arabidopsis orthologs has been reported, which can be used for Aluminum toxic stress response engineering in soybean. The functional analysis of various selected key genes revealed the underlying molecular mechanisms of Aluminum toxic stress response in soybean.

Highlights

  • With the advent of fast and cheaper genome sequencing technologies, huge genomic data is being generated and deposited in public domain databases over the years by different research organizations across the globe [1, 2]

  • In order to study the performance of the proposed Boot-Support Vector Machine-Recursive Feature Elimination (SVM-RFE) technique, the top 1000 genes obtained based on ranking from each of the gene selection techniques were used for classification of crop microarray samples into control and stress classes through SVM classifier

  • It can be seen that the performance of the proposed Boot-SVM-RFE is consistently better over other contemporary techniques across different datasets related to abiotic stresses

Read more

Summary

Introduction

With the advent of fast and cheaper genome sequencing technologies, huge genomic data is being generated and deposited in public domain databases over the years by different research organizations across the globe [1, 2] Most of these datasets are related to expression of genes from various experiments conducted to understand behavior of biological mechanism of species under biotic and abiotic stresses. T-score, F-score, Information Gain (IG) measure, Random Forest (RF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) [7, 9,10,11,12,13] have been used for gene selection In these methods genes are selected by considering only their relevance with classes. There is a possibility that genes which are spuriously associated with the classes may get selected

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call