Abstract

The analysis of gene sets is usually carried out based on gene ontology terms and known biological pathways. These approaches may not establish any formal relation between genotype and trait specific phenotype. In plant biology and breeding, analysis of gene sets with trait specific Quantitative Trait Loci (QTL) data are considered as great source for biological knowledge discovery. Therefore, we proposed an innovative statistical approach called Gene Set Analysis with QTLs (GSAQ) for interpreting gene expression data in context of gene sets with traits. The utility of GSAQ was studied on five different complex abiotic and biotic stress scenarios in rice, which yields specific trait/stress enriched gene sets. Further, the GSAQ approach was more innovative and effective in performing gene set analysis with underlying QTLs and identifying QTL candidate genes than the existing approach. The GSAQ approach also provided two potential biological relevant criteria for performance analysis of gene selection methods. Based on this proposed approach, an R package, i.e., GSAQ (https://cran.r-project.org/web/packages/GSAQ) has been developed. The GSAQ approach provides a valuable platform for integrating the gene expression data with genetically rich QTL data.

Highlights

  • The focus in Gene Expression (GE) data analysis has been shifted from single gene to gene set level, as a gene does not work alone; rather it works as an intricate network of a set of genes[7]

  • The value of number of QTL hits (NQhits) statistic depends on the factors like length of Quantitative Trait Loci (QTL) and number of genes in a gene set linked to QTLs for a given stress

  • In case of larger gene sets, the performance based on NQhits statistic is found to be better for t-score, F-score, Maximum Relevance Minimum Redundancy (MRMR), Symmetrical Uncertainty (SU), Pearson’s Correlation Filter (PCF), s Rank Correlation (SRC) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) as compared to Information Gain (IG), Gain Ratio (GR) and Random Forest (RF)

Read more

Summary

Introduction

The focus in Gene Expression (GE) data analysis has been shifted from single gene to gene set level, as a gene does not work alone; rather it works as an intricate network of a set of genes[7]. The enrichment analysis of gene sets is well developed in human disease genetics, where, GO terms and known biological pathways are taken into account[16,17,18] These approaches may not be useful to establish any formal relation between genotype and trait specific phenotype in plants. We illustrated the application of the developed GSAQ approach as biological relevant criteria to evaluate the performance of gene selection methods based on high dimensional GE data For this purpose, we used ten different gene selection methods, viz. Support Vector Machine-Recursive Feature Elimination (SVM-RFE), t-score, F-score, Maximum Relevance Minimum Redundancy (MRMR) technique, Random Forest (RF), Information Gain (IG), Gain Ratio (GR), Symmetrical Uncertainty (SU), Pearson’s Correlation Filter (PCF) and Spearman’s Rank Correlation (SRC)[19,20,21,22,23,24,25,26,27,28]. GSAQ approach provided two biological relevant criteria for evaluating the performance of gene selection methods on GE data

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call