Abstract
BackgroundDespite advances in the gene-set enrichment analysis methods; inadequate definitions of gene-sets cause a major limitation in the discovery of novel biological processes from the transcriptomic datasets. Typically, gene-sets are obtained from publicly available pathway databases, which contain generalized definitions frequently derived by manual curation. Recently unsupervised clustering algorithms have been proposed to identify gene-sets from transcriptomics datasets deposited in public domain. These data-driven definitions of the gene-sets can be context-specific revealing novel biological mechanisms. However, the previously proposed algorithms for identification of data-driven gene-sets are based on hard clustering which do not allow overlap across clusters, a characteristic that is predominantly observed across biological pathways.ResultsWe developed a pipeline using fuzzy-C-means (FCM) soft clustering approach to identify gene-sets which recapitulates topological characteristics of biological pathways. Specifically, we apply our pipeline to derive gene-sets from transcriptomic data measuring response of monocyte derived dendritic cells and A549 epithelial cells to influenza infections. Our approach apply Ward’s method for the selection of initial conditions, optimize parameters of FCM algorithm for human cell-specific transcriptomic data and identify robust gene-sets along with versatile viral responsive genes.ConclusionWe validate our gene-sets and demonstrate that by identifying genes associated with multiple gene-sets, FCM clustering algorithm significantly improves interpretation of transcriptomic data facilitating investigation of novel biological processes by leveraging on transcriptomic data available in the public domain. We develop an interactive ‘Fuzzy Inference of Gene-sets (FIGS)’ package (GitHub: https://github.com/Thakar-Lab/FIGS) to facilitate use of of pipeline. Future extension of FIGS across different immune cell-types will improve mechanistic investigation followed by high-throughput omics studies.
Highlights
Despite advances in the gene-set enrichment analysis methods; inadequate definitions of gene-sets cause a major limitation in the discovery of novel biological processes from the transcriptomic datasets
Despite advances in the methods for gene-set enrichment analysis [2, 6,7,8]; inadequate definitions of gene-sets cause a major limitation in the discovery of novel biological processes
Increasing use of high-throughput assays in the biomedical field allows identification of context-specific set of functionally related genes, which can be loosely defined to include genes regulated by a same set of transcription factors or sets of genes involved in same pathways
Summary
Despite advances in the gene-set enrichment analysis methods; inadequate definitions of gene-sets cause a major limitation in the discovery of novel biological processes from the transcriptomic datasets. Unsupervised clustering algorithms have been proposed to identify gene-sets from transcriptomics datasets deposited in public domain. These data-driven definitions of the gene-sets can be context-specific revealing novel biological mechanisms. Recent advances have led to development of data-driven approaches to identify gene-sets [9,10,11,12,13] These are powerful approaches that expand search for biological mechanisms based on datasets in public domain leading path towards discovery
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.