Abstract
MotivationDetecting differentially expressed (DE) genes between disease and normal control group is one of the most common analyses in genome-wide transcriptomic data. Since most studies don’t have a lot of samples, researchers have used meta-analysis to group different datasets for the same disease. Even then, in many cases the statistical power is still not enough. Taking into account the fact that many diseases share the same disease genes, it is desirable to design a statistical framework that can identify diseases’ common and specific DE genes simultaneously to improve the identification power.ResultsWe developed a novel empirical Bayes based mixture model to identify DE genes in specific study by leveraging the shared information across multiple different disease expression data sets. The effectiveness of joint analysis was demonstrated through comprehensive simulation studies and two real data applications. The simulation results showed that our method consistently outperformed single data set analysis and two other meta-analysis methods in identification power. In real data analysis, overall our method demonstrated better identification power in detecting DE genes and prioritized more disease related genes and disease related pathways than single data set analysis. Over 150% more disease related genes are identified by our method in application to Huntington’s disease. We expect that our method would provide researchers a new way of utilizing available data sets from different diseases when sample size of the focused disease is limited.
Highlights
High-throughput technology like microarray and next-generation sequencing (NGS) allows researchers measure thousands of gene or microRNA expression in one sample simultaneously
In this paper, we present a novel statistical framework which aims at addressing a problem often met by biological researchers: when only a limited number of sample for a specific disease is available, the identification power could be improved by jointly analyzing multiple similar disease data sets because differentially expressed (DE) genes might be shared among similar diseases
By implementing a two-component mixture model, we demonstrate the framework could improve the identification power through comprehensive simulation studies and two real data applications
Summary
High-throughput technology like microarray and next-generation sequencing (NGS) allows researchers measure thousands of gene or microRNA expression in one sample simultaneously. Qin and Lu BioData Mining (2018) 11:3 clinical diagnosis tools and investigating potential drug targets This approach has been successfully applied in many complex diseases like cancers [10, 13] and diabetes [5, 33]. With the cost of microarray and generation sequencing technique decreasing and stabilization of the experiment protocol, there are over 1,000,000+ samples deposited in public databases such as Gene Expression Ominus (GEO) [6]. With this huge amount of public data available, it is possible for researchers to perform cross disease transcriptomics comparison analysis. The cross disease transcriptomic analysis has provided researchers with new opportunities of understanding of mechanisms of complex disease and discovery of new biomarkers
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.