Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia

Jan Struyf,David Page,Seth Dobrin

doi:10.1186/1471-2164-9-531

Abstract

BackgroundThis paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.ResultsWe compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.ConclusionThis paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

Highlights

This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia
We show that bipolar disorder and schizophrenia each can be distinguished from control, based on gene expression alone, significantly better than chance – with areas under the Receiver Operating Characteristic (ROC) curve (AUC) of 0.91 and 0.92
Given that these variables may affect gene expression, they may make it more difficult to identify genes that are directly associated to the diseases. (We discuss this point in detail later in the text.) We have investigated if post-stratification can correct for such variables, but we found that it significantly reduces the predictive accuracy of the statistical methods

Summary

Introduction

This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. The Stanley Neuropathology Consortium [1] recently made a large (over 300 sample) data set publicly available on gene expression in the brains of deceased individuals with bipolar disorder or schizophrenia, as well as controls. Does addition of the demographic and clinical history data further improve the ability to distinguish bipolar disorder or schizophrenia from control?. Is there a significant difference between the abilities of different widely-used data analysis algorithms to make these distinctions?

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Jan 1, 2008
Citations: 74	License type: cc-by

R Discovery Prime

R Discovery Prime

Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Integration of Clinical and Gene Expression Data Has a Synergetic Effect on Predicting Breast Cancer Outcome
Martin H Van Vliet ... Hugo M Horlings
PLoS ONE | VOL. 7
Martin H Van Vliet, et. al.Martin H Van Vliet ... Hugo M Horlings
11 Jul 2012
PLoS ONE | VOL. 7

A multivariate analysis approach to the integration of proteomic and gene expression data
Ailís Fagan ... Aedín C Culhane
PROTEOMICS | VOL. 7
Ailís Fagan, et. al.Ailís Fagan ... Aedín C Culhane
01 Jun 2007
PROTEOMICS | VOL. 7

Classification of breast cancer subtypes by combining gene expression and DNA methylation data.
Jan Mollenhauer ... Richa Batra
Journal of integrative bioinformatics | VOL. 11
Jan Mollenhauer, et. al.Jan Mollenhauer ... Richa Batra
13 Jun 2014
Journal of integrative bioinformatics | VOL. 11

Platelet-derived Growth Factor Stimulates Src-dependent mRNA Stabilization of Specific Early Genes in Fibroblasts
Paul A Bromann ... Sara A Courtneidge
Journal of Biological Chemistry | VOL. 280
Paul A Bromann, et. al.Paul A Bromann ... Sara A Courtneidge
01 Mar 2005
Journal of Biological Chemistry | VOL. 280

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics