Abstract

BackgroundThis paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. Previous studies using statistical classification or machine learning algorithms have focused on gene expression data only. The present paper investigates if such techniques can benefit from including demographic and clinical data.ResultsWe compare six classification algorithms: support vector machines (SVMs), nearest shrunken centroids, decision trees, ensemble of voters, naïve Bayes, and nearest neighbor. SVMs outperform the other algorithms. Using expression data only, they yield an area under the ROC curve of 0.92 for bipolar disorder versus control, and 0.91 for schizophrenia versus control. By including demographic and clinical data, classification performance improves to 0.97 and 0.94 respectively.ConclusionThis paper demonstrates that SVMs can distinguish bipolar disorder and schizophrenia from normal control at a very high rate. Moreover, it shows that classification performance improves by including demographic and clinical data. We also found that some variables in this data set, such as alcohol and drug use, are strongly associated to the diseases. These variables may affect gene expression and make it more difficult to identify genes that are directly associated to the diseases. Stratification can correct for such variables, but we show that this reduces the power of the statistical methods.

Highlights

  • This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia

  • We show that bipolar disorder and schizophrenia each can be distinguished from control, based on gene expression alone, significantly better than chance – with areas under the Receiver Operating Characteristic (ROC) curve (AUC) of 0.91 and 0.92

  • Given that these variables may affect gene expression, they may make it more difficult to identify genes that are directly associated to the diseases. (We discuss this point in detail later in the text.) We have investigated if post-stratification can correct for such variables, but we found that it significantly reduces the predictive accuracy of the statistical methods

Read more

Summary

Introduction

This paper presents a retrospective statistical study on the newly-released data set by the Stanley Neuropathology Consortium on gene expression in bipolar disorder and schizophrenia. This data set contains gene expression data as well as limited demographic and clinical data for each subject. The Stanley Neuropathology Consortium [1] recently made a large (over 300 sample) data set publicly available on gene expression in the brains of deceased individuals with bipolar disorder or schizophrenia, as well as controls. Does addition of the demographic and clinical history data further improve the ability to distinguish bipolar disorder or schizophrenia from control?. Is there a significant difference between the abilities of different widely-used data analysis algorithms to make these distinctions?

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call