Abstract

A genetic risk score could be beneficial in assisting clinical diagnosis for complex diseases with high heritability. With large-scale genome-wide association (GWA) data, the current study constructed a genetic risk model with a machine learning approach for bipolar disorder (BPD). The GWA dataset of BPD from the Genetic Association Information Network was used as the training data for model construction, and the Systematic Treatment Enhancement Program (STEP) GWA data were used as the validation dataset. A random forest algorithm was applied for pre-filtered markers, and variable importance indices were assessed. 289 candidate markers were selected by random forest procedures with good discriminability; the area under the receiver operating characteristic curve was 0.944 (0.935–0.953) in the training set and 0.702 (0.681–0.723) in the STEP dataset. Using a score with the cutoff of 184, the sensitivity and specificity for BPD was 0.777 and 0.854, respectively. Pathway analyses revealed important biological pathways for identified genes. In conclusion, the present study identified informative genetic markers to differentiate BPD from healthy controls with acceptable discriminability in the validation dataset. In the future, diagnosis classification can be further improved by assessing more comprehensive clinical risk factors and jointly analysing them with genetic data in large samples.

Highlights

  • The RF is an ensemble-based machine learning method that uses multiple classification and regression trees as classifiers

  • There is an absence in the literature of a risk score model based on genetic information for the diagnosis of bipolar disorder (BPD)

  • Informative genetic markers, which are selected by methods of machine learning, have been used for the classification of outcomes or for predicting the risk of developing diseases, such as early detection of prostate cancer[26], treatment response in attention deficit hyperactivity disorder[27], and identification of idiopathic autism spectrum disorder (ASD) patients[28]

Read more

Summary

Result

The accuracy of the RF procedure was evaluated in the GAIN training set created during the forest growing process. The risk scores among all participants were ranged from 143.8 to 228.4 in the GAIN dataset, with a mean of 175.4 in the controls and 191.3 in the BPD patients. The discrimination performance was acceptable using 289 candidate risk predictors in the STEP data, which had an AUROC of 0.702 (95% CI, 0.681–0.723) and a good calibration ability (Hosmer-Lemeshow test, p-value = 0.681). 354 candidate risk markers were identified for the STEP dataset. In the external validation GAIN dataset, decreased but acceptable discrimination performance was again observed, with an AUROC of 0.732 (95% CI, 0.711–0.754) (Supplementary Table[1]). If we mapped all candidate risk markers from the two datasets to genes, there were in total 233 gene regions, including 98 genes in the GAIN dataset and 144 genes in the STEP dataset. Important biological pathways were reported, including cation ion channel activity (such as voltage-gated calcium channel activity and complex, regulation of action potential and cation transport), membrane structure (such as plasma membrane, transmembrane receptor activity and establishment of location), neuron function (such as brain development, axon guidance and GABA receptor activity) and cytoskeleton (such as cytoskeletal protein binding and actin filament)

Discussion
Findings
Materials and Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call