Abstract

SummaryGene-based supervised machine learning classification models have been widely used to differentiate disease states, predict disease progression and determine effective treatment options. However, many of these classifiers are sensitive to noise and frequently do not replicate in external validation sets. For complex, heterogeneous diseases, these classifiers are further limited by being unable to capture varying combinations of genes that lead to the same phenotype. Pathway-based classification can overcome these challenges by using robust, aggregate features to represent biological mechanisms. In this work, we developed a novel pathway-based approach, PRObabilistic Pathway Score, which uses genes to calculate individualized pathway scores for classification. Unlike previous individualized pathway-based classification methods that use gene sets, we incorporate gene interactions using probabilistic graphical models to more accurately represent the underlying biology and achieve better performance. We apply our method to differentiate two similar complex diseases, ulcerative colitis (UC) and Crohn’s disease (CD), which are the two main types of inflammatory bowel disease (IBD). Using five IBD datasets, we compare our method against four gene-based and four alternative pathway-based classifiers in distinguishing CD from UC. We demonstrate superior classification performance and provide biological insight into the top pathways separating CD from UC.Availability and ImplementationPROPS is available as a R package, which can be downloaded at http://simtk.org/home/props or on Bioconductor.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Advancements in statistical modeling combined with the ease of obtaining and generating gene expression data have led to multiple approaches to build regression and classification models to aid in diagnosis, prognosis, disease prediction, patient stratification and treatment selection

  • For differentiating Crohn’s disease (CD) from ulcerative colitis (UC), we trained on 83 samples with four validation sets containing 16, 24, 12 and 28 samples

  • Our results suggest that sphingolipid metabolism may play different roles in UC versus CD, with UC being more dysregulated than CD

Read more

Summary

Introduction

Advancements in statistical modeling combined with the ease of obtaining and generating gene expression data have led to multiple approaches to build regression and classification models to aid in diagnosis, prognosis, disease prediction, patient stratification and treatment selection The most common approaches entail using a subset of genes to derive a signature for the phenotypes of interest (Dorman et al, 2016; Huang et al, 2007; Ramaswamy et al, 2003). These gene signatures have been challenging to reproduce, in heterogeneous diseases such as cancer and when there is a lack of adequate validation data (Koscielny, 2010). Using pathway-based methods may overcome these challenges, as combining genes to produce pathway-based feature scores has been shown to be more robust (Guo et al, 2005), and can result in fewer features, which can reduce overfitting and improve generalizability while maintaining biological interpretability

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.