Abstract

Most gene expression datasets generated by individual researchers are too small to fully benefit fromunsupervised machine-learning methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. To address this challenge, we utilize transfer learning to extract coordinated expression patterns and uselearned patterns to analyze small rare disease datasets. We trained a pathway-level information extractor (PLIER) model on a large public data compendium comprising multiple experiments, tissues, and biological conditions and then transferred the model to small datasets in an approach we call MultiPLIER. Models constructed from the public data compendium included features that aligned well to known biological factors and were more comprehensive than those constructed from individual datasets or conditions. When transferred to rare disease datasets, the models describe biological processes related to disease severity more effectively than models trained only on a given dataset.

Highlights

  • The rapid expansion of the amount of publicly available gene expression data presents opportunities for discovery-driven research into rare diseases with poorly understood etiologies

  • We demonstrate that pathway-level information extractor (PLIER) learns cell-type-specific signatures and reduces technical batch effects when trained on a multidataset microarray compendium of systemic lupus erythematosus (SLE) whole blood (WB)

  • Three datasets from the same complex multisystem disease and distinct tissues are each used as a training set for a PLIER model

Read more

Summary

Introduction

The rapid expansion of the amount of publicly available gene expression data presents opportunities for discovery-driven research into rare diseases with poorly understood etiologies. There are technical considerations as measuring combinations of genes reduces the multiple hypothesis testing burden, can aid feature engineering, and is more likely to yield robust results than analyses of individual gene measurements (Cleary et al, 2017). These methods have advantages over two-group gene set-based comparisons because they provide more context for genes, are better fit to the underlying data, and remove the difficulty of identifying the most useful comparisons a priori (Stein-O’Brien et al, 2018). Unsupervised machine-learning (ML) methods including matrix factorization- and autoencoderbased approaches have successfully extracted biologically meaningful low-dimensional representations of gene expression data that can distinguish disease types, predict drug response, and identify new pathway regulators (Dincer et al, 2018; SteinO’Brien et al, 2017b; Tan et al, 2017; Way and Greene, 2017)

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call