Abstract

Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.

Highlights

  • The World Health Organization reports that cancer has become the second leading cause of death globally, as approximately 1 in 6 deaths are caused by some form of cancer [1]

  • Our experiments found that the discretization t = Enc_B, along with K = 30 c_topics, produced the best discretized Latent Dirichlet Allocation (dLDA) algorithm for survival prediction in METABRIC; after fixing the encoding scheme as Enc_B, we used the same technique on the KIPAN dataset and found K = 50 c_topics to be the best

  • That table shows that adding GE features improves survival prediction and that including both dLDA c_topics and supervised principal component analysis (SuperPC)+ principle components gives the most improvements across held-out datasets

Read more

Summary

Introduction

The World Health Organization reports that cancer has become the second leading cause of death globally, as approximately 1 in 6 deaths are caused by some form of cancer [1]. Cancers are very heterogeneous, in that the outcomes can vary widely for patients with similar diagnoses, who receive the same treatment regimen This has motivated researchers to seek other features to help predict individual outcomes. Many such analyses use just clinical features Features such as lymph node status and histological grade, while predictive of metastases, do not appear to be sufficient to reliably categorize clinical outcome [2]. This has led to many efforts to improve the prognosis for cancer, based on genomics data (e.g., gene expression (GE) or copy number variation (CNV)), possibly along with the clinical data [2,3,4,5,6]. There are many other systems that use such expression information to divide the patients into two categories: high- vs low-risk; cf., [2, 6]

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call