Gene expression based survival prediction for cancer patients-A topic modeling approach.

Luke Kumar,Russell Greiner,Paweł Pławiak

doi:10.1371/journal.pone.0224446

Abstract

Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.

Highlights

The World Health Organization reports that cancer has become the second leading cause of death globally, as approximately 1 in 6 deaths are caused by some form of cancer [1]
Our experiments found that the discretization t = Enc_B, along with K = 30 c_topics, produced the best discretized Latent Dirichlet Allocation (dLDA) algorithm for survival prediction in METABRIC; after fixing the encoding scheme as Enc_B, we used the same technique on the KIPAN dataset and found K = 50 c_topics to be the best
That table shows that adding GE features improves survival prediction and that including both dLDA c_topics and supervised principal component analysis (SuperPC)+ principle components gives the most improvements across held-out datasets

Summary

Introduction

The World Health Organization reports that cancer has become the second leading cause of death globally, as approximately 1 in 6 deaths are caused by some form of cancer [1]. Cancers are very heterogeneous, in that the outcomes can vary widely for patients with similar diagnoses, who receive the same treatment regimen This has motivated researchers to seek other features to help predict individual outcomes. Many such analyses use just clinical features Features such as lymph node status and histological grade, while predictive of metastases, do not appear to be sufficient to reliably categorize clinical outcome [2]. This has led to many efforts to improve the prognosis for cancer, based on genomics data (e.g., gene expression (GE) or copy number variation (CNV)), possibly along with the clinical data [2,3,4,5,6]. There are many other systems that use such expression information to divide the patients into two categories: high- vs low-risk; cf., [2, 6]

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PloS one	Publication Date: Nov 15, 2019
Citations: 46	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Gene expression based survival prediction for cancer patients-A topic modeling approach.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

Evaluating the Coverage and Depth of Latent Dirichlet Allocation Topic Model in Comparison with Human Coding of Qualitative Data: The Case of Education Research
Gaurav Nanda ... Alex Choi
Machine Learning and Knowledge Extraction | VOL. 5
Gaurav Nanda, et. al.Gaurav Nanda ... Alex Choi
14 May 2023
Machine Learning and Knowledge Extraction | VOL. 5

Comment

NBER/Macroeconomics Annual | VOL. 33

01 Jan 2019
NBER/Macroeconomics Annual | VOL. 33

A Study on Topic Modeling for Feature Space Reduction in Text Classification
Daniel Pfeifer ... Jochen L Leidner
-
Daniel Pfeifer, et. al.Daniel Pfeifer ... Jochen L Leidner
01 Jan 2019
01 Jan 2019

Related Data for: Scoping review of mindfulness research: A topic modelling approach
...
-
, et. al. ...
21 Jan 2021
21 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gene expression based survival prediction for cancer patients-A topic modeling approach.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one