Abstract

Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call