Mut2Vec: distributed representation of cancerous mutations

Sunkyu Kim,Jaewoo Kang,Keonwoo Kim,Heewon Lee

doi:10.1186/s12920-018-0349-7

Abstract

BackgroundEmbedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields.MethodsWe introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency.ResultsTo evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec.ConclusionsWe introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.

Highlights

Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research
Mutation vectors trained with Skip-Gram (Mut2Vec) can be used in various deep learning applications such as cancer classification and drug sensitivity prediction
The driver mutation directly affects the progression of the cancer, while the passenger mutation does not play any particular role

Summary

Introduction

Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Kim et al BMC Medical Genomics 2018, 11(Suppl 2): captures the characteristics of driver mutations, it is possible to discover novel driver mutations by calculating the similarity between a candidate mutation and each of the driver mutations. Based on this motivation, we aim to address the problem by developing continuous and distributed representations of mutations using deep learning techniques. Since the distributed representation of words includes semantic relationships among vocabularies such as the semantic similarity between two words, the representations can contain additional information compared with binary representation which contains information on the existence of words

Objectives

Results

Discussion

Conclusion