Abstract

BackgroundExisting functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding.ResultsFrom a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co-expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene-gene interaction prediction.ConclusionsWe proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications.

Highlights

  • Genes, discrete segments of the genome that are transcribed, are basic building blocks of molecular biological systems

  • Since our goal is to obtain a gene embedding that reflects the functional relationships among genes, we selected the set of hyper-parameters that maximizes the clusteredness of genes within functional pathways

  • Using the first and second components from the t-Distributed Stochastic Neighbor Embedding (t-SNE) representation, we produced a gene co-expression map, based on which we explored the distribution of all human genes from our results (Figure 3)

Read more

Summary

Introduction

Discrete segments of the genome that are transcribed, are basic building blocks of molecular biological systems. Almost all transcripts in the human genome have been identified, functional annotation of genes is still a challenging task. Most existing annotation efforts organize genes into functional categories, e.g., pathways, or represent their relationship into networks. Pathways and networks crystallize biological knowledge and are convenient qualitative conceptualization of gene functions. The challenge of creating a quantitative semantic representation of discrete units of a complex system is not unique to gene systems. For a long time, creating a quantitative representation of words had been challenging for linguistic modeling. Hinton proposed the pioneering idea of ‘learning distributed representations of words’ [1], i.e., representing the semantics of a word by mapping them to vectors in a high-dimension space. Existing functional description of genes are categorical, discrete, and mostly through manual process. We explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.