Abstract

Low-dimensional representation is a convenient method of obtaining a synthetic view of complex datasets and has been used in various domains for a long time. When the representation is related to words in a document, this kind of representation is also called a semantic map. The two most popular methods are self-organizing maps and generative topographic mapping. The second approach is statistically well-founded but far less computationally efficient than the first. On the other hand, a drawback of self-organizing maps is that they do not project all points, but only map nodes. This paper presents a method of obtaining the projections for all data points complementary to the self-organizing map nodes. The idea is to project points so that their initial distances to some cluster centers are as conserved as possible. The method is tested on an oil flow dataset and then applied to a large protein sequence dataset described by keywords. It has been integrated into an interactive data browser for biological databases.

Highlights

  • Thanks to the availability of the human and other genomes and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated

  • To quote a few examples, semantic maps have already been used in fluid mechanics [1], astronomy [2], internet data mining [3,4], scientific literature mining [5] and biology [6]

  • To validate the new points projection method, a previously established oil flow dataset [14] was used as a benchmark. This training dataset is available at http://www.ncrg.aston.ac.uk/Generative Topographic Mapping (GTM)/ and contains 1000

Read more

Summary

INTRODUCTION

Thanks to the availability of the human and other genomes and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. The closest node wi in R p is selected and each node wj moves towards y according to the equation wj (t 1) wj (t) (t)hij (t) y wj (t) where (t) is the learning rate decreasing in time and hij (t) is a neighborhood function in the two-dimensional grid These steps are iterated for all data points. The generative topographic map (GTM) [1] is a statistical method which is provably (locally) convergent and which does not require a shrinking neighborhood or a decreasing step size It is a generative model: the data is assumed to arise by probabilistically picking points in a low-dimensional space and mapping them to the observed high-dimensional input space. It is used in the results section to generate a semantic map in the context of a new integrative navigator for biological databases

METHODS
Validation Using the Oil Flow Dataset
Semantic Map Generation for Biological Database
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.