Abstract

A self-organizing map (SOM) is an artificial neural network algorithm that can learn from the training data consisting of objects expressed as vectors and perform non-hierarchical clustering to represent input vectors into discretized clusters, with vectors assigned to the same cluster sharing similar numeric or alphanumeric features. SOM has been used widely in transcriptomics to identify co-expressed genes as candidates for co-regulated genes. I envision SOM to have great potential in characterizing heterogeneous sequence motifs, and aim to illustrate this potential by a parallel presentation of SOM with a set of numerical vectors and a set of equal-length sequence motifs. While there are numerous biological applications of SOM involving numerical vectors, few studies have used SOM for heterogeneous sequence motif characterization. This paper is intended to encourage (1) researchers to study SOM in this new domain and (2) computer programmers to develop user-friendly motif-characterization SOM tools for biologists.

Highlights

  • A self-organizing map or SOM [1] is a grid of artificial neurons that are used to learn patterns from training data and use the learned pattern to perform non-hierarchical clustering to represent input vectors as discretized clusters, with vectors in the same cluster sharing similar features

  • While SOM has almost always been presented as a non-hierarchical clustering method for numerical vectors (e.g., [1] and pp. 231–250 of [8]), it theoretically can be adapted to any set of objects from which a pairwise distance between two objects can be computed

  • Several studies have demonstrated the values of using SOM to characterize sequence motifs [17,18,19,20,21,22], but their efforts do not seem sufficiently appreciated by biologists

Read more

Summary

Introduction

A self-organizing map or SOM [1] is a grid of artificial neurons that are used to learn patterns from training data and use the learned pattern to perform non-hierarchical clustering to represent input vectors as discretized clusters, with vectors in the same cluster sharing similar features. SOM involves setting up a grid of artificial neurons, initializing them either with random values or with values from routine multidimensional scaling methods such as PCA, computing a distance (or similarity) between an input vector and each neuron to identify the winning neuron (which has the shortest distance or greatest similarity to the input vector), revising the features of the winning neuron and its neighbors as a learning process, and continuing with other input vectors until the process is converged (i.e., when the vector values of neurons no longer change) Such a trained SOM can be used to classify input vectors that are not in the training data. The readers will find it easy to understand SOM with sequence motifs as input

Distance or Similarity between Two Vectors
Distance for Homologous Input Sequences
Distance for Non-Homologous Sequences
Training Data
SOM Grid Size and Initialization
Update SOM
Identify the Winning Node
Learning by Revising the Winning Node and Its Neighbors
Learning by Revising the Winning Node and its Neighbors
The Fit of SOM to Input Data
Software Implementing SOM with PWM
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.