Unsupervised speech recognition through spike-timing-dependent plasticity in a convolutional spiking neural network.

Meng Dong,Bo Xu,Xuhui Huang,Changsong Zhou

doi:10.1371/journal.pone.0204596

Abstract

Speech recognition (SR) has been improved significantly by artificial neural networks (ANNs), but ANNs have the drawbacks of biologically implausibility and excessive power consumption because of the nonlocal transfer of real-valued errors and weights. While spiking neural networks (SNNs) have the potential to solve these drawbacks of ANNs due to their efficient spike communication and their natural way to utilize kinds of synaptic plasticity rules found in brain for weight modification. However, existing SNN models for SR either had bad performance, or were trained in biologically implausible ways. In this paper, we present a biologically inspired convolutional SNN model for SR. The network adopts the time-to-first-spike coding scheme for fast and efficient information processing. A biological learning rule, spike-timing-dependent plasticity (STDP), is used to adjust the synaptic weights of convolutional neurons to form receptive fields in an unsupervised way. In the convolutional structure, the strategy of local weight sharing is introduced and could lead to better feature extraction of speech signals than global weight sharing. We first evaluated the SNN model with a linear support vector machine (SVM) on the TIDIGITS dataset and it got the performance of 97.5%, comparable to the best results of ANNs. Deep analysis on network outputs showed that, not only are the output data more linearly separable, but they also have fewer dimensions and become sparse. To further confirm the validity of our model, we trained it on a more difficult recognition task based on the TIMIT dataset, and it got a high performance of 93.8%. Moreover, a linear spike-based classifier—tempotron—can also achieve high accuracies very close to that of SVM on both the two tasks. These demonstrate that an STDP-based convolutional SNN model equipped with local weight sharing and temporal coding is capable of solving the SR task accurately and efficiently.

Highlights

Automatic speech recognition is the ability for a machine to recognize and translate spoken language into text
It seems quite implausible that this process of non-local information propagation would occur in the cortex [5], in which neurons just communicate with each other based on spikes from direct connections, and the synaptic strengths are generally modified by activities of corresponding pre- and post-synaptic neurons, e.g. spike-timing-dependent plasticity (STDP) [6,7,8,9,10]
First we show the performance of our spiking neural networks (SNNs) model by using support vector machine (SVM) as a classifier, which is compared with performances of other SNN and artificial neural networks (ANNs) models

Summary

Introduction

Automatic speech recognition is the ability for a machine to recognize and translate spoken language into text. It is a challenging task since the speech signal is high variable due to different speaker characteristics, varying speaking speed, and background noise. ANNs are inspired by features found in brain. They consist of multiple layers of artificial neurons which are able to learn data representations from the input data by gradient descent algorithms [2, 3]. Despite the biological inspiration and high performance, ANN models are fundamentally different from what are observed in biology in two main aspects. Compared to the brain’s energy efficiency, both training and execution of largescale ANNs need massive amounts of computational power to perform single tasks

Methods

Results

Discussion

Conclusion