Abstract

Spiking neural networks perform reasonably well in recognition applications for single modality (e.g., images, audio, or text). In this paper, we propose a multimodal spiking neural network that combines two modalities (image and audio). The two unimodal ensembles are connected with cross-modal connections and the entire network is trained with unsupervised learning. The network receives inputs in both modalities for the same class and predicts the class label. The excitatory connections in the unimodal ensemble and the cross-modal connections are trained with power-law weight-dependent spike timing dependent plasticity learning rule. The cross-modal connections capture the correlation between neurons of different modalities. The multimodal network learns features of both modalities and improves the classification accuracy compared to unimodal topology, even when one of the modality is distorted by noise. The cross-modal connections suppress the effect of noise on classification accuracy. The well-learned cross-modal connections invoke additional spiking activity in neurons of the correct label. The cross-modal connections are only excitatory and do not inhibit the normal activity of the unimodal ensembles. We evaluated our multimodal network on images from MNIST dataset and utterances of digits from TI46 speech corpus. The multimodal network achieved a classification accuracy of 98% on the combined MNIST and TI46 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call