Abstract

Spiking Neural Network (SNN), the third generation of neural networks, has been shown to perform well in pattern recognition tasks involving temporal information, such as speech recognition and motion detection. However, most neural networks, including the SNN, for speech recognition rely on short-time frequency analysis, such as the mel-frequency cepstral coefficients (MFCC), for low-level feature extraction. MFCC feature extraction works by analyzing a window of time signal in multiple frequency bands one window at a time, in a synchronous fashion. This is in contrast to the event-based principle of SNN, whereby electrical impulses are emitted and processed in an asynchronous fashion. Just as speech signals arrive at the human's cochlear filterbank concurrently, but spikes encoding the power in each frequency band are emitted asynchronously, we propose an event-based cochlear filter encoding scheme, whereby the power in each frequency band is directly extracted in the time domain and spikes encoded using the latency code are emitted asynchronously to represent the power of each frequency band. This replaces the traditional MFCC frontend used in most speech recognition models, and makes possible an end-to- end event-based SNN implementation for a speech recognition task. The proposed event-based neural encoding is not only biologically plausible, but also outperforms the MFCC as an encoding frontend for an SNN classifier in a speech recognition task, in terms of higher classification accuracy and lower latency. Such an end-to-end SNN model could be implemented on a neuromorphic chip to fully realize the advantages of event-based processing.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call