Front-end of Wake-Up-Word Speech Recognition System Design on FPGA

Mohamed M Eljhani Brian H Hight

doi:10.4172/2167-0919.1000108

Abstract

A typical speech recognition system is push button operated (Push-to-talk), which requires hand movement and hence mixed multi-modal interface. However, for disabled patients and those who use hands-busy applications (e.g., where the user has objects to manipulate or device to control while asking for assistance from another device) movement may be restricted or impossible. One alternative is to use Speech Only Interface. The method that is being proposed is called Wake-Up-Word Speech Recognition (WUW-SR). A WUW-SR system would allow the user to operate (activate) many systems (Cell phone, Computer, Elevator, etc.) with speech commands instead of hand movements. This paper introduces a new front-end paradigm of the Wake-Up-Word Speech Recognition. The state of the art WUW-SR system is based on three different sets of features: (1) Mel-frequency Cepstral Coefficients (MFCC), (2) Linear Predictive Coding Coefficients (LPC), and (3) Enhanced Mel-frequency Cepstral Coefficients (ENH_MFCC), these features are decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUWSR. We present an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that generates MFCC, LPC, and ENH_MFCC features simultaneously. In the WUW-SR system, the recognizer front-end is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The three sets of feature extraction of speech (MFCC, LPC, and ENH-MFCC) are performed at the front-end. These extracted features are then compressed and transmitted to the server via a dedicated channel, where subsequently they are decoded. Our front-end can be added to any hand-held electronic device compatible with WUW-SR and command (activate) it by using our voice only (no push to talk as is presently done). Our front-end is designed, simulated and implemented in Altera DSP development kit with Cyclone III FPGA as a portable system acting as a processor that is capable of computing three different sets of features at a much faster rate than software. It is cost effective, consumes very little power, and it is not limited by having to operate on a general-purpose computer so it can be used on any portable device.

Highlights

A typical speech recognition system is push button operated (Push-to-talk), which requires hand movement and mixed multi-modal interface
This paper presents the feature extraction solution based on Linear Predictive Coding Coefficients (LPC), Mel-frequency Cepstral Coefficients (MFCC) and new set of features named Enhanced Melfrequency Cepstral Coefficients (ENH –MFCC) with the architecture specially optimized for implementation in Field-Programmable Gate Array (FPGA) structures
In order to perform a fair analysis we tested the performance of this system by comparing its three sets of feature spectrograms (MFCC, LPC, and Enhanced Spectrum (ENH)-MFCC) with the software (C, C++) WUW’s front-end algorithm implementation, and MATLAB front-end model which is implemented specially for this reason

Summary

Introduction

A typical speech recognition system is push button operated (Push-to-talk), which requires hand movement and mixed multi-modal interface. A great deal of work has been conducted in this paper to address this problem by designing an efficient hardware front-end of state of the art WUW-SR [1] with an FPGA using an Altera DSP-based system, acting as a processor that is responsible for extracting three different sets of features from the input audio signal. This paper presents the feature extraction solution based on LPC, MFCC and new set of features named Enhanced Melfrequency Cepstral Coefficients (ENH –MFCC) with the architecture specially optimized for implementation in FPGA structures.

Results

Conclusion