Abstract

The performance of various acoustic feature extraction methods has been compared in this work using Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic features are a series of vectors that represents the speech signals. They can be classified in either words or sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been used. These two methods closely resemble the human auditory system. These feature vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to investigate the nature of those acoustic features.

Highlights

  • The objective is to simulate the humans’ ability to talk, to carry out of simple tasks by computers through the means of machine-human interaction, to turning speech to text through Automatic Speech Recognition (ASR) systems

  • The primary target of this study is to examine the efficiency of various acoustic vectors for Bangla speech detection using Long Short-Term Memory (LSTM) neural network and assess their performances based on different statistical parameters

  • LSTM Neural Network Structure LSTM, and in general, recurrent neural networks (RNN) based ASR systems [13,14,15] trained with connectionist temporal classification (CTC) [16] have recently been shown to work extremely well when there is an abundance of training data, matching and exceeding the performance of hybrid DNN systems [15]

Read more

Summary

INTRODUCTION

Speech is the most effective way of communication among people. This is the most natural way of conveying information. Only a handful of works have been carried out for Bangla, which is among the most widely spoken languages in the world in terms of number of speakers. Some of these efforts can be found in [8]. Majority of these studies mainly focussed on simple word-level detection worked on a very minor database. These works did not account for the various dialects of different parts of the country. The primary target of this study is to examine the efficiency of various acoustic vectors for Bangla speech detection using LSTM neural network and assess their performances based on different statistical parameters

Bangla Speech Database
Acoustic Feature Vectors
LSTM Neural Network
Bhattacharyya Distance
Mahalanobis Distance
PERFORMANCE ANALYSIS
DISCUSSION
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call