Abstract

ABSTRACT Frequency analyzer is one of the important functions of peripheral auditory system. Psycho-acoustically this gives rise to the concept of critical band, which represents the frequency resolution of the auditory system. Mel-Scale warping is one of the common techniques used for the analysis in speech recognition. Bark and ERB (Equivalent Rectangular Bandwidth) rate scales are two other auditory scales which have comparable performance to Mel-Scale. In this paper the acoustic features generated using filter banks with Mel-Scale, Bark-Scale and ERB-Scale has been investigated and analyzed with respect to the phonemes in the MISING language. General Terms Speech Processing and Analysis, Auditory Scale, Psycho-acoustic, Speech Signal Keywords Mel-Scale, Bark-Scale, ERB-scale, Filter Bank, Formant 1. INTRODUCTION Speech analysis basically tackles the problem of deriving representations from recordings of real speech signals. With proper speech analysis the key properties of the real speech can be captured and thereafter can be used to generate new speech signals. The nature of the speech signal and its acoustic properties can be studied by the analysis and presentation of speech signal in frequency domain [1]. In order to maintain the naturalness of oral communication between human and machines all aspect of speech must be involved [2]. Speech analysis is needed to be performed because the waveform does not usually directly give us the type of information we are interested in. The first stage of the speech analysis involves filtering, performed to decrease the vocal message ambiguities. Filtering is performed on discrete time quantized speech signals and after that the significant features of the speech signal are extracted. The key issues handled by the speech analysis include: a) Source / filter separation to study the spectral envelope of the sounds independent of the source that they are spoken with. b) Transformation of these spectral envelopes and source signals into representation which are efficiently coded and which shows the linguistic information more clearly. A speech sampled waveform need at most 100000 bits/sec to retain all conveyed information that is much higher than the underlying average phoneme information. In general a speaker is able to produce at most 45-50 different phonemes. Each phoneme is represented by 6 bits as 50<2

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call