Abstract

ABSTRACTThe overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance is inter-speaker variability. Vocal tract length normalization (VTLN) is introduced to compensate inter-speaker variation on the speaker signal by applying speaker-specific warping of the frequency scale of a filter bank. Instead of measuring the performance on word level with speaker-specific warping, this research focuses on direct tackling at the phoneme level and applying VTLN on all speakers’ speech signals to analyse the best setting for the highest recognition performance. This research seeks to compare each phoneme recognition results from warping factor between 0.74 and 1.54 with 0.02 increments on nine different ranges of frequency warping boundary. The warp factor and frequency warping range that provides the highest phoneme recognition performance is applied on word recognition. The results show an improved performance in phoneme recognition by 0.7% and spoken word recognition by 0.5% using warp factor of 1.40 on frequency range of 300–5000 Hz in comparison to baseline results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.