Abstract
Abstract This paper presents a comparative study of Ideal Binary Mask (IBM) and Ideal Ratio Mask (IRM) as training target for supervised Malay speech separation. Inspired by revolution of powerful computer system, Deep Neural Network (DNN) is used as a supervised algorithm to predict target mask from noisy mixture signal that is degraded by noise background. Although previous works showed IRM is better than IBM target mask with DNN algorithm, but it is incomparable due to different database. To validate DNN model with these target masks, 600 Malay utterances from a male and a female speaker were used in training session while remaining 120 Malay utterances were used in prediction session. The combination of acoustic features such as amplitude modulation spectrogram (AMS), mel-frequency cepstral coefficient (MFCC), relative spectral transformed perceptual linear prediction coefficients (RASTA-PLP) and Gammatone filter bank power spectra (GF) were used as input features to estimate target mask. The performance of intelligibility enhancement was evaluated using Short Time Objective Intelligibility (STOI) score. Average STOI score of IRM target mask indicated up to 0.83 for seen speakers while 0.76 for unseen speakers at -5dB babble noise, which is superior than IBM target mask.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.