Abstract

This paper reported a deep-learning based binaural separation using gammatone-frequency cepstral coefficient (GFCC) and multi-resolution cochleagram (MRCG) as spectral features and interaural time difference (ITD) and interaural level difference (ILD) as spatial features. A binary mask was estimated by deep neural network (DNN) binary classifier that used the features as a training data and ideal ratio mask (IRM) as a training target. In the experiment, a male speaker as a target speech at azimuth 0o and a female speaker as a masker speech at azimuth 30o, 20o, 10o, and 5o in rooms with 0.32, 0.47, 0.68, and 0.89 s reverberation time (RT60). As a training process, 50 mixtures were used for each condition experiment. The classifier contained two hidden layers, 200 binary neurons and 50 epoch, and Restricted Boltzmann Machine (RBM) was used as pre-training process. The RBM and the classifier learning rate was 1 up to 0.001 from epoch 1 to epoch 50. The sound quality results indicated by 88% intelligibility of STOI method which means that the separated sound is easily understood by the listener. The MOS value is 2.8 which means the sentence is clear but spectral distortion is slightly annoying.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.