Abstract

Throat microphone (TM) speech can be used for communication in noisy environment as it collects signals directly from human skin, but it should be improved in clarity and intelligibility due to the severe loss of high-frequency components. As recovery directly by neural networks is not sufficient to achieve satisfactory performance, we propose a dictionary representation based neural network to address this issue. Specifically, a magnitude spectrum dictionary of air-conducted speech is computed via sparse non-negative matrix factorization (SNMF), and then it is used to represent the transformed speech in hidden layer of the network. Meanwhile, a compensating dictionary is adopted to improve the representation accuracy. A memory efficient Semi-sparse Residual Recurrent Neural Network (SResRNN) with interactive mechanism and a special ResNet is employed to generate the coefficients on the dictionaries. Lastly, a three-layer neural network using a special initialization scheme is constructed as the recovery model. In the experiments, the model is compared with other five recovering models, and different criteria are adopted to measure the performance, the objective and subjective results can demonstrate the superiority of our proposed model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call