Abstract

Turning attention to a particular speaker when many people talk simultaneously is known as the cocktail party problem. It is still a tough task that remained to be solved especially for single-channel speech separation. Inspired by the physiological phenomenon that humans tend to distinguish some attractive sounds from mixed signals, we propose the multi-head self-attention deep clustering network (ADCNet) for this problem. We creatively combine the widely used deep clustering network with multi-head self-attention mechanism and exploit how the number of heads in multi-head self-attention affects separation performance. We also adopt the density-based canopy K-means algorithm to further improve performance. We trained and evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three talker mixtures. Experimental results show the new approach can achieve a better performance compared with many advanced models.

Highlights

  • Cocktail party problem which is of great significance for automatic speech recognition and voiceprint recognition [1] was first proposed by Cherry in [2]

  • For comparison with other models, we primarily evaluated our system with the source-to-distortion ratio (SDR), according to the BSS-EVAL metrics [29]

  • When h is 5, SDR for *attention deep clustering network (ADCNet) is lower than the baseline Deep clustering (DPCL) model in Table 1, which indicates that multi-head self-attention can not perform well in all instances and the value of h is not the greater the better

Read more

Summary

INTRODUCTION

Cocktail party problem which is of great significance for automatic speech recognition and voiceprint recognition [1] was first proposed by Cherry in [2]. The main contributions of this paper are: 1.Human beings can separate target speech from mixed signals owing to the speciality of auditory system We utilize this physiological characteristics for cocktail party problem. DPCL MODEL DPCL for single-channel speech separation was first put forward by Hershey et al, etc., [11] The essence of it is to train a neural network to learn the high-dimensional embedding of each time-frequency unit so that the embedding belonging to the same speaker has the minimum distance in the embedding space. MODEL The overview architecture of our proposed model is shown in figure 2 It mainly consists of four parts: pretreatment layer, embedding network, multi-head self-attention layer, improved K-means layer. The last two parts of our proposed model will be presented in detail

ATTENTION
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.