Abstract
In speech separation, the identities of the speakers may be an important cue to discriminate speeches in the mixture and separate them better. A few recent researches used the speaker embedding as an additional information, but they often require prior information about the target speaker or used noisy speaker embedding extracted from the mixture signal. In this article, we propose monaural speech separation that utilizes the speaker embedding in the later separator blocks, which is extracted from the intermediate separated results obtained by the early stages of the separator network. The later blocks in the separator networks consisting of repeated blocks such as the fully-convolutional time-domain audio separation network (Conv-TasNet) or the successive downsampling and resampling of multi-resolution features (SuDoRM-RF) are modified to take the speaker information as a form of affine transformation or addition to the original input tensor. The experimental results showed that the proposed methods significantly improved the performances of existing separation systems with a moderate number of additional parameters.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.