Abstract
• A novel mWSSL for improving the channel-invariance of the LID network. • Re-implementation of the original WSSL to achieve better performance. • Extensive experimentation on the proposed approach. State-of-the-art spoken language identification (LID) systems use sophisticated training strategies to improve the robustness to unseen channel conditions in the real-world test samples. However, all these approaches require training samples from multiple channels with corresponding channel-labels, which is not available in many cases. Recent research in this regard has shown the possibility of learning a channel-invariant representation of the speech using an auxiliary loss function called within-sample similarity loss (WSSL), which does not require samples from multiple channels. Specifically, the WSSL encourages the LID network to ignore channel-specific contents in the speech by minimizing the similarities between two utterance-level embeddings of same sample. However, as WSSL approach operates at sample-level, it ignores the channel variations that may be present across different training samples within same dataset. In this work, we propose a modification to the WSSL approach to address this limitation. Specifically, along with the WSSL, the proposed modified WSSL (mWSSL) approach additionally considers the similarities with two global-level embeddings which represent the average channel-specific contents in a given mini-batch of training samples. The proposed modification allows the network to have a better view of the channel-specific contents in the training dataset , leading to improved performance in unseen channel conditions.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have