Abstract
Self-attention networks are being popularly employed in sequence classification and sequence summarization tasks. State-of-the-art models use sequential models to capture the high-level information, but these models are sensitive to length of utterance and do not equally generalize over variable length utterances. This work explores to study the efficiency of recent advancements in self-attentive networks for improving the performance of the LID system. In self-attentive network, variable length input sequence is converted to fixed dimensional vector which represents the whole sequence. The weighted mean of input sequence is considered as utterance level representation. Along with the mean, a standard deviation is employed to represent the whole input sequence. Experiments are performed using AP17-OLR database. Use of mean with standard deviation has reduced the equal error rate (EER) with an 8% relative improvement. A multi-head attention mechanism is introduced in self-attention networks with an assumption that each head captures the distinct information to discriminate languages. Use of multi-head self-attention has further reduced the EER with a 13% relative improvement. Best performance is achieved with multi-head self-attention network with residual connections. Shifted delta cepstral features (SDC) and stacked SDC features are used for developing LID systems.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have