Abstract

This letter proposes a lightweight model for speaker recognition by leveraging a hyperbolic space. The speaker recognition performance heavily depends on the distinctiveness of speaker embeddings induced by metric learning. However, most state-of-the-art embedding methods are typically based on the Euclidean metric space, which does not account for inherent hierarchical structures of speech voice characteristics. The recent development of the neural hyperbolic geometry has demonstrated its effectiveness to model continuous hierarchical structures, which have been typically cumbersome to model by standard deep neural networks. This facet provides an additional by-product of a compact representation. Inspired by the favorable geometry of the hyperbolic geometry, we developed a hyperbolic ResNet for speaker recognition. We found that in smaller dimension regimes than typical cases, the learned speaker embeddings are more discriminative; in other words, more compact at the same level of performance. Our experiments on the large-scale VoxCeleb datasets show that, given the limited channel dimensions of neural networks, our method consistently has favorable performance against the standard ResNet for both speaker recognition and verification tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call