Abstract
In this paper, we present our efforts towards developing a robust automatic speaker verification (ASV) system for children when the domain-specific data is limited. For that purpose, we have studied the effect of in-domain and out-of-domain data augmentation. Several different combinations of data augmentation are studied in this work. Speed and pitch perturbation of children’s speech are employed for synthetically creating in-domain data to be used for augmentation. For out-of-domain data augmentation, on the other hand, adults’ speech is pooled together with children’s speech. At the same time, voice conversion (VC) is also applied on adults’ speech to alter the acoustic attributes. VC of adults’ speech makes it perceptually similar to that of children’s speech. The converted adults’ data is then used for augmentation. The ASV systems developed in this study employ x-vectors derived using a time-delay deep neural network. In addition to that, probabilistic linear discriminant analysis is used for scoring the performance. The explored methods of data augmentation are noted to reduce the equal error rate as well as minimum decision cost function by a large margin.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.