Abstract

In this paper, we present our efforts towards developing a robust automatic speaker verification (ASV) system for children when the domain-specific data is limited. For that purpose, we have studied the effect of in-domain and out-of-domain data augmentation. Several different combinations of data augmentation are studied in this work. Speed and pitch perturbation of children’s speech are employed for synthetically creating in-domain data to be used for augmentation. For out-of-domain data augmentation, on the other hand, adults’ speech is pooled together with children’s speech. At the same time, voice conversion (VC) is also applied on adults’ speech to alter the acoustic attributes. VC of adults’ speech makes it perceptually similar to that of children’s speech. The converted adults’ data is then used for augmentation. The ASV systems developed in this study employ x-vectors derived using a time-delay deep neural network. In addition to that, probabilistic linear discriminant analysis is used for scoring the performance. The explored methods of data augmentation are noted to reduce the equal error rate as well as minimum decision cost function by a large margin.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.