Abstract

Representation learning methods, such as deep autoencoders, have received sustained attention due to their ability to effectively learn meaningful representations for a variety of applications. While these learning approaches are able to derive representations from any source signal (e.g., images, language, or voice signals) and encourage the separation in dominating factor domains, they broadly treat factors of variation pertaining to nuisances (e.g., recording conditions, gender of speaker, accent etc.) no different from often subtle more interesting factors, such as paralinguistic target variables (e.g., voice quality and phonetic vowels). In paralinguistic speech analyses, nuisance variables (e.g. gender and accent of speakers) often dominate acoustic subtleties that pertain for example to the affect or well-being of the speaker. In this work, we seek to capture nuisance-free embeddings by learning two separate orthogonal representations: one representation specialized to capture nuisance factors and one that improves the representation of the target. We propose unsupervised and (semi-) supervised orthogonal autoencoders that allow us to learn informative representations of paralinguistic and phonetic targets while removing the effect of the nuisance - gender. Overall, our proposed model outperforms state-of-the-art approaches and shows improved target representations.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.