Abstract
With advancement in machine learning techniques, several speech related applications deploy end-to-end models to learn relevant features from the raw speech signal. In this work, we focus on the speech rate estimation task using an end-to-end model to learn representation from raw speech in a data driven manner. We propose an end-to-end model that comprises of 1-d convolutional layer to extract representations from raw speech and a convolutional dense neural network (CDNN) to predict speech rate from these representations. The primary aim of the work is to understand the nature of representations learned by end-to-end model for the speech rate estimation task. Experiments are performed using TIMIT corpus, in seen and unseen subject conditions. Experimental results reveal that, the frequency response of the learned 1-d CNN filters are low-pass in nature, and center frequencies of majority of the filters lie below 1000Hz. While comparing the performance of the proposed end-to-end system with the baseline MFCC based approach, we find that the performance of the learned features with CNN are on par with MFCC.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.