Abstract

In this work, we propose a raw waveform based multiscale convolution neural network approach for language-independent gender identification. Our approach uses raw audio waveform as input to the 1-dimensional multi-scale convolutional neural network instead of handcrafted feature for speaker gender classification. The multi-scale CNN has the advantage of using filters of different sizes on the audio waveform to extract features from raw waveform. We have a 3 stream CNN network where each stream contains multiple Residual blocks and we combine all the features from all streams after the last convolution layer to predict the gender label. Our gender identification dataset contains 176Hrs of audio data from 6 Indian languages(Hindi, English, Kannada, Telugu, Tamil, and Gujarati). Our experiments show that learning a gender identification task using a raw waveform gives better performance and speed up during training. Our experiments show that using multi-scale CNN on the raw waveform outperforms the spectrogram based model by an absolute improvement of 2.24%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call