Abstract

We propose raw speech waveform-based end-to-end deep neural network (DNN) architectures to estimate age and gender of children within the age range of 4–14 years. To achieve this objective, we design single-task and multi-task learning DNN configuration. In the multi-task learning DNN, we use age and gender as separate label in two output layers and jointly optimize the total objective loss. We use a data-driven approach of learning feature from raw waveform within the DNN, which provides the learning process freedom to learn gender and age discriminative features during training. Interleaving time-delay neural network and long short-term memory (TDNN-LSTM) layers with time-restricted self-attention mechanism has been used for modeling of speech temporal dynamics. Experimental results provide a comparative analysis of single-task and multi-task learning process for age and gender recognition from children’s speech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call