Abstract

We propose a progressive leaning approach to separating child speech from signals with mixed adult speech in a speaker-independent manner based on a densely connected long short-term memory (LSTM) architecture to deal with limited training data issue in child speech. First, by measuring the speech dissimilarities between children and adults using i-vectors, we demonstrate that distances between child and adult speech are large enough to warrant a possible separation through establishing child and adult speech groups. Accordingly, we present a novel LSTM design with densely connected hidden layers and stacked inputs containing progressively obtained intermediate targets that are learnt via multiple-target learning for speech separation between child and adult groups. Experimental results on a simulation corpus show that the proposed framework can yield consistent and significant gains of objective measures over the LSTM baseline for child speech separation. Further-more, our preliminary results on the SeedLing corpus with realistic recordings for child language acquisition show that our approach can achieve better overall separation performances than LSTM baseline when comparing spectrograms of separated speech, implying a potential for speaker diarization involving child speech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call