Abstract

The language cue is an important component in the task of spoken language identification (LID). But it will take a lot of time to align language cue to speech segment by the manual annotation of professional linguists. Instead of annotating the linguistic phonemes, we use the cooccurrence in speech utterances to find the underlying phoneme-like speech units by unsupervised means. Then, we model phonotactic constraint on the set of phoneme-like units for finding the larger speech segments called the suprasegmental phonemes, and extract the multi-levels language cues from them, including phonetic, phonotactic and prosodic. Further, a novel LID system is proposed based on the architecture of TDNN followed by LSTM-RNN. The proposed LID system is built and compared with the acoustic feature based methods and the phonetic feature based methods on the task of NIST LRE07 and Arabic dialect identification. The experimental results show that our LID system helps to capture robust discriminative information for short duration language identification and high accuracy for dialect identification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.