Abstract

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. We have proposed a statistical-model-based VAD using the long-term temporal information in speech, which shows good robustness against noise in an automobile environment. For further improvement, this paper describes a new method to exploit harmonic structure information with statistical models. In our approach, local peaks considered to be harmonic structures are extracted, without explicit pitch detection and voiced-unvoiced classification. The proposed method including both long-term temporal and static harmonic features led to considerable improvements under low SNR conditions in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.