Abstract

In recent years, DNN-based systems are extremely popular for performing automatic speech recognition (ASR) task. They have shown better performance than other methods. To perform ASR task efficiently, segmentation of the input data also has to be accurate. There are different kinds of methods of voice activity detection (VAD) including power based statistical methods. But to incorporate with the ASR system, it is better to perform the VAD task with the help of DNN-based method. In this paper, we investigate the superiority of DNN-based VAD over power based statistical methods. Not only that but also the effect of data augmentation on the performance of VAD in various conditions is also investigated. Performance of the VAD is evaluated using CENSREC-1-C data especially developed for evaluation of VAD task. The VAD trained with multi-condition data shows better performance than the baseline and other popular power based statistical VAD tools. We have also evaluated the performance for eval1 test dataset of CSJ corpus as well as its telephone variants.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call