Abstract

Voice Activity Detection (VAD) which is used as an onset step for majority of the applications in Digital Speech Processing (DSP) area is defined as the process of identifying speech region in an audio recording. It is mostly used for automatic speech recognition, speaker identification/verification, speech enhancement, speaker diarization etc. in order to reduce output errors and increase overall effectiveness of the systems. In this study, a bag-level MNIST modelling of VAD was proposed using Deep Multiple Instance Learning (Deep MIL) approach. To the best of our knowledge, because this is the first attempt that the VAD was modelled as a MIL problem in the literature, we named “milVAD”. The MNIST dataset was modified to obtain bag-level classifier model for the VAD framework while the MIL algorithm was implemented inside a Convolutional Neural Network (CNN) as an embedded layer using Noisy-And pooling method. The proposed modelling scenario has surprisingly achieved high training accuracy, which is approx. 99.91%, with only nine epochs via Deep MIL at bag-level. These results proved that the MIL can efficiently be used for the VAD systems in the manner of binary classification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.