Abstract

To detect social signals such as laughter and filler events in an audio recording, the most straightforward way is to utilize a Hidden Markov Model — or these days a Hidden Markov Model/Deep Neural Network (HMM/DNN) hybrid. HMM/DNNs, however, perform best if the DNN outputs are scaled by dividing them by the a priori class probabilities first, before applying a dynamic or Viterbi beam search. These class a priori probability values (or priors for short) are usually estimated by counting the frame occurrences of each class in the training set and then dividing these totals by the total number of frames. These estimates, however, may in fact be suboptimal for a number of reasons ranging from imprecise labeling to the overconfidence of DNNs. In this study we show empirically that more reliable scaling factors can be obtained by optimization. Using this approach, we managed to achieve a 6−9% relative error reduction both at the frame level and the segment level, using a public database containing spontaneous English mobile phone conversations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call