In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, an artificial neural network (ANN) trained on clean speech only (as used in a standard large vocabulary speech recognition system) is used as a channel model at the output of which the entropy and “dynamism” will be measured every 10 ms. These features are then integrated over time through an ergodic 2-state (speech and non-speech) hidden Markov model (HMM) with minimum duration constraints on each HMM state. For instance, in the case of entropy, it is indeed clear (and observed in practice) that, on average, the entropy at the output of the ANN will be larger for non-speech segments than speech segments presented at their input. In our case, the ANN acoustic model was a multi-layer perceptron (MLP, as often used in hybrid HMM/ANN systems) generating at its output estimators of the phonetic posterior probabilities based on the acoustic vectors at its input. It is from these outputs, thus from “real” probabilities, that the entropy and dynamism are estimated. The 2-state speech/non-speech HMM will take these two-dimensional features (entropy and dynamism) whose distributions will be modeled through multi-Gaussian densities or a secondary MLP. The parameters of this HMM are trained in a supervised manner using Viterbi algorithm. Although the proposed method can easily be adapted to other speech/non-speech discrimination applications, the present paper only focuses on speech/music segmentation. Different experiments, including different speech and music styles, as well as different temporal distributions of the speech and music signals (real data distribution, mostly speech, or mostly music), illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures.
Read full abstract