Abstract

Speaker segmentation is the problem of finding speaker segment boundaries when a speaker begins and stop speaking in an audio speaker stream. This segmentation of audio data is of interest to a broad class of applications like surveillance meetings summarization or indexing of broadcast news. Unsupervised speaker segmentation approaches suppose that there is no information about the speakers and their number is known a priori. It can be classed into three categories: energy-based segmentation, metric-based selection and model-selection-based segmentation. Energy-based segmentation: silence in the input audio stream is detected either by a decoder or directly by measuring and thresholding the audio energy. The segments are then generated by cutting the input at silence locations. • Metric-based segmentation: the audio stream is segmented at maxima of the distances between neighboring windows placed in evenly spaced time intervals. • Model-selection-based segmentation [4]: assuming that data are generated by a Gaussian process, speaker changes are detected by using a statistical decision criterion within a sliding window through the audio stream. A widely used technique for speaker segmentation is based on the Bayesian Information Criterion (BIC). Indeed, BIC segmentation presents the advantages of robustness and threshold independence. However, this method, extremely computationally expensive, can introduce an estimation error due to insufficient data when the speaker turns are close to each other.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call