Abstract
Automatic phoneme segmentation of a speech sequence is a basic problem in speech engineering. This study investigates unsupervised phoneme segmentation without using prior information on linguistic contents and acoustic models of an input sequence. The authors formulate the unsupervised segmentation as an optimal problem by means of maximum likelihood, and show that the optimal segmentation corresponds to minimising the coding length of the input sequence. Under different assumptions, five different objective functions are developed, namely log determinant, rate distortion (RD), Bayesian log determinant, Mahalanobis distance and Euclidean distance objectives. The authors prove that the optimal segmentations have the transformation-invariant properties, introduce a time-constrained agglomerative clustering algorithm to find the optimal segmentations, and propose an efficient implementation of the algorithm by using integration functions. The experiments are carried out on the TIMIT database to compare the above five objective functions. The results show that RD achieves the best performance, and the proposed method outperforms the previous unsupervised segmentation methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.