An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Michael R Brent

doi:10.1023/a:1007541817488

Abstract

This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on wordss instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that our algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Abstract

Talk to us

Similar Papers

More From: Machine Learning

Lead the way for us

Journal: Machine Learning	Publication Date: Jan 1, 1999
Citations: 313

Similar Papers

Predicting pauses in L1 and L2 speech: the effects of utterance boundaries and word frequency
Nivja H De Jong
International Review of Applied Linguistics in Language Teaching | VOL. 54
Nivja H De JongNivja H De Jong
01 Jan 2015
International Review of Applied Linguistics in Language Teaching | VOL. 54

Author response: An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions
Sanne ten Oever ... Andrea E Martin
-
Sanne ten Oever, et. al.Sanne ten Oever ... Andrea E Martin
21 Jun 2021
21 Jun 2021

Text document clustering based on frequent word meaning sequences
Yanjun Li ... Soon M Chung
Data & Knowledge Engineering | VOL. 64
Yanjun Li, et. al.Yanjun Li ... Soon M Chung
30 Aug 2007
Data & Knowledge Engineering | VOL. 64

Hydrological Uncertainty Processor (HUP) with Estimation of the Marginal Distribution by a Gaussian Mixture Model
Kuaile Feng ... Zhongzheng He
Water Resources Management | VOL. 33
Kuaile Feng, et. al.Kuaile Feng ... Zhongzheng He
07 Jun 2019
Water Resources Management | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Abstract

Talk to us

Similar Papers

More From: Machine Learning