Abstract

This paper describes a joint model of word segmentation and phonological alternations, which takes unsegmented utterances as input and infers word segmentations and underlying phonological representations. The model is a Maximum Entropy or log-linear model, which can express a probabilistic version of Opti- mality Theory (OT; Prince and Smolensky (2004)), a standard phonological framework. The features in our model are inspired by OT's Markedness and Faithfulness constraints. Fol- lowing the OT principle that such features in- dicate violations, we require their weights to be non-positive. We apply our model to a modified version of the Buckeye corpus (Pitt et al., 2007) in which the only phonological alternations are deletions of word-final /d/ and /t/ segments. The model sets a new state-of- the-art for this corpus for word segmentation, identification of underlying forms, and identi- fication of /d/ and /t/ deletions. We also show that the OT-inspired sign constraints on fea- ture weights are crucial for accurate identifi- cation of deleted /d/s; without them our model posits approximately 10 times more deleted underlying /d/s than appear in the manually annotated data.

Highlights

  • This paper unifies two different strands of research on word segmentation and phonological rule induction

  • Elman (1990) introduced the word segmentation task as a simplified form of lexical acquisition, and Brent and Cartwright (1996) and Brent (1999) introduced the unigram model of word segmentation, which forms the basis of the model used here

  • If the Buckeye surface form does not end in an allophonic variant of that segment, our surface form consists of the Buckeye underlying form with that final segment deleted

Read more

Summary

Introduction

This paper unifies two different strands of research on word segmentation and phonological rule induction. The word segmentation task is the task of segmenting utterances represented as sequences of phones into sequences of words This is an idealisation of the lexicon induction problem, since the resulting words are phonological forms for lexical entries. In its simplest form, the data for a word segmentation task is obtained by looking up the words of an orthographic transcript (of, say, child-directed speech) in a pronouncing dictionary and concatenating the results. This formulation significantly oversimplifies the problem because it assumes that each token of a word type is pronounced identically in the form specified by the pronouncing dictionary (usually its citation form). The MaxEnt features can capture phonotactic generalisations about possible word shapes, and their model achieves a stateof-the-art word segmentation f-score

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.