Abstract
We present a probabilistic model of phonotactics, the set of well-formed phoneme sequences in a language. Unlike most computational models of phonotactics (Hayes and Wilson, 2008; Goldsmith and Riggle, 2012), we take a fully generative approach, modeling a process where forms are built up out of subparts by phonologically-informed structure building operations. We learn an inventory of subparts by applying stochastic memoization (Johnson et al., 2007; Goodman et al., 2008) to a generative process for phonemes structured as an and-or graph, based on concepts of feature hierarchy from generative phonology (Clements, 1985; Dresher, 2009). Subparts are combined in a way that allows tier-based feature interactions. We evaluate our models’ ability to capture phonotactic distributions in the lexicons of 14 languages drawn from the WOLEX corpus (Graff, 2012). Our full model robustly assigns higher probabilities to held-out forms than a sophisticated N-gram model for all languages. We also present novel analyses that probe model behavior in more detail.
Highlights
People have systematic intuitions about which sequences of sounds would constitute likely or unlikely words in their language: blick is not an English word, it sounds like it could be, while bnick does not (Chomsky and Halle, 1965)
It is widely accepted that phonotactic judgments may be gradient: the nonsense word blick is better as a hypothetical English word than bwick, which is better than bnick (Hayes and Wilson, 2008; Albright, 2009; Daland et al, 2011)
To evaluate the contribution of feature dependency graphs, we compare our models with a baseline N-gram model, which represents phonemes as atomic units
Summary
People have systematic intuitions about which sequences of sounds would constitute likely or unlikely words in their language: blick is not an English word, it sounds like it could be, while bnick does not (Chomsky and Halle, 1965) Such intuitions reveal that speakers are aware of the restrictions on sound sequences which can make up possible morphemes in their language—the phonotactics of the language. It is widely accepted that phonotactic judgments may be gradient: the nonsense word blick is better as a hypothetical English word than bwick, which is better than bnick (Hayes and Wilson, 2008; Albright, 2009; Daland et al, 2011). Inspired by optimality-theoretic approaches to phonology, the most linguistically informed and successful such models have been constraint-based— formulating the problem of phonotactic generalization in terms of restrictions that penalize illicit combinations of sounds (e.g., ruling out ∗bn-)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Transactions of the Association for Computational Linguistics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.