Abstract
Inferring formal grammars with nonparametric Bayesian approach is one of the most powerful approach for achieving high accuracy from unsupervised data. In this paper, mildly-context-sensitive probabilities, called (k, l)-context-sensitive probabilities, are defined on context-free grammars (CFGs). Inferring CFGs where the probabilities of rules are identified from contexts can be seen as a kind of dual approaches for distributional learning, in which the contexts characterize the substrings. We can handle the data sparsity for the context-sensitive probabilities by the smoothing effect of the hierarchical nonparametric Bayesian models such as Pitman–Yor processes (PYPs). We define the hierarchy of PYPs naturally by augmenting the infinite PCFGs. The blocked Gibbs sampling is known to be effective for inferring PCFGs. We show that, by modifying the inside probabilities, the blocked Gibbs sampling is able to be applied to the (k, l)-context-sensitive probabilistic grammars. At the same time, we show that the time complexity for (k, l)-context-sensitive probabilities of a CFG is \(O(|V|^{l+3}|w|^3)\) for each sentence w, where V is a set of nonterminals. Since it is computationally too expensive to iterate sufficient times especially when |V| is not small, some alternative sampling algorithms are required. Therefore, we propose a new sampling method called composite sampling, with which the sampling procedure is separated into sub-procedures for nonterminals and for derivation trees. Finally, we demonstrate that the inferred (k, 0)-context-sensitive probabilistic grammars can achieve lower perplexities than other probabilistic language models such as PCFGs, n-grams, and HMMs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.