Abstract

This article uses semi-supervised Expectation Maximization (EM) to learn lexico-syntactic dependencies, i.e. associations between words and the structures that occur with them. Due to Zipfian distributions in language, such dependencies are extremely sparse in labelled data, and unlabelled data are the only source for learning them. Specifically, we learn sparse lexical parameters of a generative parsing model (a Probabilistic Context-Free Grammar, PCFG) that is initially estimated over the Penn Treebank. Our lexical parameters are similar to supertags - they are fine-grained, and encode complex structural information at the pre-terminal level. Our goal is to use unlabelled data to learn these for words that are rare or unseen in the labelled data. We get large error reductions (up to 17.5%) in parsing ambiguous structures associated with unseen verbs, the most important case of learning lexico-structural dependencies, resulting in a statistically significant improvement in labelled bracketing score of the treebank PCFG. Our semi-supervised method incorporates structural and lexical priors from the labelled data to guide estimation from unlabelled data, and is the first successful use of semi-supervised EM to improve a generative structured model already trained over large labelled data. The method scales well to larger amounts of unlabelled data, and also gives substantial error reductions (up to 11.5%) for models trained on smaller amounts of labelled data, making it relevant to low-resource languages with small treebanks as well.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call