Abstract
This article uses semi-supervised Expectation Maximization (EM) to learn lexico-syntactic dependencies, i.e. associations between words and the structures that occur with them. Due to Zipfian distributions in language, such dependencies are extremely sparse in labelled data, and unlabelled data are the only source for learning them. Specifically, we learn sparse lexical parameters of a generative parsing model (a Probabilistic Context-Free Grammar, PCFG) that is initially estimated over the Penn Treebank. Our lexical parameters are similar to supertags - they are fine-grained, and encode complex structural information at the pre-terminal level. Our goal is to use unlabelled data to learn these for words that are rare or unseen in the labelled data. We get large error reductions (up to 17.5%) in parsing ambiguous structures associated with unseen verbs, the most important case of learning lexico-structural dependencies, resulting in a statistically significant improvement in labelled bracketing score of the treebank PCFG. Our semi-supervised method incorporates structural and lexical priors from the labelled data to guide estimation from unlabelled data, and is the first successful use of semi-supervised EM to improve a generative structured model already trained over large labelled data. The method scales well to larger amounts of unlabelled data, and also gives substantial error reductions (up to 11.5%) for models trained on smaller amounts of labelled data, making it relevant to low-resource languages with small treebanks as well.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.