Abstract

We examine the impact of domain on parse selection accuracy, in the context of precision HPSG parsing using the English Resource Grammar, using two training corpora and four test corpora and evaluating using exact tree matches as well as dependency F-scores. In addition to determining the relative impact of in- vs. cross-domain parse selection training on parser performance, we propose strategies to avoid cross-domain performance penalty when limited in-domain data is available. Our work supports previous research showing that in-domain training data significantly improves parse selection accuracy, and that it provides greater parser accuracy than an out-of-domain training corpus of the same size, but we verify experimentally that this holds for a handcrafted grammar, observing a 10---16% improvement in exact match and 5---6% improvement in dependency F-score by using a domain-matched training corpus. We also find it is possible to considerably improve parse selection accuracy through construction of even small-scale in-domain treebanks, and learning of parse selection models over in-domain and out-of-domain data. Naively adding an 11,000-token in-domain training corpus boosts dependency F-score by 2---3% over using solely out-of-domain data. We investigate more sophisticated strategies for combining data from these sources to train models: weighted linear interpolation between the single-domain models, and training a model from the combined data, optionally duplicating the smaller corpus to give it a higher weighting. The most successful strategy is training a monolithic model after duplicating the smaller corpus, which gives an improvement over a range of weightings, but we also show that the optimal value for these parameters can be estimated on a case-by-case basis using a cross-validation strategy. This domain-tuning strategy provides a further performance improvement of up to 2.3% for exact match and 0.9% for dependency F-score compared to the naive combination strategy using the same data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.