A word clustering approach to domain adaptation: Robust parsing of source and target domains

D Seddah,E H Anguiano,M Candito

doi:10.1093/logcom/exs082

D Seddah, E H Anguiano + Show 1 more

https://doi.org/10.1093/logcom/exs082

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

We present a technique to improve out-of-domain statistical parsing by reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with target domain data. We also investigate the impact of guiding out-of-domain parsing with predicted part-of-speech tags. We provide an evaluation for French, and obtain improvements in performance for both non-technical and technical target domains. Though the improvements over a strong baseline are slight, an interesting result is that the proposed techniques also improve parsing performance on the source domain, contrary to techniques such as self-training, thus leading to a more robust parser overall. We also describe new target domain evaluation treebanks, freely available, that comprise a total of about 3,000 annotated sentences from the medical domain, regional newspaper articles, French Europarl and French Wikipedia.

Full Text