Abstract

This paper focuses on the improvement of statistically-extracted phrase lists by applying word alignment approaches to bitext. Such phrase lists serve several tasks such as the compilation of terminology or translation databases. Our investigations are based on the assumption that word alignment favors well-formed phrase structures rather than irregular text segments. If this is the case, word alignment will filter out irregular structures from automatically generated phrase lists. As a result, an improved phrase list, in terms of precision, may be compiled. Furthermore, word alignment approaches can be used to identify additional multi-word units, e.g. multi-word cognates. Our investigations are focused on a Swedish/English text corpus that has been aligned with the Uppsala Word Aligner (UWA). Finally, we describe and apply three approaches to evaluate the automatically generated phrase lists: an evaluation by comparing results with existing reference data (prior reference), an evaluation against given syntactic patterns (prior reference patterns), and a manual evaluation of sample data (posterior reference). The evaluations of the extraction of phrasal terms in English substantiate the assumption: precision has improved significantly with little loss in recall.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call