In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora

Ayla Rigouts Terryn,Véronique Hoste,Els Lefever

doi:10.1007/s10579-019-09453-9

Ayla Rigouts Terryn, Véronique Hoste + Show 1 more

Open Access

https://doi.org/10.1007/s10579-019-09453-9

Copy DOI

Journal: Language Resources and Evaluation	Publication Date: Mar 26, 2019
Citations: 29	License type: open-access

Affiliation: Ghent University

Abstract

Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation.

Highlights

Automatic term extraction (ATE), often referred to as automatic term recognition (ATR), is the automated process of identifying terms in specialised texts, where terms can be described as the linguistic representations of domainspecific concepts
Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation
Automatic term extraction is a productive field of research and a preprocessing step for many other natural language processing (NLP) tasks

Summary

Introduction

Automatic term extraction (ATE), often referred to as automatic term recognition (ATR), is the automated process of identifying terms in specialised texts, where terms can be described as the linguistic representations of domainspecific concepts. ATE is meant to alleviate the time- and effort-consuming task of manual terminology management by providing a ranked list of candidate terms identified in a given domain-specific corpus It has become an important pre-processing step in many natural language processing (NLP) tasks (Zhang et al 2018), such as automatic indexing (Jacquemin and Bourigault 2003), automatic text summarisation (Zhang et al 2004) and machine translation (Wolf et al 2011). To evaluate ATE against human performance, a manually annotated gold standard is needed, which requires a lot of time and effort to create and often has low inter-annotator agreement due to the lack of a clear boundary between terminology and general language Such datasets remain invaluable for accurate evaluation and are needed as training data with the current evolution towards supervised learning and deep learning methodologies (Drouin et al 2018a, b)

Objectives

Results

Conclusion