Evaluation of five single-word term recognition methods on a legal English corpus

María José Marín

doi:10.3366/cor.2014.0052

Abstract

Specialised texts are characterised by, amongst other features, the presence of terminology which conveys domain-specific concepts that are essential for the specialist who is interested in analysing such texts. Automatic Term Recognition methods (ATR) are employed to identify those terms automatically, which is especially helpful in view of the large size of corpora nowadays. However, they tend to concentrate on the identification of Multi-Word Terms (MWTs) neglecting Single-Word Terms (SWTs) to a certain extent. This might be related to the greater number of the former found in fields such as biomedicine. However, so far as legal English is concerned, testing has shown that SWTs represent 65.22 percent of the items in the specialised glossary employed for the evaluation of the ATR methods examined herein. This paper presents the evaluation of five SWT recognition methods, namely, those of Chung (2003) , Drouin (2003) , Kit and Liu (2008) , Keywords (2008), and TF-IDF (term frequency-inverse document frequency). These were tested on the United Kingdom Supreme Court Corpus (UKSCC), a legal corpus of 2.6 million words which was compiled for this purpose. The results indicate that Drouin's TermoStat software is the best performing method, achieving 73.45 percent precision on the top 2,000 candidate terms.

Full Text