Semi-automatic extraction of multiword terms from domain-specific corpora

Vesna Pajić,Staša Vujičić Stanković,Ranka Stanković,Miloš Pajić

doi:10.1108/el-06-2017-0128

Abstract

PurposeA hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts.Design/methodology/approachThe method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts.FindingsBy using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary.Originality/valueThe paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Semi-automatic extraction of multiword terms from domain-specific corpora

Abstract

Talk to us

Similar Papers

More From: The Electronic Library

Lead the way for us

Journal: The Electronic Library	Publication Date: May 24, 2018
Citations: 7

Similar Papers

The C-value/NC-value domain-independent method for multi-word term extraction
Katerina T Frantzi ... Sophia Ananiadou
Journal of Natural Language Processing | VOL. 6
Katerina T Frantzi, et. al.Katerina T Frantzi ... Sophia Ananiadou
01 Jan 1998
Journal of Natural Language Processing | VOL. 6

Automatic extraction of Arabic multi-word terms
K Al Khatib ... A Badarneh
-
K Al Khatib, et. al.K Al Khatib ... A Badarneh
01 Oct 2010
01 Oct 2010

Compound Terms and Their Multi-word Variants: Case of German and Russian Languages
Elizaveta Clouet ... Béatrice Daille
-
Elizaveta Clouet, et. al.Elizaveta Clouet ... Béatrice Daille
01 Jan 2014
01 Jan 2014

French-English Terminology Extraction from Comparable Corpora
Béatrice Daille ... Emmanuel Morin
-
Béatrice Daille, et. al.Béatrice Daille ... Emmanuel Morin
01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Semi-automatic extraction of multiword terms from domain-specific corpora

Abstract

Talk to us

Similar Papers

More From: The Electronic Library