Abstract

In this paper, we present an advanced domain-specific multi-word terminology extraction method. Our hybrid approach for automatic term identification benefits from both statistical and linguistic approaches. Our main goal is to reduce as much as possible the human effort in term selection tasks as well as to provide a wide-range and representative terminology of a domain. We emphasize in identification of verb or noun phrases multi-word terms, in neologisms and technical jargons. Our architecture applies the term frequency-inverse document frequency (TF-IDF) algorithm to a domain-specific textual corpus in order to measure a unit’s importance in it. We also use techniques to filter out nested terms of a candidate term taking into consideration its frequency by itself in the corpus. In addition, the exported terms are filtered out based on a stop-word list and linguistic criteria. To further reduce the size of the candidate terms and achieve accurate and precise terminologies, our method automatically validates them against a general-purpose corpus. Our study based on a small corpus of vibration-based condition monitoring domain shows that most extracted terms have nice correspondence to the domain of condition monitoring concepts and notions.

Highlights

  • Every scientific area makes use of a special vocabulary to convey specialized concepts by means of technical language

  • We present a method for the automatic extraction of multi-word terms from machine-readable textual corpora

  • Our algorithm emphasizes in identification of verb or noun phrases multi-word terms

Read more

Summary

Introduction

Every scientific area makes use of a special vocabulary to convey specialized concepts by means of technical language. Terminologies, which are the lexical components of specialized languages, have value in the way they condense mass of information into single-word, multi-word or compound word units. They are a crucial component of both technical and scientific writing, ensuring more effective communication. Terminology identification and documentation is a time consuming task, requiring manual validation by humans. It is subjective and depends on the experience and criteria of the experts who validate it

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call