Abstract

The aim of the paper is to present a methodological framework for the development of an English-Lithuanian bilingual termbase in the cybersecurity domain, which can be applied as a model for other language pairs and other specialised domains. It is argued that the presented methodological approach can ensure creation of high-quality bilingual termbases even with limited available resources. The paper touches upon the methods and problems of dataset (corpora) compilation, terminology annotation, automatic bilingual term extraction (BiTE) and alignment, knowledge-rich context extraction, and linguistic linked open data (LLOD) technologies. The paper presents theoretical considerations as well as the arguments on the effectiveness of the described methods. The theoretical analysis and a pilot study allow arguing that: 1) a combination of parallel and comparable corpora enable to considerably expand the amount and variety of data sources that can be used for terminology extraction; this methodology is especially important for less-resourced languages which often lack parallel data; 2) deep learning systems trained by using manually annotated data (gold standard corpora) allow effective automatization of extraction of terminological data and metadata, which enables to regularly update termbases with minimised manual input; 3) LLOD technologies enable to integrate the terminological data into the global linguistic data ecosystem and make it reusable, searchable and discoverable across the Web.

Highlights

  • The aim of the paper is to present a methodological framework for the development of an English-Lithuanian bilingual termbase in the cybersecurity domain

  • We argue that the methodology can be applied as a model for other language pairs and other specialised domains, as it ensures creation of high-quality bilingual termbases even with limited available resources

  • The analysis of the related research studies, as well as the pilot study on terminology extraction performed by the authors allow arguing that the presented methodological framework would considerably enhance the quality of termbases because it allows:

Read more

Summary

Introduction

The aim of the paper is to present a methodological framework for the development of an English-Lithuanian bilingual termbase in the cybersecurity domain. Cybersecurity (CS) domain was chosen for several reasons This area is relevant in the current information age, whereas the COVID-19 pandemic, which has accelerated digital transformation of state institutions and businesses, has further increased its significance. Cyber awareness and cyber hygiene have gained utmost importance for governmental institutions and companies, and for every user of the Internet The termbase of this domain is believed to contribute to better understanding of cyber threats and data protection measures in Lithuania. The cybersecurity domain is dynamic as new concepts are constantly developed and get new terminological designations, predominantly in English. Counterparts of these designations are constantly created in other languages. The termbase based on the generalised empirical data is believed to help target users to select the most appropriate terminology for their needs: drafting of official documents and their translation, technical writing, scientific and educational writing, etc

Objectives
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call