In the melting pot of web‐crawled texts: The challenges of extracting English words from Croatian corpora

Jasmina Jelčić Čolakovac,Mirjana Borucinsky

doi:10.1111/ijal.12485

Abstract

AbstractThe focus of this paper are English words and phrases used in Croatian which, unlike loanwords, have not undergone major adaptations at the orthographic, phonetic, or other levels apart from being influenced by the inflectional system of the recipient language. A list of English words in Croatian corpora was compiled using automatic algorithm extraction, corpus query language in Sketch Engine, and manual word list evaluation with the end goal of publishing the first comprehensive online database of English words in Croatian. The ENGRI corpus of Croatian was created by web crawling procedure and used together with the existing Croatian hrWaC 2.2 RFTagger corpus to produce a list of English words and phrases. In this paper, word list compilation issues are discussed in relation to both general issues encountered in the study of interlingual lexical types (such as false cognates, antonomasia, and polysemy) as well as Croatian‐specific language properties such as its inflectional system and diacritical marks. In conclusion, we propose that manual evaluation is an indispensable method and a necessary complement to computational linguistic tools in the creation of word lists and databases of foreign words in other languages.

Full Text