Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Aleksei V Abramov,Vladimir V Ivanov

doi:10.22363/2687-0088-30118

Aleksei V Abramov, Vladimir V Ivanov

Open Access

https://doi.org/10.22363/2687-0088-30118

Copy DOI

Abstract

Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Russian Journal of Linguistics	Publication Date: Jun 29, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Abstract

Talk to us

Similar Papers

More From: Russian Journal of Linguistics

Lead the way for us

Similar Papers

Evaluation and Analysis of Word Embedding Vectors of English Text Using Deep Learning Technique
Jaspreet Singh ... Rajinder Singh
-
Jaspreet Singh, et. al.Jaspreet Singh ... Rajinder Singh
01 Jan 2018
01 Jan 2018

AI-KU at SemEval-2016 Task 11: Word Embeddings and Substring Features for Complex Word Identification
Onur Kuru
-
Onur KuruOnur Kuru
01 Jan 2015
AI-KU at SemEval-2016 Task 11: Word Embeddings and Substring Features for Complex Word Identification
Onur Kuru

Predicting lexical complexity in English texts: the Complex 2.0 dataset
Matthew Shardlow ... Richard Evans
Language Resources and Evaluation | VOL. 56
Matthew Shardlow, et. al.Matthew Shardlow ... Richard Evans
23 Mar 2022
Language Resources and Evaluation | VOL. 56

Slovene and Croatian word embeddings in terms of gender occupational analogies
Matej Ulčar ... Senja Pollak
Slovenščina 2.0: empirical, applied and interdisciplinary research | VOL. 9
Matej Ulčar, et. al.Matej Ulčar ... Senja Pollak
06 Jul 2021
Slovenščina 2.0: empirical, applied and interdisciplinary research | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Abstract

Talk to us

Similar Papers

More From: Russian Journal of Linguistics