Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Rafał Jaworski,Ivan Dunđer,Sanja Seljan

doi:10.3390/info14040226

Rafał Jaworski, Ivan Dunđer + Show 1 more

Open Access

https://doi.org/10.3390/info14040226

Copy DOI

Abstract

Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Abstract

Talk to us

Similar Papers

More From: Information

Lead the way for us

Journal: Information	Publication Date: Apr 6, 2023
License type: CC BY 4.0

Similar Papers

Bodo to English Machine Translation through Transliteration
Saiful Islam* ... Prof Bipul Syam Purkayastha
International Journal of Innovative Technology and Exploring Engineering | VOL. 8
Saiful Islam*, et. al.Saiful Islam* ... Prof Bipul Syam Purkayastha
30 Oct 2019
International Journal of Innovative Technology and Exploring Engineering | VOL. 8

Word Alignment of Chinese Poetry Parallel Corpus based on Word Embedding Technology
Xizhe Wang
Applied and Computational Engineering | VOL. 8
Xizhe WangXizhe Wang
01 Aug 2023
Applied and Computational Engineering | VOL. 8

Comparative Analysis of the Performance of the Fasttext and Word2vec Methods on the Semantic Similarity Query of Sirah Nabawiyah Information Retrieval System: A systematic literature review
Etna Syirfa Qorina ... Hamka Hasan
-
Etna Syirfa Qorina, et. al.Etna Syirfa Qorina ... Hamka Hasan
23 Oct 2020
23 Oct 2020

A Principle-Based System for Natural Language Analysis and Translation

-

01 Aug 1988
01 Aug 1988

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Abstract

Talk to us

Similar Papers

More From: Information