Large Monolingual Corpora Research Articles

<p>In recent years, transformer models have achieved great success in natural language processing (NLP) tasks. Most of the current results are achieved by using monolingual transformer models, where the model is pre-trained using a single-language unlabelled text corpus. Then, the model is fine-tuned to the specific downstream task. However, the cost of pre-training a new transformer model is high for most languages. In this work, we propose a cost-effective transfer learning method to adopt a strong source language model, trained from a large monolingual corpus to a low-resource language. Thus, using the XLNet language model, we demonstrate competitive performance with mBERT and a pre-trained target language model on the cross-lingual sentiment (CLS) dataset and on a new sentiment analysis dataset for the low-resource language Tigrinya. With only 10k examples of the given Tigrinya sentiment analysis dataset, English XLNet achieved 78.88% F1-Score, outperforming BERT and mBERT by 10% and 7%, respectively. More interestingly, fine-tuning (English) XLNet model on the CLS dataset showed promising results compared to mBERT and even outperformed mBERT for one dataset of the Japanese language.</p>

Read full abstract

In recent years, great availability of various language resources in different forms as well as rapid development of computer technology and programming skills have made researchers in the fields of linguistics and computer science cooperate in solving different problems of computational linguistics and natural language processing. Building large monolingual as well as bilingual corpora in digital forms and storing them in computer memories has enabled linguists and lan- guage engineers to automatically explore techniques for processing information with the help of various computer programs without any need to manually col- lect and analyze data. One of the main applications of monolingual corpora can be seen in developing automatic spell-checking systems. In such systems, a large monolingual corpus can function as a database instead of a monolingual dictionary. In the present study, it has been tried to demonstrate the effectiveness of a large monolingual corpus of Persian in improving the output quality of a spell-checker developed for this language. In the present spelling correction system, the three phases of error detection, making suggestions, and ranking suggestions are performed in three separate stages. An experiment was carried out to evaluate the performance of the spell-checking system.

Read full abstract

Large Monolingual Corpora Research Articles

Related Topics

Articles published on Large Monolingual Corpora

Transferring monolingual model to low-resource language: the case of Tigrinya

Verbos de elocução em português: um estudo descritivo com base em grandes corpora e motivado pela linguística computacional

Unsupervised Phrasal Near-Synonym Generation from Text Corpora

Verb Classification Using Bilingual Lexicon and Translation Information in Tibetan Language

Izgradnja modelov za prepoznavanje imenskih entitet za hrvaščino in slovenščino

FarsiSpell: A spell-checking system for Persian using a large monolingual corpus

Extraction of multi-word expressions from small parallel corpora

Biomedical information retrieval across languages

Modifying nouns : an english-spanish corpus-based contrast of three word pairs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Monolingual Corpora Research Articles

Related Topics

Articles published on Large Monolingual Corpora

Transferring monolingual model to low-resource language: the case of Tigrinya

Verbos de elocução em português: um estudo descritivo com base em grandes corpora e motivado pela linguística computacional

Unsupervised Phrasal Near-Synonym Generation from Text Corpora

Verb Classification Using Bilingual Lexicon and Translation Information in Tibetan Language

Izgradnja modelov za prepoznavanje imenskih entitet za hrvaščino in slovenščino

FarsiSpell: A spell-checking system for Persian using a large monolingual corpus

Extraction of multi-word expressions from small parallel corpora

Biomedical information retrieval across languages

Modifying nouns : an english-spanish corpus-based contrast of three word pairs