High-resource Languages Research Articles

With the invention of deep learning concepts, Machine Translation (MT) migrated towards Neural Machine Translation (NMT) architectures, eventually from Statistical Machine Translation (SMT), which ruled MT for a few decades. Slowly, NMT paved its path into Indian MT research and witnessed many works for various language pairs in this regard. Numerous NMT architectures are floating across the international and national research pool; many claims to be state-of-the-art architectures. Though NMT for Indic languages (ILNMT) is giving better results for majority speaking language pairs, the translation quality is low due to a lack of significant resources. Automated machine translation models are unavailable for some less spoken Indic languages like Kashmiri and Dogri. Hence, there is increasing demand in the research to address the challenges of developing applicable MT models even when minuscule training data is available. Based on the corpus availability, the languages are categorized into High Resource Languages (HRLs), Low Resource Languages (LRLs), and Zero Resource Languages (ZRLs). Many Indic languages are classified into HRLs, LRLs, and ZRLs based on corpus availability. The vision behind this literature survey paper is to make this paper a collective source for all information regarding the predominant ILNMT architectures, the toolkits available for building NMT models, and various pre-trained language models needed by researchers who contribute to the ILNMT research community. In this survey paper, ILNMT architectures for different Indic languages are covered, e.g., Hindi, Tamil (HRLs), Kannada, Marathi (LRLs), Sinhala, and Nepali (ZRLs). There are a few language-specific survey papers on ILNMT, and this is one of the first kinds of survey papers where all the information is gathered under one canopy.

Read full abstract

In multilingual societies like India, mixing the native language with English has become common during social media conversations. Further, due to the government’s digitization push, more people from rural India are joining social media platforms, resulting in the exponential growth of native or code-mixed content. The resultant content on social media is available for both positive (also termed as Hope Speech) as well as negative context (also termed as Hate Speech). To keep the social media clean and hate free, it is important to remove the negative content using machine learning filters. Since most of the existing hate content prediction models are trained using high resource language such as English, they fail to work on code-mixed text due to its spelling variance and non-grammatical structure. In addition, the lack of suitable training data could be one reason behind existing models’ poor performance on code-mixed text. To address these issues and promote research in this direction, we developed a manually annotated Hinglish Code-mixed corpus of 9254 comments taken from Twitter handles. We also annotated our data with the target audience and severity level. In each label, we provided a more fine-grained classification with three independent classes, and we built a Multi-label and Multi-class corpus for the severity of hate content prediction in Hinglish code-mixed text. Further, we modeled various supervised classifiers for severity prediction to validate our proposed data. The proposed models employ transformers for feature extraction and different machine learning and RNN (Recurrent neural network) models for classification. According to the experimental results, the target label combined with embeddings from Twitter text using the BiLSTM (a varient of RNN) classifier performed better on severity prediction, attaining an acceptable weighted F1 score.

Read full abstract

High-resource Languages Research Articles

Related Topics

Articles published on High-resource Languages

SexWEs: Domain-Aware Word Embeddings via Cross-Lingual Semantic Specialisation for Chinese Sexism Detection in Social Media

Cross-Lingual and Cross-Domain Crisis Classification for Low-Resource Scenarios

English–Assamese neural machine translation using prior alignment and pre-trained language model

Context-aware Emotion Detection from Low-resource Urdu Language Using Deep Neural Network

MBERT-GRU multilingual deep learning framework for hate speech detection in social media

Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi

CovTiNet: Covid text identification network using attention-based positional embedding feature fusion.

Learning Student Intents and Named Entities in the Education Domain

Meta-Learning a Cross-lingual Manifold for Semantic Parsing

Zero-shot cross-lingual transfer language selection using linguistic similarity

A Voyage on Neural Machine Translation for Indic Languages

Multilingual Sentiment Analysis for Under-Resourced Languages: A Systematic Review of the Landscape

Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages

A Survey of Cross-Lingual Text Classification and Its Applications on Fake News Detection

Transfer Learning Based Neural Machine Translation of English-Khasi on Low-Resource Settings

Dzongkha Handwritten Digit Recognition using Machine Learning Techniques

Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

루간다어-영어 병렬코퍼스 구축과 번역 모델 학습

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Threatening URDU Language Detection from Tweets Using Machine Learning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-resource Languages Research Articles

Related Topics

Articles published on High-resource Languages

SexWEs: Domain-Aware Word Embeddings via Cross-Lingual Semantic Specialisation for Chinese Sexism Detection in Social Media

Cross-Lingual and Cross-Domain Crisis Classification for Low-Resource Scenarios

English–Assamese neural machine translation using prior alignment and pre-trained language model

Context-aware Emotion Detection from Low-resource Urdu Language Using Deep Neural Network

MBERT-GRU multilingual deep learning framework for hate speech detection in social media

Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi

CovTiNet: Covid text identification network using attention-based positional embedding feature fusion.

Learning Student Intents and Named Entities in the Education Domain

Meta-Learning a Cross-lingual Manifold for Semantic Parsing

Zero-shot cross-lingual transfer language selection using linguistic similarity

A Voyage on Neural Machine Translation for Indic Languages

Multilingual Sentiment Analysis for Under-Resourced Languages: A Systematic Review of the Landscape

Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages

A Survey of Cross-Lingual Text Classification and Its Applications on Fake News Detection

Transfer Learning Based Neural Machine Translation of English-Khasi on Low-Resource Settings

Dzongkha Handwritten Digit Recognition using Machine Learning Techniques

Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

루간다어-영어 병렬코퍼스 구축과 번역 모델 학습

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Threatening URDU Language Detection from Tweets Using Machine Learning