Low-resource Languages Research Articles

The proliferation of the internet, especially on social media platforms, has amplified the prevalence of cyberbullying and harassment. Addressing this issue involves harnessing natural language processing (NLP) and machine learning (ML) techniques for the automatic detection of harmful content. However, these methods encounter challenges when applied to low-resource languages like the Chittagonian dialect of Bangla. This study compares two approaches for identifying offensive language containing vulgar remarks in Chittagonian. The first relies on basic keyword matching, while the second employs machine learning and deep learning techniques. The keyword-matching approach involves scanning the text for vulgar words using a predefined lexicon. Despite its simplicity, this method establishes a strong foundation for more sophisticated ML and deep learning approaches. An issue with this approach is the need for constant updates to the lexicon. To address this, we propose an automatic method for extracting vulgar words from linguistic data, achieving near-human performance and ensuring adaptability to evolving vulgar language. Insights from the keyword-matching method inform the optimization of machine learning and deep learning-based techniques. These methods initially train models to identify vulgar context using patterns and linguistic features from labeled datasets. Our dataset, comprising social media posts, comments, and forum discussions from Facebook, is thoroughly detailed for future reference in similar studies. The results indicate that while keyword matching provides reasonable results, it struggles to capture nuanced variations and phrases in specific vulgar contexts, rendering it less robust for practical use. This contradicts the assumption that vulgarity solely relies on specific vulgar words. In contrast, methods based on deep learning and machine learning excel in identifying deeper linguistic patterns. Comparing SimpleRNN models using Word2Vec and fastText embeddings, which achieved accuracies ranging from 0.84 to 0.90, logistic regression (LR) demonstrated remarkable accuracy at 0.91. This highlights a common issue with neural network-based algorithms, namely, that they typically require larger datasets for adequate generalization and competitive performance compared to conventional approaches like LR.

Read full abstract

Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology–Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.

Read full abstract

Low-resource Languages Research Articles

Related Topics

Articles published on Low-resource Languages

Homophobia and transphobia detection for low-resourced languages in social media comments

A Novel Ensemble Model for Complex Entities Identification in Low Resource Language

How a Deep Contextualized Representation and Attention Mechanism Justifies Explainable Cross-Lingual Sentiment Analysis

Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing

Neural Machine Translation from Bengali Language to English language and vice-versa

Code-switching in Tunisian Arabic: a multi-factorial random forest analysis

Meta-Learning for Neural Machine Translation

Low-Resource Language Processing Using Improved Deep Learning with Hunter–Prey Optimization Algorithm

Automatic Vulgar Word Extraction Method with Application to Vulgar Remark Detection in Chittagonian Dialect of Bangla

A Korean emotion-factor dataset for extracting emotion and factors in Korean conversations

Mountain Gazelle Optimizer with Deep Learning Driven Satirical News Classification on Low-resource Language Corpus

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units

Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition

Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios

Building lexicon-based sentiment analysis model for low-resource languages

Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

Bidirectional Representations for Low-Resource Spoken Language Understanding

A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

Neural Arabic singular-to-plural conversion using a pretrained Character-BERT and a fused transformer

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Low-resource Languages Research Articles

Related Topics

Articles published on Low-resource Languages

Homophobia and transphobia detection for low-resourced languages in social media comments

A Novel Ensemble Model for Complex Entities Identification in Low Resource Language

How a Deep Contextualized Representation and Attention Mechanism Justifies Explainable Cross-Lingual Sentiment Analysis

Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing

Neural Machine Translation from Bengali Language to English language and vice-versa

Code-switching in Tunisian Arabic: a multi-factorial random forest analysis

Meta-Learning for Neural Machine Translation

Low-Resource Language Processing Using Improved Deep Learning with Hunter–Prey Optimization Algorithm

Automatic Vulgar Word Extraction Method with Application to Vulgar Remark Detection in Chittagonian Dialect of Bangla

A Korean emotion-factor dataset for extracting emotion and factors in Korean conversations

Mountain Gazelle Optimizer with Deep Learning Driven Satirical News Classification on Low-resource Language Corpus

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units

Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition

Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios

Building lexicon-based sentiment analysis model for low-resource languages

Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

Bidirectional Representations for Low-Resource Spoken Language Understanding

A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

Neural Arabic singular-to-plural conversion using a pretrained Character-BERT and a fused transformer