Word N-grams Research Articles

Social media platforms have become a substratum for people to enunciate their opinions and ideas across the globe. Due to anonymity preservation and freedom of expression, it is possible to humiliate individuals and groups, disregarding social etiquette online, inevitably proliferating and diversifying the incidents of cyberbullying and cyber hate speech. This intimidating problem has recently sought the attention of researchers and scholars worldwide. Still, the current practices to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the recent prevalence of regional languages in social media, the dearth of language resources, and flexible detection approaches, specifically for low-resource languages. In this context, most existing studies are oriented towards traditional resource-rich languages and highlight a huge gap in recently embraced resource-poor languages. One such language currently adopted worldwide and more typically by South Asian users for textual communication on social networks is Roman Urdu. It is derived from Urdu and written using a Left-to-Right pattern and Roman scripting. This language elicits numerous computational challenges while performing natural language preprocessing tasks due to its inflections, derivations, lexical variations, and morphological richness. To alleviate this problem, this research proposes a cyberbullying detection approach for analyzing textual data in the Roman Urdu language based on advanced preprocessing methods, voting-based ensemble techniques, and machine learning algorithms. The study has extracted a vast number of features, including statistical features, word N-Grams, combined n-grams, and BOW model with TFIDF weighting in different experimental settings using GridSearchCV and cross-validation techniques. The detection approach has been designed to tackle users’ textual input by considering user-specific writing styles on social media in a colloquial and non-standard form. The experimental results show that SVM with embedded hybrid N-gram features produced the highest average accuracy of around 83%. Among the ensemble voting-based techniques, XGboost achieved the optimal accuracy of 79%. Both implicit and explicit Roman Urdu instances were evaluated, and the categorization of severity based on prediction probabilities was performed. Time complexity is also analyzed in terms of execution time, indicating that LR, using different parameters and feature combinations, is the fastest algorithm. The results are promising with respect to standard assessment metrics and indicate the feasibility of the proposed approach in cyberbullying detection for the Roman Urdu language.

Rapid increase in conversational AI and user chat data lead to intensive development of dialogue management systems (DMS) for various industries. Yet, for low-resource languages, such as Azerbaijani, very little research has been conducted. The main purpose of this work is to experiment with various DMS pipeline set-ups to decide on the most appropriate natural language understanding and dialogue manager settings. In our project, we designed and evaluated different DMS pipelines with respect to the conversational text data obtained from one of the leading retail banks in Azerbaijan. In the work, the main two components of DMS—Natural language Understanding (NLU) and Dialogue Manager—have been investigated. In the first step of NLU, we utilized a language identification (LI) component for language detection. We investigated both built-in LI methods such as fastText and custom machine learning (ML) models trained on the domain-based dataset. The second step of the work was a comparison of the classic ML classifiers (logistic regression, neural networks, and SVM) and Dual Intent and Entity Transformer (DIET) architecture for user intention detection. In these experiments we used different combinations of feature extractors such as CountVectorizer, Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer, and word embeddings for both word and character n-gram based tokens. To extract important information from the text messages, Named Entity Extraction (NER) component was added to the pipeline. The best NER model was chosen among conditional random fields (CRF) tagger, deep neural networks (DNN), models and build in entity extraction component inside DIET architecture. Obtained entity tags fed to the Dialogue Management module as features. All NLU set-ups were followed by the Dialogue Management module that contains a Rule-based Policy to handle FAQs and chitchats as well as a Transformer Embedding Dialogue (TED) Policy to handle more complex and unexpected dialogue inputs. As a result, we suggest a DMS pipeline for a financial assistant, which is capable of identifying intentions, named entities, and a language of text followed by policies that allow generating a proper response (based on the designed dialogues) and suggesting the best next action.

Word N-grams Research Articles

Related Topics

Articles published on Word N-grams

Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques

FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

Identification of offensive language in Urdu using semantic and embedding models.

Big-Data-Based Text Mining and Social Network Analysis of Landscape Response to Future Environmental Change

Optimal alphabet for single text compression

직업성 피부질환의 텍스트마이닝, CONCOR 분석을 통한 연구동향 분석

『주택연구』의 30년간 연구경향: 텍스트 마이닝 접근법

기업의 해외진출에 따른 지역별 무역기술장벽 특성 비교: 수산업을 중심으로

ExpFinder: A hybrid model for expert finding from text-based expertise data

Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms

Keyphrase Extraction Using Enhanced Word and Document Embedding

Machine Learning Techniques, Features, Datasets, and Algorithm Performance Parameters for Sentiment Analysis: A Systematic Review

Multi-class sentiment analysis of urdu text using multilingual BERT

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

코로나19 팬데믹 상황의 건강 및 운동 인식에 대한 빅데이터 분석

Deep Learning Model for Sentiment Analysis on Short Informal Texts

Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model

Zeta revisited

Development of Dialogue Management System for Banking Services

Linguistic features evaluation for hadith authenticity through automatic machine learning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Word N-grams Research Articles

Related Topics

Articles published on Word N-grams

Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques

FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

Identification of offensive language in Urdu using semantic and embedding models.

Big-Data-Based Text Mining and Social Network Analysis of Landscape Response to Future Environmental Change

Optimal alphabet for single text compression

직업성 피부질환의 텍스트마이닝, CONCOR 분석을 통한 연구동향 분석

『주택연구』의 30년간 연구경향: 텍스트 마이닝 접근법

기업의 해외진출에 따른 지역별 무역기술장벽 특성 비교: 수산업을 중심으로

ExpFinder: A hybrid model for expert finding from text-based expertise data

Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms

Keyphrase Extraction Using Enhanced Word and Document Embedding

Machine Learning Techniques, Features, Datasets, and Algorithm Performance Parameters for Sentiment Analysis: A Systematic Review

Multi-class sentiment analysis of urdu text using multilingual BERT

Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection

코로나19 팬데믹 상황의 건강 및 운동 인식에 대한 빅데이터 분석

Deep Learning Model for Sentiment Analysis on Short Informal Texts

Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model

Zeta revisited

Development of Dialogue Management System for Banking Services

Linguistic features evaluation for hadith authenticity through automatic machine learning