Word Representation Method Research Articles

This paper proposes a new word representation method emphasizes general words over specific words. The main motivation for developing this method is to address the weighting bias in modern Language Models (LMs). Based on the Transformer architecture, contemporary LMs tend to naturally emphasize specific words through the Attention mechanism to capture the key semantic concepts in a given text. As a result, general words, including question words are often neglected by LMs, leading to a biased word significance representation (where specific words have heightened weight, while general words have reduced weights). This paper presents a case study, where general words' semantics are as important as specific words' semantics, specifically in the abstractive answer area within the Natural Language Processing (NLP) Question Answering (QA) domain. Based on the selected case study datasets, two experiments are designed to test the hypothesis that "the significance of general words is highly correlated with its Term Frequency (TF) percentage across various document scales”. The results from these experiments support this hypothesis, justifying the proposed intention of the method to emphasize general words over specific words in any corpus size. The output of the proposed method is a list of token (word)-weight pairs. These generated weights can be used to leverage the significance of general words over specific words in suitable NLP tasks. An example of such task is the question classification process (classifying question text whether it expects factual or abstractive answer). In this context, general words, particularly the question words are more semantically significant than the specific words. This is because the same specific words in different questions might require different answers based on their question words (e.g. "How many items are on sale?" and "What items are on sale?" questions). By employing the general weight values produced by this method, the weightage of question and specific words can be heightened, making it easier for the classification system to differentiate between these questions. Additionally, the token (word)-weight pair list is made available online at https://www.kaggle.com/datasets/saliimiabbas/genwords-weight.

Read full abstract

Nowadays, with the increasing number and use of social media platforms, people now share their experiences about a product they have bought or a place they have been to on social media platforms more frequently. Considering the volume of data on social media platforms, it is considered that there is some meaningful information for institutions or companies in the reviews and experiences shared on social media platforms. As such, it is important to improve the methods of extracting meaningful information from the reviews and experiences shared on social media and to know which method is better. In this study, the classification successes of the bag of words and the fastText word representation methods, which are among the word representation methods in sentiment analysis methods mentioned above, were compared by using Turkish reviews performed for touristic places. Besides, while performing the comparison process, it was measured whether the process of separating the words into their roots and negation of the words, which is the preliminary stage of the sentiment analysis process, contributed to the classification success. In the study, both two-class (positive, negative) sentiment analysis and three-class (positive, negative, neutral) sentiment analysis were performed. Six data sets were created to carry out the mentioned comparison operations. The data sets were first classified using the Naive Bayes (NB), Multinomial Naive Bayes (MNB), k-Nearest Neighbor (k-NN) and Support Vector Machines (SVM) algorithms, which are frequently used in text mining, and based on bag of words word representation method, they were classified with WEKA program. After the test results of all data sets were obtained according to the bag of words word representation method, the tests of the fastText word representation method were carried out using the fastText library of the Python programming language. Classification procedures were carried out with 10-fold cross-validation methods, and f-score values of the classification processes were obtained. Finally, it was determined that bag of words word representation method performed a more successful classification than the fastText word representation method in two-class emotion analysis, while the fastText word representation method performed a more successful classification process than bag of words word representation method in three-class emotional analysis. It was observed that the process of separating the words into their roots and negating the words, which are the preliminary processes of sentiment analysis, did not contribute positively or negatively to the classification processes performed with the fastText word representation method. However, it was determined that it had a minor contribution to sentiment analysis processes performed by using bag of words word representation method. In the two-class sentiment analysis, the most successful classification result was achieved by using the machine learning model created with the SVM algorithm with the value of 0.91 f-score employing bag of words word representation method. In the three-class sentiment analysis, the most successful classification result was achieved with the machine learning model created using the fastText word representation method with the value of 0.78 f-score.

Read full abstract

Word Representation Method Research Articles

Related Topics

Articles published on Word Representation Method

Predictors of explicit and implicit anthropomorphism in house facades.

Sentiment analysis of coronavirus data with ensemble and machine learning methods

Investigating Computational Identity and Empowerment of The Students Studying Programming: A Text Mining Study

General Words Representation Method for Modern Language Model

TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites

Unsupervised cross-lingual model transfer for named entity recognition with contextualized word representations.

Sentiment Classification Performance Analysis Based on Glove Word Embedding

LogClass: Anomalous Log Identification and Classification With Partial Labels

Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya

FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması

Defect Texts Mining of Secondary Device in Smart Substation with GloVe and Attention-Based Bidirectional LSTM

Learning variable-length representation of words

Contextual Word Representation and Deep Neural Networks-based Method for Arabic Question Classification

Enriching Word Embeddings with Global Information and Testing on Highly Inflected Language

Convolution–deconvolution word embedding: An end-to-end multi-prototype fusion embedding method for natural language processing

Chinese Event Detection Based on Multi-Feature Fusion and BiLSTM

Sentiment Analysis of Comment Texts Based on BiLSTM

Enhanced news sentiment analysis using deep learning methods

Combining the Attention Network and Semantic Representation for Chinese Verb Metaphor Identification

A COMPARATIVE STUDY OF WORD REPRESENTATION METHODS WITH CONDITIONAL RANDOM FIELDS AND MAXIMUM ENTROPY MARKOV FOR BIO-NAMED ENTITY RECOGNITION

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Word Representation Method Research Articles

Related Topics

Articles published on Word Representation Method

Predictors of explicit and implicit anthropomorphism in house facades.

Sentiment analysis of coronavirus data with ensemble and machine learning methods

Investigating Computational Identity and Empowerment of The Students Studying Programming: A Text Mining Study

General Words Representation Method for Modern Language Model

TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites

Unsupervised cross-lingual model transfer for named entity recognition with contextualized word representations.

Sentiment Classification Performance Analysis Based on Glove Word Embedding

LogClass: Anomalous Log Identification and Classification With Partial Labels

Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya

FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması

Defect Texts Mining of Secondary Device in Smart Substation with GloVe and Attention-Based Bidirectional LSTM

Learning variable-length representation of words

Contextual Word Representation and Deep Neural Networks-based Method for Arabic Question Classification

Enriching Word Embeddings with Global Information and Testing on Highly Inflected Language

Convolution–deconvolution word embedding: An end-to-end multi-prototype fusion embedding method for natural language processing

Chinese Event Detection Based on Multi-Feature Fusion and BiLSTM

Sentiment Analysis of Comment Texts Based on BiLSTM

Enhanced news sentiment analysis using deep learning methods

Combining the Attention Network and Semantic Representation for Chinese Verb Metaphor Identification

A COMPARATIVE STUDY OF WORD REPRESENTATION METHODS WITH CONDITIONAL RANDOM FIELDS AND MAXIMUM ENTROPY MARKOV FOR BIO-NAMED ENTITY RECOGNITION