Abstract
Representation models for text classification have recently shown impressive performance. However, these models neglect the importance of polysemous words in text. When polysemous words appear in a text, imprecise polysemous word embeddings will produce low-quality text representation that results in changing the original meaning of the text. To address this problem, in this paper, we present a more effective model architecture, the polyseme-aware vector representation model (PAVRM), to generate more precise vector representations for words and texts. The PAVRM can effectively identify polysemous words in a corpus with a context clustering algorithm. Additionally, we propose two methods to construct polysemous word representations, PAVRM-Context and PAVRM-Center. Experiments conducted on three standard text classification tasks and a custom text classification task demonstrate that the proposed PAVRM can be effectively introduced into existing models to generate higher-quality word and text representations to achieve better classification performance.
Highlights
Representation learning is a fundamental problem in natural language processing (NLP) and is crucial in text classification tasks
Traditional representation models for text classification can be roughly divided into two types: models based on linear operations, which utilize tools for word embedding training, such as word2vec [1] or GloVe [2], to learn word-level representations that are later combined to form text representations [3]–[5]; and models based on deep neural networks, which use various neural network structures, such as convolutional neural networks (CNNs) [6]–[9], recurrent neural networks (RNNs) based on long short-term memory (LSTM) [10]–[12], neural networks based on attention mechanisms [13], generative adversarial networks (GANs) [14], [15], reinforcement learning (RL) [16], [17], graph convolutional networks (GCNs) [18]–[20], and pretrained language models [21], [22], to extract complex syntactic and semantic meaning from texts to generate text representations
polyseme-aware vector representation model (PAVRM)-Center indicates a variation of our proposed PAVRM model in which we replace the algorithm based on context with the algorithm based on the center vector to construct polysemous word representations
Summary
Representation learning is a fundamental problem in natural language processing (NLP) and is crucial in text classification tasks. Traditional representation models for text classification can be roughly divided into two types: models based on linear operations, which utilize tools for word embedding training, such as word2vec [1] or GloVe [2], to learn word-level representations that are later combined to form text representations [3]–[5]; and models based on deep neural networks, which use various neural network structures, such as convolutional neural networks (CNNs) [6]–[9], recurrent neural networks (RNNs) based on long short-term memory (LSTM) [10]–[12], neural networks based on attention mechanisms [13], generative adversarial networks (GANs) [14], [15], reinforcement learning (RL) [16], [17], graph convolutional networks (GCNs) [18]–[20], and pretrained language models [21], [22], to extract complex syntactic and semantic meaning from texts to generate text representations These models achieve very good performance in many text classification tasks, they still neglect an important. The following two sentences are taken as examples: they are extracted from two different sentiment classification tasks: movie reviews and baby product reviews
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.