Word Semantic Similarity Research Articles

Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

Read full abstract

Abstract Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

Read full abstract

Word Semantic Similarity Research Articles

Related Topics

Articles published on Word Semantic Similarity

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics

Albanian Text Classification: Bag of Words Model and Word Analogies

A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models

A framework for intelligent question answering system using semantic context-specific document clustering and Wordnet

A survey on word embedding techniques and semantic similarity for paraphrase identification

A survey on word embedding techniques and semantic similarity for paraphrase identification

Document-based topic coherence measures for news media text

Weighting-based semantic similarity measure based on topological parameters in semantic taxonomy

A Semantic Similarity Algorithm Based on the Nearest Common Ancestor Node

Hierarchical Rhetorical Sentence Categorization for Scientific Papers

Computational linguistic analysis applied to a semantic fluency task to measure derailment and tangentiality in schizophrenia

On the Suitability of Applying WordNet to Privacy Measurement

Word Sense Disambiguation for Arabic Exploiting Arabic WordNet and Word Embedding

Word predictability and semantic similarity show distinct patterns of brain activity during language comprehension

Multilingual semantic dictionaries for natural language processing: The case of BabelNet

Extracting Personae Interactive Relation in Chinese Microblog Based on an Improved Dependency Trigram Kernel

The Research of Chinese Words Semantic Similarity Calculation with Multi-Information

Document Classification Using N-gram and Word Semantic Similarity

On retrieving intelligently plagiarized documents using semantic similarity

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Word Semantic Similarity Research Articles

Related Topics

Articles published on Word Semantic Similarity

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics

Albanian Text Classification: Bag of Words Model and Word Analogies

A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models

A framework for intelligent question answering system using semantic context-specific document clustering and Wordnet

A survey on word embedding techniques and semantic similarity for paraphrase identification

A survey on word embedding techniques and semantic similarity for paraphrase identification

Document-based topic coherence measures for news media text

Weighting-based semantic similarity measure based on topological parameters in semantic taxonomy

A Semantic Similarity Algorithm Based on the Nearest Common Ancestor Node

Hierarchical Rhetorical Sentence Categorization for Scientific Papers

Computational linguistic analysis applied to a semantic fluency task to measure derailment and tangentiality in schizophrenia

On the Suitability of Applying WordNet to Privacy Measurement

Word Sense Disambiguation for Arabic Exploiting Arabic WordNet and Word Embedding

Word predictability and semantic similarity show distinct patterns of brain activity during language comprehension

Multilingual semantic dictionaries for natural language processing: The case of BabelNet

Extracting Personae Interactive Relation in Chinese Microblog Based on an Improved Dependency Trigram Kernel

The Research of Chinese Words Semantic Similarity Calculation with Multi-Information

Document Classification Using N-gram and Word Semantic Similarity

On retrieving intelligently plagiarized documents using semantic similarity