Word Semantic Similarity Research Articles

As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in texts for topic inference. However, it is usually hard to extract a group of words that are semantically coherent and have competent representation ability when the models applied into short texts. It is because the feature space of the short texts is too sparse to provide enough co-occurrence information for topic inference. The continuous development of word embeddings brings new representation of words and more effective measurement of word semantic similarity from concept perspective. In this study, we first mine word co-occurrence patterns (i.e., biterms) from short text corpus and then calculate biterm frequency and semantic similarity between its two words. The result shows that a biterm with higher frequency or semantic similarity usually has more similar words in the corpus. Based on the result, we develop a novel probabilistic topic model, named Noise Biterm Topic Model with Word Embeddings (NBTMWE). NBTMWE extends the Biterm Topic Model (BTM) by introducing a noise topic with prior knowledge of frequency and semantic similarity of biterm. NBTMWE shows the following advantages compared with BTM: (1) It can distinguish meaningful latent topics from a noise topic which consists of some common-used words that appear in many texts of the dataset; (2) It can promote a biterm’s semantically related words to the same topic during the sampling process via generalized $P\acute {o}lya$ Urn (GPU) model. Using auxiliary word embeddings trained from a large scale of corpus, we report the results testing on two short text datasets (i.e., Sina Weibo and Web Snippets). Quantitatively, NBTMWE outperforms the state-of-the-art models in terms of coherence, topic word similarity and classification accuracy. Qualitatively, each of the topics generated by NBTMWE contains more semantically similar words and shows superior intelligibility.

Recently, political events, such as elections, have raised a lot of discussions on social media networks, in particular, Twitter. This brings new opportunities for social scientists to address social science tasks, such as understanding what communities said or identifying whether a community has an influence on another. However, identifying these communities and extracting what they said from social media data are challenging and non-trivial tasks. We aim to make progress towards understanding 'who' (i.e. communities) said 'what' (i.e. discussed topics) and 'when' (i.e. time) during political events on Twitter. While identifying the 'who' can benefit from Twitter user community classification approaches, 'what' they said and 'when' can be effectively addressed on Twitter by extracting their discussed topics using topic modelling approaches that also account for the importance of time on Twitter. To evaluate the quality of these topics, it is necessary to investigate how coherent these topics are to humans. Accordingly, we propose a series of approaches in this thesis. First, we investigate how to effectively evaluate the coherence of the topics generated using a topic modelling approach. The topic coherence metric evaluates the topical coherence by examining the semantic similarity among words in a topic. We argue that the semantic similarity of words in tweets can be effectively captured by using word embeddings trained using a Twitter background dataset. Through a user study, we demonstrate that our proposed word embedding-based topic coherence metric can assess the coherence of topics like humans [1, 2]. In addition, inspired by the precision at k metric, we propose to evaluate the coherence of a topic model (containing many topics) by averaging the top-ranked topics within the topic model [3]. Our proposed metrics can not only evaluate the coherence of topics and topic models, but also can help users to choose the most coherent topics. Second, we aim to extract topics with a high coherence from Twitter data. Such topics can be easily interpreted by humans and they can assist to examine 'what' has been discussed and 'when'. Indeed, we argue that topics can be discussed in different time periods (see [4]) and therefore can be effectively identified and distinguished by considering their time periods. Hence, we propose an effective time-sensitive topic modelling approach by integrating the time dimension of tweets (i.e. 'when') [5]. We show that the time dimension helps to generate topics with a high coherence. Hence, we argue that 'what' has been discussed and 'when' can be effectively addressed by our proposed time-sensitive topic modelling approach. Next, to identify 'who' participated in the topic discussions, we propose approaches to identify the community affiliations of Twitter users, including automatic ground-truth generation approaches and a user community classification approach. We show that the mentioned hashtags and entities in the users' tweets can indicate which community a Twitter user belongs to. Hence, we argue that they can be used to generate the ground-truth data for classifying users into communities. On the other hand, we argue that different communities favour different topic discussions and their community affiliations can be identified by leveraging the discussed topics. Accordingly, we propose a Topic-Based Naive Bayes (TBNB) classification approach to classify Twitter users based on their words and discussed topics [6]. We demonstrate that our TBNB classifier together with the ground-truth generation approaches can effectively identify the community affiliations of Twitter users. Finally, to show the generalisation of our approaches, we apply our approaches to analyse 3.6 million tweets related to US Election 2016 on Twitter [7]. We show that our TBNB approach can effectively identify the 'who', i.e. classify Twitter users into communities. To investigate 'what' these communities have discussed, we apply our time-sensitive topic modelling approach to extract coherent topics. We finally analyse the community-related topics evaluated and selected using our proposed topic coherence metrics. Overall, we contribute to provide effective approaches to assist social scientists towards analysing political events on Twitter. These approaches include topic coherence metrics, a time-sensitive topic modelling approach and approaches for classifying the community affiliations of Twitter users. Together they make progress to study and understand the connections and dynamics among communities on Twitter. Supervisors : Iadh Ounis, Craig Macdonald, Philip Habel The thesis is available at http://theses.gla.ac.uk/41135/

Word Semantic Similarity Research Articles

Related Topics

Articles published on Word Semantic Similarity

Cross-lingual embeddings with auxiliary topic models

Research of Toponyms of the Irkutsk Region Using the Method of Artificial Intelligence

Knowledge-based sentence semantic similarity: algebraical properties

An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language

Building Arabic Paraphrasing Benchmark based on Transformation Rules

Semantic Similarity of Inverse Morpheme Words Based on Word Embedding

MultiGBS: A multi-layer graph approach to biomedical summarization

AWN-similarity: Towards developing free open-source frameworks for measuring Arabic semantic similarity under Windows / Linux operating systems

AWN-similarity: Towards developing free open-source frameworks for measuring Arabic semantic similarity under Windows / Linux operating systems

Modeling Literary Criticism: How to Do It and How to Teach It to Humans and Machines

Applying Clustering and Topic Modeling to Automatic Analysis of Citizens’ Comments in EGovernment

Duration of verbal fluency inter‐word intervals and risk of cognitive impairment

Improving biterm topic model with word embeddings

Understanding the spatial dimension of natural language by measuring the spatial semantic similarity of words through a scalable geospatial context window

Representation of associative and affective semantic similarity of abstract words in the lateral temporal perisylvian language regions

A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages

A Natural Language Process-Based Framework for Automatic Association Word Extraction

METHOD FOR AUTOMATIC IDENTIFICATION OF SEMANTICALLY SIMILAR FRAGMENTS OF NEWS TEXTS

СИСТЕМАТИЗАЦИЯ СМЫСЛОВЫХ ОТНОШЕНИЙ ПРОИЗВОДЯЩИХ ПОЛИСЕМАНТОВ ПРИ ОТРАЖЕННЫХ СИНОНИМАХ

Analysing political events on Twitter

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Word Semantic Similarity Research Articles

Related Topics

Articles published on Word Semantic Similarity

Cross-lingual embeddings with auxiliary topic models

Research of Toponyms of the Irkutsk Region Using the Method of Artificial Intelligence

Knowledge-based sentence semantic similarity: algebraical properties

An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language

Building Arabic Paraphrasing Benchmark based on Transformation Rules

Semantic Similarity of Inverse Morpheme Words Based on Word Embedding

MultiGBS: A multi-layer graph approach to biomedical summarization

AWN-similarity: Towards developing free open-source frameworks for measuring Arabic semantic similarity under Windows / Linux operating systems

AWN-similarity: Towards developing free open-source frameworks for measuring Arabic semantic similarity under Windows / Linux operating systems

Modeling Literary Criticism: How to Do It and How to Teach It to Humans and Machines

Applying Clustering and Topic Modeling to Automatic Analysis of Citizens’ Comments in EGovernment

Duration of verbal fluency inter‐word intervals and risk of cognitive impairment

Improving biterm topic model with word embeddings

Understanding the spatial dimension of natural language by measuring the spatial semantic similarity of words through a scalable geospatial context window

Representation of associative and affective semantic similarity of abstract words in the lateral temporal perisylvian language regions

A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages

A Natural Language Process-Based Framework for Automatic Association Word Extraction

METHOD FOR AUTOMATIC IDENTIFICATION OF SEMANTICALLY SIMILAR FRAGMENTS OF NEWS TEXTS

СИСТЕМАТИЗАЦИЯ СМЫСЛОВЫХ ОТНОШЕНИЙ ПРОИЗВОДЯЩИХ ПОЛИСЕМАНТОВ ПРИ ОТРАЖЕННЫХ СИНОНИМАХ

Analysing political events on Twitter