Graph-based exploration and clustering analysis of semantic spaces

Alexander Veremyev,Eduardo L Pasiliao,Alexander Semenov,Vladimir Boginski

doi:10.1007/s41109-019-0228-y

Alexander Veremyev, Eduardo L Pasiliao + Show 2 more

Open Access

https://doi.org/10.1007/s41109-019-0228-y

Copy DOI

Abstract

The goal of this study is to demonstrate how network science and graph theory tools and concepts can be effectively used for exploring and comparing semantic spaces of word embeddings and lexical databases. Specifically, we construct semantic networks based on word2vec representation of words, which is “learnt” from large text corpora (Google news, Amazon reviews), and “human built” word networks derived from the well-known lexical databases: WordNet and Moby Thesaurus. We compare “global” (e.g., degrees, distances, clustering coefficients) and “local” (e.g., most central nodes and community-type dense clusters) characteristics of considered networks. Our observations suggest that human built networks possess more intuitive global connectivity patterns, whereas local characteristics (in particular, dense clusters) of the machine built networks provide much richer information on the contextual usage and perceived meanings of words, which reveals interesting structural differences between human built and machine built semantic networks. To our knowledge, this is the first study that uses graph theory and network science in the considered context; therefore, we also provide interesting examples and discuss potential research directions that may motivate further research on the synthesis of lexicographic and machine learning based tools and lead to new insights in this area.

Highlights

The amount of text data generated in various domains has exploded exponentially over the past few years, and it is estimated that about 80% of all data is unstructured text-heavy data (Schneider 2016; Sumathy and Chidambaram 2013)
Structural characteristics of human built semantic networks WordNet network characteristics WordNet is a large lexical database of English terms developed by George Miller and colleagues (Miller 1995; Fellbaum 1998) in which words are grouped into sets of cognitive synonyms, each representing a distinct concept
Human built networks exhibit global characteristics that are more consistent with the frequency of word usage in the English language, whereas machine built networks lack such global characteristics

Summary

Introduction

The amount of text data generated in various domains has exploded exponentially over the past few years, and it is estimated that about 80% of all data is unstructured text-heavy data (Schneider 2016; Sumathy and Chidambaram 2013). It is increasingly important to develop effective tools and methodologies for handling and analyzing text data. The field of text analytics contains a set of techniques for extracting valuable knowledge from the text, such as the use of natural language processing tools to convert unstructured text-rich data into structured machine-understandable form of data. Typical text analytics applications include finding/extracting relevant information from the text, text categorization, document summarization, text clustering, sentiment analysis, concept extraction, and others (Gandomi and Haider 2015). Many of these tasks are addressed using various machine learning techniques. A significant challenge for text analytics is understanding language organization

Objectives

Methods

Results

Conclusion