Semantic Unsupervised Automatic Keyphrases Extraction by Integrating Word Embedding with Clustering Methods

Isabella Gagliardi,Maria Teresa Artese

doi:10.3390/mti4020030

Abstract

Increasingly, the web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. Keywords/keyphrases that characterize the semantic content of documents should be, automatically or manually, extracted, and/or associated with them. The paper presents a novel method to address the problem of the automatic unsupervised extraction of keywords/phrases from texts, expressed both in English and in Italian. The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text. In the paper, the datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF.

Highlights

The web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval
The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text
The datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF

Summary

Introduction

The web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. The problem of extracting or associating keywords to texts available on the web, able to represent their contents, is a very topical problem, which requires ad hoc solutions, especially in the case of multilingual texts or specialized topics. The keywords and key phrases of a document, if appropriately chosen, ensure the document content to be adequately and semantically represented [1,2]. Natural language processing, and information retrieval rely on keywords extraction as they provide compact information about the document and allow documents to be indexed, classified, summarized, and filtered automatically. The process of extracting keywords involves text processing methods and can be done manually by experts in the field (e.g., museum curators in the case of Cultural Heritage), or automatically by identifying and scoring words that characterize the document. The keywords/phrases obtained reflect the content and are useful to perform queries on document databases and to “suggest”

Methods

Results

Discussion

Conclusion