Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Asahi Ushio,Federico Liberatore,Jose Camacho-Collados

doi:10.18653/v1/2021.emnlp-main.638

Abstract

Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. Source code to reproduce our experimental results, including a keyword extraction library, are available in the following repository: https://github.com/asahi417/kex

Highlights

1 Introduction geometric distribution such as lexical specificity (Lafon, 1980) can perform at least as well as or Keyword extraction has been an essential task in many scientific fields as a first step to extract relevant terms from text corpora
Different variants have been proposed, which we summarize in two main alternatives: tf-idf (Section 2.1.1) and lexical specificity (Section 2.1.2)
For mean reciprocal rank (MRR), LexSpec exhibits a very different behaviour, performing significantly better in datasets with high average number of noun phrases and high variability in the number of gold keywords

Summary

Introduction

1 Introduction geometric distribution such as lexical specificity (Lafon, 1980) can perform at least as well as or Keyword extraction has been an essential task in many scientific fields as a first step to extract relevant terms from text corpora. Graph-based methods initialized with tf-idf or lexical specificity performs best overall. Pablos and García-Peñalvo, 2020) and Natural Lan- ment. As each of such phrases consists of conguage Processing (NLP) tasks (Riedel et al, 2017; tiguous words in the document, the task can be Arroyo-Fernández et al, 2019). As an extension of tf, term frequency–inverse document frequency (tf-idf) (Jones, 1972) is one of most popular and effective methods used for statistical keyword extraction (El-Beltagy and Rafea, 2009), as well as still being an important component in modern information retrieval applications (Marcos-Pablos and García-Peñalvo, 2020; Guu et al, 2020).

Methods

Findings

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 3	License type: cc-by

Similar Papers

Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction
...
-
, et. al. ...
15 Oct 2021
15 Oct 2021

Graph-Based Natural Language Processing and Information Retrieval Rada Mihalcea and Dragomir Radev (University of North Texas and University of Michigan) Cambridge, UK: Cambridge University Press, 2011, viii+192 pp; hardbound, ISBN 978-0-521-89613-9, $65.00
Chris Biemann
Computational Linguistics | VOL. 38
Chris BiemannChris Biemann
01 Mar 2012
Computational Linguistics | VOL. 38

Graph-based Natural Language Processing and Information Retrieval
Rada Mihalcea ... Dragomir Radev
-
Rada Mihalcea, et. al.Rada Mihalcea ... Dragomir Radev
11 Apr 2011
11 Apr 2011

Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics -
-
-
--
01 Jan 1999
01 Jan 1999

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Abstract

Highlights

Summary

Talk to us

Similar Papers