Abstract

Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. Source code to reproduce our experimental results, including a keyword extraction library, are available in the following repository: https://github.com/asahi417/kex

Highlights

  • 1 Introduction geometric distribution such as lexical specificity (Lafon, 1980) can perform at least as well as or Keyword extraction has been an essential task in many scientific fields as a first step to extract relevant terms from text corpora

  • Different variants have been proposed, which we summarize in two main alternatives: tf-idf (Section 2.1.1) and lexical specificity (Section 2.1.2)

  • For mean reciprocal rank (MRR), LexSpec exhibits a very different behaviour, performing significantly better in datasets with high average number of noun phrases and high variability in the number of gold keywords

Read more

Summary

Introduction

1 Introduction geometric distribution such as lexical specificity (Lafon, 1980) can perform at least as well as or Keyword extraction has been an essential task in many scientific fields as a first step to extract relevant terms from text corpora. Graph-based methods initialized with tf-idf or lexical specificity performs best overall. Pablos and García-Peñalvo, 2020) and Natural Lan- ment. As each of such phrases consists of conguage Processing (NLP) tasks (Riedel et al, 2017; tiguous words in the document, the task can be Arroyo-Fernández et al, 2019). As an extension of tf, term frequency–inverse document frequency (tf-idf) (Jones, 1972) is one of most popular and effective methods used for statistical keyword extraction (El-Beltagy and Rafea, 2009), as well as still being an important component in modern information retrieval applications (Marcos-Pablos and García-Peñalvo, 2020; Guu et al, 2020).

Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.