Evaluation and analysis of term scoring methods for term extraction

Suzan Verberne,Maya Sappelli,Wessel Kraaij,Djoerd Hiemstra

doi:10.1007/s10791-016-9286-2

Abstract

We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback---Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

Highlights

Keywords or key terms are short phrases that represent the content of a document or a document collection
We have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose
We evaluate and compare six unsupervised term scoring methods from the literature on four different test collections, each with their own specific use case: (1) personal scientific document collections; terms are extracted for the purpose of author profiling; (2) news articles retrieved for Boolean queries; terms are extracted for the purpose of query term suggestion; (3) scientific articles retrieved for highly specific information needs; terms are extracted for the purpose of personalized query suggestion; (4) medical discharge summaries; terms are extracted for the purpose of automatically expanding patient queries with medical terms

Summary

Introduction

Keywords or key terms are short phrases that represent the content of a document or a document collection. We expect C-Value and KLIP to give the best results for collections and use cases where multi-word terms are important. We address two collections : the Author Profiling collections, where we evaluate term scoring for increasing word counts, and discharge summaries for Medical Query Expansion, where we investigate how different methods perform on collections with a small number of words. Boldface indicates the best scoring background corpus per collection generated by KLIP, FP and PLM for each of the topics.

Objectives

Methods

Results

Discussion

Conclusion