Abstract

Automatic key concept extraction from text is the main challenging task in information extraction, information retrieval and digital libraries, ontology learning, and text analysis. The statistical frequency and topical graph-based ranking are the two kinds of potentially powerful and leading unsupervised approaches in this area, devised to address the problem. To utilize the potential of these approaches and improve key concept identification, a comprehensive performance analysis of these approaches on datasets from different domains is needed. The objective of the study presented in this paper is to perform a comprehensive empirical analysis of selected frequency and topical graph-based algorithms for key concept extraction on three different datasets, to identify the major sources of error in these approaches. For experimental analysis, we have selected TF-IDF, KP-Miner and TopicRank. Three major sources of error, i.e., frequency errors, syntactical errors and semantical errors, and the factors that contribute to these errors are identified. Analysis of the results reveals that performance of the selected approaches is significantly degraded by these errors. These findings can help us develop an intelligent solution for key concept extraction in the future.

Highlights

  • The key concepts in an ontology of a specific domain represent a set of important entities’ classes or objects [1,2]

  • By dipping in depth to determine, why TopicRank performing low and behaves differently in an unstable way on SemEval-2010 and Quranic dataset, we found that the main responsibility lies in the way of generating topics and their weighting

  • We present the overall performance of the above methods in terms Average Precision (AP), which measures that how early in the ranking list a ranking algorithm fills the position

Read more

Summary

Introduction

The key concepts in an ontology of a specific domain represent a set of important entities’ classes or objects [1,2] Extracting these key concepts automatically is a fundamental and challenging step in Ontology Learning. In order to utilize the potential of these approaches for improving key concept identification, we need to thoroughly analyze the performance of the methods based on these approaches, on datasets from different domains, and investigate the underlying reasons and error sources in case of poor results. To gain better understanding of the approaches by identifying their shortcomings, and to provide future research directions, we examine three state-of-the-art methods and evaluate their performance on three different datasets We will describe these datasets later in the analysis section.

Related Work
Common Extraction Steps
TF-IDF
KP-Miner
TopicRank
Comparative Analysis
Experimental Setup
Performance Measures
Individual Performance
Overall Performance
Method
Error Source Analysis
Conclusions
Findings
Limitations
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call