A Comprehensive Evaluation of Metadata-Based Features to Classify Research Paper’s Topics

Ghulam Mustafa,Muhammad Tanvir Afzal,Muhammad Usman,Anis Koubaa,Abdul Shahid

doi:10.1109/access.2021.3115148

Abstract

The existing plethora of document classification techniques exploits different data sources either from the content or metadata of research articles. Various journal publishers like Springer, Elsevier, IEEE, etc., do not provide open access to the content of research articles, whereas metadata is freely available there. Metadata like title, keyword, and abstract can serve as a better alternative to the content in various scenarios. In the current literature, researchers have assessed the role of some of the metadata individually. We believe that the collective contribution of metadata parameters can play a significant role in classifying research papers. This paper presents a comprehensive evaluation of the role of metadata, individually as well as in combinations to achieve the objective of research paper classification. Moreover, we have classified the research articles into ACM hierarchy root categories (e.g. general literature, hardware, software, etc.). In this comprehensive evaluation, we have assessed all the possible combinations of metadata features against different classifiers such as Random Forest, K Nearest Neighbor, and Decision Tree. The results of this research reveal that the title & keywords combination outperforms other combinations with an F-measure score of 0.88.

Highlights

Over the past several years, the research plethora over the web is briskly expanding
This study presents a comprehensive evaluation of metadata of research papers individually and collectively by forming different combinations to classify research papers into different categories specified by Association for Computing Machinery (ACM)
An accurate classification model to label the research papers into different categories can boost the efficiency of various digital libraries and can assist the scholarly community by providing them content to conduct a literature review on a particular topic or domain

Summary

INTRODUCTION

Over the past several years, the research plethora over the web is briskly expanding. Based on critical analysis of contemporary approaches, we have identified that as per our knowledge none of them has combined useful metadata parameters like abstract, title, keywords, general terms, etc., to classify research papers using comprehensive and large datasets. This study presents a comprehensive evaluation of metadata of research papers individually and collectively by forming different combinations to classify research papers into different categories specified by ACM. For this purpose, the comprehensive and large dataset is taken from ACM prepared by SANTOS et al [9]. It contains different metadata parameters of research articles from the domain of Computer Science From this data, we have extracted title, abstract, general terms, and keywords.

RELATED WORK

Results

Limitations

DATASET

INDIVIDUAL FEATURES

CONCLUSION