Improving in-text citation reason extraction and classification using supervised machine learning techniques

Imran Ihsan,Hameedur Rahman,Asadullah Shaikh,Adel Sulaiman,Khairan Rajab,Adel Rajab

doi:10.1016/j.csl.2023.101526

Abstract

In the last decade, automatic extraction and classification of in-text citations have received immense popularity and have become one of the most frequently used techniques to evaluate research. Due to the large volume of in-text citations in various digital libraries such as Web of Science, Scopus, Google Scholar, Microsoft Academic, etc., machine learning models and natural language processing techniques are being used to extract, classify, and analyze them. Typical automatic in-text classification techniques use sentiment-based classes (Positive, Negative, and Neutral). However, there are cognitive-based schemes as well that classify in-text citations based on the author’s perspective. In such schemes, extracting citation reasons with high recall is challenging. To address this challenge, we have used eight citations’ context and reason classes defined by CCRO (Citation’s Context and Reasons Ontology) to develop a machine learning model to achieve high recall without compromising on precision. We have worked on Association for Computational Linguistics Corpus with over 7000 in-text citations, randomly annotated by experts in CCRO classes. Afterwards, an array of machine-learning models is implemented on the annotated dataset: Support Vector Machine (SVM), Naïve Bayesian (NB), and Random Forest (RF). We have used various part-of-speech (Nouns, Verbs, Adverbs, and Adjectives) as novel features. Our results show that we have outperformed the three comparative models by achieving 91% accuracy.

Full Text