Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

Joshua Eykens,Tim C E Engels,Raf Guns

doi:10.1162/qss_a_00106

Abstract

Abstract We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting data set consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multilabel data set is used to train the machine learning algorithms in different configurations. We deploy a multilabel classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.

Highlights

Disciplines have long been considered as the fundamental units of division within the sciences (Stichweh, 2003)
While some of the reported indicators, such as F-scores, are relatively low, we think it is instructive to compare our results to those of the recent studies by Kandimalla et al (2020) and Dunham et al (2020). While these authors report better accuracy, it should be highlighted that in this paper we look at the applicability of supervised learning in the context of social sciences
In this article we present a supervised ML approach to classify social science journal articles into multiple fine-grained disciplinary categories

Summary

Introduction

Disciplines have long been considered as the fundamental units of division within the sciences (Stichweh, 2003). These units are knowledge production and communication systems, and can as such serve important classificatory functions (Hammarfelt, 2018; Stichweh, 1992, 2003; Sugimoto & Weingart, 2015; van den Besselaar & Heimeriks, 2006). The subjects of interest for scientometricians (i.e., scientific documents) are classified according to disciplines to facilitate research into knowledge production and dissemination. Over the past few decades, we have faced continuous growth of the number of new disciplines and specialties (i.e., internal differentiation), resulting in increasing dynamism and “intensification of the interactions between [...] disciplines” Several concerns have been raised in this regard—here we mention

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Quantitative Science Studies	Publication Date: Apr 8, 2021
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Quantitative Science Studies

Lead the way for us

Similar Papers

Plants meet machines: Prospects in machine learning for plant biology
Pamela S Soltis ... Gil Nelson
Applications in Plant Sciences | VOL. 8
Pamela S Soltis, et. al.Pamela S Soltis ... Gil Nelson
01 Jun 2020
Applications in Plant Sciences | VOL. 8

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.
Zhenxing Wu ... Dejun Jiang
Briefings in bioinformatics | VOL. 22
Zhenxing Wu, et. al.Zhenxing Wu ... Dejun Jiang
14 Dec 2020
Briefings in bioinformatics | VOL. 22

Review of Machine Learning Algorithms for Diagnosing Mental Illness
Gyeongcheol Cho ... Younyoung Choi
Psychiatry Investigation | VOL. 16
Gyeongcheol Cho, et. al.Gyeongcheol Cho ... Younyoung Choi
08 Apr 2019
Psychiatry Investigation | VOL. 16

Searching for the Best Machine Learning Algorithm for the Detection of Left Ventricular Hypertrophy from the ECG: A Review.
Simon W Rabkin
Bioengineering | VOL. 11
Simon W RabkinSimon W Rabkin
15 May 2024
Bioengineering | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Quantitative Science Studies