Disambiguation of author entities in ADS using supervised learning and graph theory methods

Helena Mihaljević,Lucía Santamaría

doi:10.1007/s11192-021-03951-w

Helena Mihaljević, Lucía Santamaría

Open Access

https://doi.org/10.1007/s11192-021-03951-w

Copy DOI

Abstract

Disambiguation of authors in digital libraries is essential for many tasks, including efficient bibliographical searches and scientometric analyses to the level of individuals. The question of how to link documents written by the same person has been given much attention by academic publishers and information retrieval researchers alike. Usual approaches rely on publications’ metadata such as affiliations, email addresses, co-authors, or scholarly topics. Lack of homogeneity in the structure of bibliographic collections and discipline-specific dissimilarities between them make the creation of general-purpose disambiguators arduous. We present an algorithm to disambiguate authorships in the Astrophysics Data System (ADS) following an established semi-supervised approach of training a classifier on authorship pairs and clustering the resulting graphs. Due to the lack of high-signal features such as email addresses and citations, we engineer additional content- and location-based features via text embeddings and named-entity recognition. We train various nonlinear tree-based classifiers and detect communities from the resulting weighted graphs through label propagation, a fast yet efficient algorithm that requires no tuning. The resulting procedure reaches reasonable complexity and offers possibilities for interpretation. We apply our method to the creation of author entities in a recent ADS snapshot. The algorithm is evaluated on 39 manually-labeled author blocks comprising 9545 authorships from 562 author profiles. Our best approach utilizes the Random Forest classifier and yields a micro- and macro-averaged BCubed mathrm {F}_1 score of 0.95 and 0.87, respectively. We release our code and labeled data publicly to foster the development of further disambiguation procedures for ADS.

Highlights

Bibliographic databases contain large compilations of research articles’ metadata as released by academic publishers
We have implemented our data processing pipelines and Machine Learning (ML) algorithms in Python 3.8; classifier training and evaluation were carried out using the scikit-learn library (Pedregosa et al 2011), for label propagation we use the implementation in NetworkX (Hagberg et al 2008)
The following classifiers have been trained: Decision Tree (DT), Random Forest (RF), and Histogram-based Gradient Boosting Decision Tree (Hist-GBDT), whose implementation in scikit-learn is inspired by LightGBM (Ke et al 2017)

Summary

Introduction

Bibliographic databases contain large compilations of research articles’ metadata as released by academic publishers. The availability of author profiles is essential to perform effective literature research and discovery, given that direct author search is the most frequently used feature in digital libraries (Xie and Matusiak 2016). It is crucial in bibliometrics and scientometrics, since it enables analyses of scholarly data to the level of individuals, for instance studies on academic careers (Mihaljević-Brandt et al 2016), credit attribution (Caplar et al 2017), research networks (Newman 2004; Jadidi et al 2018), or migration (Moed and Halevi 2014; Sugimoto et al 2016). For the purposes of comprehensive scientometrics studies spanning multiple decades, some sort of author disambiguation of past publications needs to be achieved by methods other than self-identification

Objectives

Results

Conclusion