Author Name Disambiguation in PubMed using Ensemble-Based Classification Algorithms

Kaushal Jhawar,Debarshi Kumar Sanyal,Partha Pratim Das,Samiran Chattopadhyay,Plaban Kumar Bhowmick

doi:10.1145/3383583.3398568

Abstract

Author name ambiguity is a common problem in digital libraries. The problem occurs because multiple individuals may share the same name and the same individual may be represented by various names. Researchers have proposed various techniques for author name disambiguation (AND). In this paper, we study AND in the context of research publications indexed in the PubMed citation database. We perform an empirical study where we experiment with two ensemble-based classification algorithms, namely, random forest and gradient boosted decision trees, on a publicly available corpus of manually disambiguated author names from PubMed. Results show that random forest produces higher accuracy, precision, recall and F1-score, but gradient boosted trees perform competitively. We also determine which features are most discriminative given the feature set and the classifiers.

Full Text