Abstract

Author name ambiguity is a common problem in digital libraries. The problem occurs because multiple individuals may share the same name and the same individual may be represented by various names. Researchers have proposed various techniques for author name disambiguation (AND). In this paper, we study AND in the context of research publications indexed in the PubMed citation database. We perform an empirical study where we experiment with two ensemble-based classification algorithms, namely, random forest and gradient boosted decision trees, on a publicly available corpus of manually disambiguated author names from PubMed. Results show that random forest produces higher accuracy, precision, recall and F1-score, but gradient boosted trees perform competitively. We also determine which features are most discriminative given the feature set and the classifiers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call