A supervised machine learning approach to author disambiguation in the Web of Science

Andreas Rehs

doi:10.1016/j.joi.2021.101166

Abstract

• Machine learning is used for author name disambiguation in the Web of Science. • Supervised learning via the author identifier of Researcher ID through random forest and logistic regression. • Name Frequency-based, bibliographic, thematic, and address-based features are used and evaluated. • Missing first name data is included to make the machine learning robust to quality changes of new data. • Pairwise paper predictions are clustered into author profiles via infomap graph-community detection method. • Cluster measure of average K-Metric arrives at >0.78 values and suggest reasonable performance of our appraoch. Author-level scientometric indicators are an important tool in individual and institutional-based research assessment and require high-quality author-publication profiles. To address this need, our study developed a robust supervised machine learning approach in combination with graph community detection methods to disambiguate author names in the Web of Science publication database. We used the unique author identifier Researcher ID to retrieve true authorship data of 1,904 scientists and trained a random forest and a logistic regression classifier on 1.2 million corresponding publication pairs with authors that share the same last name and first name initial. To do this, we reviewed a vast set of paper and author characteristics and randomly included missing data to make our machine learning robust to quality changes of new publication data. In the application on an unseen test set, we achieved F1 scores of 0.82 in the random forest and 0.75 in the logistic regression model. Subsequently, we evaluate feature performance and apply the infomap graph community detection algorithm to identify all publications belonging to an author. The community detection results in reasonable cluster metrics (Mean K-Metric in logistic regression-based model = 0.78 and = 0.81 in random forest-based model). Finally, we test our algorithm on a large surname-initial block (“Muller, M.”) and demonstrate speed and predictive performance.

Full Text