Parallel Computing for Large-Scale Author Name Disambiguation in MEDLINE

Anyao Tang,Chengkun Wu,Jie Liu,Wei Wang,Yuting Xing,Xi Yang

doi:10.1109/hpcc/smartcity/dss.2019.00217

Abstract

Author name disambiguation (AND) is an important task in the field of scientific data mining. It has become a great challenge with the rapid growth of academic digital libraries. The task of AND for a large number of authors is computationally intensive. In particular, an author's name in MEDLINE is represented by full last name and initials, like Zhang S, which leads to a lot of identical strings that actually represent different names. In this paper, we proposed an efficient algorithm for parallel AND computation. The proposed algorithm mainly addresses the load balancing issue across many computing nodes. It involves the following strategies:(1) Author-based load balancing, which splits the computation load for each core by author name labels. (2) Matrix-based strategy, which calculates the pairwise similarity between publications and saves them in a matrix globally shared by all processes. Then group them by width-first search. We combine the above two strategies, the second of which is used to calculate authors with a large number of documents, and the other authors apply the first. We constructed a publications database written by Chinese authors from MEDLINE, the biggest public database for biomedical literature (abstracts). For benchmark testing, we experimented our algorithm with a dataset of 1 million publications on the Tianhe-2A supercomputer. Firstly, we trained an AND classifier that can achieve 98.1% of F1. The serial computation time is estimated to be approximately 246 hours, while the parallel execution time is approximately 66 hours in the case of four cores on a single node (with a speedup of 3.7x). Finally, we reduced the total parallel computing time of 1 million documents to about 2 hours and achieved 65.8% of parallelism efficiency using 200 cores on 90 nodes.

Full Text