Abstract
We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.
Highlights
Increasing amounts of collected data require the development of new effective methods for data integration, understood as the process of combining data from different sources into a unified view
The best performance was achieved by the BP-MLL model, while the worst was by the label powerset (LP)-Gaussian naive Bayes (GNB)
Both the Binary relevance (BR)-decision tree classifier (DT) and BP-MLL favored the prediction of multiple labels per instance, which is noticeable in high values of Re
Summary
Increasing amounts of collected data require the development of new effective methods for data integration, understood as the process of combining data from different sources into a unified view. Shanghai Science & Technology Talents Development Center maintain two separated databases: the Scopus database from Elsevier, containing metadata about scientific journal publications, and the Chinese Patents Database from the National Intellectual Property Administration, People’s Republic of China Integration of these databases simplifies the systems searching for experts, saves time, and reduces errors. To improve the quality of record linkage we propose a new algorithm that uses three strategies that involve the generation of new attributes and new methods of attribute comparison, namely: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles and (3) comparison of subject areas of patent inventors and authors of articles.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have