Abstract

We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.

Highlights

  • Increasing amounts of collected data require the development of new effective methods for data integration, understood as the process of combining data from different sources into a unified view

  • The best performance was achieved by the BP-MLL model, while the worst was by the label powerset (LP)-Gaussian naive Bayes (GNB)

  • Both the Binary relevance (BR)-decision tree classifier (DT) and BP-MLL favored the prediction of multiple labels per instance, which is noticeable in high values of Re

Read more

Summary

Introduction

Increasing amounts of collected data require the development of new effective methods for data integration, understood as the process of combining data from different sources into a unified view. Shanghai Science & Technology Talents Development Center maintain two separated databases: the Scopus database from Elsevier, containing metadata about scientific journal publications, and the Chinese Patents Database from the National Intellectual Property Administration, People’s Republic of China Integration of these databases simplifies the systems searching for experts, saves time, and reduces errors. To improve the quality of record linkage we propose a new algorithm that uses three strategies that involve the generation of new attributes and new methods of attribute comparison, namely: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles and (3) comparison of subject areas of patent inventors and authors of articles.

Record Linkage Algorithm
Generation of Features for Comparison
Classification
Implementation Note
Evaluation of ASJC Code Prediction
Evaluation of Record Linkage
Evaluation of the Matching Complexity
Indexing Method
Evaluation of Matching Quality
Evaluation of Classification
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call