Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles

Robert Nowak,Zhouxian Zhang,Xin Tian,Wiktor Franus,Xu Chen,Jiarui Zhang,Xiaoyu Liu,Yue Zhu

doi:10.3390/app11188417

Robert Nowak, Zhouxian Zhang + Show 6 more

Open Access

https://doi.org/10.3390/app11188417

Copy DOI

Abstract

We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.

Highlights

Increasing amounts of collected data require the development of new effective methods for data integration, understood as the process of combining data from different sources into a unified view
The best performance was achieved by the BP-MLL model, while the worst was by the label powerset (LP)-Gaussian naive Bayes (GNB)
Both the Binary relevance (BR)-decision tree classifier (DT) and BP-MLL favored the prediction of multiple labels per instance, which is noticeable in high values of Re

Summary

Introduction

Increasing amounts of collected data require the development of new effective methods for data integration, understood as the process of combining data from different sources into a unified view. Shanghai Science & Technology Talents Development Center maintain two separated databases: the Scopus database from Elsevier, containing metadata about scientific journal publications, and the Chinese Patents Database from the National Intellectual Property Administration, People’s Republic of China Integration of these databases simplifies the systems searching for experts, saves time, and reduces errors. To improve the quality of record linkage we propose a new algorithm that uses three strategies that involve the generation of new attributes and new methods of attribute comparison, namely: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles and (3) comparison of subject areas of patent inventors and authors of articles.

Record Linkage Algorithm

Generation of Features for Comparison

Classification

Implementation Note

Evaluation of ASJC Code Prediction

Evaluation of Record Linkage

Evaluation of the Matching Complexity

Indexing Method

Evaluation of Matching Quality

Evaluation of Classification

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: Sep 10, 2021
License type: CC BY 4.0

Similar Papers

A note on using the F-measure for evaluating record linkage algorithms
David Hand ... Peter Christen
Statistics and Computing | VOL. 28
David Hand, et. al.David Hand ... Peter Christen
19 Apr 2017
Statistics and Computing | VOL. 28

Keywords given by authors of scientific articles in database descriptors
...
Journal of the Association for Information Science and Technology | VOL. -
, et. al. ...
01 Jun 2007
Journal of the Association for Information Science and Technology | VOL. -

Keywords given by authors of scientific articles in database descriptors
Isidoro Gil‐Leiva ... Adolfo Alonso‐Arroyo
Journal of the American Society for Information Science and Technology | VOL. 58
Isidoro Gil‐Leiva, et. al.Isidoro Gil‐Leiva ... Adolfo Alonso‐Arroyo
25 Apr 2007
Journal of the American Society for Information Science and Technology | VOL. 58

Zinātniskā raksta nobeiguma daļa: saturs, struktūra, valoda
Diāna Laiveniece
Vārds un tā pētīšanas aspekti: rakstu krājums = The Word: Aspects of Research: conference proceedings | VOL. -
Diāna LaivenieceDiāna Laiveniece
23 Nov 2022
Vārds un tā pētīšanas aspekti: rakstu krājums = The Word: Aspects of Research: conference proceedings | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences