Incremental author name disambiguation using author profile models and self-citations

Ijaz Hussain,Sohail Asghar

doi:10.3906/elk-1806-132

Abstract

Author name ambiguity in bibliographic databases (BDs) such as DBLP is a challenging problem that degrades the information retrieval quality, citation analysis, and proper attribution to the authors. It occurs when several authors have the same name (homonym) or when an author publishes under several name variants (synonym). Traditionally, much research has been conducted to disambiguate whole bibliographic database at once whenever some new citations are added in these BDs. However, it is more time-consuming and discards the manual disambiguation effects (if any). Only a few incremental author name disambiguation methods are proposed but these methods produce fragmented clusters which lower their accuracy. In this paper, a method, called CAND, that uses author profile models and self-citations for incremental author name disambiguation is proposed. CAND introduces name indices that enhance the overall system response by comparing the newly inserted references to the indexed author clusters. Author profile models are generated for the existing authors in BDs which help in disambiguating the newly inserted references. A comparator function is proposed to resolve the incremental author name ambiguity which utilizes the most strong bibliometric features such as coauthor, titles, author profile models, and self-citations. Two real-world data sets, one from Arnetminer and the other from BDBComp, are used to validate CAND's performance. Experimental results show that CAND's performance is overall better than the existing state-of-the-art incremental author name disambiguation methods.

Highlights

Due to a limited number of names or some popular names, different authors may have the same name and in contrast to this, an author name may be represented in different ways due to different journals/conferences naming conventions
Baseline methods We found three incremental AND methods in the literature: INDi [7], reducing fragmentation in incremental author name disambiguation (INDi+) [8], and INC [9]
In 1987, whole bibliographic databases (BDs) was disambiguated using the batch AND algorithm proposed by us [18], whereas, for subsequent loads after 1987, CAND is used for each new year

Summary

Introduction

Due to a limited number of names or some popular names, different authors may have the same name and in contrast to this, an author name may be represented in different ways due to different journals/conferences naming conventions. Author name ambiguities can cause wrong attributions and incorrect search results [1,2,3]. This is quite common in Asian names, in Chinese and Korean. The methods that resolve these author name ambiguities are called author name disambiguation (AND) methods. The increased growth of scientific publications has made the author name ambiguity problem much harder than in the past. Bollen et al predicted the substantial growth in coming years for the research articles [4]. In 2010, Jinaha estimated that until now 50 million research articles have been published, and on average one article per minute is being published [5]

Methods

Results

Conclusion