Abstract

Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.

Highlights

  • In this paper, we provide a comprehensive and detailed review of data sets used in computational author name disambiguation (AND) experiments.1 AND data sets are basically collections of publication headers in which author names have been annotatedScientometrics (2017) 111:1467–1500 with unique author identifiers

  • In the section “A new AND data set from the domain of mathematics”, we provide some background information on the real-life data that we employ for the creation of our own data set, SCAD-zbMATH,5 and describe the quality assurance process

  • In the section “Initial naive baseline experimentation”, we describe a simple and practical procedure based on selective disambiguation which allows to maintain the advantages of full disambiguation, while at the same time focussing on relevant, ad-hoc sub sets of authorship records

Read more

Summary

Introduction

We provide a comprehensive and detailed review of data sets used in computational author name disambiguation (AND) experiments. AND data sets are basically collections of publication headers in which author names have been annotatedScientometrics (2017) 111:1467–1500 with unique author identifiers. Song et al focus on disambiguating authors by using advanced semantic topic-modelling techniques (Song et al 2007) They create a data set of more than 750.000 publications which contains author names, and titles, abstracts, keywords, and the full text of each publication’s first page. Each tuple of author name, author name position in author list, and unique publication identifier constitutes an authorship record (Cota et al 2010) Using this terminology, author name disambiguation can be characterized as follows: Given a set of authorship records, AND tries to determine which of these refer to the same author entity. This phenomenon is called name homography. Failure to distinguish between different authors with identical names will cause a merging or Mixed Citation (Lee et al 2005) error

Objectives
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call