Author Name Disambiguation in Scholarly Research: A Bibliometric Perspective

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Abstract The rapid expansion of scholarly publishing has amplified the long-standing challenge of author name ambiguity in academic databases. This issue, manifesting as homonymy and synonymy, undermines the accuracy of bibliometric analyses, author-level metrics, and research evaluation systems. Author Name Disambiguation (AND) has thus emerged as a critical focus area in digital scholarship, with evolving strategies ranging from supervised machine learning and graph-based models to the adoption of persistent digital identifiers like ORCID. Despite notable advancements, significant challenges remain – particularly in linguistically diverse and underrepresented regions – where metadata inconsistencies, transliteration issues, and limited ORCID adoption exacerbate disambiguation errors. This study presents a comprehensive bibliometric analysis of 2,004 publications on AND from 2005 to 2024, sourced from the Scopus database. Using tools such as Biblioshiny and VOSviewer, the analysis identifies publication trends, leading authors and institutions, core sources, co-authorship networks, and thematic evolution in the field. Findings highlight increasing international collaboration, the dominance of computer science-driven methodologies, and the critical role of metadata quality and institutional frameworks. The study concludes with recommendations for inclusive, multilingual, and interoperable disambiguation systems, advocating for cross-disciplinary collaboration to ensure equitable author identification in global scholarly communication.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 68
  • 10.1007/s11192-017-2363-5
Data sets for author name disambiguation: an empirical analysis and a new resource
  • Jan 1, 2017
  • Scientometrics
  • Mark-Christoph Müller + 2 more

Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.1007/s00799-024-00398-1
Author name disambiguation literature review with consolidated meta-analytic approach
  • Apr 10, 2024
  • International Journal on Digital Libraries
  • Natan S Rodrigues + 2 more

Name ambiguity is a common problem in many bibliographic repositories affecting data integrity and validity. This article presents an author name disambiguation (AND) literature review using the theory of the consolidated meta-analytic approach, including quantitative techniques and bibliometric aspects. The literature review covers information from 211 documents of the Web of Science and Scopus databases in the period 2003 to 2022. A taxonomy based on the literature was used to organize the identified approaches to solve the AND problem. We identified that the most widely used AND solving approaches are author grouping associated with similarity functions and clustering methods and some works using author assignment allied to classification methods. The countries that publish most in AND are the USA, China, Germany, and Brazil with 21%, 19%, 13% and 8% of the total papers, respectively. The review results provide an overview of AND state-of-the-art research that can direct further investigation based on the quantitative and qualitative information from the AND research history.

  • Research Article
  • Cite Count Icon 67
  • 10.1177/0165551519888605
A review of author name disambiguation techniques for the PubMed bibliographic database
  • Dec 1, 2019
  • Journal of Information Science
  • Debarshi Kumar Sanyal + 2 more

Author names in bibliographic databases often suffer from ambiguity owing to the same author appearing under different names and multiple authors possessing similar names. It creates difficulty in associating a scholarly work with the person who wrote it, thereby introducing inaccuracy in credit attribution, bibliometric analysis, search-by-author in a digital library and expert discovery. A plethora of techniques for disambiguation of author names has been proposed in the literature. In this article, we focus on the research efforts targeted to disambiguate author names specifically in the PubMed bibliographic database. We believe this concentrated review will be useful to the research community because it discusses techniques applied to a very large real database that is actively used worldwide. We make a comprehensive survey of the existing author name disambiguation (AND) approaches that have been applied to the PubMed database: we organise the approaches into a taxonomy; describe the major characteristics of each approach including its performance, strengths, and limitations; and perform a comparative analysis of them. We also identify the datasets from PubMed that are publicly available for researchers to evaluate AND algorithms. Finally, we outline a few directions for future work.

  • Research Article
  • Cite Count Icon 1
  • 10.18438/b8md0g
Personal Publications Lists Serve as a Reliable Calibration Parameter to Compare Coverage in Academic Citation Databases with Scientific Social Media
  • Mar 15, 2017
  • Evidence Based Library and Information Practice
  • Emma Hughes

A Review of:
 Hilbert, F., Barth, J., Gremm, J., Gros, D., Haiter, J., Henkel, M., Reinhardt, W., & Stock, W.G. (2015). Coverage of academic citation databases compared with coverage of scientific social media: personal publication lists as calibration parameters. Online Information Review 39(2): 255-264. http://dx.doi.org/10.1108/OIR-07-2014-0159
 
 Abstract
 
 Objective – The purpose of this study was to explore coverage rates of information science publications in academic citation databases and scientific social media using a new method of personal publication lists as a calibration parameter. The research questions were: How many publications are covered in different databases, which has the best coverage, and what institutions are represented and how does the language of the publication play a role?
 
 Design – Bibliometric analysis.
 
 Setting – Academic citation databases (Web of Science, Scopus, Google Scholar) and scientific social media (Mendeley, CiteULike, Bibsonomy).
 
 Subjects – 1,017 library and information science publications produced by 76 information scientists at 5 German-speaking universities in Germany and Austria.
 
 Methods – Only documents which were published between 1 January 2003 and 31 December 2012 were included. In that time the 76 information scientists had produced 1,017 documents. The information scientists confirmed that their publication lists were complete and these served as the calibration parameter for the study. The citations from the publication lists were searched in three academic databases: Google Scholar, Web of Science (WoS), and Scopus; as well as three social media citation sites: Mendeley, CiteULike, and BibSonomy and the results were compared. The publications were searched for by author name and words from the title.
 
 Main results – None of the databases investigated had 100% coverage. In the academic databases, Google Scholar had the highest amount of coverage with an average of 63%, Scopus an average of 31%, and lowest was WoS with an average of 15%. On social media sites, Bibsonomy had the highest coverage with an average of 24%, Mendeley had an average coverage of 19%, and the lowest coverage was CiteULike with an average of 8%. 
 
 Conclusion – The use of personal publication lists are reliable calibration parameters to compare coverage of information scientists in academic citation databases with scientific social media. Academic citation databases had a higher coverage of publications, in particular, Google Scholar, compared to scientific social media sites. The authors recommend that information scientists personally publish work on social media citation databases to increase exposure. Formulating a publication strategy may be useful to identify journals with the most exposure in academic citation databases. Individuals should be encouraged to keep personal publication lists and these can be used as calibration parameters as a measure of coverage in the future.

  • Research Article
  • 10.1016/j.jhin.2026.01.029
Global trends and thematic evolution in antimicrobial stewardship research: a comprehensive bibliometric and network analysis (1977-2025).
  • May 1, 2026
  • The Journal of hospital infection
  • M M E Taha + 13 more

Global trends and thematic evolution in antimicrobial stewardship research: a comprehensive bibliometric and network analysis (1977-2025).

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3589334.3645596
Author Name Disambiguation via Paper Association Refinement and Compositional Contrastive Embedding
  • May 13, 2024
  • Dezhi Liu + 3 more

Author name disambiguation (AND) is an essential task for online academic retrieval systems. Recent models adopt representation learning in the author's name disambiguation. Despite achieving remarkable success, these methods may be limited in two aspects. First, the heuristically constructed paper association graphs used for representation learning contain uncertainties that may cause negative supervision. Second, existing algorithms, such as binary cross-entropy loss, used to train representation learning models may not produce sufficiently high-quality representations for AND. To tackle the above problems, we propose an association refining and compositional contrasting (ARCC) framework for AND tasks. ARCC first adopts an iterative graph structure refinement process to dynamically reduce the uncertainties in paper graphs. Then, a compositional contrastive learning method is proposed to encourage learning more discriminative representations for AND. Empirical studies on two benchmark datasets suggest that ARCC is effective for AND and outperforms the state-of-the-art models.

  • Research Article
  • Cite Count Icon 4
  • 10.12928/telkomnika.v19i3.18877
Author identification in bibliographic data using deep neural networks
  • Jun 1, 2021
  • TELKOMNIKA (Telecommunication Computing Electronics and Control)
  • Firdaus Firdaus + 7 more

Author name disambiguation (AND) is a challenging task for scholars who mine bibliographic information for scientific knowledge. A constructive approach for resolving name ambiguity is to use computer algorithms to identify author names. Some algorithm-based disambiguation methods have been developed by computer and data scientists. Among them, supervised machine learning has been stated to produce decent to very accurate disambiguation results. This paper presents a combination of principal component analysis (PCA) as a feature reduction and deep neural networks (DNNs), as a supervised algorithm for classifying AND problems. The raw data is grouped into four classes, i.e., synonyms, homonyms, homonyms-synonyms, and non-homonyms-synonyms classification. We have taken into account several hyperparameters tuning, such as learning rate, batch size, number of the neuron and hidden units, and analyzed their impact on the accuracy of results. To the best of our knowledge, there are no previous studies with such a scheme. The proposed DNNs are validated with other ML techniques such as Naïve Bayes, random forest (RF), and support vector machine (SVM) to produce a good classifier. By exploring the result in all data, our proposed DNNs classifier has an outperformed other ML technique, with accuracy, precision, recall, and F1-score, which is 99.98%, 97.98%, 97.86%, and 99.99%, respectively. In the future, this approach can be easily extended to any dataset and any bibliographic records provider.

  • Research Article
  • Cite Count Icon 65
  • 10.1017/s0269888917000182
A survey of author name disambiguation techniques: 2010–2016
  • Jan 1, 2017
  • The Knowledge Engineering Review
  • Ijaz Hussain + 1 more

Digital libraries content and quality of services are badly affected by the author name ambiguity problem in the citations and it is considered as one of the hardest problems faced by the digital library researchers. Several techniques have been proposed in the literature for the author name ambiguity problem. In this paper, we reviewed some recently presented author name disambiguation techniques and give some challenges and future research directions. We analyze the recent advancements in this field and classify these techniques into supervised, unsupervised, semi-supervised, graph-based and heuristic-based techniques according to their problem formulation that is mainly used for the author name disambiguation. A few surveys have been conducted to review different techniques for the author name disambiguation. These surveys highlighted only the methodology adopted for author name disambiguation but did not critically review their shortcomings. This survey provides a detailed review of author name disambiguation techniques available in the literature, makes a comparison of these techniques at an abstract level and discusses their limitations.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/hpcc/smartcity/dss.2019.00217
Parallel Computing for Large-Scale Author Name Disambiguation in MEDLINE
  • Aug 1, 2019
  • Anyao Tang + 5 more

Author name disambiguation (AND) is an important task in the field of scientific data mining. It has become a great challenge with the rapid growth of academic digital libraries. The task of AND for a large number of authors is computationally intensive. In particular, an author's name in MEDLINE is represented by full last name and initials, like Zhang S, which leads to a lot of identical strings that actually represent different names. In this paper, we proposed an efficient algorithm for parallel AND computation. The proposed algorithm mainly addresses the load balancing issue across many computing nodes. It involves the following strategies:(1) Author-based load balancing, which splits the computation load for each core by author name labels. (2) Matrix-based strategy, which calculates the pairwise similarity between publications and saves them in a matrix globally shared by all processes. Then group them by width-first search. We combine the above two strategies, the second of which is used to calculate authors with a large number of documents, and the other authors apply the first. We constructed a publications database written by Chinese authors from MEDLINE, the biggest public database for biomedical literature (abstracts). For benchmark testing, we experimented our algorithm with a dataset of 1 million publications on the Tianhe-2A supercomputer. Firstly, we trained an AND classifier that can achieve 98.1% of F1. The serial computation time is estimated to be approximately 246 hours, while the parallel execution time is approximately 66 hours in the case of four cores on a single node (with a speedup of 3.7x). Finally, we reduced the total parallel computing time of 1 million documents to about 2 hours and achieved 65.8% of parallelism efficiency using 200 cores on 90 nodes.

  • Research Article
  • Cite Count Icon 33
  • 10.1007/s11192-016-1892-7
Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses
  • Feb 23, 2016
  • Scientometrics
  • Jan Schulz

Bibliometric analyses depend on the quality of data sets and the author name disambiguation process (ANDP), which attributes author names on papers to real persons. Errors in a data set or the ANDP result in wrongly attributed papers to the wrong person. These errors can potentially distort the results of analyses based on such data sets. However, the general impact of data set quality on bibliometric analysis is mostly unknown; as such, an assessment is costly due to the manual steps involved. This paper presents an overview of the data set qualities produced by different ANDPs and uses simulations to study the general impact of data set quality on different bibliometric analysis (author rankings and regressions analysis with number of papers as dependent variable). The results show that rankings of authors are only valid on high quality data sets, which are typically not found directly in commercially available datasets. Both mean and individual per person data set quality is important for valid ranking results. Regressions are not as influenced by the overall data set quality but instead by individual quality differences between authors. Different types of errors can potentially bias the regression results. The outcome of this study also shows the importance of reporting both overall and individual variation in data set quality, so that the validity of analyses based on these data sets can be assessed.

  • Research Article
  • Cite Count Icon 316
  • 10.1145/1552303.1552304
Author name disambiguation in MEDLINE
  • Jul 1, 2009
  • ACM Transactions on Knowledge Discovery from Data
  • Vetle I Torvik + 1 more

BACKGROUND: We recently described "Author-ity," a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. METHODS: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. RESULTS: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ~98.8%. Lumping (putting two different individuals into the same cluster) affects ~0.5% of clusters, whereas splitting (assigning articles written by the same individual to >1 cluster) affects ~2% of articles. IMPACT: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. AVAILABILITY: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.

  • Research Article
  • Cite Count Icon 1
  • 10.1177/01655515231193859
Automatic author name disambiguation by differentiable feature selection
  • Sep 19, 2023
  • Journal of Information Science
  • Zhijian Fang + 5 more

Author name disambiguation (AND) is the task of resolving the ambiguity problem in bibliographic databases, where distinct real-world authors may share the same name or same author may have distinct names. The aim of AND is to split the name-ambiguous entities (articles) into the corresponding authors. Existing AND algorithms mainly focus on designing different similarity metrics between two ambiguous articles. However, most previous methods empirically select and process the features of entities, then use features to predict the similarity by data-driven models. In this article, we are motivated by natural questions: Which features are most useful for splitting name-ambiguous entities? Can they be automatically determined by an optimisation approach rather than heuristic feature engineering? Therefore, we proposed a novel end-to-end differentiable feature selection algorithm, automatically searching the optimal features for AND task (AAND). AAND optimises the discrete feature selection by differentiable Gumbel-Softmax, leading to the joint learning of feature selection policy and similarity prediction model. The experiments are conducted on a benchmark data set, S2AND, which harmonises eight different AND data sets. The results show that the performance of our proposal is superior to the advanced AND methods and feature selection algorithms. Meanwhile, deep insights into AND features are also given.

  • Research Article
  • Cite Count Icon 5
  • 10.1109/access.2022.3190088
Toward a New Paradigm for Author Name Disambiguation
  • Jan 1, 2022
  • IEEE Access
  • Ayesha Manzoor + 2 more

Author Name Disambiguation (AND) has emerged as a significant challenge in the bibliometric context with the growing volume of scientific literature. When citations written by different authors have the same names (polysemy or homonym names), and when an author has different names, there is ambiguity (synonyms or name variants). It is difficult to associate a citation with the correct author. Polysemy and synonyms cause merging and splitting anomalies in the citations. These anomalies affect the quantification of an author’s productivity (bibliometric analysis) and the reliability and quality of the information retrieved. Many techniques for AND have been proposed in the literature; most of them do not go beyond string matching or text matching. Most do not consider the context or semantics of the terms used in the citations. The AND problem is resolved semantically in this paper using the deep learning technique on the PubMed dataset. The experimental results show that the proposed method achieves overall (11.72%, 12.5%, and 12.1%) higher precision, recall, and f-measure than the pairwise class classification.

  • Research Article
  • Cite Count Icon 33
  • 10.1007/s11192-017-2338-6
Semantic fingerprints-based author name disambiguation in Chinese documents
  • Mar 15, 2017
  • Scientometrics
  • Hongqi Han + 5 more

Author name disambiguation is an important problem that needs to be resolved in bibliometric analysis or tech mining. Many techniques have been presented; however, most of them require a long run time or additional information. A new method based on semantic fingerprints was presented to disambiguate author names without external data. A manually annotated dataset was built to testify on the efficiency of the presented method. Experiments using co-author features, institution features, and text fingerprints were conducted respectively. We found that the first two methods had higher precision, but their recall was low, and the text fingerprint method had higher recall and satisfied precision. Based on these results, we integrated co-author features, institution features, and text fingerprints to provide semantic fingerprints for disambiguating author names and achieving better performance on the F-measure.

  • Research Article
  • 10.3126/qjmss.v7i2.87817
Social Protection in Shaping Labor Migration Decisions among Youth in South Asia: A Bibliometric Analysis
  • Dec 28, 2025
  • Quest Journal of Management and Social Sciences
  • Hemanta Panthi + 3 more

Background: Labor markets in developing countries often face structural challenges, including limited opportunities, informality, and rising inequalities, that push many young people to migrate in search of better livelihoods. As social protection systems evolve, they increasingly shape migration decisions by reducing risks, supporting mobility, and influencing how youth respond to labor market constraints. Purpose: In South Asia, labor potential exceeds fragile market structures, nudging migration. The study aims to consolidate fragmented research across labor, migration, and social policy, providing an integrated understanding of how welfare mechanisms influence migration decisions and labor mobility. Methodology: A bibliometric analysis was done using the Scopus database (1991–2024), following the PRISMA framework for systematic selection of studies. Initially, there were 556 records; only 254 peer-reviewed articles met the inclusion criteria. Using the VOS viewer and the bibiloshiny package in R, co-authorship networks, keyword co-occurrence, and citation structures were mapped to identify intellectual patterns and thematic evolution. Findings: The bibliometric analysis shows a shift from theoretical debates on globalization and market flexibility to applied studies on migration, informality, and social protection. The collaboration pattern is fragmented but gradually expanding, signaling potential for stronger global and interdisciplinary engagement. Post-2022 research, shaped by the COVID-19 crisis, emphasizes inequality, resilience, and marginalized labor, redefining social protection as a proactive and adaptive mechanism for inclusive labor systems. Conclusion: Social protection emerges as both stabilizing and enabling, shaping migration through risk minimization and behavioral incentives. The field is maturing toward an integrated labor–migration framework, though regional and conceptual gaps persist. Keywords: Labor migration, social protection, Youth employment, Bibliometric analysis, South Asia

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant