Procedure for Preparing Bibliographic Metadata Records in Author Name Disambiguation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The quality of metadata in bibliographic records can be compromised by issues of incompleteness and inaccuracy. One of the most significant challenges is the inaccurate representation of author names. In the specific case of metadata in the MARC21 format, the MARCQuality tool—developed by researchers at the Central University “Marta Abreu” of Las Villas—partially addresses these issues by unifying variations of the same author’s name and resolving, to some extent, synonymy. However, limitations remain concerning homonymy, where multiple authors have the same name. This study aims to implement a solution that optimizes the preparation of MARC21 record fields for future use in neural network models designed for author name disambiguation. As these models are not yet available, the proposed approach focuses on structuring and adapting the fields to improve prediction accuracy. The solution is based on disambiguation techniques using LAGOS AND (Large, Gold Standard Dataset for Scholarly Author Name Disambiguation), as established in HFAND (Hybrid Framework for Author Name Disambiguation). Given the differences between MARC21 cataloged records and LAGOS AND publications, an analysis is necessary to establish correspondences between their fields. This analysis also incorporates the Dublin Core format, used by UCLV’s DSpace repository, to explore its potential integration into the MARCQuality tool. As a result, a structured procedure is developed to organize record fields for preprocessing and similarity metric calculations, facilitating their application in neural network models for author name disambiguation.

Similar Papers
  • Research Article
  • Cite Count Icon 52
  • 10.1017/s0269888917000182
A survey of author name disambiguation techniques: 2010–2016
  • Jan 1, 2017
  • The Knowledge Engineering Review
  • Ijaz Hussain + 1 more

Digital libraries content and quality of services are badly affected by the author name ambiguity problem in the citations and it is considered as one of the hardest problems faced by the digital library researchers. Several techniques have been proposed in the literature for the author name ambiguity problem. In this paper, we reviewed some recently presented author name disambiguation techniques and give some challenges and future research directions. We analyze the recent advancements in this field and classify these techniques into supervised, unsupervised, semi-supervised, graph-based and heuristic-based techniques according to their problem formulation that is mainly used for the author name disambiguation. A few surveys have been conducted to review different techniques for the author name disambiguation. These surveys highlighted only the methodology adopted for author name disambiguation but did not critically review their shortcomings. This survey provides a detailed review of author name disambiguation techniques available in the literature, makes a comparison of these techniques at an abstract level and discusses their limitations.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 58
  • 10.1007/s11192-017-2363-5
Data sets for author name disambiguation: an empirical analysis and a new resource
  • Jan 1, 2017
  • Scientometrics
  • Mark-Christoph Müller + 2 more

Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/3383583.3398568
Author Name Disambiguation in PubMed using Ensemble-Based Classification Algorithms
  • Aug 1, 2020
  • Kaushal Jhawar + 4 more

Author name ambiguity is a common problem in digital libraries. The problem occurs because multiple individuals may share the same name and the same individual may be represented by various names. Researchers have proposed various techniques for author name disambiguation (AND). In this paper, we study AND in the context of research publications indexed in the PubMed citation database. We perform an empirical study where we experiment with two ensemble-based classification algorithms, namely, random forest and gradient boosted decision trees, on a publicly available corpus of manually disambiguated author names from PubMed. Results show that random forest produces higher accuracy, precision, recall and F1-score, but gradient boosted trees perform competitively. We also determine which features are most discriminative given the feature set and the classifiers.

  • Research Article
  • Cite Count Icon 50
  • 10.1177/0165551519888605
A review of author name disambiguation techniques for the PubMed bibliographic database
  • Dec 1, 2019
  • Journal of Information Science
  • Debarshi Kumar Sanyal + 2 more

Author names in bibliographic databases often suffer from ambiguity owing to the same author appearing under different names and multiple authors possessing similar names. It creates difficulty in associating a scholarly work with the person who wrote it, thereby introducing inaccuracy in credit attribution, bibliometric analysis, search-by-author in a digital library and expert discovery. A plethora of techniques for disambiguation of author names has been proposed in the literature. In this article, we focus on the research efforts targeted to disambiguate author names specifically in the PubMed bibliographic database. We believe this concentrated review will be useful to the research community because it discusses techniques applied to a very large real database that is actively used worldwide. We make a comprehensive survey of the existing author name disambiguation (AND) approaches that have been applied to the PubMed database: we organise the approaches into a taxonomy; describe the major characteristics of each approach including its performance, strengths, and limitations; and perform a comparative analysis of them. We also identify the datasets from PubMed that are publicly available for researchers to evaluate AND algorithms. Finally, we outline a few directions for future work.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/intellisys.2017.8324326
LUCID: Author name disambiguation using graph Structural Clustering
  • Sep 1, 2017
  • Ijaz Hussain + 1 more

Author name ambiguity may occur in two situations when multiple authors have the same name or the same author writes her name in multiple ways. The former is called homonym and the later is called synonym. Disambiguation of these ambiguous authors is a non-trivial job because there is a limited amount of information available in citations data set. In this paper, a graph structural clustering algorithm “LUCID: Author Name Disambiguation using Graph Structural Clustering” is proposed which disambiguates authors by using community detection algorithm and graph operations. In the first phase, LUCID performs some preprocessing tasks on data set and creates blocks of ambiguous authors. In the second phase coauthors graph is built and “SCAN: A Structural Clustering Algorithm for Networks” is applied to detect hubs, outliers, and clusters of nodes (author communities). The hub node that intersects with many clusters is considered as a homonym and resolved by splitting across this node. Finally, the synonyms are disambiguated using proposed hybrid similarity function. LUCID performance is evaluated using a real data set of Arnetminer. Results show that LUCID performance is overall better than baseline methods and it achieves 97% in terms of pairwise precision, 74% in pairwise recall and 82% in pairwise F1.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.softx.2024.101719
ANDez: An open-source tool for author name disambiguation using machine learning
  • Apr 1, 2024
  • SoftwareX
  • Jinseok Kim + 1 more

Author name disambiguation in bibliographic data is challenging due to the same names of different authors and name variations of authors. Various machine learning (ML) methods address this, but a unified framework for comparing them is lacking. This study introduces ANDez, an open-source tool that integrates top-performing ML techniques for author name disambiguation. Developed in Python using popular ML libraries, ANDez provides a transparent system, merging complex procedures from different ML approaches. This promotes the assessment, modification, and benchmarking of ML techniques in author name disambiguation. ANDez's user-friendly design also helps researchers analyze ambiguous bibliographic data without needing advanced ML coding expertise.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/hpcc/smartcity/dss.2019.00217
Parallel Computing for Large-Scale Author Name Disambiguation in MEDLINE
  • Aug 1, 2019
  • Anyao Tang + 5 more

Author name disambiguation (AND) is an important task in the field of scientific data mining. It has become a great challenge with the rapid growth of academic digital libraries. The task of AND for a large number of authors is computationally intensive. In particular, an author's name in MEDLINE is represented by full last name and initials, like Zhang S, which leads to a lot of identical strings that actually represent different names. In this paper, we proposed an efficient algorithm for parallel AND computation. The proposed algorithm mainly addresses the load balancing issue across many computing nodes. It involves the following strategies:(1) Author-based load balancing, which splits the computation load for each core by author name labels. (2) Matrix-based strategy, which calculates the pairwise similarity between publications and saves them in a matrix globally shared by all processes. Then group them by width-first search. We combine the above two strategies, the second of which is used to calculate authors with a large number of documents, and the other authors apply the first. We constructed a publications database written by Chinese authors from MEDLINE, the biggest public database for biomedical literature (abstracts). For benchmark testing, we experimented our algorithm with a dataset of 1 million publications on the Tianhe-2A supercomputer. Firstly, we trained an AND classifier that can achieve 98.1% of F1. The serial computation time is estimated to be approximately 246 hours, while the parallel execution time is approximately 66 hours in the case of four cores on a single node (with a speedup of 3.7x). Finally, we reduced the total parallel computing time of 1 million documents to about 2 hours and achieved 65.8% of parallelism efficiency using 200 cores on 90 nodes.

  • Research Article
  • Cite Count Icon 8
  • 10.1109/access.2020.3031112
Model Reuse in Machine Learning for Author Name Disambiguation: An Exploration of Transfer Learning
  • Jan 1, 2020
  • IEEE Access
  • Jinseok Kim + 1 more

Machine learning for author name disambiguation is usually conducted on the training and test subsets of labeled data created for a specific task. As a result, disambiguation models learned on heterogeneous labeled data are often inapplicable for other purposes that either do not use the same labeled data or do not make use of any labeled data at all. This article explores the idea of transfer learning in a new context, author name disambiguation. We focus on cases where a disambiguation task lacking labeled training data uses models trained on labeled data generated for other tasks. For this purpose, two labeled source datasets are used for training of disambiguation models to be applied to three test target datasets that are deficient of labeled training data. Our results show that transfer learning can produce disambiguation performances similar to those achievable by traditional machine learning in which training and test datasets come from the same labeled data source. The good performance through transfer learning are possible when training source datasets have similar feature distributions as test target datasets. This study suggests that through transfer learning, rich disambiguation models in previous studies can be retained and reused across ambiguous bibliographic data from different fields and data sources, motivating further research on how to correct feature distribution differences between source and target datasets to expand the application of transfer learning in author name disambiguation beyond the model sharing explored in this research.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icws53863.2021.00071
Multiple Features Driven Author Name Disambiguation
  • Sep 1, 2021
  • Qian Zhou + 4 more

Author Name Disambiguation (AND) has received more attention recently, accompanied by the increase of academic publications. To tackle the AND problem, existing studies have proposed many approaches based on different types of information, such as raw document feature (e.g., co-author, title, and keywords), fusion feature (e.g., a hybrid publication embedding based on raw document feature), local structural information (e.g., a publication's neighborhood information on a graph), and global structural information (e.g., the interactive information between a node and others on a graph). However, there has been no work taking all the above-mentioned information into account for the AND problem so far. To fill the gap, we propose a novel framework namely MFAND (Multiple Features Driven Author Name Disambiguation). Specifically, we first employ the raw document and fusion feature to construct six similarity graphs for each author name to be disambiguated. Next, the global and local structural information extracted from these graphs is fed into a novel encoder called R3JG, which integrates and reconstructs the above-mentioned four types of information associated with an author, with the goal of learning the latent information to enhance the generalization ability of the MFAND. Then, the integrated and reconstructed information is fed into a binary classification model for disambiguation. Note that, several pruning strategies are applied before the information extraction to remove noise effectively. Finally, our proposed framework is investigated on two real-world datasets, and the experimental results show that MFAND performs better than all state-of-the-art methods.

  • Research Article
  • Cite Count Icon 278
  • 10.1145/1552303.1552304
Author name disambiguation in MEDLINE
  • Jul 1, 2009
  • ACM Transactions on Knowledge Discovery from Data
  • Vetle I Torvik + 1 more

BACKGROUND: We recently described "Author-ity," a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. METHODS: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. RESULTS: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ~98.8%. Lumping (putting two different individuals into the same cluster) affects ~0.5% of clusters, whereas splitting (assigning articles written by the same individual to >1 cluster) affects ~2% of articles. IMPACT: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. AVAILABILITY: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.

  • Research Article
  • Cite Count Icon 1
  • 10.1177/01655515231193859
Automatic author name disambiguation by differentiable feature selection
  • Sep 19, 2023
  • Journal of Information Science
  • Zhijian Fang + 5 more

Author name disambiguation (AND) is the task of resolving the ambiguity problem in bibliographic databases, where distinct real-world authors may share the same name or same author may have distinct names. The aim of AND is to split the name-ambiguous entities (articles) into the corresponding authors. Existing AND algorithms mainly focus on designing different similarity metrics between two ambiguous articles. However, most previous methods empirically select and process the features of entities, then use features to predict the similarity by data-driven models. In this article, we are motivated by natural questions: Which features are most useful for splitting name-ambiguous entities? Can they be automatically determined by an optimisation approach rather than heuristic feature engineering? Therefore, we proposed a novel end-to-end differentiable feature selection algorithm, automatically searching the optimal features for AND task (AAND). AAND optimises the discrete feature selection by differentiable Gumbel-Softmax, leading to the joint learning of feature selection policy and similarity prediction model. The experiments are conducted on a benchmark data set, S2AND, which harmonises eight different AND data sets. The results show that the performance of our proposal is superior to the advanced AND methods and feature selection algorithms. Meanwhile, deep insights into AND features are also given.

  • Research Article
  • Cite Count Icon 5
  • 10.5860/lrts.43n1.14
An Analysis of Tables of Contents in Recent English-Language Books
  • Jan 1, 1999
  • Library Resources & Technical Services
  • R Conrad Winke

A sample of 648 current English-language book publications with Library of Congress cataloging was examined to determine how many have tables of contents suitable for inclusion in bibliographic records. They were also examined to determine the number whose bibliographic records already contain contents notes (MARC field 505) supplied by the Library of Congress, the overall average length of their tables of contents, the levels of complexity or hierarchy of tables of contents, whether the tables of contents were subject-based or author/title based, how many new author names would be added to a bibliographic record that contained an analytic tables of contents note, whether books on certain subjects are more likely than others to include tables of contents, and to determine the proportion of books with usable tables of contents that also have subject indexes which might be usable for enhancing keyword access. Finally, I examined all current bibliographic records produced by the Library of Congress in order to determine how many books in general include subject indexes and how many bibliographic records contain contents notes. It was found that 92.75% of the books examined had tables of contents that could be included in catalog records, with an average length of 67.75 words. Most tables of contents contain one or two levels of hierarchy. Author/title based tables of contents account for 25.62% of the sample pool, with each table containing an average of 15.58 names. Finally, 1.12% of the bibliographic records currently produced by the Library of Congress include contents notes and 53.96% indicate the presence of an index.

  • Research Article
  • Cite Count Icon 39
  • 10.1016/j.joi.2015.08.004
Exploring author name disambiguation on PubMed-scale
  • Oct 1, 2015
  • Journal of Informetrics
  • Min Song + 2 more

Exploring author name disambiguation on PubMed-scale

  • Research Article
  • Cite Count Icon 8
  • 10.1002/asi.24720
LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation
  • Nov 28, 2022
  • Journal of the Association for Information Science and Technology
  • Li Zhang + 2 more

In this article, we present a method to automatically build large labeled datasets for the author ambiguity problem in the academic world by leveraging the authoritative academic resources, ORCID and DOI. Using the method, we built LAGOS‐AND, two large, gold‐standard sub‐datasets for author name disambiguation (AND), of which LAGOS‐AND‐BLOCK is created for clustering‐based AND research and LAGOS‐AND‐PAIRWISE is created for classification‐based AND research. Our LAGOS‐AND datasets are substantially different from the existing ones. The initial versions of the datasets (v1.0, released in February 2021) include 7.5 M citations authored by 798 K unique authors (LAGOS‐AND‐BLOCK) and close to 1 M instances (LAGOS‐AND‐PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets. In building the datasets, we reveal the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. Furthermore, we evaluate several baseline disambiguation methods as well as the MAG's author IDs system on our datasets, and the evaluation helps identify several interesting findings. We hope the datasets and findings will bring new insights for future studies. The code and datasets are publicly available.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.knosys.2024.112624
A cross-domain transfer learning model for author name disambiguation on heterogeneous graph with pretrained language model
  • Oct 18, 2024
  • Knowledge-Based Systems
  • Zhenyuan Huang + 4 more

A cross-domain transfer learning model for author name disambiguation on heterogeneous graph with pretrained language model

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.