A principled approach to validation of persistent identifiers

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In order to make data collections abide to the FAIR data principles, FAIRification pipelines have recently been proposed. Such pipelines typically start with an assessment phase, in which potential issues can be identified. One such issue, that negatively impacts the interoperability of data, is the incorrect usage of persistent identifiers that refer to external data sources. We address this issue by proposing a formal framework for validation of persistent identifiers. We show that a robust implementation of this framework can be achieved by introducing group expressions. These are formulas where variables refer to capture groups of a regular expression. The increase in expressivity obtained by group expressions is shown to be necessary when confronted with important validation steps like check digit verification, prefix rules and cross-referencing. We demonstrate the potential of this framework by implementing a validation server as a REST interface and provide empirical results on three real-life datasets. Our results show that the proposed approach scales to millions of instances and provides a robust method for validation of persistent identifiers.

ReferencesShowing 10 of 19 papers
  • Cite Count Icon 20
  • 10.1080/07468342.1991.11973381
The Mathematics of Identification Numbers
  • May 1, 1991
  • The College Mathematics Journal
  • Joseph A Gallian

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 115
  • 10.1038/s41597-019-0184-5
Evaluating FAIR maturity through a scalable, automated, community-governed framework
  • Sep 20, 2019
  • Scientific Data
  • Mark D Wilkinson + 10 more

  • Cite Count Icon 3723
  • 10.1080/07421222.1996.11518099
Beyond Accuracy: What Data Quality Means to Data Consumers
  • Mar 1, 1996
  • Journal of Management Information Systems
  • Richard Y Wang + 1 more

  • Cite Count Icon 11
  • 10.1016/0003-6870(80)90114-3
Electronic calculators: which notation is the better?
  • Mar 1, 1980
  • Applied Ergonomics
  • S.J Agate + 1 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 164
  • 10.1038/sdata.2018.118
A design framework and exemplar metrics for FAIRness
  • Jun 26, 2018
  • Scientific Data
  • Mark D Wilkinson + 5 more

  • Cite Count Icon 27
  • 10.1109/tfuzz.2017.2686807
A Measure-Theoretic Foundation for Data Quality
  • Apr 1, 2018
  • IEEE Transactions on Fuzzy Systems
  • Antoon Bronselaer + 2 more

  • Cite Count Icon 12
  • 10.1080/00140137908924675
Human behaviour and performance in calculator use with Algebraic and Reverse Polish Notation
  • Sep 1, 1979
  • Ergonomics
  • D M Kasprzyk + 2 more

  • Open Access Icon
  • Cite Count Icon 94
  • 10.3389/fdata.2022.850611
A Survey of Data Quality Measurement and Monitoring Tools
  • Mar 31, 2022
  • Frontiers in Big Data
  • Lisa Ehrlinger + 1 more

  • Open Access Icon
  • Cite Count Icon 5
  • 10.1109/access.2022.3222786
Cleaning Data With Selection Rules
  • Jan 1, 2022
  • IEEE Access
  • Toon Boeckling + 2 more

  • Cite Count Icon 3
  • 10.1016/j.ipm.2023.103522
A novel approach to assess and improve syntactic interoperability in data integration
  • Oct 11, 2023
  • Information Processing & Management
  • Rihem Nasfi + 2 more

Similar Papers
  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-319-32025-0_26
PBA: Partition and Blocking Based Alignment for Large Knowledge Bases
  • Jan 1, 2016
  • Yan Zhuang + 3 more

The vigorous development of semantic web has enabled the creation of a growing number of large-scale knowledge bases across various domains. As different knowledge-bases contain overlapping and complementary information, automatically integrating these knowledge bases by aligning their classes and instances can improve the quality and coverage of the knowledge bases. Existing knowledge-base alignment algorithms have some limitations: (1) not scalable, (2) poor quality, (3) not fully automatic. To address these limitations, we develop a scalable partition-and-blocking based alignment framework, named Pba, which can automatically align knowledge bases with tens of millions of instances efficiently. Pba contains three steps. (1) Partition: we propose a new hierarchical agglomerative co-clustering algorithm to partition the class hierarchy of the knowledge base into multiple class partitions. (2) Blocking: we judiciously divide the instances in the same class partition into small blocks to further improve the performance. (3) Alignment: we compute the similarity of the instances in each block using a vector space model and align the instances with large similarities. Experimental results on real and synthetic datasets show that our algorithm significantly outperforms state-of-art approaches in efficiency, even by an order of magnitude, while keeping high alignment quality.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/jcsse.2019.8864152
DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata
  • Jul 1, 2019
  • Waran Taveekarn + 5 more

In the present, technology has become a big influence that impacts the lives of many humans, with artificial intelligence being one of the most influential elements. Creative feature engineering is an important part of machine learning methodology that supports and manipulates existing data to make it work more efficiently by modifying dimensions of data. Pulling useful information from external sources and combining them, however, are cumbersome since data engineers need to manually find external data sources and process them. Therefore, the ability to modify and enrich existing data automatically, using external open data sources could prove crucial to data engineers and scientists looking to enrich their datasets. In this paper, we propose a method that automatically augments a given structured dataset, by inferencing relevant dimension from an external data source with respect to the target attribute. Specifically, our proposed algorithm first creates bloom filters for every instance of data items. Such filters are then used to retrieve relevant information from the linked open data source, which is later processed into additional columns in the target dataset. A case study of three real-world datasets using Wikidata as the external data source is used to empirically validate our proposed method on both regression and classification tasks. The experimental results show that the datasets augmented by our proposed algorithm yield correlation improvement of 23.11 % on average for the regression task, and ROC improvement of 86.50% for the classification task.

  • Research Article
  • 10.14778/3611479.3611535
Effective Entity Augmentation by Querying External Data Sources
  • Jul 1, 2023
  • Proceedings of the VLDB Endowment
  • Christopher Buss + 5 more

Users often want to augment and enrich entities in their datasets with relevant information from external data sources. As many external sources are accessible only via keyword-search interfaces, a user usually has to manually formulate a keyword query that extract relevant information for each entity. This approach is challenging as many data sources contain numerous tuples, only a small fraction of which may contain entity-relevant information. Furthermore, different datasets may represent the same information in distinct forms and under different terms (e.g., different data source may use different names to refer to the same person). In such cases, it is difficult to formulate a query that precisely retrieves information relevant to an entity. Current methods for information enrichment mainly rely on lengthy and resource-intensive manual effort to formulate queries to discover relevant information. However, in increasingly many settings, it is important for users to get initial answers quickly and without substantial investment in resources (such as human attention). We propose a progressive approach to discovering entity-relevant information from external sources with minimal expert intervention. It leverages end users' feedback to progressively learn how to retrieve information relevant to each entity in a dataset from external data sources. Our empirical evaluation shows that our approach learns accurate strategies to deliver relevant information quickly.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.3389/frma.2022.1010504
ORCID coverage in research institutions—Readiness for partially automated research reporting
  • Nov 10, 2022
  • Frontiers in Research Metrics and Analytics
  • Kathrin Schnieders + 6 more

Reporting and presentation of research activities and outcome for research institutions in official, normative standards are more and more important and are the basis to comply with reporting duties. Institutional Current Research Information Systems (CRIS) serve as important databases or data sources for external and internal reporting, which should ideally be connected with interfaces to the operational systems for automated loading routines to extract relevant research information. This investigation evaluates whether (semi-) automated reporting using open, public research information collected via persistent identifiers (PIDs) for organizations (ROR), persons (ORCID), and research outputs (DOI) can reduce effort of reporting. For this purpose, internally maintained lists of persons to whom an ORCID record could be assigned (internal ORCID person lists) of two different German research institutions—Osnabrück University (UOS) and the non-university research institution TIB—Leibniz Information Center for Science and Technology Hannover—are used to investigate ORCID coverage in external open data sources like FREYA PID Graph (developed by DataCite), OpenAlex and ORCID itself. Additionally, for UOS a detailed analysis of discipline specific ORCID coverage is conducted. Substantial differences can be found for ORCID coverage between both institutions and for each institution regarding the various external data sources. A more detailed analysis of ORCID distribution by discipline for UOS reveals disparities by research area—internally and in external data sources. Recommendations for future actions can be derived from our results: Although the current level of coverage of researcher IDs which could automatically be mapped is still not sufficient to use persistent identifier-based extraction for standard (automated) reporting, it can already be a valuable input for institutional CRIS.

  • Research Article
  • Cite Count Icon 39
  • 10.1016/j.datak.2010.02.010
Refining non-taxonomic relation labels with external structured data to support ontology learning
  • Mar 2, 2010
  • Data & Knowledge Engineering
  • Albert Weichselbraun + 2 more

Refining non-taxonomic relation labels with external structured data to support ontology learning

  • Research Article
  • Cite Count Icon 2
  • 10.1108/ijwis-04-2018-0020
OntoGenesis: an architecture for automatic semantic enhancement of data services
  • Apr 4, 2019
  • International Journal of Web Information Systems
  • Bruno C.N Oliveira + 3 more

Purpose This paper describes a software architecture that automatically adds semantic capabilities to data services. The proposed architecture, called OntoGenesis, is able to semantically enrich data services, so that they can dynamically provide both semantic descriptions and data representations. Design/methodology/approach The enrichment approach is designed to intercept the requests from data services. Therefore, a domain ontology is constructed and evolved in accordance with the syntactic representations provided by such services in order to define the data concepts. In addition, a property matching mechanism is proposed to exploit the potential data intersection observed in data service representations and external data sources so as to enhance the domain ontology with new equivalences triples. Finally, the enrichment approach is capable of deriving on demand a semantic description and data representations that link to the domain ontology concepts. Findings Experiments were performed using real-world datasets, such as DBpedia, GeoNames as well as open government data. The obtained results show the applicability of the proposed architecture and that it can boost the development of semantic data services. Moreover, the matching approach achieved better performance when compared with other existing approaches found in the literature. Research limitations/implications This work only considers services designed as data providers, i.e., services that provide an interface for accessing data sources. In addition, our approach assumes that both data services and external sources – used to enhance the domain ontology – have some potential of data intersection. Such assumption only requires that services and external sources share particular property values. Originality/value Unlike most of the approaches found in the literature, the architecture proposed in this paper is meant to semantically enrich data services in such way that human intervention is minimal. Furthermore, an automata-based index is also presented as a novel method that significantly improves the performance of the property matching mechanism.

  • Conference Article
  • Cite Count Icon 28
  • 10.1109/mass.2017.85
Study of Internet-of-Things Messaging Protocols Used for Exchanging Data with External Sources
  • Oct 1, 2017
  • Ajay Chaudhary + 2 more

Internet-of-Things (IoT) is emerging as one of the popular technologies influencing every aspect of human life. The IoT devices equipped with sensors are changing every domain of the world to become smarter. In particular, the majorly benifited service sectors are agriculture, industries, healthcare, control a automation, retail a logistics, and power a energy. The data generated in these areas is massive requiring bigger storage and stronger compute. On the other hand, the IoT devices are limited in processing and storage capabilities and can not store and process the sensed data locally. Hence, there is a dire need to integrate these devices with the external data sources for effective utilisation and assessment of the collected data. Several existing well-known message exchange protocols like Message Queuing Telemetry Transport (MQTT), Advanced Message Queuing Protocol (AMQP), and Constrained Application Protocol (CoAP) are applicable in IoT communications. However, a thorough study is required to understand their impact and suitability in IoT scenario for the exchange of information between external data sources and IoT devices. In this paper, we designed and implemented an application layer framework to test and understand the behavior of these protocols and conducted the experiments on a realistic test bed using wired, wifi and 2/3/4G networks. The results revealed that MQTT and AMQP perform well on wired and wireless connections whereas CoAP performs consistently well and is less network dependent. On the lossy networks, CoAP generates low traffic as compared to MQTT and AMQP. The low memory footprint of MQTT and CoAP has made them a better choice over AMQP.

  • Research Article
  • Cite Count Icon 36
  • 10.5334/egems.270
Assessing and Minimizing Re-identification Risk in Research Data Derived from Health Care Records
  • Mar 29, 2019
  • eGEMs
  • Gregory E Simon + 7 more

Background:Sharing of research data derived from health system records supports the rigor and reproducibility of primary research and can accelerate research progress through secondary use. But public sharing of such data can create risk of re-identifying individuals, exposing sensitive health information.Method:We describe a framework for assessing re-identification risk that includes: identifying data elements in a research dataset that overlap with external data sources, identifying small classes of records defined by unique combinations of those data elements, and considering the pattern of population overlap between the research dataset and an external source. We also describe alternative strategies for mitigating risk when the external data source can or cannot be directly examined.Results:We illustrate this framework using the example of a large database used to develop and validate models predicting suicidal behavior after an outpatient visit. We identify elements in the research dataset that might create risk and propose a specific risk mitigation strategy: deleting indicators for health system (a proxy for state of residence) and visit year.Discussion:Researchers holding health system data must balance the public health value of data sharing against the duty to protect the privacy of health system members. Specific steps can provide a useful estimate of re-identification risk and point to effective risk mitigation strategies.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/bigdata55660.2022.10020577
Proactive Query Expansion for Streaming Data Using External Sources
  • Dec 17, 2022
  • Farah Alshanik + 4 more

Queries used to draw data from high-volume, high-velocity social media data streams, such as Twitter, typically require a set of keywords to filter the data. When topics and conversations change rapidly, initial keywords may become outdated and irrelevant, which may result in incomplete data. We propose a novel technique that improves data collection from social media streams in two ways. First, we develop a query expansion method that identifies and adds emergent keywords to the initial query, which makes the data collection a dynamic process that adapts to changes in social conversations. Second, we develop a "predictive query expansion" method that combines keywords from the streams with external data sources, which enables the construction of new queries that effectively capture emergent events that a user may not have anticipated when initiating the data collection stream. We demonstrate the effectiveness of our approach with an analysis of more than 20.5 million Twitter messages related to the 2015 Baltimore protests. We use newspaper archives as an external data source from which we collect keywords to expand the queries built from the primary stream.Reproducibility: https://github.com/FarahAlshanik/QE

  • Research Article
  • Cite Count Icon 7
  • 10.1002/nur.22324
Using item response theory to develop a shortened practice environment scale of the nursing work index.
  • May 30, 2023
  • Research in Nursing & Health
  • Aoyjai P Montgomery + 4 more

The 31-item Practice Environment Scale of the Nursing Work Index (PES-NWI) has been frequently used globally to measure the nurse work environment. However, due to its length and subsequent respondent burden, a more parsimonious version of the PES-NWI may be desirable. Item response theory (IRT) is a statistical technique that assists in decreasing the number of items in an instrument without sacrificing reliability and validity. Two separate samples of nurses in the United States (one called the "internal data source"and the other called "external data source"; sample sizes = 843 and 722, respectively) were analyzed. The internal data source was randomly split into training (n = 531) and validating data sets (n = 312), while a separate whole external data source was used as the final validating data set. Using IRT with training data, we removed nine items; two additional items were removed based on recommendations from a previous study. Confirmatory factor analyses supported the validity of the measurement model with the 20-item of PES-NWI in both internal and external validation data sources. The correlations among subscales between 31- and 20-item versions were high magnitude for five subscales in both validation data sets (τ = 0.84-0.89). Ultimately, we identified a 20-item version of the PES-NWI which demonstrated adequate validity and reliability properties while decreasing data collection burden yet maintaining a similar factor structure to the original instrument. Additional research may be necessary to update the items themselves on the PES-NWI.

  • Research Article
  • Cite Count Icon 5
  • 10.5081/jgps.12.1.53
Stochastic Ionosphere Models for Precise GNSS Positioning: Sensitivity Analysis
  • Jun 30, 2013
  • Journal of Global Positioning Systems
  • Peiyuan Zhou + 1 more

In Global Navigation Satellite System (GNSS) positioning, ranging signals are delayed when travelling through the ionosphere, the layer of the atmosphere ranging in altitude from about 50 to 1000 km consisting largely of ionized particles. This delay can vary from 1 meter to over 100 meters, and is still one of the most significant error sources in GNSS positioning. In precise GNSS positioning applications, ionospheric errors must be accounted for. One way to do so is to treat unknown ionosphere delay as stochastic parameter, which can account for the ionospheric errors in the GNSS measurements as well as keeping the full original information. The idea is adding ionospheric delay from external sources as pseudo-observables. In this paper, the performance of ionosphere-weighted model is evaluated using real data sets, and the correctness of priori ionosphere variance is also validated.

  • Research Article
  • Cite Count Icon 2
  • 10.3897/biss.6.94310
Plant-pollinator Interaction Data: A case study of the WorldFAIR project
  • Sep 7, 2022
  • Biodiversity Information Science and Standards
  • Debora Drucker + 10 more

Biodiversity is a data-intensive science and relies on data from a large number of disciplines in order to build up a coherent picture of the extent and trajectory of life on earth (Bowker 2000). The ability to integrate such data from different disciplines, geographic regions and scales is crucial for making better decisions towards sustainable development. As the Biodiversity Information Standards (TDWG) community tackles standards development and adoption beyond its initial emphases on taxonomy and species distributions, expanding its impact and engaging a wider audience becomes increasingly important. Biological interactions data (e.g., predator-prey, host-parasite, plant-pollinator) have been a topic of interest within TDWG for many years and a Biological Interaction Data Interest Group (IG) was established in 2016 to address that issue. The IG has been working on the complexity of representing interactions data and surveying how Darwin Core (DwC, Wieczorek 2012) is being used to represent them (Salim 2022). The importance of cross-disciplinary science and data inspired the recently funded WorldFAIR project—Global cooperation on FAIR data policy and practice—coordinated by the Committee on Data of the International Science Council (CODATA), with the Research Data Alliance (RDA) as a major partner. WorldFAIR will work with a set of case studies to advance implementation of the FAIR data principles (Fig. 1). The FAIR data principles promote good practices in data management, by making data and metadata Findable, Accessible, Interoperable, and Reusable (Wilkinson 2016). Interoperability will be a particular focus to facilitate cross-disciplinary research. A set of recommendations and a framework for FAIR assessment in a set of disciplines will be developed (Molloy 2022). One of WorldFAIR's case studies is related to plant-pollinator interactions data. Its starting point is the model and schema proposed by Salim (2022) based on the DwC standard, which adheres to the diversifying GBIF data model strategy and on the Plant-Pollinator vocabulary described by Salim (2021). The case study on plant-pollinator interactions originated in the TDWG Biological Interaction Data Interest Group (IG) and within the RDA Improving Global Agricultural Data (IGAD) Community of Practice. IGAD is a forum for sharing experiences and providing visibility to research and work in food and agricultural data and has become a space for networking and blending ideas related to data management and interoperability. This topic was chosen because interoperability of plant-pollinator data is needed for better monitoring of pollination services, understanding the impacts of cultivated plants on wild pollinators and quantifying the contribution of wild pollinators to cultivated crops, understanding the impact of domesticated bees on wild ecosystems, and understanding the behaviour of these organisms and how this influences their effectiveness as pollinators. In addition to the ecological importance of these data, pollination is economically important for food production. In Brazil, the economic value of the pollination service was estimated at US$ 12 billion in 2018 (Wolowski 2019). All eleven case studies within the WorldFAIR project are working on FAIR Implementation Profiles (FIPs), which capture comprehensive sets of FAIR principle implementation choices made by communities of practice and which can accelerate convergence and facilitate cross-collaboration between disciplines (Schultes 2020). The FIPs are published through the FIP Wizard, which allows the creation of FAIR Enabling Resources. The FIPs creation will be repeated by the end of the project and capture results obtained from each case study in order to advance data interoperability. In the first FIP, resources from the Global Biodiversity Information Facility (GBIF) and Global Biotic Interactions (GloBI) were catalogued by the Plant-Pollinator Case Study team, and we expect to expand the existing FAIR Enabling Resources by the end of the project and contribute to plant-pollinator data interoperability and reuse. To tackle the challenge of promoting FAIR data for plant-pollinator interactions within the broad scope of the several disciplines and subdisciplines that generate and use them, we will conduct a survey of existing initiatives handling plant-pollinator interactions data and summarise the current status of best practices in the community. Once the survey is concluded, we will choose at least five agriculture-specific plant-pollination initiatives from our partners, to serve as targets for standards adoption. For data to be interoperable and reusable, it is essential that standards and best practices are community-developed to ensure adoption by the tool builders and data scientists across the globe. TDWG plays an important role in this scenario and we expect to engage the IG and other interested parties in that discussion.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s11135-022-01563-x
Quality principles of retrospective data collected through a life history calendar
  • Oct 27, 2022
  • Quality & Quantity
  • Julie Chevallereau + 1 more

To assert the quality of retrospective data, most studies using tools such as life history calendars rely on comparisons with external sources. Our research aimed to integrate quality principles into a life history calendar and test their capacity to evaluate the data quality. The purpose was to avoid reliance on external data sources because of their possible unavailability. The first quality principle was the relationship between the dating accuracy of verifiable events and the data quality of the life domains of the calendar. The second was the certainty, as self-assessed by participants through color coding, that an event took place at the quarter indicated. We designed an experiment using a paper-and-pencil life history calendar that was completed by 104 university students. Our research highlighted the relevance to use the self-assessment of certainty to assert the data quality. However, we could not establish a relationship between the dating accuracy of verifiable events and the data quality of the life domains. In addition, we present a set of qualitative findings from 20 interviews conducted with study participants explaining the approaches used to complete a life calendar and the difficulties encountered.

  • Research Article
  • Cite Count Icon 1
  • 10.15407/jai2020.02.022
Online education empowerment with artificial intelligence tools
  • Jul 15, 2020
  • Artificial Intelligence
  • Boichenko A.V + 1 more

The experience of organizing the educational process during the quarantine caused by the COVID-19 pandemic is considered. Using of interactive technologies that allow organizing instant audio communication with a remote audience, as well as intelligent tools based on artificial intelligence that can help educational institutions to work more efficiently. Examples of sufficient use of artificial intelligence in distance learning are given. Particular attention is paid to the development of intelligent chatbots intended for use in communications with students of online courses of educational web portals. The use of technologies of ontology formation based on automatic extraction of concepts from external sources is offered, what can lead to greater acceleration of construction of the intellectual component of chatbots. Artificial intelligence tools can become an essential part of distance learning during this global COVID-19 pandemic. While educational institutions are closed to quarantine and many of them transitioned to distance learning lecturers and schoolteachers, as well as students and schoolchildren faced with the necessity to study in this new reality. The impact of these changes depends on people's ability to learn and on the role that the education system will play in meeting the demand for quality and affordable training. The experience of organizing the educational process at the University of Education Management of the National Academy of Pedagogical Sciences of Ukraine in the quarantine caused by the COVID-19 pandemic showed that higher and postgraduate institutions were mostly ready to move to distance learning. However, most distance learning systems, on whatever platform they are organized, need to be supplemented: the ability to broadcast video (at least ‒ one-way streaming), providing fast transmission of various types of information, receiving instant feedback when voting, polls and more. The structure of each section of the training course for the online learning system should fully cover the training material and meet all the objectives of the course. Appropriate language should be used, and wording, syntax, and presentation of tasks should be considered. One of the areas of application of artificial intelligence technologies in online learning is the use of chatbots which are characterrized by the following properties. It is advisable to use computer ontologies to ensure the intellectualization of chatbots. In this case, the metadata must be understandable to both humans and software and meet the requirements of modern standards in the field of information technology. The extraction of concepts from external data sources was carried out to build the ontology.

  • Research Article
  • 10.6084/m9.figshare.5271907.v1
Uberization of Symplectic Elements Citation Data Entries and use of Curation Bins
  • Aug 3, 2017
  • Muhammad Javed

At Cornell University Library, the primary entity of interest is scholarship, of which people and organizations are, by definition, both the creators and consumers. From this perspective, the attention is focused on aggregate views of scholarship data. In Scholars@Cornell, we use “Symplectic Elements” [1] for the continuous and automated collection of scholarship metadata from multiple internal and external data sources. For the journal articles category, Elements captures the title of the article, list of the authors, name of the journal, volume number, issue, ISSN number, DOI, publication status, pagination, external identifiers etc. - named as citation items. These citation items may or may not be available in every data source. The Crossref version may be different in some details from the Pubmed version and so forth. Some fields may be missing from one version of the metadata but present in the another. This leads to the different metadata versions of the same scholarly publication - named as version entries. In Elements, a user can specify his/her preferred data source for their scholarly publications and VIVO Harvester API [2] can be used to push the preferred citation data entries from Elements to Scholars@Cornell. In Scholars@Cornell, rather using VIVO Harvester API, we built an uberization module that merge the version entries from multiple data sources and creates a “uber record”. For creation of an uber record for a publication, we ranked the sources based on the experience and intuition of two senior Cornell librarians and started with the metadata from the source they considered best. The uberization module allowed us to generate and present best of the best scholarship metadata (in terms of correctness and completeness) to the users. In addition to external sources (such as WoS, PubMed etc.), we use Activity Insight (AI) feed as an internal local source. Any person can manually enter scholarship metadata in AI. We use such manually entered metadata (which is error-prone) as a seed (in Elements) to harvest additional metadata from external sources. Once additional metadata is harvested, uberization process merge these version entries and present the best of the best scholarship metadata that is later fed into Scholars@Cornell. Any scholarship metadata that could not pass through the validation step of Elements-to-Scholars transition, is pushed into a curation bin. A manual curation is required here to resolve the metadata issues. We believe such curation bins can also be used to enhance the scholarship metadata, such as adding ORCID ids for the authors, GRID ids for the organizations, adding abstracts of the articles, keywords, etc. We will briefly discuss the (VIVO-ISF ontology driven) data modelling and data architecture issues, as lessons learnt, that were encountered during the first phase of Scholar@Cornell launch. https://scholars.cornell.edu

More from: Journal of Data and Information Quality
  • New
  • Research Article
  • 10.1145/3774755
The BigFAIR Architecture: Enabling Big Data Analytics in FAIR-compliant Repositories
  • Nov 6, 2025
  • Journal of Data and Information Quality
  • João Pedro De Carvalho Castro + 3 more

  • Research Article
  • 10.1145/3770753
A GenAI System for Improved FAIR Independent Biological Database Integration
  • Oct 14, 2025
  • Journal of Data and Information Quality
  • Syed N Sakib + 3 more

  • Research Article
  • 10.1145/3770750
Ontology-Based Schema-Level Data Quality: The Case of Consistency
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Gianluca Cima + 2 more

  • Research Article
  • 10.1145/3769113
xFAIR: A Multi-Layer Approach to Data FAIRness Assessment and Data FAIRification
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Antonella Longo + 4 more

  • Research Article
  • 10.1145/3769116
FAIRness of the Linguistic Linked Open Data Cloud: an Empirical Investigation
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Maria Angela Pellegrino + 2 more

  • Research Article
  • 10.1145/3769120
Sustainable quality in data preparation
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Barbara Pernici + 12 more

  • Research Article
  • 10.1145/3769264
Editorial: Special Issue on Advanced Artificial Intelligence Technologies for Multimedia Big Data Quality
  • Sep 30, 2025
  • Journal of Data and Information Quality
  • Shaohua Wan + 3 more

  • Research Article
  • 10.1145/3743144
A Language to Model and Simulate Data Quality Issues in Process Mining
  • Jun 28, 2025
  • Journal of Data and Information Quality
  • Marco Comuzzi + 2 more

  • Research Article
  • 10.1145/3736178
Quantitative Data Valuation Methods: A Systematic Review and Taxonomy
  • Jun 24, 2025
  • Journal of Data and Information Quality
  • Malick Ebiele + 2 more

  • Research Article
  • 10.1145/3735511
Graph Metrics-driven Record Cluster Repair meets LLM-based active learning
  • Jun 24, 2025
  • Journal of Data and Information Quality
  • Victor Christen + 4 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon