A principled approach to validation of persistent identifiers
In order to make data collections abide to the FAIR data principles, FAIRification pipelines have recently been proposed. Such pipelines typically start with an assessment phase, in which potential issues can be identified. One such issue, that negatively impacts the interoperability of data, is the incorrect usage of persistent identifiers that refer to external data sources. We address this issue by proposing a formal framework for validation of persistent identifiers. We show that a robust implementation of this framework can be achieved by introducing group expressions. These are formulas where variables refer to capture groups of a regular expression. The increase in expressivity obtained by group expressions is shown to be necessary when confronted with important validation steps like check digit verification, prefix rules and cross-referencing. We demonstrate the potential of this framework by implementing a validation server as a REST interface and provide empirical results on three real-life datasets. Our results show that the proposed approach scales to millions of instances and provides a robust method for validation of persistent identifiers.
20
- 10.1080/07468342.1991.11973381
- May 1, 1991
- The College Mathematics Journal
115
- 10.1038/s41597-019-0184-5
- Sep 20, 2019
- Scientific Data
3723
- 10.1080/07421222.1996.11518099
- Mar 1, 1996
- Journal of Management Information Systems
11
- 10.1016/0003-6870(80)90114-3
- Mar 1, 1980
- Applied Ergonomics
164
- 10.1038/sdata.2018.118
- Jun 26, 2018
- Scientific Data
27
- 10.1109/tfuzz.2017.2686807
- Apr 1, 2018
- IEEE Transactions on Fuzzy Systems
12
- 10.1080/00140137908924675
- Sep 1, 1979
- Ergonomics
94
- 10.3389/fdata.2022.850611
- Mar 31, 2022
- Frontiers in Big Data
5
- 10.1109/access.2022.3222786
- Jan 1, 2022
- IEEE Access
3
- 10.1016/j.ipm.2023.103522
- Oct 11, 2023
- Information Processing & Management
- Book Chapter
6
- 10.1007/978-3-319-32025-0_26
- Jan 1, 2016
The vigorous development of semantic web has enabled the creation of a growing number of large-scale knowledge bases across various domains. As different knowledge-bases contain overlapping and complementary information, automatically integrating these knowledge bases by aligning their classes and instances can improve the quality and coverage of the knowledge bases. Existing knowledge-base alignment algorithms have some limitations: (1) not scalable, (2) poor quality, (3) not fully automatic. To address these limitations, we develop a scalable partition-and-blocking based alignment framework, named Pba, which can automatically align knowledge bases with tens of millions of instances efficiently. Pba contains three steps. (1) Partition: we propose a new hierarchical agglomerative co-clustering algorithm to partition the class hierarchy of the knowledge base into multiple class partitions. (2) Blocking: we judiciously divide the instances in the same class partition into small blocks to further improve the performance. (3) Alignment: we compute the similarity of the instances in each block using a vector space model and align the instances with large similarities. Experimental results on real and synthetic datasets show that our algorithm significantly outperforms state-of-art approaches in efficiency, even by an order of magnitude, while keeping high alignment quality.
- Conference Article
3
- 10.1109/jcsse.2019.8864152
- Jul 1, 2019
In the present, technology has become a big influence that impacts the lives of many humans, with artificial intelligence being one of the most influential elements. Creative feature engineering is an important part of machine learning methodology that supports and manipulates existing data to make it work more efficiently by modifying dimensions of data. Pulling useful information from external sources and combining them, however, are cumbersome since data engineers need to manually find external data sources and process them. Therefore, the ability to modify and enrich existing data automatically, using external open data sources could prove crucial to data engineers and scientists looking to enrich their datasets. In this paper, we propose a method that automatically augments a given structured dataset, by inferencing relevant dimension from an external data source with respect to the target attribute. Specifically, our proposed algorithm first creates bloom filters for every instance of data items. Such filters are then used to retrieve relevant information from the linked open data source, which is later processed into additional columns in the target dataset. A case study of three real-world datasets using Wikidata as the external data source is used to empirically validate our proposed method on both regression and classification tasks. The experimental results show that the datasets augmented by our proposed algorithm yield correlation improvement of 23.11 % on average for the regression task, and ROC improvement of 86.50% for the classification task.
- Research Article
- 10.14778/3611479.3611535
- Jul 1, 2023
- Proceedings of the VLDB Endowment
Users often want to augment and enrich entities in their datasets with relevant information from external data sources. As many external sources are accessible only via keyword-search interfaces, a user usually has to manually formulate a keyword query that extract relevant information for each entity. This approach is challenging as many data sources contain numerous tuples, only a small fraction of which may contain entity-relevant information. Furthermore, different datasets may represent the same information in distinct forms and under different terms (e.g., different data source may use different names to refer to the same person). In such cases, it is difficult to formulate a query that precisely retrieves information relevant to an entity. Current methods for information enrichment mainly rely on lengthy and resource-intensive manual effort to formulate queries to discover relevant information. However, in increasingly many settings, it is important for users to get initial answers quickly and without substantial investment in resources (such as human attention). We propose a progressive approach to discovering entity-relevant information from external sources with minimal expert intervention. It leverages end users' feedback to progressively learn how to retrieve information relevant to each entity in a dataset from external data sources. Our empirical evaluation shows that our approach learns accurate strategies to deliver relevant information quickly.
- Research Article
6
- 10.3389/frma.2022.1010504
- Nov 10, 2022
- Frontiers in Research Metrics and Analytics
Reporting and presentation of research activities and outcome for research institutions in official, normative standards are more and more important and are the basis to comply with reporting duties. Institutional Current Research Information Systems (CRIS) serve as important databases or data sources for external and internal reporting, which should ideally be connected with interfaces to the operational systems for automated loading routines to extract relevant research information. This investigation evaluates whether (semi-) automated reporting using open, public research information collected via persistent identifiers (PIDs) for organizations (ROR), persons (ORCID), and research outputs (DOI) can reduce effort of reporting. For this purpose, internally maintained lists of persons to whom an ORCID record could be assigned (internal ORCID person lists) of two different German research institutions—Osnabrück University (UOS) and the non-university research institution TIB—Leibniz Information Center for Science and Technology Hannover—are used to investigate ORCID coverage in external open data sources like FREYA PID Graph (developed by DataCite), OpenAlex and ORCID itself. Additionally, for UOS a detailed analysis of discipline specific ORCID coverage is conducted. Substantial differences can be found for ORCID coverage between both institutions and for each institution regarding the various external data sources. A more detailed analysis of ORCID distribution by discipline for UOS reveals disparities by research area—internally and in external data sources. Recommendations for future actions can be derived from our results: Although the current level of coverage of researcher IDs which could automatically be mapped is still not sufficient to use persistent identifier-based extraction for standard (automated) reporting, it can already be a valuable input for institutional CRIS.
- Research Article
39
- 10.1016/j.datak.2010.02.010
- Mar 2, 2010
- Data & Knowledge Engineering
Refining non-taxonomic relation labels with external structured data to support ontology learning
- Research Article
2
- 10.1108/ijwis-04-2018-0020
- Apr 4, 2019
- International Journal of Web Information Systems
Purpose This paper describes a software architecture that automatically adds semantic capabilities to data services. The proposed architecture, called OntoGenesis, is able to semantically enrich data services, so that they can dynamically provide both semantic descriptions and data representations. Design/methodology/approach The enrichment approach is designed to intercept the requests from data services. Therefore, a domain ontology is constructed and evolved in accordance with the syntactic representations provided by such services in order to define the data concepts. In addition, a property matching mechanism is proposed to exploit the potential data intersection observed in data service representations and external data sources so as to enhance the domain ontology with new equivalences triples. Finally, the enrichment approach is capable of deriving on demand a semantic description and data representations that link to the domain ontology concepts. Findings Experiments were performed using real-world datasets, such as DBpedia, GeoNames as well as open government data. The obtained results show the applicability of the proposed architecture and that it can boost the development of semantic data services. Moreover, the matching approach achieved better performance when compared with other existing approaches found in the literature. Research limitations/implications This work only considers services designed as data providers, i.e., services that provide an interface for accessing data sources. In addition, our approach assumes that both data services and external sources – used to enhance the domain ontology – have some potential of data intersection. Such assumption only requires that services and external sources share particular property values. Originality/value Unlike most of the approaches found in the literature, the architecture proposed in this paper is meant to semantically enrich data services in such way that human intervention is minimal. Furthermore, an automata-based index is also presented as a novel method that significantly improves the performance of the property matching mechanism.
- Conference Article
28
- 10.1109/mass.2017.85
- Oct 1, 2017
Internet-of-Things (IoT) is emerging as one of the popular technologies influencing every aspect of human life. The IoT devices equipped with sensors are changing every domain of the world to become smarter. In particular, the majorly benifited service sectors are agriculture, industries, healthcare, control a automation, retail a logistics, and power a energy. The data generated in these areas is massive requiring bigger storage and stronger compute. On the other hand, the IoT devices are limited in processing and storage capabilities and can not store and process the sensed data locally. Hence, there is a dire need to integrate these devices with the external data sources for effective utilisation and assessment of the collected data. Several existing well-known message exchange protocols like Message Queuing Telemetry Transport (MQTT), Advanced Message Queuing Protocol (AMQP), and Constrained Application Protocol (CoAP) are applicable in IoT communications. However, a thorough study is required to understand their impact and suitability in IoT scenario for the exchange of information between external data sources and IoT devices. In this paper, we designed and implemented an application layer framework to test and understand the behavior of these protocols and conducted the experiments on a realistic test bed using wired, wifi and 2/3/4G networks. The results revealed that MQTT and AMQP perform well on wired and wireless connections whereas CoAP performs consistently well and is less network dependent. On the lossy networks, CoAP generates low traffic as compared to MQTT and AMQP. The low memory footprint of MQTT and CoAP has made them a better choice over AMQP.
- Research Article
36
- 10.5334/egems.270
- Mar 29, 2019
- eGEMs
Background:Sharing of research data derived from health system records supports the rigor and reproducibility of primary research and can accelerate research progress through secondary use. But public sharing of such data can create risk of re-identifying individuals, exposing sensitive health information.Method:We describe a framework for assessing re-identification risk that includes: identifying data elements in a research dataset that overlap with external data sources, identifying small classes of records defined by unique combinations of those data elements, and considering the pattern of population overlap between the research dataset and an external source. We also describe alternative strategies for mitigating risk when the external data source can or cannot be directly examined.Results:We illustrate this framework using the example of a large database used to develop and validate models predicting suicidal behavior after an outpatient visit. We identify elements in the research dataset that might create risk and propose a specific risk mitigation strategy: deleting indicators for health system (a proxy for state of residence) and visit year.Discussion:Researchers holding health system data must balance the public health value of data sharing against the duty to protect the privacy of health system members. Specific steps can provide a useful estimate of re-identification risk and point to effective risk mitigation strategies.
- Conference Article
1
- 10.1109/bigdata55660.2022.10020577
- Dec 17, 2022
Queries used to draw data from high-volume, high-velocity social media data streams, such as Twitter, typically require a set of keywords to filter the data. When topics and conversations change rapidly, initial keywords may become outdated and irrelevant, which may result in incomplete data. We propose a novel technique that improves data collection from social media streams in two ways. First, we develop a query expansion method that identifies and adds emergent keywords to the initial query, which makes the data collection a dynamic process that adapts to changes in social conversations. Second, we develop a "predictive query expansion" method that combines keywords from the streams with external data sources, which enables the construction of new queries that effectively capture emergent events that a user may not have anticipated when initiating the data collection stream. We demonstrate the effectiveness of our approach with an analysis of more than 20.5 million Twitter messages related to the 2015 Baltimore protests. We use newspaper archives as an external data source from which we collect keywords to expand the queries built from the primary stream.Reproducibility: https://github.com/FarahAlshanik/QE
- Research Article
7
- 10.1002/nur.22324
- May 30, 2023
- Research in Nursing & Health
The 31-item Practice Environment Scale of the Nursing Work Index (PES-NWI) has been frequently used globally to measure the nurse work environment. However, due to its length and subsequent respondent burden, a more parsimonious version of the PES-NWI may be desirable. Item response theory (IRT) is a statistical technique that assists in decreasing the number of items in an instrument without sacrificing reliability and validity. Two separate samples of nurses in the United States (one called the "internal data source"and the other called "external data source"; sample sizes = 843 and 722, respectively) were analyzed. The internal data source was randomly split into training (n = 531) and validating data sets (n = 312), while a separate whole external data source was used as the final validating data set. Using IRT with training data, we removed nine items; two additional items were removed based on recommendations from a previous study. Confirmatory factor analyses supported the validity of the measurement model with the 20-item of PES-NWI in both internal and external validation data sources. The correlations among subscales between 31- and 20-item versions were high magnitude for five subscales in both validation data sets (τ = 0.84-0.89). Ultimately, we identified a 20-item version of the PES-NWI which demonstrated adequate validity and reliability properties while decreasing data collection burden yet maintaining a similar factor structure to the original instrument. Additional research may be necessary to update the items themselves on the PES-NWI.
- Research Article
5
- 10.5081/jgps.12.1.53
- Jun 30, 2013
- Journal of Global Positioning Systems
In Global Navigation Satellite System (GNSS) positioning, ranging signals are delayed when travelling through the ionosphere, the layer of the atmosphere ranging in altitude from about 50 to 1000 km consisting largely of ionized particles. This delay can vary from 1 meter to over 100 meters, and is still one of the most significant error sources in GNSS positioning. In precise GNSS positioning applications, ionospheric errors must be accounted for. One way to do so is to treat unknown ionosphere delay as stochastic parameter, which can account for the ionospheric errors in the GNSS measurements as well as keeping the full original information. The idea is adding ionospheric delay from external sources as pseudo-observables. In this paper, the performance of ionosphere-weighted model is evaluated using real data sets, and the correctness of priori ionosphere variance is also validated.
- Research Article
2
- 10.3897/biss.6.94310
- Sep 7, 2022
- Biodiversity Information Science and Standards
Biodiversity is a data-intensive science and relies on data from a large number of disciplines in order to build up a coherent picture of the extent and trajectory of life on earth (Bowker 2000). The ability to integrate such data from different disciplines, geographic regions and scales is crucial for making better decisions towards sustainable development. As the Biodiversity Information Standards (TDWG) community tackles standards development and adoption beyond its initial emphases on taxonomy and species distributions, expanding its impact and engaging a wider audience becomes increasingly important. Biological interactions data (e.g., predator-prey, host-parasite, plant-pollinator) have been a topic of interest within TDWG for many years and a Biological Interaction Data Interest Group (IG) was established in 2016 to address that issue. The IG has been working on the complexity of representing interactions data and surveying how Darwin Core (DwC, Wieczorek 2012) is being used to represent them (Salim 2022). The importance of cross-disciplinary science and data inspired the recently funded WorldFAIR project—Global cooperation on FAIR data policy and practice—coordinated by the Committee on Data of the International Science Council (CODATA), with the Research Data Alliance (RDA) as a major partner. WorldFAIR will work with a set of case studies to advance implementation of the FAIR data principles (Fig. 1). The FAIR data principles promote good practices in data management, by making data and metadata Findable, Accessible, Interoperable, and Reusable (Wilkinson 2016). Interoperability will be a particular focus to facilitate cross-disciplinary research. A set of recommendations and a framework for FAIR assessment in a set of disciplines will be developed (Molloy 2022). One of WorldFAIR's case studies is related to plant-pollinator interactions data. Its starting point is the model and schema proposed by Salim (2022) based on the DwC standard, which adheres to the diversifying GBIF data model strategy and on the Plant-Pollinator vocabulary described by Salim (2021). The case study on plant-pollinator interactions originated in the TDWG Biological Interaction Data Interest Group (IG) and within the RDA Improving Global Agricultural Data (IGAD) Community of Practice. IGAD is a forum for sharing experiences and providing visibility to research and work in food and agricultural data and has become a space for networking and blending ideas related to data management and interoperability. This topic was chosen because interoperability of plant-pollinator data is needed for better monitoring of pollination services, understanding the impacts of cultivated plants on wild pollinators and quantifying the contribution of wild pollinators to cultivated crops, understanding the impact of domesticated bees on wild ecosystems, and understanding the behaviour of these organisms and how this influences their effectiveness as pollinators. In addition to the ecological importance of these data, pollination is economically important for food production. In Brazil, the economic value of the pollination service was estimated at US$ 12 billion in 2018 (Wolowski 2019). All eleven case studies within the WorldFAIR project are working on FAIR Implementation Profiles (FIPs), which capture comprehensive sets of FAIR principle implementation choices made by communities of practice and which can accelerate convergence and facilitate cross-collaboration between disciplines (Schultes 2020). The FIPs are published through the FIP Wizard, which allows the creation of FAIR Enabling Resources. The FIPs creation will be repeated by the end of the project and capture results obtained from each case study in order to advance data interoperability. In the first FIP, resources from the Global Biodiversity Information Facility (GBIF) and Global Biotic Interactions (GloBI) were catalogued by the Plant-Pollinator Case Study team, and we expect to expand the existing FAIR Enabling Resources by the end of the project and contribute to plant-pollinator data interoperability and reuse. To tackle the challenge of promoting FAIR data for plant-pollinator interactions within the broad scope of the several disciplines and subdisciplines that generate and use them, we will conduct a survey of existing initiatives handling plant-pollinator interactions data and summarise the current status of best practices in the community. Once the survey is concluded, we will choose at least five agriculture-specific plant-pollination initiatives from our partners, to serve as targets for standards adoption. For data to be interoperable and reusable, it is essential that standards and best practices are community-developed to ensure adoption by the tool builders and data scientists across the globe. TDWG plays an important role in this scenario and we expect to engage the IG and other interested parties in that discussion.
- Research Article
1
- 10.1007/s11135-022-01563-x
- Oct 27, 2022
- Quality & Quantity
To assert the quality of retrospective data, most studies using tools such as life history calendars rely on comparisons with external sources. Our research aimed to integrate quality principles into a life history calendar and test their capacity to evaluate the data quality. The purpose was to avoid reliance on external data sources because of their possible unavailability. The first quality principle was the relationship between the dating accuracy of verifiable events and the data quality of the life domains of the calendar. The second was the certainty, as self-assessed by participants through color coding, that an event took place at the quarter indicated. We designed an experiment using a paper-and-pencil life history calendar that was completed by 104 university students. Our research highlighted the relevance to use the self-assessment of certainty to assert the data quality. However, we could not establish a relationship between the dating accuracy of verifiable events and the data quality of the life domains. In addition, we present a set of qualitative findings from 20 interviews conducted with study participants explaining the approaches used to complete a life calendar and the difficulties encountered.
- Research Article
1
- 10.15407/jai2020.02.022
- Jul 15, 2020
- Artificial Intelligence
The experience of organizing the educational process during the quarantine caused by the COVID-19 pandemic is considered. Using of interactive technologies that allow organizing instant audio communication with a remote audience, as well as intelligent tools based on artificial intelligence that can help educational institutions to work more efficiently. Examples of sufficient use of artificial intelligence in distance learning are given. Particular attention is paid to the development of intelligent chatbots intended for use in communications with students of online courses of educational web portals. The use of technologies of ontology formation based on automatic extraction of concepts from external sources is offered, what can lead to greater acceleration of construction of the intellectual component of chatbots. Artificial intelligence tools can become an essential part of distance learning during this global COVID-19 pandemic. While educational institutions are closed to quarantine and many of them transitioned to distance learning lecturers and schoolteachers, as well as students and schoolchildren faced with the necessity to study in this new reality. The impact of these changes depends on people's ability to learn and on the role that the education system will play in meeting the demand for quality and affordable training. The experience of organizing the educational process at the University of Education Management of the National Academy of Pedagogical Sciences of Ukraine in the quarantine caused by the COVID-19 pandemic showed that higher and postgraduate institutions were mostly ready to move to distance learning. However, most distance learning systems, on whatever platform they are organized, need to be supplemented: the ability to broadcast video (at least ‒ one-way streaming), providing fast transmission of various types of information, receiving instant feedback when voting, polls and more. The structure of each section of the training course for the online learning system should fully cover the training material and meet all the objectives of the course. Appropriate language should be used, and wording, syntax, and presentation of tasks should be considered. One of the areas of application of artificial intelligence technologies in online learning is the use of chatbots which are characterrized by the following properties. It is advisable to use computer ontologies to ensure the intellectualization of chatbots. In this case, the metadata must be understandable to both humans and software and meet the requirements of modern standards in the field of information technology. The extraction of concepts from external data sources was carried out to build the ontology.
- Research Article
- 10.6084/m9.figshare.5271907.v1
- Aug 3, 2017
At Cornell University Library, the primary entity of interest is scholarship, of which people and organizations are, by definition, both the creators and consumers. From this perspective, the attention is focused on aggregate views of scholarship data. In Scholars@Cornell, we use “Symplectic Elements” [1] for the continuous and automated collection of scholarship metadata from multiple internal and external data sources. For the journal articles category, Elements captures the title of the article, list of the authors, name of the journal, volume number, issue, ISSN number, DOI, publication status, pagination, external identifiers etc. - named as citation items. These citation items may or may not be available in every data source. The Crossref version may be different in some details from the Pubmed version and so forth. Some fields may be missing from one version of the metadata but present in the another. This leads to the different metadata versions of the same scholarly publication - named as version entries. In Elements, a user can specify his/her preferred data source for their scholarly publications and VIVO Harvester API [2] can be used to push the preferred citation data entries from Elements to Scholars@Cornell. In Scholars@Cornell, rather using VIVO Harvester API, we built an uberization module that merge the version entries from multiple data sources and creates a “uber record”. For creation of an uber record for a publication, we ranked the sources based on the experience and intuition of two senior Cornell librarians and started with the metadata from the source they considered best. The uberization module allowed us to generate and present best of the best scholarship metadata (in terms of correctness and completeness) to the users. In addition to external sources (such as WoS, PubMed etc.), we use Activity Insight (AI) feed as an internal local source. Any person can manually enter scholarship metadata in AI. We use such manually entered metadata (which is error-prone) as a seed (in Elements) to harvest additional metadata from external sources. Once additional metadata is harvested, uberization process merge these version entries and present the best of the best scholarship metadata that is later fed into Scholars@Cornell. Any scholarship metadata that could not pass through the validation step of Elements-to-Scholars transition, is pushed into a curation bin. A manual curation is required here to resolve the metadata issues. We believe such curation bins can also be used to enhance the scholarship metadata, such as adding ORCID ids for the authors, GRID ids for the organizations, adding abstracts of the articles, keywords, etc. We will briefly discuss the (VIVO-ISF ontology driven) data modelling and data architecture issues, as lessons learnt, that were encountered during the first phase of Scholar@Cornell launch. https://scholars.cornell.edu
- New
- Research Article
- 10.1145/3774755
- Nov 6, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3770753
- Oct 14, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3770750
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769113
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769116
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769120
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769264
- Sep 30, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3743144
- Jun 28, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3736178
- Jun 24, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3735511
- Jun 24, 2025
- Journal of Data and Information Quality
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.