Sustainable quality in data preparation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Data preparation is crucial for achieving good data management following the four foundational FAIR principles — Findability, Accessibility, Interoperability, and Reusability. Processing datasets to achieve high data (and metadata) quality is mandatory in modern applications. However, the data preparation activities that are needed to reach such levels may easily become unsustainable due to, for example, resource intensity or scalability challenges. Moreover, some preparation efforts may become unnecessary if they result in negligible improvements or duplicate actions. This paper examines the sustainability aspects of data preparation through the lens of a circular economy. Within the data landscape, this perspective encourages practices that minimize waste, extend the data life cycle, and maximize reuse in alignment with the FAIR principles. We explore these practices and their impact on selecting and configuring effective data preparation strategies to design sustainable, high-quality pipelines. To this end, we propose an evaluation model that integrates data quality metrics with sustainability parameters for human and computational tasks. Finally, we apply the model in a comparative analysis of key data preparation methods, demonstrating its effectiveness in assessing sustainability and quality trade-offs.

Similar Papers
  • Research Article
  • Cite Count Icon 94
  • 10.3389/fdata.2022.850611
A Survey of Data Quality Measurement and Monitoring Tools
  • Mar 31, 2022
  • Frontiers in Big Data
  • Lisa Ehrlinger + 1 more

High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between data quality research and practical implementations with a detailed investigation on how data quality measurement and monitoring concepts are implemented in state-of-the-art tools. For the first time and in contrast to all existing data quality tool surveys, we conducted a systematic search, in which we identified 667 software tools dedicated to “data quality.” To evaluate the tools, we compiled a requirements catalog with three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) automated data quality monitoring. Using a set of predefined exclusion criteria, we selected 13 tools (8 commercial and 5 open-source tools) that provide the investigated features and are not limited to a specific domain for detailed investigation. On the one hand, this survey allows a critical discussion of concepts that are widely accepted in research, but hardly implemented in any tool observed, for example, generally applicable data quality metrics. On the other hand, it reveals potential for functional enhancement of data quality tools and supports practitioners in the selection of appropriate tools for a given use case.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 20
  • 10.3390/bdcc6040153
An Advanced Big Data Quality Framework Based on Weighted Metrics
  • Dec 9, 2022
  • Big Data and Cognitive Computing
  • Widad Elouataoui + 3 more

While big data benefits are numerous, the use of big data requires, however, addressing new challenges related to data processing, data security, and especially degradation of data quality. Despite the increased importance of data quality for big data, data quality measurement is actually limited to few metrics. Indeed, while more than 50 data quality dimensions have been defined in the literature, the number of measured dimensions is limited to 11 dimensions. Therefore, this paper aims to extend the measured dimensions by defining four new data quality metrics: Integrity, Accessibility, Ease of manipulation, and Security. Thus, we propose a comprehensive Big Data Quality Assessment Framework based on 12 metrics: Completeness, Timeliness, Volatility, Uniqueness, Conformity, Consistency, Ease of manipulation, Relevancy, Readability, Security, Accessibility, and Integrity. In addition, to ensure accurate data quality assessment, we apply data weights at three data unit levels: data fields, quality metrics, and quality aspects. Furthermore, we define and measure five quality aspects to provide a macro-view of data quality. Finally, an experiment is performed to implement the defined measures. The results show that the suggested methodology allows a more exhaustive and accurate big data quality assessment, with a more extensive methodology defining a weighted quality score based on 12 metrics and achieving a best quality model score of 9/10.

  • Dissertation
  • 10.17918/etd-2532
Data preparation for biomedical knowledge domain visualization
  • Jul 16, 2021
  • Marie B Synnestvedt + 1 more

This thesis presents a methodology of data preparation with probabilistic record linkage and information fusion for improving and enriching information visualizations of biomedical citation data. The problem of record linkage of citation databases where only non-unique identifiers such as author names and document titles are available as common identifiers to be linked was investigated. This problem in citation data parallels problems in clinical data and Knowledge Discovery in Databases (KDD) methods from clinical data mining are evaluated. Probabilistic and deterministic (exact-match) record linkage models were developed and compared through the use of a gold standard or truth dataset. Empirical comparison with ROC analysis of record linkage models showed a significant difference (p=.000) in performance of a probabilistic model over deterministic models. The methodology was evaluated with probabilistic linkage of records from the Web of Science, Medline, and CINAHL citation databases in the knowledge domains of medical informatics, HIV/AIDS, and nursing informatics. Data quality metrics for datasets prepared with probabilistic record linkage and information fusion showed improvement in completeness of key variables and reduction in sample bias. The resulting visualizations offered a richer information space for users through an increase in terms entering the visualization. The significant contributions of this work include the development of a novel model of probabilistic record linkage for biomedical citation databases which improves upon existing deterministic models. In addition a methodology for improving and enriching knowledge domain visualizations though a data preparation approach has been validated with analyses of multiple citation databases and knowledge domains. The data preparation methodology of probabilistic record linkage with information fusion offers a remedy for data quality problems, and the opportunity to enrich visualizations with added content for user exploration, which in turn improves the utility of knowledge domain visualizations as a medium for assessing available evidence and forming hypotheses.

  • Research Article
  • Cite Count Icon 8
  • 10.54623/fue.fcij.6.1.3
DATA QUALITY DIMENSIONS, METRICS, AND IMPROVEMENT TECHNIQUES
  • Jul 11, 2021
  • Future Computing and Informatics Journal
  • Menna Ibrahim Gabr + 2 more

DATA QUALITY DIMENSIONS, METRICS, AND IMPROVEMENT TECHNIQUES

  • Research Article
  • Cite Count Icon 8
  • 10.1002/hbm.25724
Widespread effects of dMRI data quality on diffusion measures in children
  • Nov 19, 2021
  • Human Brain Mapping
  • Nabin Koirala + 6 more

Diffusion magnetic resonance imaging (dMRI) datasets are susceptible to several confounding factors related to data quality, which is especially true in studies involving young children. With the recent trend of large‐scale multicenter studies, it is more critical to be aware of the varied impacts of data quality on measures of interest. Here, we investigated data quality and its effect on different diffusion measures using a multicenter dataset. dMRI data were obtained from 691 participants (5–17 years of age) from six different centers. Six data quality metrics—contrast to noise ratio, outlier slices, and motion (absolute, relative, translation, and rotational)—and four diffusion measures—fractional anisotropy, mean diffusivity, tract density, and length—were computed for each of 36 major fiber tracts for all participants. The results indicated that four out of six data quality metrics (all except absolute and translation motion) differed significantly between centers. Associations between these data quality metrics and the diffusion measures differed significantly across the tracts and centers. Moreover, these effects remained significant after applying recently proposed harmonization algorithms that purport to remove unwanted between‐site variation in diffusion data. These results demonstrate the widespread impact of dMRI data quality on diffusion measures. These tracts and measures have been routinely associated with individual differences as well as group‐wide differences between neurotypical populations and individuals with neurological or developmental disorders. Accordingly, for analyses of individual differences or group effects (particularly in multisite dataset), we encourage the inclusion of data quality metrics in dMRI analysis.

  • Preprint Article
  • 10.5194/egusphere-egu22-7927
First experience with GNSS data quality monitoring in the distributed EPOS e-infrastructure
  • Mar 28, 2022
  • Fikri Bamahry + 3 more

<p>The European Plate Observing System (EPOS) is a very large and complex European e-infrastructure that provides pre-operational access to a first set of datasets and services for Solid Earth research. The EPOS-GNSS Data Gateway provides, through an Application Program Interface (API) and a web portal, access to GNSS (Global Navigation Satellite Systems) RINEX data from a distributed infrastructure of data nodes. Currently, ten EPOS-GNSS nodes have been installed, and three of them are still in the pre-operational phase. To monitor the long-term data quality of EPOS-GNSS stations at the nodes level, ROB is developing a new service. The first step of this service is a web portal (www.gnssquality-epos.oma.be) that provides access to data quality metrics of the RINEX data available from the different EPOS-GNSS nodes.</p><p>The web portal presents plots of the long-term tracking performance of more than 1000 EPOS-GNSS stations. The plots focus on several data quality metrics such as the number of observed versus expected observations, the number of missing epochs, the number of observed satellites, the number of cycle slips, and multipath values on code observations. These metrics have been computed at the node level using GLASS and Anubis Software (https://gnutsoftware.com/software/anubis). The metrics provide helpful information for node managers or station users to assess the EPOS-GNSS station’s performance and detect potential degradation of the RINEX data quality. The outlook of this work is to investigate the possible usage of data quality metrics to detect data unsuitable for high-precision GNSS analysis for geophysical or meteorological applications. Here, we will present the newly developed web portal, the considered data quality metrics, and some preliminary results of this ongoing work.</p>

  • Research Article
  • Cite Count Icon 19
  • 10.1186/s43251-022-00068-9
Review of data quality indicators and metrics, and suggestions for indicators and metrics for structural health monitoring
  • Nov 3, 2022
  • Advances in Bridge Engineering
  • Nisrine Makhoul

Structural Health Monitoring (SHM) systems have been extensively implemented to deliver data support and safeguard structural safety in structural integrity management context. SHM relies on data that can be noisy in large amounts or scarce. Little work has been done on SHM data quality (DQ). Therefore, this article suggests SHM DQ indicators and recommends deterministic and probabilistic SHM DQ metrics to address uncertainties. This will allow better decision-making for structural integrity management.Therefore, first, the literature on DQ indicators and measures is thoroughly examined. Second, and for the first time, necessary SHM DQ indicators are identified, and their definitions are tailored.Then SHM deterministic simplified DQ metrics are suggested, and more essentially probabilistic metrics are offered to address the embedded uncertainties and to account for the data flow.A generic example of a bridge with permanent and occasional monitoring systems is provided. It helps to better understand the influence of SHM data flow on the choice of DQ metrics and allocated probability distribution functions. Finally, a real case example is provided to test the feasibility of the suggested method within a realistic context.

  • Research Article
  • Cite Count Icon 142
  • 10.1111/psyp.13793
Standardized measurement error: A universal metric of data quality for averaged event-related potentials.
  • Mar 29, 2021
  • Psychophysiology
  • Steven J Luck + 3 more

Event-related potentials (ERPs) can be very noisy, and yet, there is no widely accepted metric of ERP data quality. Here, we propose a universal measure of data quality for ERP research-the standardized measurement error (SME)-which is a special case of the standard error of measurement. Whereas some existing metrics provide a generic quantification of the noise level, the SME quantifies the data quality (precision) for the specific amplitude or latency value being measured in a given study (e.g., the peak latency of the P3 wave). It can be applied to virtually any value that is derived from averaged ERP waveforms, making it a universal measure of data quality. In addition, the SME quantifies the data quality for each individual participant, making it possible to identify participants with low-quality data and "bad" channels. When appropriately aggregated across individuals, SME values can be used to quantify the combined impact of the single-trial EEG noise and the number of trials being averaged together on the effect size and statistical power in a given experiment. If SME values were regularly included in published articles, researchers could identify the recording and analysis procedures that produce the highest data quality, which could ultimately lead to increased effect sizes and greater replicability across the field.

  • Research Article
  • Cite Count Icon 7
  • 10.1007/s40192-020-00167-3
Data Assessment Method to Support the Development of Creep-Resistant Alloys
  • Jan 16, 2020
  • Integrating Materials and Manufacturing Innovation
  • Madison Wenzlick + 4 more

This work introduces a methodology to assess data quality for the tensile, creep/stress relaxation, and fatigue properties of alloys (as well as metadata associated with manufacture) as a part of a project to develop new materials for extreme environments. The extreme environments in question deal with those found in the power generation sector. Data quality assessment is needed to ensure the reliability of data used in analytics to develop new materials for the power generation sector and to predict the performance of established materials in current use. As data quality metrics have not been standardized for material properties data, quality rating guidelines are developed here for the aspects of data completeness, accuracy, usability, and standardization. The specific design requirements for heat-resistant alloy development were considered in creating each metric. Establishing the quality of a dataset in these areas will enable robust analysis. High-quality data can be set aside to develop predictive models. Lower-quality data need not be discarded but can be used for experimental design. Determining the quality of a materials dataset will also provide additional metadata with the data resource and will promote data reusability. A sample high-quality dataset is presented to indicate the typical data attributes collected from relevant mechanical property testing results, which were considered when generating the data quality metrics. A data template of these attributes was created as a tool for data generators and collectors to promote uniformity and reusability of alloy data. The sparsity of the sample dataset was calculated in order to highlight the areas where data gaps pose a challenge for reliable prediction of creep rupture lifetime.

  • Research Article
  • Cite Count Icon 70
  • 10.1145/3148238
Requirements for Data Quality Metrics
  • Jun 30, 2017
  • Journal of Data and Information Quality
  • Bernd Heinrich + 4 more

Data quality and especially the assessment of data quality have been intensively discussed in research and practice alike. To support an economically oriented management of data quality and decision making under uncertainty, it is essential to assess the data quality level by means of well-founded metrics. However, if not adequately defined, these metrics can lead to wrong decisions and economic losses. Therefore, based on a decision-oriented framework, we present a set of five requirements for data quality metrics. These requirements are relevant for a metric that aims to support an economically oriented management of data quality and decision making under uncertainty. We further demonstrate the applicability and efficacy of these requirements by evaluating five data quality metrics for different data quality dimensions. Moreover, we discuss practical implications when applying the presented requirements.

  • Research Article
  • Cite Count Icon 15
  • 10.1136/bmjopen-2020-038174
Improving the quality of routine maternal and newborn data captured in primary health facilities in Gombe State, Northeastern Nigeria: a before-and-after study
  • Dec 1, 2020
  • BMJ Open
  • Antoinette Alas Bhattacharya + 6 more

ObjectivesPrimary objective: to assess nine data quality metrics for 14 maternal and newborn health data elements, following implementation of an integrated, district-focused data quality intervention. Secondary objective: to consider whether...

  • Research Article
  • Cite Count Icon 31
  • 10.1117/1.nph.5.1.015004
Motion correction for infant functional near-infrared spectroscopy with an application to live interaction data.
  • Feb 13, 2018
  • Neurophotonics
  • Katherine L Perdue + 3 more

Correcting for motion is an important consideration in infant functional near-infrared spectroscopy studies. We tested the performance of conventional motion correction methods and compared probe motion and data quality metrics for data collected at different infant ages (5, 7, and 12 months) and during different methods of stimulus presentation (video versus live). While 5-month-olds had slower maximum head speed than 7- or 12-month-olds, data quality metrics and hemodynamic response recovery errors were similar across ages. Data quality was also similar between video and live stimulus presentation. Motion correction algorithms, such as wavelet filtering and targeted principal component analysis, performed well for infant data using infant-specific parameters, and parameters may be used without fine-tuning for infant age or method of stimulus presentation. We recommend using wavelet filtering with [Formula: see text]; however, a range of parameters seemed acceptable. We do not recommend using trial rejection alone, because it did not improve hemodynamic response recovery as compared to no correction at all. Data quality metrics calculated from uncorrected data were associated with hemodynamic response recovery error, indicating that full simulation studies may not be necessary to assess motion correction performance.

  • Research Article
  • Cite Count Icon 24
  • 10.1074/mcp.o111.015446
Recommendations for Mass Spectrometry Data Quality Metrics for Open Access Data (Corollary to the Amsterdam Principles)
  • Nov 3, 2011
  • Molecular & Cellular Proteomics
  • Christopher R Kinsinger + 35 more

Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the United States National Cancer Institute convened the "International Workshop on Proteomic Data Quality Metrics" in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: 1) an evolving list of comprehensive quality metrics and 2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data. By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.

  • Research Article
  • Cite Count Icon 22
  • 10.1002/pmic.201100562
Recommendations for mass spectrometry data quality metrics for open access data (corollary to the Amsterdam principles)
  • Dec 14, 2011
  • PROTEOMICS
  • Christopher R Kinsinger + 35 more

Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the "International Workshop on Proteomic Data Quality Metrics" in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed upon two primary needs for the wide use of quality metrics: (i) an evolving list of comprehensive quality metrics and (ii) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data. By agreement, this article is published simultaneously in Proteomics, Proteomics Clinical Applications, Journal of Proteome Research, and Molecular and Cellular Proteomics, as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.

  • Research Article
  • Cite Count Icon 9
  • 10.1002/prca.201100097
Recommendations for mass spectrometry data quality metrics for open access data (corollary to the Amsterdam principles)
  • Dec 1, 2011
  • PROTEOMICS – Clinical Applications
  • Christopher R Kinsinger + 35 more

Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the "International Workshop on Proteomic Data Quality Metrics" in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (i) an evolving list of comprehensive quality metrics and (ii) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data. By agreement, this article is published simultaneously in Proteomics, Proteomics Clinical Applications, Journal of Proteome Research, and Molecular and Cellular Proteomics, as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.

More from: Journal of Data and Information Quality
  • New
  • Research Article
  • 10.1145/3774755
The BigFAIR Architecture: Enabling Big Data Analytics in FAIR-compliant Repositories
  • Nov 6, 2025
  • Journal of Data and Information Quality
  • João Pedro De Carvalho Castro + 3 more

  • Research Article
  • 10.1145/3770753
A GenAI System for Improved FAIR Independent Biological Database Integration
  • Oct 14, 2025
  • Journal of Data and Information Quality
  • Syed N Sakib + 3 more

  • Research Article
  • 10.1145/3770750
Ontology-Based Schema-Level Data Quality: The Case of Consistency
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Gianluca Cima + 2 more

  • Research Article
  • 10.1145/3769113
xFAIR: A Multi-Layer Approach to Data FAIRness Assessment and Data FAIRification
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Antonella Longo + 4 more

  • Research Article
  • 10.1145/3769116
FAIRness of the Linguistic Linked Open Data Cloud: an Empirical Investigation
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Maria Angela Pellegrino + 2 more

  • Research Article
  • 10.1145/3769120
Sustainable quality in data preparation
  • Oct 9, 2025
  • Journal of Data and Information Quality
  • Barbara Pernici + 12 more

  • Research Article
  • 10.1145/3769264
Editorial: Special Issue on Advanced Artificial Intelligence Technologies for Multimedia Big Data Quality
  • Sep 30, 2025
  • Journal of Data and Information Quality
  • Shaohua Wan + 3 more

  • Research Article
  • 10.1145/3743144
A Language to Model and Simulate Data Quality Issues in Process Mining
  • Jun 28, 2025
  • Journal of Data and Information Quality
  • Marco Comuzzi + 2 more

  • Research Article
  • 10.1145/3736178
Quantitative Data Valuation Methods: A Systematic Review and Taxonomy
  • Jun 24, 2025
  • Journal of Data and Information Quality
  • Malick Ebiele + 2 more

  • Research Article
  • 10.1145/3735511
Graph Metrics-driven Record Cluster Repair meets LLM-based active learning
  • Jun 24, 2025
  • Journal of Data and Information Quality
  • Victor Christen + 4 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon