Sustainable quality in data preparation
Data preparation is crucial for achieving good data management following the four foundational FAIR principles — Findability, Accessibility, Interoperability, and Reusability. Processing datasets to achieve high data (and metadata) quality is mandatory in modern applications. However, the data preparation activities that are needed to reach such levels may easily become unsustainable due to, for example, resource intensity or scalability challenges. Moreover, some preparation efforts may become unnecessary if they result in negligible improvements or duplicate actions. This paper examines the sustainability aspects of data preparation through the lens of a circular economy. Within the data landscape, this perspective encourages practices that minimize waste, extend the data life cycle, and maximize reuse in alignment with the FAIR principles. We explore these practices and their impact on selecting and configuring effective data preparation strategies to design sustainable, high-quality pipelines. To this end, we propose an evaluation model that integrates data quality metrics with sustainability parameters for human and computational tasks. Finally, we apply the model in a comparative analysis of key data preparation methods, demonstrating its effectiveness in assessing sustainability and quality trade-offs.
- Research Article
94
- 10.3389/fdata.2022.850611
- Mar 31, 2022
- Frontiers in Big Data
High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between data quality research and practical implementations with a detailed investigation on how data quality measurement and monitoring concepts are implemented in state-of-the-art tools. For the first time and in contrast to all existing data quality tool surveys, we conducted a systematic search, in which we identified 667 software tools dedicated to “data quality.” To evaluate the tools, we compiled a requirements catalog with three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) automated data quality monitoring. Using a set of predefined exclusion criteria, we selected 13 tools (8 commercial and 5 open-source tools) that provide the investigated features and are not limited to a specific domain for detailed investigation. On the one hand, this survey allows a critical discussion of concepts that are widely accepted in research, but hardly implemented in any tool observed, for example, generally applicable data quality metrics. On the other hand, it reveals potential for functional enhancement of data quality tools and supports practitioners in the selection of appropriate tools for a given use case.
- Research Article
20
- 10.3390/bdcc6040153
- Dec 9, 2022
- Big Data and Cognitive Computing
While big data benefits are numerous, the use of big data requires, however, addressing new challenges related to data processing, data security, and especially degradation of data quality. Despite the increased importance of data quality for big data, data quality measurement is actually limited to few metrics. Indeed, while more than 50 data quality dimensions have been defined in the literature, the number of measured dimensions is limited to 11 dimensions. Therefore, this paper aims to extend the measured dimensions by defining four new data quality metrics: Integrity, Accessibility, Ease of manipulation, and Security. Thus, we propose a comprehensive Big Data Quality Assessment Framework based on 12 metrics: Completeness, Timeliness, Volatility, Uniqueness, Conformity, Consistency, Ease of manipulation, Relevancy, Readability, Security, Accessibility, and Integrity. In addition, to ensure accurate data quality assessment, we apply data weights at three data unit levels: data fields, quality metrics, and quality aspects. Furthermore, we define and measure five quality aspects to provide a macro-view of data quality. Finally, an experiment is performed to implement the defined measures. The results show that the suggested methodology allows a more exhaustive and accurate big data quality assessment, with a more extensive methodology defining a weighted quality score based on 12 metrics and achieving a best quality model score of 9/10.
- Dissertation
- 10.17918/etd-2532
- Jul 16, 2021
This thesis presents a methodology of data preparation with probabilistic record linkage and information fusion for improving and enriching information visualizations of biomedical citation data. The problem of record linkage of citation databases where only non-unique identifiers such as author names and document titles are available as common identifiers to be linked was investigated. This problem in citation data parallels problems in clinical data and Knowledge Discovery in Databases (KDD) methods from clinical data mining are evaluated. Probabilistic and deterministic (exact-match) record linkage models were developed and compared through the use of a gold standard or truth dataset. Empirical comparison with ROC analysis of record linkage models showed a significant difference (p=.000) in performance of a probabilistic model over deterministic models. The methodology was evaluated with probabilistic linkage of records from the Web of Science, Medline, and CINAHL citation databases in the knowledge domains of medical informatics, HIV/AIDS, and nursing informatics. Data quality metrics for datasets prepared with probabilistic record linkage and information fusion showed improvement in completeness of key variables and reduction in sample bias. The resulting visualizations offered a richer information space for users through an increase in terms entering the visualization. The significant contributions of this work include the development of a novel model of probabilistic record linkage for biomedical citation databases which improves upon existing deterministic models. In addition a methodology for improving and enriching knowledge domain visualizations though a data preparation approach has been validated with analyses of multiple citation databases and knowledge domains. The data preparation methodology of probabilistic record linkage with information fusion offers a remedy for data quality problems, and the opportunity to enrich visualizations with added content for user exploration, which in turn improves the utility of knowledge domain visualizations as a medium for assessing available evidence and forming hypotheses.
- Research Article
8
- 10.54623/fue.fcij.6.1.3
- Jul 11, 2021
- Future Computing and Informatics Journal
DATA QUALITY DIMENSIONS, METRICS, AND IMPROVEMENT TECHNIQUES
- Research Article
8
- 10.1002/hbm.25724
- Nov 19, 2021
- Human Brain Mapping
Diffusion magnetic resonance imaging (dMRI) datasets are susceptible to several confounding factors related to data quality, which is especially true in studies involving young children. With the recent trend of large‐scale multicenter studies, it is more critical to be aware of the varied impacts of data quality on measures of interest. Here, we investigated data quality and its effect on different diffusion measures using a multicenter dataset. dMRI data were obtained from 691 participants (5–17 years of age) from six different centers. Six data quality metrics—contrast to noise ratio, outlier slices, and motion (absolute, relative, translation, and rotational)—and four diffusion measures—fractional anisotropy, mean diffusivity, tract density, and length—were computed for each of 36 major fiber tracts for all participants. The results indicated that four out of six data quality metrics (all except absolute and translation motion) differed significantly between centers. Associations between these data quality metrics and the diffusion measures differed significantly across the tracts and centers. Moreover, these effects remained significant after applying recently proposed harmonization algorithms that purport to remove unwanted between‐site variation in diffusion data. These results demonstrate the widespread impact of dMRI data quality on diffusion measures. These tracts and measures have been routinely associated with individual differences as well as group‐wide differences between neurotypical populations and individuals with neurological or developmental disorders. Accordingly, for analyses of individual differences or group effects (particularly in multisite dataset), we encourage the inclusion of data quality metrics in dMRI analysis.
- Preprint Article
- 10.5194/egusphere-egu22-7927
- Mar 28, 2022
<p>The European Plate Observing System (EPOS) is a very large and complex European e-infrastructure that provides pre-operational access to a first set of datasets and services for Solid Earth research. The EPOS-GNSS Data Gateway provides, through an Application Program Interface (API) and a web portal, access to GNSS (Global Navigation Satellite Systems) RINEX data from a distributed infrastructure of data nodes. Currently, ten EPOS-GNSS nodes have been installed, and three of them are still in the pre-operational phase. To monitor the long-term data quality of EPOS-GNSS stations at the nodes level, ROB is developing a new service. The first step of this service is a web portal (www.gnssquality-epos.oma.be) that provides access to data quality metrics of the RINEX data available from the different EPOS-GNSS nodes.</p><p>The web portal presents plots of the long-term tracking performance of more than 1000 EPOS-GNSS stations. The plots focus on several data quality metrics such as the number of observed versus expected observations, the number of missing epochs, the number of observed satellites, the number of cycle slips, and multipath values on code observations. These metrics have been computed at the node level using GLASS and Anubis Software (https://gnutsoftware.com/software/anubis). The metrics provide helpful information for node managers or station users to assess the EPOS-GNSS station’s performance and detect potential degradation of the RINEX data quality. The outlook of this work is to investigate the possible usage of data quality metrics to detect data unsuitable for high-precision GNSS analysis for geophysical or meteorological applications. Here, we will present the newly developed web portal, the considered data quality metrics, and some preliminary results of this ongoing work.</p>
- Research Article
19
- 10.1186/s43251-022-00068-9
- Nov 3, 2022
- Advances in Bridge Engineering
Structural Health Monitoring (SHM) systems have been extensively implemented to deliver data support and safeguard structural safety in structural integrity management context. SHM relies on data that can be noisy in large amounts or scarce. Little work has been done on SHM data quality (DQ). Therefore, this article suggests SHM DQ indicators and recommends deterministic and probabilistic SHM DQ metrics to address uncertainties. This will allow better decision-making for structural integrity management.Therefore, first, the literature on DQ indicators and measures is thoroughly examined. Second, and for the first time, necessary SHM DQ indicators are identified, and their definitions are tailored.Then SHM deterministic simplified DQ metrics are suggested, and more essentially probabilistic metrics are offered to address the embedded uncertainties and to account for the data flow.A generic example of a bridge with permanent and occasional monitoring systems is provided. It helps to better understand the influence of SHM data flow on the choice of DQ metrics and allocated probability distribution functions. Finally, a real case example is provided to test the feasibility of the suggested method within a realistic context.
- Research Article
142
- 10.1111/psyp.13793
- Mar 29, 2021
- Psychophysiology
Event-related potentials (ERPs) can be very noisy, and yet, there is no widely accepted metric of ERP data quality. Here, we propose a universal measure of data quality for ERP research-the standardized measurement error (SME)-which is a special case of the standard error of measurement. Whereas some existing metrics provide a generic quantification of the noise level, the SME quantifies the data quality (precision) for the specific amplitude or latency value being measured in a given study (e.g., the peak latency of the P3 wave). It can be applied to virtually any value that is derived from averaged ERP waveforms, making it a universal measure of data quality. In addition, the SME quantifies the data quality for each individual participant, making it possible to identify participants with low-quality data and "bad" channels. When appropriately aggregated across individuals, SME values can be used to quantify the combined impact of the single-trial EEG noise and the number of trials being averaged together on the effect size and statistical power in a given experiment. If SME values were regularly included in published articles, researchers could identify the recording and analysis procedures that produce the highest data quality, which could ultimately lead to increased effect sizes and greater replicability across the field.
- Research Article
7
- 10.1007/s40192-020-00167-3
- Jan 16, 2020
- Integrating Materials and Manufacturing Innovation
This work introduces a methodology to assess data quality for the tensile, creep/stress relaxation, and fatigue properties of alloys (as well as metadata associated with manufacture) as a part of a project to develop new materials for extreme environments. The extreme environments in question deal with those found in the power generation sector. Data quality assessment is needed to ensure the reliability of data used in analytics to develop new materials for the power generation sector and to predict the performance of established materials in current use. As data quality metrics have not been standardized for material properties data, quality rating guidelines are developed here for the aspects of data completeness, accuracy, usability, and standardization. The specific design requirements for heat-resistant alloy development were considered in creating each metric. Establishing the quality of a dataset in these areas will enable robust analysis. High-quality data can be set aside to develop predictive models. Lower-quality data need not be discarded but can be used for experimental design. Determining the quality of a materials dataset will also provide additional metadata with the data resource and will promote data reusability. A sample high-quality dataset is presented to indicate the typical data attributes collected from relevant mechanical property testing results, which were considered when generating the data quality metrics. A data template of these attributes was created as a tool for data generators and collectors to promote uniformity and reusability of alloy data. The sparsity of the sample dataset was calculated in order to highlight the areas where data gaps pose a challenge for reliable prediction of creep rupture lifetime.
- Research Article
70
- 10.1145/3148238
- Jun 30, 2017
- Journal of Data and Information Quality
Data quality and especially the assessment of data quality have been intensively discussed in research and practice alike. To support an economically oriented management of data quality and decision making under uncertainty, it is essential to assess the data quality level by means of well-founded metrics. However, if not adequately defined, these metrics can lead to wrong decisions and economic losses. Therefore, based on a decision-oriented framework, we present a set of five requirements for data quality metrics. These requirements are relevant for a metric that aims to support an economically oriented management of data quality and decision making under uncertainty. We further demonstrate the applicability and efficacy of these requirements by evaluating five data quality metrics for different data quality dimensions. Moreover, we discuss practical implications when applying the presented requirements.
- Research Article
15
- 10.1136/bmjopen-2020-038174
- Dec 1, 2020
- BMJ Open
ObjectivesPrimary objective: to assess nine data quality metrics for 14 maternal and newborn health data elements, following implementation of an integrated, district-focused data quality intervention. Secondary objective: to consider whether...
- Research Article
31
- 10.1117/1.nph.5.1.015004
- Feb 13, 2018
- Neurophotonics
Correcting for motion is an important consideration in infant functional near-infrared spectroscopy studies. We tested the performance of conventional motion correction methods and compared probe motion and data quality metrics for data collected at different infant ages (5, 7, and 12 months) and during different methods of stimulus presentation (video versus live). While 5-month-olds had slower maximum head speed than 7- or 12-month-olds, data quality metrics and hemodynamic response recovery errors were similar across ages. Data quality was also similar between video and live stimulus presentation. Motion correction algorithms, such as wavelet filtering and targeted principal component analysis, performed well for infant data using infant-specific parameters, and parameters may be used without fine-tuning for infant age or method of stimulus presentation. We recommend using wavelet filtering with [Formula: see text]; however, a range of parameters seemed acceptable. We do not recommend using trial rejection alone, because it did not improve hemodynamic response recovery as compared to no correction at all. Data quality metrics calculated from uncorrected data were associated with hemodynamic response recovery error, indicating that full simulation studies may not be necessary to assess motion correction performance.
- Research Article
24
- 10.1074/mcp.o111.015446
- Nov 3, 2011
- Molecular & Cellular Proteomics
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the United States National Cancer Institute convened the "International Workshop on Proteomic Data Quality Metrics" in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: 1) an evolving list of comprehensive quality metrics and 2) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data. By agreement, this article is published simultaneously in the Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics, and Proteomics Clinical Applications as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
- Research Article
22
- 10.1002/pmic.201100562
- Dec 14, 2011
- PROTEOMICS
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the "International Workshop on Proteomic Data Quality Metrics" in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed upon two primary needs for the wide use of quality metrics: (i) an evolving list of comprehensive quality metrics and (ii) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data. By agreement, this article is published simultaneously in Proteomics, Proteomics Clinical Applications, Journal of Proteome Research, and Molecular and Cellular Proteomics, as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
- Research Article
9
- 10.1002/prca.201100097
- Dec 1, 2011
- PROTEOMICS – Clinical Applications
Policies supporting the rapid and open sharing of proteomic data are being implemented by the leading journals in the field. The proteomics community is taking steps to ensure that data are made publicly accessible and are of high quality, a challenging task that requires the development and deployment of methods for measuring and documenting data quality metrics. On September 18, 2010, the U.S. National Cancer Institute (NCI) convened the "International Workshop on Proteomic Data Quality Metrics" in Sydney, Australia, to identify and address issues facing the development and use of such methods for open access proteomics data. The stakeholders at the workshop enumerated the key principles underlying a framework for data quality assessment in mass spectrometry data that will meet the needs of the research community, journals, funding agencies, and data repositories. Attendees discussed and agreed up on two primary needs for the wide use of quality metrics: (i) an evolving list of comprehensive quality metrics and (ii) standards accompanied by software analytics. Attendees stressed the importance of increased education and training programs to promote reliable protocols in proteomics. This workshop report explores the historic precedents, key discussions, and necessary next steps to enhance the quality of open access data. By agreement, this article is published simultaneously in Proteomics, Proteomics Clinical Applications, Journal of Proteome Research, and Molecular and Cellular Proteomics, as a public service to the research community. The peer review process was a coordinated effort conducted by a panel of referees selected by the journals.
- New
- Research Article
- 10.1145/3774755
- Nov 6, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3770753
- Oct 14, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3770750
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769113
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769116
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769120
- Oct 9, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3769264
- Sep 30, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3743144
- Jun 28, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3736178
- Jun 24, 2025
- Journal of Data and Information Quality
- Research Article
- 10.1145/3735511
- Jun 24, 2025
- Journal of Data and Information Quality
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.