Online process monitoring under quality data scarcity: Self-starting truncated EWMA schemes for time between events
Online process monitoring under quality data scarcity: Self-starting truncated EWMA schemes for time between events
- Research Article
8
- 10.1016/j.eng.2024.04.024
- Feb 1, 2025
- Engineering
On the data quality and imbalance in machine learning-based design and manufacturing—A systematic review
- Research Article
51
- 10.1080/00401706.2013.804437
- Jan 2, 2014
- Technometrics
As the volume and variety of available data continue to proliferate, organizations increasingly turn to analytics in order to enhance business decision-making and ultimately, performance. However, the decisions made as a result of the analytics process are only as good as the data on which they are based. In this article, we examine the data quality problem and propose the use of control charting methods as viable tools for data quality monitoring and improvement. We motivate our discussion using an integrated case study example of a real aircraft maintenance database. We include discussions of the measures of multiple data quality dimensions in this online process. We highlight the lack of appropriate statistical methods for the analysis of this type of problem and suggest opportunities for research in control chart methods within the data quality environment. This article has supplementary material online.
- Research Article
27
- 10.1080/24725854.2019.1636428
- Aug 9, 2019
- IISE Transactions
Response-surface-based design optimization has been commonly used in Robust Process Design (RPD) to seek optimal process settings for minimizing the output variability around a target value. Recently, the online RPD strategy has attracted increasing research attention, as it is expected to provide a better performance than offline RPD by utilizing online process feedback to continuously adjust process settings during process operation. However, the lack of knowledge about process model parameter uncertainty and data quality in the online RPD decisions means that this superiority cannot be guaranteed. This motivates this article to present a Bayesian approach for online RPD, which can provide systematic decisions of when and how to update the process model parameters for online process design optimization by considering data quality. The effectiveness of the proposed approach is illustrated with both simulation studies and a case study on a micro-milling process. The comparison results demonstrate that the proposed approach can achieve a better process performance than two conventional design approaches that do not consider the data quality and model parameter uncertainty.
- Research Article
64
- 10.1016/j.jwpe.2018.12.010
- Jan 2, 2019
- Journal of Water Process Engineering
Data scarcity in modelling and simulation of a large-scale WWTP: Stop sign or a challenge
- Research Article
- 10.7250/conect.2025.002
- May 9, 2025
- CONECT. International Scientific Conference of Environmental and Climate Technologies
Integrating AI in HVAC systems is a promising approach that helps enhance energy efficiency in buildings, which leads to cost savings and provides environmental benefits. However, the effective performance of these AI models, especially in HVAC systems, depends not only on the model design but also on the data's quality, reliability, size, availability, and management. Data plays an important role in determining the accuracy and reliability of the AI model's performance. This paper analyses recent studies that apply AI models to achieve energy efficiency in HVAC systems from a data perspective, examining various aspects of data management in Deep Learning and Hybrid models applied to HVAC in buildings, such as data availability, the different data sources, type, quality issues, and data splitting methods. Through this analysis, the paper aims to provide insights into data-related challenges and recommend ways to overcome and mitigate them to develop AI models that perform more effectively. The paper highlights the importance of developing better data-handling practices to have more accurate, efficient, and reliable AI models in HVAC systems. The findings reveal that combining multiple data types can enhance model performance and generalizability. Moreover, the analysis concludes that the main data type for residential buildings is simulated data rather than real-world data; this could be due to privacy concerns. Meanwhile, commercial buildings have commonly utilized more structured and reliable dataset sources, enabling more precise modelling. The findings also indicate that data quality is overlooked by researchers in many studies, where only 31 % of the analysed papers discussed quality issues, reflecting that it is not yet a standard practice in this field. Additionally, this analysis addresses the scarcity of reliable and audited data. Therefore, and in response to this issue, this paper recommends accessible and reliable data resources that can be employed in AI applications for HVAC systems in buildings.
- Conference Article
13
- 10.1145/3437963.3441747
- Mar 8, 2021
Knowledge tracing is a fundamental task in intelligent education for tracking the knowledge states of students on necessary concepts. In recent years, Deep Knowledge Tracing (DKT) utilizes recurrent neural networks to model student learning sequences. This approach has achieved significant success and has been widely used in many educational applications. However, in practical scenarios, it tends to suffer from the following critical problems due to data isolation: 1) Data scarcity. Educational data, which is usually distributed across different silos (e.g., schools), is difficult to gather. 2) Different data quality. Students in different silos have different learning schedules, which results in unbalanced learning records, meaning that it is necessary to evaluate the learning data quality independently for different silos. 3) Data incomparability. It is difficult to compare the knowledge states of students with different learning processes from different silos. Inspired by federated learning, in this paper, we propose a novel Federated Deep Knowledge Tracing (FDKT) framework to collectively train high-quality DKT models for multiple silos. In this framework, each client takes charge of training a distributed DKT model and evaluating data quality by leveraging its own local data, while a center server is responsible for aggregating models and updating the parameters for all the clients. In particular, in the client part, we evaluate data quality incorporating different education measurement theories, and we construct two quality-oriented implementations based on FDKT, i.e., FDKTCTT and FDKTIRT-where the means of data quality evaluation follow Classical Test Theory and Item Response Theory, respectively. Moreover, in the server part, we adopt hierarchical model interpolation to uptake local effects for model personalization. Extensive experiments on real-world datasets demonstrate the effectiveness and superiority of the FDKT framework.
- Research Article
15
- 10.1016/j.heliyon.2024.e33669
- Jun 26, 2024
- Heliyon
Optimizing flood predictions by integrating LSTM and physical-based models with mixed historical and simulated data
- Discussion
3
- 10.1016/s0140-6736(22)01335-6
- Aug 1, 2022
- The Lancet
Iran's burden of disease and burden of data collection
- Conference Article
1
- 10.1109/oceans.1995.526748
- Oct 9, 1995
The Naval Oceanographic Office (NAVOCEANO) has completed its initial development of the worldwide Ocean-Temperature Temporal-Variability Model as part of its general ocean climatology program and as an adjunct to its centralized compilation of oceanographic data. The development of a temporal variability model to match the resolution of the Navy's standard climatologies (nominally a 30' grid, but sometimes as fine a resolution as a 10' grid) was beset by many problems, the most serious being the scarcity of data, data quality, and removal of spatial variability. Since individual editing of the more than 4 million vertical profiles in NAVOCEANO's data bank is unrealistic, semi-automated techniques were employed. Spatial variability was estimated using the Navy's standard temperature climatology, the Generalized Digital Environmental Model. These estimates were used to allow the sampling space to increase in regions of low spatial variability and to decrease in areas of high spatial variability, resulting in realistic estimates of variability. As new data are added to the master data bank, the variability model will be used to tag anomalous data for further editing and ultimately used to make periodic updates to the variability model itself.
- Conference Article
- 10.3997/2214-4609.202272006
- Jan 1, 2022
Summary A key element of digital transformation is the promotion of actionable and accessible subsurface data. For most operators, scarcity of data is not a problem as much as a lack of trust in data quality and availability of tools to access and utilize that information in an effective manner to drive the business forwards. This paper details how - via collaboration with subject matter experts at a major European operator - a rich, varied, yet historically under-utilized dataset was able to power new insights into the subsurface.
- Preprint Article
- 10.5194/egusphere-egu21-10006
- Mar 4, 2021
<p>Glacierized catchments are of great importance for water supply sustaining diverse human livelihoods, economies, and cultures. Despite their importance, both glacierized headwaters and downstream areas remain poorly monitored. Nevertheless, a considerable amount of international and local research has dealt with hydrological models including different levels of complexity, data sources, and goals. In addition, the increasing availability of free software and powerful automatic model calibration tools facilitates the use of complex models even to non-expert users. As a result, models could show a good performance despite misconceptions. That is also true for the tropical Andes where low data availability and quality combined with large uncertainties on glacio-hydrological and meteorological processes prevail.</p><p>Accordingly, this study aims to identify if simple or more complex glacio-hydrological models can perform robust simulations for tropical glacier-fed basins combined with scarce data. The study case was carried out in the Sibinacocha (4,822 m a.s.l) and Phinaya (4,678 m a.s.l.) catchments, both located in the headwater of the Vilcanota-Urubamba river basin, in the Cusco region, Peru. These outer-tropical catchments are characterized by pronounced dry and wet seasons and hold a glacier extent of about 8 and 18%, respectively. Three conceptual models were implemented, in order of increasing complexity: 1) the lumped Shaman model (developed in this study), and the semi-distributed 2) HBV-light, and 3) RS Minerve. All simulations were implemented on a monthly time step from 1981 to 2010. Hydroclimatological data series were obtained from the gridded PISCO dataset at 10 km spatial resolution and two local weather stations. Furthermore, changes in glacier surface were delineated for three years (1986, 1994 and 2004) by using a semi-automatic NDSI approach based on satellite imagery. Finally, a comprehensive evaluation was performed using common measures of model performance, the associated flow signatures, and different runoff components.</p><p>Results show that all model complexities allow for an acceptable performance (R<sup>2</sup> > 0.65, Nash-Sutcliffe > 0.65, Nash-Sutcliffe-ln > 0.73) with small differences related to the model structure. However, more complex models require a more comprehensive calibration strategy and assessment to avoid simulations with apparently high model performance driven by inadequate assumptions. Moreover, more complex models require a better understanding of the underlying hydrological processes that is often hampered by data scarcity, limited knowledge and field accessibility in the Peruvian Andes. Results suggest that a careful calibration strategy, additional data collection, and the implementation of simple models can provide more robust simulations rather than opt for increasing model complexity. For robust hydrological modeling, a comprehensive assessment of the flow signatures and runoff components is pivotal. These findings have been incorporated into a framework that aims for expert and non-expert conducted robust glacio-hydrological simulation under data scarcity. Nevertheless, high uncertainty and limited knowledge hamper a more thorough process understanding and the improvement of related model results which illustrates the limitations of their predictive character. In such a context, additional data collection with local participatory approaches combined with policy-making for climate change adaptation and water management can benefit from approaches that support decision making under high uncertainty.</p>
- Research Article
143
- 10.1016/j.artmed.2024.102861
- Mar 30, 2024
- Artificial intelligence in medicine
Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations
- Research Article
- 10.2478/rtuect-2025-0036
- Jan 1, 2025
- Environmental and Climate Technologies
Integrating Artificial Intelligence (AI) into heating, ventilation, and air conditioning (HVAC) systems is a promising approach that helps enhance energy efficiency in buildings, which leads to cost savings and provides environmental benefits. However, the effective performance of the AI models depends not only on the model design but also on the data quality, reliability, size, availability, and management. This paper analyses recent studies that apply AI models, specifically Deep Learning and Hybrid models, to achieve energy efficiency in HVAC systems in buildings from a data perspective, examining various aspects of data management. This analysis aims to provide insights into data-related challenges in AIdriven HVAC systems and propose strategies to overcome them, ensuring more accurate, efficient, and reliable models. The findings reveal that combining multiple data types can enhance model performance and generalizability. The findings also indicate that data quality is overlooked by researchers in many studies, where only 31 % of the analysed papers discussed quality issues, reflecting that it is not yet a standard practice in this field. Additionally, this analysis highlights the scarcity of reliable and audited data. Therefore, and in response to this issue, this paper recommends accessible and reliable data resources that can be employed in AI applications for HVAC systems in buildings.
- Research Article
5
- 10.1371/journal.pmed.1004638
- Jun 24, 2025
- PLoS medicine
Antimicrobial resistance (AMR) is a major global health issue that exacerbates the burden of infectious diseases and healthcare costs. However, the scarcity of national-level AMR data in African countries hampers our understanding of its scale and contributing factors in the region. To gain insights into AMR prevalence in Africa, we collected and analyzed retrospective AMR data from 14 countries. We estimated bacterial AMR prevalence, defined as the proportion of resistant human isolates tested from antimicrobial susceptibility (AST) data collected retrospectively for 2016-2019 from 205 laboratories across 14 African countries. We generated 95% confidence intervals (CIs) for aggregated AMR estimates to account for data quality disparities across countries; the median data quality score was 73.1%, ranging from 56.4% to 80.8%. We assessed 819,584 culture records covering 9,266 pathogen-drug combinations, of which 187,832 (22.9%) were positive cultures with AST results. The most frequently cultured specimens were urine (32.0%) and purulent samples (28.1%), and the most frequently isolated pathogens were Escherichia coli (22.2%) and Staphylococcus aureus (15.0%). Aggregated AMR estimates did not change significantly across the years studied (p > 0.337); however, there were significant variations in AMR prevalence estimates in culture-positive samples across countries, regions, patient departments (inpatient/outpatient), and specimen sources (p < 0.05). Male sex (adjusted odds ratio [aOR] 1.15; 95% CI [1.09,1.21]; p < 0.0001), ages above 65 (aOR 1.28; 95% CI [1.16-1.41]; p < 0.0001), and inpatient department (aOR 1.24; 95% CI [1.13-1.35]; p < 0.0001) were associated with higher AMR prevalence among culture-positive samples. The lack of routine testing, as reflected in the low data volume from most contributing laboratories, and the absence of patient clinical information, represent significant limitations of this study. Analysis of the largest retrospective AMR dataset in Africa indicates high variability in AMR prevalence across countries, coupled with differences in AMR testing capacities, data quality, and AMR estimates. Gaps in AST practices and inadequate digital infrastructures for data collection and reporting represent barriers to estimating the true AMR burden in the region. These barriers warrant large-scale investments to expand healthcare access and strengthen bacteriology laboratory capacities.
- Book Chapter
- 10.4324/9780429287213-54
- Dec 23, 2021
Using information from published and unpublished materials, this paper describes how scarcity of quality data negatively impacted the results from model-based estimates of HIV prevalence and AIDS deaths in sub-Saharan Africa from mid-1980s to early 2000s. It also highlights specific examples of ways in which increased availability of good quality data, especially the population based sero-data from the Demographic and Health Surveys program, led to an improved understanding of the epidemic, including its spatial distribution, socio-demographic correlates, and the risk factors. The improvements in the availability and use of reliable data seems to have positively influenced political commitment and development of evidence-based HIV/AIDS policies and programs in the region. Consequently, there was a gradual decrease in HIV prevalence in Africa. Analysis of good quality data has led to a correction of some wrongly held beliefs and biased assumptions about HIV/AIDS and the risk behaviors in the region.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.