Do You Consent to the Use of Your Biological Data for Training ML and AI Models? Online Survey Targeting Clinicians and Researchers.
Aim: The majority of machine learning (ML) models in healthcare are built on retrospective data, much of which is collected without explicit patient consent for use in artificial intelligence (AI) and ML applications. The primary aim of this study was to evaluate whether clinicians and scientific researchers themselves consent to provide their own data for the training of ML models. Materials and Methods: This survey was conducted through an anonymous online survey, utilizing platforms such as Telegram, LinkedIn, and Viber. The target audience comprised specific international groups, primarily Russian, German, and English-speaking, of clinicians and scientific researchers. These participants ranged in their levels of expertise and experience, from beginners to veterans. The survey centered on a singular, pivotal question: “Do You Consent to the Use of Your Biological and Private Data for Training Machine Learning and AI Models?” Respondents had the option to choose from three responses: “Yes” and “No”. Results: The survey was conducted in January 2024. A total of 119 unique and verified individuals participated in the survey. The results revealed that only 50% of respondents (63 persons) expressed consent to provide their own data for the training of ML and AI models. Conclusion: In the development of ML and AI models, particularly open-source ones, it is crucial to ascertain whether participants are willing to provide their private data. While ML algorithms can transform the nature of data, it is important to remember that the primary owner of this data is the individual. Our findings show that in 50% of the cases, even participants from scientific research and clinical backgrounds – individuals typically accountable for ensuring data quality in AI and ML model development – do not consent to the use of their data in AI and ML settings. This highlights the need for more stringent consent processes and ethical considerations in the utilization of personal data in AI and ML research.
- Research Article
9
- 10.1016/j.heliyon.2023.e15143
- Apr 1, 2023
- Heliyon
IntroductionArtificial intelligence (AI) applications in healthcare and medicine have increased in recent years. To enable access to personal data, Trusted Research Environments (TREs) (otherwise known as Safe Havens) provide safe and secure environments in which researchers can access sensitive personal data and develop AI (in particular machine learning (ML)) models. However, currently few TREs support the training of ML models in part due to a gap in the practical decision-making guidance for TREs in handling model disclosure. Specifically, the training of ML models creates a need to disclose new types of outputs from TREs. Although TREs have clear policies for the disclosure of statistical outputs, the extent to which trained models can leak personal training data once released is not well understood. BackgroundWe review, for a general audience, different types of ML models and their applicability within healthcare. We explain the outputs from training a ML model and how trained ML models can be vulnerable to external attacks to discover personal data encoded within the model. RisksWe present the challenges for disclosure control of trained ML models in the context of training and exporting models from TREs. We provide insights and analyse methods that could be introduced within TREs to mitigate the risk of privacy breaches when disclosing trained models. DiscussionAlthough specific guidelines and policies exist for statistical disclosure controls in TREs, they do not satisfactorily address these new types of output requests; i.e., trained ML models. There is significant potential for new interdisciplinary research opportunities in developing and adapting policies and tools for safely disclosing ML outputs from TREs.
- Research Article
20
- 10.3390/w14101666
- May 23, 2022
- Water
Accurate estimation of reference evapotranspiration (ETo) plays a vital role in irrigation and water resource planning. The Penman–Monteith method recommended by the Food and Agriculture Organization (FAO PM56) is widely used and considered a standard to calculate ETo. However, FAO PM56 cannot be used with limited meteorological variables, so it is compulsory to choose an alternative model for ETo estimation, which requires fewer variables. This study built ten machine learning (ML) models based on multi-function, neural network, and tree-based structure against the FAO PM56 method. For this purpose, time series temperature data on a monthly scale are only used to train ML models. The developed ML models were applied to estimate ETo at different test stations and the obtained results were compared with the FAO PM56 method to verify and validate their performance in ETo estimation for the selected stations. In addition, multiple statistical indicators, including root-mean-square error (RMSE), coefficient of determination (R2), mean absolute error (MAE), Nash–Sutcliffe efficiency (NSE), and correlation coefficient (r) were calculated to compare the performance of each ML model on ETo estimation. Among the applied ML models, the ETo tree boost (TB) ML model outperformed the other ML models in estimating ETo in diverse climatic conditions based on statistical indicators (R2, NSE, r, RMSE, and MAE). Moreover, the observed R2, NSE, and r were the highest for the TB ML model, while RMSE and MAE were found to be the lowest at the study sites compared to other applied ML models. Lastly, ETo point data yielded from the TB ML model was used in an interpolation process to create monthly and annual ETo maps. Based on the ETo maps, this study suggests mainly a focus on areas with high ETo values and proper irrigation scheduling of crops to ensure water sustainability.
- Research Article
6
- 10.3390/sym16010128
- Jan 21, 2024
- Symmetry
In the near future, the incorporation of shared electric automated and connected mobility (SEACM) technologies will significantly transform the landscape of transportation into a sustainable and efficient mobility ecosystem. However, these technological advances raise complex scientific challenges. Problems related to safety, energy efficiency, and route optimization in dynamic urban environments are major issues to be resolved. In addition, the unavailability of realistic and various data of such systems makes their deployment, design, and performance evaluation very challenging. As a result, to avoid the constraints of real data collection, using generated artificial datasets is crucial for simulation to test and validate algorithms and models under various scenarios. These artificial datasets are used for the training of ML (Machine Learning) models, allowing researchers and operators to evaluate performance and predict system behavior under various conditions. To generate artificial datasets, numerous elements such as user behavior, vehicle dynamics, charging infrastructure, and environmental conditions must be considered. In all these elements, symmetry is a core concern; in some cases, asymmetry is more realistic; however, in others, reaching/maintaining as much symmetry as possible is a core requirement. This review paper provides a comprehensive literature survey of the most relevant techniques generating synthetic datasets in the literature, with a particular focus on the shared electric automated and connected mobility context. Furthermore, this paper also investigates central issues of these complex and dynamic systems regarding how artificial datasets could be used in the training of ML models to address the repositioning problem. Hereby, symmetry is undoubtedly a crucial consideration for ML models. In the case of datasets, it is imperative that they accurately emulate the symmetry or asymmetry observed in real-world scenarios to be effectively represented by the generated datasets. Then, this paper investigates the current challenges and limitations of synthetic datasets, such as the reliability of simulations to the real world, and the validation of generative models. Additionally, it explores how ML-based algorithms can be used to optimize vehicle routing, charging infrastructure usage, demand forecasting, and other important operational elements. In conclusion, this paper outlines a series of interesting new research avenues concerning the generation of artificial data for SEACM systems.
- Conference Article
4
- 10.1109/pacificvis48177.2020.1028
- May 8, 2020
Machine Learning (ML) plays a key role in various intelligent systems, and building an effective ML model for a data set is a difficult task involving various steps including data cleaning, feature definition and extraction, ML algorithms development, model training and evaluation as well as others. One of the most important steps in the process is to compare generated substantial amounts of ML models to find the optimal one for the deployment. It is challenging to compare such models with dynamic number of features. This paper proposes a novel visualisation approach based on a radial net to compare ML models trained with a different number of features of a given data set while revealing implicit dependent relations. In the proposed approach, ML models and features are represented by lines and arcs respectively. The dependence of ML models with dynamic number of features is encoded into the structure of visualisation, where ML models and their dependent features are directly revealed from related line connections. ML model performance information is encoded with colour and line width in the innovative visualisation. Together with the structure of visualization, feature importance can be directly discerned to help to understand ML models.
- Research Article
7
- 10.1007/s41781-021-00061-3
- Jul 5, 2021
- Computing and Software for Big Science
Machine Learning (ML) will play a significant role in the success of the upcoming High-Luminosity LHC (HL-LHC) program at CERN. An unprecedented amount of data at the exascale will be collected by LHC experiments in the next decade, and this effort will require novel approaches to train and use ML models. In this paper, we discuss a Machine Learning as a Service pipeline for HEP (MLaaS4HEP) which provides three independent layers: a data streaming layer to read High-Energy Physics (HEP) data in their native ROOT data format; a data training layer to train ML models using distributed ROOT files; a data inference layer to serve predictions using pre-trained ML models via HTTP protocol. Such modular design opens up the possibility to train data at large scale by reading ROOT files from remote storage facilities, e.g., World-Wide LHC Computing Grid (WLCG) infrastructure, and feed the data to the user’s favorite ML framework. The inference layer implemented as TensorFlow as a Service (TFaaS) may provide an easy access to pre-trained ML models in existing infrastructure and applications inside or outside of the HEP domain. In particular, we demonstrate the usage of the MLaaS4HEP architecture for a physics use-case, namely, the t{bar{t}} Higgs analysis in CMS originally performed using custom made Ntuples. We provide details on the training of the ML model using distributed ROOT files, discuss the performance of the MLaaS and TFaaS approaches for the selected physics analysis, and compare the results with traditional methods.
- Supplementary Content
3
- 10.3390/v17070882
- Jun 23, 2025
- Viruses
Advances in high-throughput technologies, digital phenotyping, and increased accessibility of publicly available datasets offer opportunities for big data to be applied in infectious disease surveillance, diagnosis, treatment, and outcome prediction. Artificial intelligence (AI) and machine learning (ML) have emerged as promising tools to analyze complex clinical and molecular data. However, it remains unclear which AI or ML models are most suitable for infectious disease management, as most existing studies use non-scoping literature reviews to recommend AI and ML models for data analysis. This scoping literature review thus examines the ML models and applications that are most relevant for infectious disease management, with a proposed actionable workflow for implementing ML models in clinical practice. We conducted a literature search on PubMed, Google Scholar, and ScienceDirect, including papers published in English between January 2020 and April 2024. Search keywords included AI, ML, public health, surveillance, diagnosis, prognosis, and infectious disease, to identify published studies using AI and ML in infectious disease management. Studies without public datasets or lacking descriptions of the ML models were excluded. This review included a total of 77 studies applied in surveillance, prognosis, and diagnosis. Different types of input data from infectious disease surveillance, clinical diagnosis, and prognosis required different ML and AI models to achieve the maximum performance in infectious disease management. Our findings highlight the potential of Explainable AI and ensemble learning models to be more broadly applicable in different aspects of infectious disease management, which can be integrated in clinical workflows to improve infectious disease surveillance, diagnosis, and prognosis. Explainable AI and ensemble learning models can be suitably used to achieve high accuracy in prediction. However, as most of the studies have not been validated in different cohorts, it remains unclear whether these ML models can be broadly applicable to different populations. Nonetheless, the findings encourage deploying ML and AI to complement clinicians and augment clinical decision-making.
- Research Article
1
- 10.62487/2rm68r13
- Feb 13, 2024
- Web3 Journal: ML in Health Science
Aim: The aim of this study was to assess the acceptance among natural science specialists of the current official regulatory recommendations to avoid utilizing artificial intelligence (AI) and machine learning (ML) models that could exacerbate social disparities. Materials and Methods: An anonymous online survey was conducted using the Telegram platform, where participants were asked a single question: "Do you consider the inclusion of religious status in AI and ML models justified from the perspective of medical ethics and science?" Respondents were provided with only two response options: "Yes" or "No." This survey was specifically targeted at international groups, focusing primarily on English and Russian-speaking clinicians and scientific researchers. Results: 134 unique individuals participated in the survey. The results revealed that two-thirds of the respondents (87 individuals) agreed that including Religion status as predictor in the ML and AI models is inappropriate. Conclusion: Two-thirds of healthcare practitioners and scientific researchers participating in this survey agree that categorizing individuals within healthcare settings based on their religion is inappropriate. Constructing healthcare predictive models based on confounders like religion is unlikely to aid in identifying or treating any pathology or disease. However, the high conflict potential of this predictor may deepen societal disparities.
- Research Article
6
- 10.1088/2632-2153/ad605f
- Jul 17, 2024
- Machine Learning: Science and Technology
Acquiring a substantial number of data points for training accurate machine learning (ML) models is a big challenge in scientific fields where data collection is resource-intensive. Here, we propose a novel approach for constructing a minimal yet highly informative database for training ML models in complex multi-dimensional parameter spaces. To achieve this, we mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR). Using a set of known data, GPR provides predictive means and standard deviation for the unknown data. Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models. We compare the performance of ML models trained on databases obtained through this method, with databases obtained using traditional approaches. Our results demonstrate that the ML models trained on the database obtained using Bayesian optimization approach consistently outperform the other two databases, achieving high accuracy with a significantly smaller number of data points. Our work contributes to the resource-efficient collection of data in high-dimensional complex parameter spaces, to achieve high precision ML predictions.
- Research Article
- 10.54364/aaiml.2024.43159
- Jan 1, 2024
- Advances in Artificial Intelligence and Machine Learning
Introduction The accurate prediction of mandibular bone growth is crucial in orthodontics and maxillofacial surgery, impacting treatment planning and patient outcomes. Traditional methods often fall short due to their reliance on linear models and clinician expertise, which are prone to human error and variability. Artificial intelligence (AI) and machine learning (ML) offer advanced alternatives, capable of processing complex datasets to provide more accurate predictions. This systematic review examines the efficacy of AI and ML models in predicting mandibular growth compared to traditional methods. Method. A systematic review was conducted following the PRISMA guidelines, focusing on studies published up to July 2024. Databases searched included PubMed, Embase, Scopus, and Web of Science. Studies were selected based on their use of AI and ML algorithms for predicting mandibular growth. A total of 31 studies were identified, with 6 meeting the inclusion criteria. Data were extracted on study characteristics, AI models used, and prediction accuracy. The risk of bias was assessed using the QUADAS-2 tool. Results. The review found that AI and ML models generally provided high accuracy in predicting mandibular growth. For instance, the LASSO model achieved an average error of 1.41 mm for predicting skeletal landmarks. However, not all AI models outperformed traditional methods; in some cases, deep learning models were less accurate than conventional growth prediction models. Discussion. The variability in datasets and study designs across the included studies posed challenges for comparing AI models’ effectiveness. Additionally, the complexity of AI models may limit their clinical applicability. Despite these challenges, AI and ML show significant promise in enhancing predictive accuracy for mandibular growth. Conclusion. AI and ML models have the potential to revolutionize mandibular growth prediction, offering greater accuracy and reliability than traditional methods. However, further research is needed to standardize methodologies, expand datasets, and improve model interpretability for clinical integration.
- Conference Article
25
- 10.1109/swc50871.2021.00023
- Oct 1, 2021
The majority of Internet of Things (IoT) devices are tiny embedded systems with a micro-controller unit (MCU) as its brain. The memory footprint (SRAM, Flash, and EEPROM) of such MCU-based devices is often very limited, restricting onboard Machine Learning (ML) model training for large trainsets with high feature dimensions. To cope with memory issues, the current edge analytics approaches train high-quality ML models on the cloud GPUs (uses large volume historical data), then deploy the deep optimized version of the resultant models on edge devices for inference. Such approaches are inefficient in concept drift situations where the data generated at the device level vary frequently, and trained models are clueless on how to behave if previously unseen data arrives. In this paper, we present Train++, an incremental training algorithm that trains ML models locally at the device level (e.g., on MCUs and small CPUs) using the full n-samples of high-dimensional data. Train++ transforms even the most resource-constrained MCU-based IoT edge devices into intelligent devices that can locally build their own knowledge base on-the-fly using the live data, thus creating smart self-learning and autonomous problem-solving devices. Train++ algorithm is extensively evaluated on 5 popular MCU-boards, using 7 datasets of varying sizes and feature dimensions. A few exciting findings when analyzing the evaluation results are: (i) The proposed method reduces the onboard binary classifier training time by ≈ 10 - 226 sec across various commodity MCUs; (ii) Train++ can infer on MCUs for the entire test set in real-time of 1 ms; (iii) The accuracy improved by 5.15 - 7.3% since the incremental characteristic of Train++ enabled the loading of full n-samples of the high-dimensional datasets even on MCUs with only a few hundred kBs of memory.
- Research Article
2
- 10.1200/jco.2023.41.4_suppl.70
- Feb 1, 2023
- Journal of Clinical Oncology
70 Background: Hepatitis C virus (HCV) is known for its oncogenic potential, especially in hepatocellular carcinoma and non-Hodgkin lymphoma. On review, several studies have indicated that patients with chronic hepatitis C (CHC) have an increased risk of developing colorectal cancer (CRC). We developed an artificial intelligence (AI) based tool using machine learning (ML) algorithms to help stratify these patients into a higher risk of CRC/adenomas. Methods: The study was approved by the institutional review board. We developed an AI automated calculator uploaded to a graphical user interface (GUI), and we applied ML to train models to predict the probability and the number of adenomas detected on colonoscopy. Data collected were age, smoking history, significant alcohol consumption, aspirin intake, ethnicity, HCV status, gender, body mass index (BMI), and colonoscopy findings. The models can operate either in the presence or absence of the above parameters. Data sets were split into 70:30 ratios for training and internal validation. Scikit-learn StandardScaler was used to scale values of continuous variables. We used the colonoscopy findings as the gold standard and applied a deep learning architecture to train six ML models for prediction. The ML models used were Support Vector Classifier, Random Forest, Bernoulli Naïve Bayes (BNB), Gradient Boosting Classifier (GBC), Logistic Regression, and Deep Neural Networks. Additional regression models were trained and tested to predict the number of polyps. A Flask (customizable python framework) application programming interface (API) was used to deploy the trained ML model with the highest accuracy as a web application. Finally, Heroku was used for the deployment of the web-based API to https://adenomadetection.herokuapp.com. Results: Data was collected for 415 patients, of which only 206 had colonoscopy results. On internal validation with the remaining patients, BNB predicted the probability of adenoma detection with the highest accuracy of 56%, precision of 55%, recall of 55%, and F1 measure of 54%. Support Vector Regressor (SVR) predicted the number of adenomas with the least mean absolute error (MAE) of 0.905. Conclusions: Our AI-based tool shows an association between CHC and colorectal adenomas. This tool can help providers stratify patients with CHC for early referral for screening colonoscopy. Along with giving a numerical percentage, the calculator can also comment on the number of adenomatous polyps a gastroenterologist can expect while doing a colonoscopy, thus prompting a higher adenoma detection rate.
- Research Article
5
- 10.3233/sw-233511
- Jan 5, 2024
- Semantic Web
In recent years, knowledge graphs (KGs) have been considered pyramids of interconnected data enriched with semantics for complex decision-making. The potential of KGs and the demand for interpretability of machine learning (ML) models in diverse domains (e.g., healthcare) have gained more attention. The lack of model transparency negatively impacts the understanding and, in consequence, interpretability of the predictions made by a model. Data-driven models should be empowered with the knowledge required to trace down their decisions and the transformations made to the input data to increase model transparency. In this paper, we propose InterpretME, a tool that using KGs, provides fine-grained representations of trained ML models. An ML model description includes data – (e.g., features’ definition and SHACL validation) and model-based characteristics (e.g., relevant features and interpretations of prediction probabilities and model decisions). InterpretME allows for defining a model’s features over data collected in various formats, e.g., RDF KGs, CSV, and JSON. InterpretME relies on the SHACL schema to validate integrity constraints over the input data. InterpretME traces the steps of data collection, curation, integration, and prediction; it documents the collected metadata in the InterpretME KG. InterpretME is published in GitHub11 https://github.com/SDM-TIB/InterpretME and Zenodo22 https://doi.org/10.5281/zenodo.8112628. The InterpretME framework includes a pipeline for enhancing the interpretability of ML models, the InterpretME KG, and an ontology to describe the main characteristics of trained ML models; a PyPI library of InterpretME is also provided33 https://pypi.org/project/InterpretME/. Additionally, a live code44 https://github.com/SDM-TIB/InterpretME_Demo, and a video55 https://www.youtube.com/watch?v=Bu4lROnY4xg demonstrating InterpretME in several use cases are also available.
- Research Article
154
- 10.1016/j.conbuildmat.2019.08.042
- Aug 27, 2019
- Construction and Building Materials
A comparison of machine learning methods for predicting the compressive strength of field-placed concrete
- Research Article
27
- 10.1016/j.cardfail.2021.12.004
- Dec 20, 2021
- Journal of Cardiac Failure
Predicting 30-Day Readmissions in Patients With Heart Failure Using Administrative Data: A Machine Learning Approach
- Conference Article
3
- 10.1115/gt2023-102024
- Jun 26, 2023
Effective deployment of machine-learning (ML) models could drive a high level of efficiency in aircraft engine conceptual design. Aero-Engines AI is a user-friendly app that has been created to deploy trained machine-learning (ML) models to assess aircraft engine concepts. It was created using tkinter, a GUI (graphical user interface) module that is built into the standard Python library. Employing tkinter greatly facilitates the sharing of ML application as an executable file which can be run on Windows machines (without the need to have Python or any library installed). The app gets user input for a turbofan design, preprocesses the input data, and deploys trained ML models to predict turbofan thrust specific fuel consumption (TSFC), engine weight, core size, and turbomachinery stage-counts. The ML predictive models were built by employing supervised deep-learning and K-nearest neighbor regression algorithms to study patterns in an existing open-source database of production and research turbofan engines. They were trained, cross-validated, and tested in Keras, an open-source neural networks API (application programming interface) written in Python, with TensorFlow (Google open-source artificial intelligence library) serving as the backend engine. The smooth deployment of these ML models using the app shows that Aero-Engines AI is an easy-to-use and a time-saving tool for aircraft engine design-space exploration during the conceptual design stage. Current version of the app focuses on the performance prediction of conventional turbofans. However, the scope of the app can easily be easily expanded to include other engine types (such as turboshaft and hybrid-electric systems) after their ML models are developed. Overall, the use of a machine-learning app for aircraft engine concept assessment represents a promising area of development in aircraft engine conceptual design.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.