Perform Data Augmentation Research Articles

Background In hematological malignancies there is a growing demand for real-world, comprehensive data including clinical and genomic information to build powerful models to improve diagnosis, prognosis and personalized treatment choice. However, collecting such information in large patient populations is challenging and there are many issues concerning patient privacy that need to be accounted for. One approach that can circumvent these issues is the creation of synthetic data that captures the complexities of the original data set (distributions, non-linear relationships, and noise) without including any real patient information. Aims 1) Apply advanced synthetic data generation methods to real-world datasets of different hematological malignancies. 2) Develop a Synthetic Validation Framework to evaluate the quality of synthetic data and perform data augmentation. 3) Test the capability of synthetic data to accelerate translational research. Methods Here we implemented a Conditional Tabular Wasserstein Generative Adversarial Networks (GAN) architecture with Gradient Penalty to generate synthetic data. Use cases were different cohorts of patients with myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) with available clinical and molecular features. We created a Synthetic Validation Framework to evaluate the quality of generated synthetic data: Clinical Synthetic Fitness (CSF) and Genomic Synthetic Fitness (GSF) scores were calculated as the average of multiple metric tests adopted. Patients were stratified by Hierarchical Dirichlet (HD) clustering. Explainability analysis was carried out by SHapley Additive exPlanations approach (SHAP). Survival analyses were performed by Kaplan-Meier curves and CoxPH models (Experimental plan is reported in Figure 1). Results We first created a synthetic copy of a MDS cohort (n=2,043) using all the real data for training the model. We compared synthetic vs. real data, obtaining high fitness performances for both clinical and genomic features (CSF=93%; GSF=90%). HD were then applied to define clusters capturing broad dependencies among genomic features, showing comparable results in synthetic vs. real data; SHAP analysis indicated that similar features drive patients’ classification in both datasets. Finally, synthetic patients had comparable survival with respect to real ones; when applying conventional scoring system (IPSS-R), the probability of survival of the 5 risk categories was comparable between synthetic and real data. In the second experiment setting, we analysed synthetic MDS datasets with different size generated with model trained on a real dataset. Interestingly, when generating a synthetic augmented dataset (200%) we obtained high fitness performance for both clinical and genomic features (CSF=91%; GSF=89%). Moreover, all the performances showed a similar trend when considering a cohort of 1,002 patients with AML (CSF=92%; GSF=89%) thus proving evidence for high generalizability of the model across different clinical settings. Finally, we investigated if the generation of synthetic data can accelerate translational research in hematology. Since the first publication on clinical relevance of gene mutations in MDS (Leukemia 2014;28:241), it took several years to collect data in large patient populations for generating a molecular classification (JCO 2021;39:1223) and prognostic score (IPSS-M, NEJM Evid 2022;1:7). Starting from the MDS cohort available in 2014 (n=944, Leukemia 2014;28:241), we generated 300% augmented synthetic dataset. HD were applied to synthetic data to define genomic-based clinical entities, resulting in the identification of the same 8 subgroups described in a real cohort of 2,043 patients many years later. Moreover, we applied a CoxPH model to the synthetic dataset to generate a molecular prognostic score (IPSS-M_Syn). The model was based on similar molecular features as the real IPSS-M and identified 6 risk categories in which the probability of survival was similar to that of IPSS-M risk groups (Figure 2). Conclusion GAN-generated synthetic data recapitulate statistical properties and complexity of clinical and genomic features in different hematological malignancies, replicate reliable survival estimates and allow effective data augmentation. The implementation of this technology seems to accelerate precision medicine research in hematology. Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal

Read full abstract

BackgroundThe application of machine learning to cardiac auscultation has the potential to improve the accuracy and efficiency of both routine and point-of-care screenings. The use of convolutional neural networks (CNN) on heart sound spectrograms in particular has defined state-of-the-art performance. However, the relative paucity of patient data remains a significant barrier to creating models that can adapt to a wide range of potential variability. To that end, we examined a CNN model’s performance on automated heart sound classification, before and after various forms of data augmentation, and aimed to identify the most optimal augmentation methods for cardiac spectrogram analysis.ResultsWe built a standard CNN model to classify cardiac sound recordings as either normal or abnormal. The baseline control model achieved a PR AUC of 0.763 ± 0.047. Among the single data augmentation techniques explored, horizontal flipping of the spectrogram image improved the model performance the most, with a PR AUC of 0.819 ± 0.044. Principal component analysis color augmentation (PCA) and perturbations of saturation-value (SV) of the hue-saturation-value (HSV) color scale achieved a PR AUC of 0.779 ± 045 and 0.784 ± 0.037, respectively. Time and frequency masking resulted in a PR AUC of 0.772 ± 0.050. Pitch shifting, time stretching and compressing, noise injection, vertical flipping, and applying random color filters negatively impacted model performance. Concatenating the best performing data augmentation technique (horizontal flip) with PCA and SV perturbations improved model performance.ConclusionData augmentation can improve classification accuracy by expanding and diversifying the dataset, which protects against overfitting to random variance. However, data augmentation is necessarily domain specific. For example, methods like noise injection have found success in other areas of automated sound classification, but in the context of cardiac sound analysis, noise injection can mimic the presence of murmurs and worsen model performance. Thus, care should be taken to ensure clinically appropriate forms of data augmentation to avoid negatively impacting model performance.

Read full abstract

Perform Data Augmentation Research Articles

Related Topics

Articles published on Perform Data Augmentation

D2BOF-COVIDNet: A Framework of Deep Bayesian Optimization and Fusion-Assisted Optimal Deep Features for COVID-19 Classification Using Chest X-ray and MRI Scans.

A pyramid input augmented multi-scale CNN for GGO detection in 3D lung CT images

Applying an Intelligent Approach to Environmental Sustainability Innovation in Complex Scenes

An End-to-End Contrastive Self-Supervised Learning Framework for Language Understanding

A machine learning-based framework for forecasting sales of new products with short life cycles using deep neural networks

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies

An explainable COVID-19 detection system based on human sounds.

Research on the Guidance of Youth Labor Education Based on the "Combination of Education and Production Labor" Program Based on the Deep Learning Model.

A disaster classification application using convolutional neural network by performing data augmentation

On the analysis of data augmentation methods for spectral imaged based heart sound classification using convolutional neural networks

RG-GCN: A Random Graph Based on Graph Convolution Network for Point Cloud Semantic Segmentation

Data augmentation strategies for EEG-based motor imagery decoding

Deep Convolutional Neural Networks For Environmental Sound Classification

A Deep Convolutional Generative Adversarial Networks-Based Method for Defect Detection in Small Sample Industrial Parts Images

Does Minority Case Sampling Improve Performance with Imbalanced Outcomes in Psychological Research?

Artificial intelligence based detection of age-related macular degeneration using optical coherence tomography with unique image preprocessing.

A Multi-Level Optimization Framework for End-to-End Text Augmentation

A review: Data pre-processing and data augmentation techniques

Pengenalan Emosi Pembicara Menggunakan Convolutional Neural Networks

Transfer-Learning-Based Approach for the Diagnosis of Lung Diseases from Chest X-ray Images.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Perform Data Augmentation Research Articles

Related Topics

Articles published on Perform Data Augmentation

D2BOF-COVIDNet: A Framework of Deep Bayesian Optimization and Fusion-Assisted Optimal Deep Features for COVID-19 Classification Using Chest X-ray and MRI Scans.

A pyramid input augmented multi-scale CNN for GGO detection in 3D lung CT images

Applying an Intelligent Approach to Environmental Sustainability Innovation in Complex Scenes

An End-to-End Contrastive Self-Supervised Learning Framework for Language Understanding

A machine learning-based framework for forecasting sales of new products with short life cycles using deep neural networks

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies

An explainable COVID-19 detection system based on human sounds.

Research on the Guidance of Youth Labor Education Based on the "Combination of Education and Production Labor" Program Based on the Deep Learning Model.

A disaster classification application using convolutional neural network by performing data augmentation

On the analysis of data augmentation methods for spectral imaged based heart sound classification using convolutional neural networks

RG-GCN: A Random Graph Based on Graph Convolution Network for Point Cloud Semantic Segmentation

Data augmentation strategies for EEG-based motor imagery decoding

Deep Convolutional Neural Networks For Environmental Sound Classification

A Deep Convolutional Generative Adversarial Networks-Based Method for Defect Detection in Small Sample Industrial Parts Images

Does Minority Case Sampling Improve Performance with Imbalanced Outcomes in Psychological Research?

Artificial intelligence based detection of age-related macular degeneration using optical coherence tomography with unique image preprocessing.

A Multi-Level Optimization Framework for End-to-End Text Augmentation

A review: Data pre-processing and data augmentation techniques

Pengenalan Emosi Pembicara Menggunakan Convolutional Neural Networks

Transfer-Learning-Based Approach for the Diagnosis of Lung Diseases from Chest X-ray Images.