Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

A VAE approach to sample multivariate extremes

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Generating accurate extremes from an observational data set is crucial when seeking to estimate risks associated with the occurrence of future extremes which could be larger than those already observed. Applications range from the occurrence of natural disasters to financial crashes. Generative models from the machine learning (ML) community do not apply to extreme samples without careful adaptation. Besides, asymptotic results from extreme value theory (EVT) give a theoretical framework to model multivariate extreme events. Bridging these two fields, this paper details a variational autoencoder (VAE) approach for sampling multivariate heavy-tailed distributions, in which extremes of particularly large intensity are likely to occur. We illustrate the relevance of our approach on a synthetic data set and on a real data set of discharge measurements along the Danube river network. The latter shows the potential of our approach for flood risks' assessment. In addition to outperforming the vanilla VAE for the tested data sets, we also provide a comparison with a competing EVT-based generative approach. In the tested cases, our approach better captures the dependence structure between extreme events.

Similar Papers
  • Preprint Article
  • Cite Count Icon 2
  • 10.5194/egusphere-egu21-666
A comparison of moderate and extreme ERA-5 daily precipitation with two observational data sets
  • Mar 3, 2021
  • Pauline Rivoire + 2 more

<p>Both mean and extreme precipitation are highly relevant and a probability distribution that models the entire precipitation distribution therefore provides important information. Gamma distributions are often used to model low and moderate precipitation amounts and extreme value theory allows to model the upper tail of the distribution. We apply the Extended Generalized Pareto Distribution (EGPD). Thanks to a transition function, this method overcomes the problem of finding a threshold between upper and lower tails. The transition cumulative distribution function of the EGPD is constrained on the upper tail and lower tail to enable a GPD behavior for both small and large extremes.</p><p>EGPD is used here to characterize ERA-5 precipitation. ERA-5 is a new ECMWF climate re-analysis dataset that provides a numerical description of the recent climate by combining a numerical weather model with observations. The data set is global with a spatial resolution of 0.25° and currently covers the period from 1979 to present. ERA-5 precipitation is computed from model forecasts and therefore needs validation against observational datasets. ERA-5 daily precipitation is compared to EOBS precipitation, a gridded dataset spatially interpolated from observations over Europe, and to CMORPH precipitation, a global satellite-based dataset. Simultaneous occurrence of extreme events is assessed with a hit rate. An intensity comparison is conducted with quantiles confidence intervals and a Kullback Leibler divergence test, both derived from the EGPD.</p><p>Overall, good agreements but also strong mismatches between ERA-5 and the observational datasets can be found, depending on the feature of interest in precipitation data. This work highlights both. For example, extreme event occurrences between ERA5 and the observational datasets appear to agree. The overlap between 95% confidence intervals on quantiles depends on the season and the probability of occurrence. Over Europe, the best agreement results are generally reached in regions with high station density in EOBS. The global intensity comparison between ERA5 and CMORPH shows a good agreement for moderate quantiles, except for some mountainous regions, but presents a large signal of disagreement in the tropics for large quantiles.</p>

  • Research Article
  • 10.1021/acsomega.5c05979
Fusion of Generative AI Techniques and Machine LearningModels to Generate and Investigate Biosignals for Glucose Sensors
  • Nov 18, 2025
  • ACS Omega
  • Kirti Sharma + 2 more

The research presentsa cutting-edge and an inexpensive technologyto predict hematological parameters on the amperometric data set ina hand-held glucometer. The data set contains peak current (Ip inμA), time corresponding to the current (Tp in sec), hematocritvolume (Hv in %), glucose concentration (Gc in mg/dL), and blood viscosityat 12 s–1 (Vis_12 in cP) and 120 s–1 (Vis_120 in cP) shear rates. We deciphered an interconnection betweenthe blood glucose concentration and hemoglobin level through the hematocritvolume of the blood by utilizing machine learning (ML) models. TheML models such as linear regression (LR), support vector regressor(SVR), decision tree (DT), random forest regressor (RFR), extremegradient boosting regressor model (XGBoost), light gradient boostingregressor (Light GBM), and artificial neural network (ANNs) predictedGc, Hv, Vis_12, Vis_120, Hgb, and Occ with an acceptable accuracycorroborated through statistical metrics, namely, R-squared (R2) score, mean squared error(MSE), and root-mean squared error (RMSE). The ML models were trainedwith 80% of the data set and validated with the remaining 20%. Furthermore,the reliability of the models were tested via relative error (RE),K-fold cross-validation technique, and 95% of confidence intervalin the domain of predictive analytics. Moreover, five thousand syntheticdata sets were generated by utilizing generative artificial intelligence(Gen AI) models such as Generative Adversarial Network (GAN), VariationalAuto-Encoder (VAE), and Gaussian copula (Gcop), a multivariate distributiontechnique. Synthetic data sets were assessed by training the developedmachine learning models on the synthetic data set and testing themon the original data set. This approach enabled validation of modelperformance by comparing the original data with the predicted outputs.The statistical metrics of the models trained and tested on the originaldata set were compared with the trained data set and tested on thesynthetic data set. While XGBoost outperformed other models on theoriginal data set, Light GBM surpassed all models, including XGBoost,on the Gcop-generated data set, making it the most reliable modelfor synthetic data applications. Our limitations lie toward the viscosityprediction on the Gcop-generated synthetic data set as corroboratedthrough SHAP analysis. Conclusively, we are futuristically propelledto refine the generative process to produce feasible values for theviscosity variables.

  • Preprint Article
  • Cite Count Icon 1
  • 10.5194/egusphere-egu24-19266
Application of novel generative diffusion models to precipitation downscaling
  • Mar 11, 2024
  • Alex Saoulis + 4 more

Machine Learning (ML) is playing an increasingly valuable role in statistical downscaling. Capable of leveraging complex, non-linear relationships latent in the training data, the community has demonstrated significant potential for ML to learn a downscaling mapping. Following the perfect-prognosis (PP) approach, ML models can be trained on historical reanalysis data to learn a relationship between coarse predictors and higher resolution (i.e. downscaled) predictands. Once trained, the models can then be evaluated on general circulation model (GCM) outputs to generate regional downscaled results. Due to the relatively low computational cost of training and utilising these models, they can be used to efficiently downscale large ensembles of climate models over regional to global domains. This work employs a novel diffusion algorithm to downscale climate data. Diffusion models have proven highly successful in applications such as natural image generation and super-resolution (the natural image analogue to climate downscaling). Diffusion models have been shown to significantly outperform earlier generative ML models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs); they can produce highly diverse samples, emulate fine details with high fidelity, and exhibit much more stable training than alternative ML models.  This work trains and evaluates diffusion models on the Multi-Source Weighted-Ensemble Precipitation (MSWEP) observational dataset over the Colorado River Basin (USA). High resolution (10km x 10km) MSWEP fields are artificially coarsened to generate training data. Once trained, the models are applied to bias-corrected climate model outputs to evaluate their ability to generate realistic downscaled precipitation fields. Performance is compared with several benchmarks, including classical regression techniques as well as alternative ML models.

  • Research Article
  • Cite Count Icon 1
  • 10.1200/cci-25-00033
Longitudinal Synthetic Data Generation by Artificial Intelligence to Accelerate Clinical and Translational Research in Breast Cancer.
  • Nov 1, 2025
  • JCO clinical cancer informatics
  • Elena Zazzetti + 16 more

Real-world data (RWD) are critical for breast cancer (BC) research but are limited by privacy concerns, missing information, and data fragmentation. This study explores synthetic data (SD) generated through advanced generative models to address these challenges and create harmonized longitudinal data sets. A data set of 1052 patients with human epidermal growth factor receptor 2-positive and triple-negative BC from the Informatics for Integrating Biology and the Bedside (i2b2) platform was used. Advanced generative models, including generative adversarial networks (GANs), variational autoencoders (VAEs), and language models (LMs), were applied to generate synthetic longitudinal data sets replicating disease progression, treatment patterns, and clinical outcomes. The Synthethic Validation Framework (SAFE) powered by Train was used to evaluate the fidelity, utility, and privacy. SD were tested across three settings: (1) integration with i2b2 for privacy-preserving data sets; (2) multistate disease modeling to predict clinical outcomes; and (3) generation of synthetic control groups for clinical trials. The synthetic data sets exhibited high fidelity (score 0.94) and ensured privacy, with temporal patterns validated through time-series analyses and Uniform Manifold Approximation and Projection embeddings. In setting A, SD accurately mirrored RWD on the i2b2 platform while maintaining privacy. In setting B, incorporating SD improved the predictive performance of a multistate disease progression model, increasing the C-index by up to 10%. In setting C, SD replicated the end points of the APT trial, demonstrating its feasibility for generating synthetic control arms with preserved statistical properties of the real data set. AI-generated longitudinal SD effectively address key challenges in RWD use in BC. This approach can improve translational research and clinical trial design while ensuring robust privacy protection. Integration with platforms such as i2b2 highlights their scalability and potential for broader applications in oncology.

  • Research Article
  • Cite Count Icon 67
  • 10.1007/s00500-019-04094-0
Auto-encoder-based generative models for data augmentation on regression problems
  • May 30, 2019
  • Soft Computing
  • Hiroshi Ohno

Recently, auto-encoder-based generative models have been widely used successfully for image processing. However, there are few studies on the realization of continuous input–output mappings for regression problems. Lack of a sufficient amount of training data plagues regression problems, which is also a notable problem in machine learning, which affects its application in the field of materials science. Using variational auto-encoders (VAEs) as generative models for data augmentation, we address the issue of small data size for regression problems. VAEs are popular and powerful auto-encoder-based generative models. Generative auto-encoder models such as VAEs use multilayer neural networks to generate sample data. In this study, we demonstrate the effectiveness of multi-task learning (auto-encoding and regression tasks) relating to regression problems. We conducted experiments on seven benchmark datasets and on one ionic conductivity dataset as an application in materials science. The experimental results show that the multi-task learning for VAEs improved the generalization performance of multivariable linear regression model trained with augmented data.

  • Research Article
  • Cite Count Icon 163
  • 10.2478/popets-2019-0067
Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models
  • Jul 30, 2019
  • Proceedings on Privacy Enhancing Technologies
  • Benjamin Hilprecht + 2 more

We present two information leakage attacks that outperform previous work on membership inference against generative models. The first attack allows membership inference without assumptions on the type of the generative model. Contrary to previous evaluation metrics for generative models, like Kernel Density Estimation, it only considers samples of the model which are close to training data records. The second attack specifically targets Variational Autoencoders, achieving high membership inference accuracy. Furthermore, previous work mostly considers membership inference adversaries who perform single record membership inference. We argue for considering regulatory actors who perform set membership inference to identify the use of specific datasets for training. The attacks are evaluated on two generative model architectures, Generative Adversarial Networks (GANs) and Variational Autoen-coders (VAEs), trained on standard image datasets. Our results show that the two attacks yield success rates superior to previous work on most data sets while at the same time having only very mild assumptions. We envision the two attacks in combination with the membership inference attack type formalization as especially useful. For example, to enforce data privacy standards and automatically assessing model quality in machine learning as a service setups. In practice, our work motivates the use of GANs since they prove less vulnerable against information leakage attacks while producing detailed samples.

  • Book Chapter
  • Cite Count Icon 18
  • 10.1016/b978-0-32-396126-4.00015-1
Chapter 10 - Generative adversarial networks
  • Jan 1, 2023
  • Machine Learning for Transportation Research and Applications
  • Yinhai Wang + 2 more

Chapter 10 - Generative adversarial networks

  • Research Article
  • Cite Count Icon 39
  • 10.1016/j.eswa.2022.117936
Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems
  • Jun 27, 2022
  • Expert Systems with Applications
  • Marc Chalé + 1 more

Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 21
  • 10.1186/s12874-021-01237-6
Deep generative models in DataSHIELD
  • Apr 3, 2021
  • BMC Medical Research Methodology
  • Stefan Lenz + 2 more

BackgroundThe best way to calculate statistics from medical data is to use the data of individual patients. In some settings, this data is difficult to obtain due to privacy restrictions. In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients.MethodsThe DataSHIELD software provides an infrastructure and a set of statistical methods for joint, privacy-preserving analyses of distributed data. The contained algorithms are reformulated to work with aggregated data from the participating sites instead of the individual data. If a desired algorithm is not implemented in DataSHIELD or cannot be reformulated in such a way, using artificial data is an alternative. Generating artificial data is possible using so-called generative models, which are able to capture the distribution of given data. Here, we employ deep Boltzmann machines (DBMs) as generative models. For the implementation, we use the package “BoltzmannMachines” from the Julia programming language and wrap it for use with DataSHIELD, which is based on R.ResultsWe present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data. Such data sets of artificial patients, which are not linked to real patients, can then be used for joint analyses. As an exemplary application, we conduct a distributed analysis with DBMs on a synthetic data set, which simulates genetic variant data. Patterns from the original data can be recovered in the artificial data using hierarchical clustering of the virtual patients, demonstrating the feasibility of the approach. Additionally, we compare DBMs, variational autoencoders, generative adversarial networks, and multivariate imputation as generative approaches by assessing the utility and disclosure of synthetic data generated from real genetic variant data in a distributed setting with data of a small sample size.ConclusionsOur implementation adds to DataSHIELD the ability to generate artificial data that can be used for various analyses, e.g., for pattern recognition with deep learning. This also demonstrates more generally how DataSHIELD can be flexibly extended with advanced algorithms from languages other than R.

  • Research Article
  • Cite Count Icon 10
  • 10.1007/s00382-016-3108-5
Variability of hydrological extreme events in East Asia and their dynamical control: a comparison between observations and two high-resolution global climate models
  • Apr 13, 2016
  • Climate Dynamics
  • N Freychet + 12 more

This work investigates the variability of extreme weather events (drought spells, DS15, and daily heavy rainfall, PR99) over East Asia. It particularly focuses on the large scale atmospheric circulation associated with high levels of the occurrence of these extreme events. Two observational datasets (APHRODITE and PERSIANN) are compared with two high-resolution global climate models (HiRAM and HadGEM3-GC2) and an ensemble of other lower resolution climate models from CMIP5. We first evaluate the performance of the high resolution models. They both exhibit good skill in reproducing extreme events, especially when compared with CMIP5 results. Significant differences exist between the two observational datasets, highlighting the difficulty of having a clear estimate of extreme events. The link between the variability of the extremes and the large scale circulation is investigated, on monthly and interannual timescales, using composite and correlation analyses. Both extreme indices DS15 and PR99 are significantly linked to the low level wind intensity over East Asia, i.e. the monsoon circulation. It is also found that DS15 events are strongly linked to the surface temperature over the Siberian region and to the land-sea pressure contrast, while PR99 events are linked to the sea surface temperature anomalies over the West North Pacific. These results illustrate the importance of the monsoon circulation on extremes over East Asia. The dependencies on of the surface temperature over the continent and the sea surface temperature raise the question as to what extent they could affect the occurrence of extremes over tropical regions in future projections.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.3390/atmos14030553
Daily Precipitation and Temperature Extremes in Southern Italy (Calabria Region)
  • Mar 14, 2023
  • Atmosphere
  • Giuseppe Prete + 4 more

We apply extreme value theory (EVT) to study the daily precipitation and temperature extremes in the Calabria region (southern Italy) mainly considering a long-term observational dataset (1990–2020) and also investigating the possible use of the ERA5 (ECMWF Reanalysis v5) fields. The efficiency of the EVT applied on the available observational dataset is first assessed—both through a punctual statistical analysis and return-level maps. Two different EVT methods are adopted, namely the peak-over-threshold (POT) approach for the precipitation and the block-maxima (BM) approach for the temperature. The proposed methodologies appear to be suitable for describing daily extremes both in quantitative terms, considering the punctual analysis in specific points, and in terms of the most affected areas by extreme values, considering the return-level maps. Conversely, the analysis conducted using the reanalysis fields for the same time period highlights the limitations of using these fields for a correct quantitative reconstruction of the extremes while showing a certain consistency regarding the areas most affected by extreme events. By applying the methodology on the observed dataset but focusing on return periods of 50 and 100 years, an increasing trend of daily extreme rainfall and temperature over the whole region emerges, with specific areas more affected by these events; in particular, rainfall values up to 500 mm/day are predicted in the southeastern part of Calabria for the 50-year-return period, and maximum daily temperatures up to 40 °C are expected in the next 100 years, mainly in the western and southern parts of the region. These results offer a useful perspective for evaluating the exacerbation of future extreme weather events possibly linked to climate change effects.

  • Research Article
  • Cite Count Icon 26
  • 10.1109/tifs.2023.3262112
Bottlenecks CLUB: Unifying Information-Theoretic Trade-Offs Among Complexity, Leakage, and Utility
  • Jan 1, 2023
  • IEEE Transactions on Information Forensics and Security
  • Behrooz Razeghi + 3 more

Bottleneck problems are an important class of optimization problems that have recently gained increasing attention in the domain of machine learning and information theory. They are widely used in generative models, fair machine learning algorithms, design of privacy-assuring mechanisms, and appear as information-theoretic performance bounds in various multi-user communication problems. In this work, we propose a general family of optimization problems, termed as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">complexity-leakage-utility bottleneck (CLUB)</i> model, which (i) provides a unified theoretical framework that generalizes most of the state-of-the-art literature for the information-theoretic privacy models, (ii) establishes a new interpretation of the popular generative and discriminative models, (iii) constructs new insights for the generative compression models, and (iv) can be used to obtain fair generative models. We first formulate the CLUB model as a complexity-constrained privacy-utility optimization problem. We then connect it with the closely related bottleneck problems, namely information bottleneck (IB), privacy funnel (PF), deterministic IB (DIB), conditional entropy bottleneck (CEB), and conditional PF (CPF). We show that the CLUB model generalizes all these problems as well as most other information-theoretic privacy models. Then, we construct the deep variational CLUB (DVCLUB) models by employing neural networks to parameterize variational approximations of the associated information quantities. Building upon these information quantities, we present unified objectives of the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">supervised</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unsupervised</i> DVCLUB models. Leveraging the DVCLUB model in an unsupervised setup, we then connect it with state-of-the-art generative models, such as variational auto-encoders (VAEs), generative adversarial networks (GANs), as well as the Wasserstein GAN (WGAN), Wasserstein auto-encoder (WAE), and adversarial auto-encoder (AAE) models through the optimal transport (OT) problem. We then show that the DVCLUB model can also be used in fair representation learning problems, where the goal is to mitigate the undesired bias during the training phase of a machine learning model. We conduct extensive quantitative experiments on colored-MNIST and CelebA datasets.

  • Research Article
  • 10.1063/5.0222403
Physical discovery in representation learning via conditioning on prior knowledge
  • Aug 14, 2024
  • Journal of Applied Physics
  • Yongtao Liu + 3 more

Recent advances in electron, scanning probe, optical, and chemical imaging and spectroscopy yield bespoke data sets containing the information of structure and functionality of complex systems. In many cases, the resulting data sets are underpinned by low-dimensional simple representations encoding the factors of variability within the data. The representation learning methods seek to discover these factors of variability, ideally further connecting them with relevant physical mechanisms. However, generally, the task of identifying the latent variables corresponding to actual physical mechanisms is extremely complex. Here, we present an empirical study of an approach based on conditioning the data on the known (continuous) physical parameters and systematically compare it with the previously introduced approach based on the invariant variational autoencoders. The conditional variational autoencoder (cVAE) approach does not rely on the existence of the invariant transforms and hence allows for much greater flexibility and applicability. Interestingly, cVAE allows for limited extrapolation outside of the original domain of the conditional variable. However, this extrapolation is limited compared to the cases when true physical mechanisms are known, and the physical factor of variability can be disentangled in full. We further show that introducing the known conditioning results in the simplification of the latent distribution if the conditioning vector is correlated with the factor of variability in the data, thus allowing us to separate relevant physical factors. We initially demonstrate this approach using 1D and 2D examples on a synthetic data set and then extend it to the analysis of experimental data on ferroelectric domain dynamics visualized via piezoresponse force microscopy.

  • Supplementary Content
  • Cite Count Icon 1
  • 10.25394/pgs.12370451.v1
Inferential GANs and Deep Feature Selection with Applications
  • Jun 15, 2020
  • Figshare
  • Yao Chen

Deep nueral networks (DNNs) have become popular due to their predictive power and flexibility in model fitting. In unsupervised learning, variational autoencoders (VAEs) and generative adverarial networks (GANs) are two most popular and successful generative models. How to provide a unifying framework combining the best of VAEs and GANs in a principled way is a challenging task. In supervised learning, the demand for high-dimensional data analysis has grown significantly, especially in the applications of social networking, bioinformatics, and neuroscience. How to simultaneously approximate the true underlying nonlinear system and identify relevant features based on high-dimensional data (typically with the sample size smaller than the dimension, a.k.a. small-n-large-p) is another challenging task.In this dissertation, we have provided satisfactory answers for these two challenges. In addition, we have illustrated some promising applications using modern machine learning methods.In the first chapter, we introduce a novel inferential Wasserstein GAN (iWGAN) model, which is a principled framework to fuse auto-encoders and WGANs. GANs have been impactful on many problems and applications but suffer from unstable training. The Wasserstein GAN (WGAN) leverages the Wasserstein distance to avoid the caveats in the minmax two-player training of GANs but has other defects such as mode collapse and lack of metric to detect the convergence. The iWGAN model jointly learns an encoder network and a generator network motivated by the iterative primal dual optimization process. The encoder network maps the observed samples to the latent space and the generator network maps the samples from the latent space to the data space. We establish the generalization error bound of iWGANs to theoretically justify the performance of iWGANs. We further provide a rigorous probabilistic interpretation of our model under the framework of maximum likelihood estimation. The iWGAN, with a clear stopping criteria, has many advantages over other autoencoder GANs. The empirical experiments show that the iWGAN greatly mitigates the symptom of mode collapse, speeds up the convergence, and is able to provide a measurement of quality check for each individual sample. We illustrate the ability of iWGANs by obtaining a competitive and stable performance with state-of-the-art for benchmark datasets. In the second chapter, we present a general framework for high-dimensional nonlinear variable selection using deep neural networks under the framework of supervised learning. The network architecture includes both a selection layer and approximation layers. The problem can be cast as a sparsity-constrained optimization with a sparse parameter in the selection layer and other parameters in the approximation layers. This problem is challenging due to the sparse constraint and the nonconvex optimization. We propose a novel algorithm, called Deep Feature Selection, to estimate both the sparse parameter and the other parameters. Theoretically, we establish the algorithm convergence and the selection consistency when the objective function has a Generalized Stable Restricted Hessian. This result provides theoretical justifications of our method and generalizes known results for high-dimensional linear variable selection. Simulations and real data analysis are conducted to demonstrate the superior performance of our method.In the third chapter, we develop a novel methodology to classify the electrocardiograms (ECGs) to normal, atrial fibrillation and other cardiac dysrhythmias as defined by the Physionet Challenge 2017. More specifically, we use piecewise linear splines for the feature selection and a gradient boosting algorithm for the classifier. In the algorithm, the ECG waveform is fitted by a piecewise linear spline, and morphological features related to the piecewise linear spline coefficients are extracted. XGBoost is used to classify the morphological coefficients and heart rate variability features. The performance of the algorithm was evaluated by the PhysioNet Challenge database (3658 ECGs classified by experts). Our algorithm achieves an average F1 score of 81% for a 10-fold cross validation and also achieved 81% for F1 score on the independent testing set. This score is similar to the top 9th score (81%) in the official phase of the Physionet Challenge 2017.In the fourth chapter, we introduce a novel region-selection penalty in the framework of image-on-scalar regression to impose sparsity of pixel values and extract active regions simultaneously. This method helps identify regions of interest (ROI) associated with certain disease, which has a great impact on public health. Our penalty combines the Smoothly Clipped Absolute Deviation (SCAD) regularization, enforcing sparsity, and the SCAD of total variation (TV) regularization, enforcing spatial contiguity, into one group, which segments contiguous spatial regions against zero-valued background. Efficient algorithm is based on the alternative direction method of multipliers (ADMM) which decomposes the non-convex problem into two iterative optimization problems with explicit solutions. Another virtue of the proposed method is that a divide and conquer learning algorithm is developed, thereby allowing scaling to large images. Several examples are presented and the experimental results are compared with other state-of-the-art approaches.

  • Preprint Article
  • 10.52843/cassyni.sh36t3
Machine Learning for Space Weather and the challenge of rare and extreme events
  • Dec 11, 2025
  • Enrico Camporeale

Space weather refers to the conditions in near-Earth space driven by solar activity—solar flares, coronal mass ejections, and geomagnetic storms—that can disrupt satellites, radio communication, navigation systems, and power grids. In recent years, the field has been revolutionized by the rapid improvement in data-driven and machine learning based forecasting. Yet the events that matter most for operational decision-making are precisely those that occur least often. This creates a core scientific challenge: our models must anticipate the rare and the extreme, even though the historical record is dominated by quiet days. Classical machine learning methods struggle in this setting. In regression tasks with strongly imbalanced target distributions, “more data” often means “more of the same,” offering limited benefit. The data that capture extremes are far more informative than abundant but redundant samples. In this talk, I will give an overview on recent advances of machine learning in the broad field of space weather and space physics. Emphasis will be given on three complementary methods that address the imbalanced regression challenge. ACCRUE (Accurate and Reliable Uncertainty Estimate) is a model-agnostic post-hoc technique that converts deterministic models into probabilistic ones with calibrated and trustworthy uncertainties. PARIS (Pruning Algorithm via the Representer theorem for Imbalanced Scenarios) applies influence-aware sample attribution to identify and remove redundant or counterproductive data, improving performance specifically on rare events. ProBoost (Probabilistic Boosting) integrates the two ideas, forming an uncertainty-weighted ensemble in which each model contributes proportionally to its calibrated confidence. Although motivated by space weather prediction, these approaches are broadly applicable to any domain where rare or extreme events—not the majority of the data—drive the real-world risk.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant