Wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation
As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680 000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.
- Conference Article
- 10.2172/2549315
- Apr 3, 2025
As machine learning (ML) increasingly serves as a tool for addressing real-time challenges in scientific applications, the development of advanced tooling has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as model synthesis, are now becoming limiting factors in the rapid iteration of designs. To reduce these emerging constraints, multiple efforts are being launched toward designing an ML-based surrogate model that estimates resource usage of synthesized accelerator architectures. This model would reduce the design iteration time, especially when designing within a set of given hardware constraints. This approach shows considerable potential, but as it stands, the effort is early and would benefit from coordination and standardization to assist future work as it emerges. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of more than 100,000 fully connected neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. In addition to the resource utilization and latency data provided, the dataset includes generated artifacts and log files for many of the synthesized neural networks, in order to support future research in ML-based code generation. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, as well as the average performance across a subset of the dataset. We measure the performance of a given predictor model through multiple metrics, including $R^2$ score and SMAPE on regression tasks, as well as inference time to further characterize the estimator under test. Additionally, we introduce the latency/utilization inference graph neural network (lui-gnn), a surrogate model that uses a graph neural network to represent input architectures in the form of a directed graph. This graph representation allows for a diverse set of model architectures to all be effectively handled by a surrogate model. We present the architecture and performance of the model, as evaluated by the new proposed benchmark, including SMAPE, $R^2$ score, and inference times, and find that lui-gnn generally predicts latency and utilization for the 75\% quantile within several percent of the synthesized resources on the synthetic test dataset, indicating that this approach of estimating resource and latency via a surrogate models has promise and warrants further research.
- Research Article
3
- 10.3390/civileng6010002
- Jan 7, 2025
- CivilEng
The concept of digital twins (DT)s enhances traditional structural health monitoring (SHM) by integrating real-time data with digital models for predictive maintenance and decision-making whilst combined with finite element modelling (FEM). However, the computational demand of FE modelling necessitates surrogate models for real-time performance, alongside the requirement of inverse structural analysis to infer overall behaviour via the measured structural response of a structure. A FEM-based machine learning (ML) model is an ideal option in this context, as it can be trained to perform those calculations instantly based on FE-based training data. However, the performance of the surrogate model depends on the ML model architecture. In this light, the current study investigates three distinct ML models to surrogate FE modelling for DTs. It was identified that all models demonstrated a strong performance, with the tree-based models outperforming the performance of the neural network (NN) model. The highest accuracy of the surrogate model was identified in the random forest (RF) model with an error of 0.000350, whilst the lowest inference time was observed with the trained XGBoost algorithm, which was at approximately 1 millisecond. By leveraging the capabilities of ML, FEM, and DTs, this study presents an ideal solution for implementing real-time DTs to advance the functionalities of current SHM systems.
- Research Article
1
- 10.1002/eqe.3908
- May 4, 2023
- Earthquake Engineering & Structural Dynamics
EESD special issue: AI and data‐driven methods in earthquake engineering – (Part 1)
- Research Article
5
- 10.1080/09540091.2024.2445249
- Jan 2, 2025
- Connection Science
This study aims to benchmark the performance of machine learning (ML), deep learning (DL), and generative AI (GenAI) models in categorising assessment questions based on Bloom’s Taxonomy. Previous studies have lacked comprehensive investigations into the performance of these approaches. Further, the GenAI remains unexplored, offering a promising avenue for groundbreaking explorations. Therefore, we explore the effectiveness of various ML models by incorporating domain-specific term weighting and utilising word embeddings. The study also analyses the performance of Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) with and without bidirectional connections, as well as an approach that combines RNNs and CNN. Furthermore, we evaluate several transformer-based models by fine-tuning them alongside GenAI models text-davinci-003, gpt-3.5-turbo, PaLM2, and Gemini Pro in zero-shot classification settings. The results demonstrate that ML models outperformed DL models, achieving a best accuracy of 0.871 and F1 score of 0.872. Additionally, domain-specific term weighting is found to be superior to word embeddings. Furthermore, most ML and DL models performed better than GenAI models, with GenAI models achieving a best accuracy of 0.618 and a best F1 score of 0.627. Therefore, the outcome suggests considering the ML models with domain-specific term weighting as benchmark models in future research.
- Conference Article
- 10.1145/3706628.3708827
- Feb 27, 2025
As machine learning (ML) increasingly serves as a tool for addressing real-time challenges in scientific applications, the development of advanced tooling has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as model synthesis, are now becoming limiting factors in the rapid iteration of designs. To reduce these emerging constraints, multiple efforts are being launched toward designing an ML-based surrogate model that estimates resource usage of synthesized accelerator architectures. This model would reduce the design iteration time, especially when designing within a set of given hardware constraints. This approach shows considerable potential, but as it stands, the effort is early and would benefit from coordination and standardization to assist future work as it emerges.
- Research Article
29
- 10.1371/journal.pone.0292026
- Jun 17, 2024
- PloS one
Machine learning (ML) and deep learning (DL) models are being increasingly employed for medical imagery analyses, with both approaches used to enhance the accuracy of classification/prediction in the diagnoses of various cancers, tumors and bloodborne diseases. To date however, no review of these techniques and their application(s) within the domain of white blood cell (WBC) classification in blood smear images has been undertaken, representing a notable knowledge gap with respect to model selection and comparison. Accordingly, the current study sought to comprehensively identify, explore and contrast ML and DL methods for classifying WBCs. Following development and implementation of a formalized review protocol, a cohort of 136 primary studies published between January 2006 and May 2023 were identified from the global literature, with the most widely used techniques and best-performing WBC classification methods subsequently ascertained. Studies derived from 26 countries, with highest numbers from high-income countries including the United States (n = 32) and The Netherlands (n = 26). While WBC classification was originally rooted in conventional ML, there has been a notable shift toward the use of DL, and particularly convolutional neural networks (CNN), with 54.4% of identified studies (n = 74) including the use of CNNs, and particularly in concurrence with larger datasets and bespoke features e.g., parallel data pre-processing, feature selection, and extraction. While some conventional ML models achieved up to 99% accuracy, accuracy was shown to decrease in concurrence with decreasing dataset size. Deep learning models exhibited improved performance for more extensive datasets and exhibited higher levels of accuracy in concurrence with increasingly large datasets. Availability of appropriate datasets remains a primary challenge, potentially resolvable using data augmentation techniques. Moreover, medical training of computer science researchers is recommended to improve current understanding of leucocyte structure and subsequent selection of appropriate classification models. Likewise, it is critical that future health professionals be made aware of the power, efficacy, precision and applicability of computer science, soft computing and artificial intelligence contributions to medicine, and particularly in areas like medical imaging.
- Research Article
147
- 10.1016/j.matt.2020.02.012
- Mar 10, 2020
- Matter
Machine-Learning-Accelerated Perovskite Crystallization
- Research Article
13
- 10.1016/j.resuscitation.2023.110049
- Nov 14, 2023
- Resuscitation
Electroencephalogram-based machine learning models to predict neurologic outcome after cardiac arrest: A systematic review
- Supplementary Content
- 10.1016/j.bpsgos.2025.100654
- Nov 17, 2025
- Biological Psychiatry Global Open Science
Electroencephalography-Based Machine and Deep Learning Approaches for the Diagnosis of Dissociative Disorders: A Comprehensive Review
- Research Article
20
- 10.1145/3575798
- Apr 20, 2023
- ACM Transactions on Embedded Computing Systems
Recently, automated co-design of machine learning (ML) models and accelerator architectures has attracted significant attention from both the industry and academia. However, most co-design frameworks either explore a limited search space or employ suboptimal exploration techniques for simultaneous design decision investigations of the ML model and the accelerator. Furthermore, training the ML model and simulating the accelerator performance is computationally expensive. To address these limitations, this work proposes a novel neural architecture and hardware accelerator co-design framework, called CODEBench. It comprises two new benchmarking sub-frameworks, CNNBench and AccelBench, which explore expanded design spaces of convolutional neural networks (CNNs) and CNN accelerators. CNNBench leverages an advanced search technique, Bayesian Optimization using Second-order Gradients and Heteroscedastic Surrogate Model for Neural Architecture Search, to efficiently train a neural heteroscedastic surrogate model to converge to an optimal CNN architecture by employing second-order gradients. AccelBench performs cycle-accurate simulations for diverse accelerator architectures in a vast design space. With the proposed co-design method, called Bayesian Optimization using Second-order Gradients and Heteroscedastic Surrogate Model for Co-Design of CNNs and Accelerators, our best CNN–accelerator pair achieves 1.4% higher accuracy on the CIFAR-10 dataset compared to the state-of-the-art pair while enabling 59.1% lower latency and 60.8% lower energy consumption. On the ImageNet dataset, it achieves 3.7% higher Top1 accuracy at 43.8% lower latency and 11.2% lower energy consumption. CODEBench outperforms the state-of-the-art framework, i.e., Auto-NBA, by achieving 1.5% higher accuracy and 34.7× higher throughput while enabling 11.0× lower energy-delay product and 4.0× lower chip area on CIFAR-10.
- Research Article
1
- 10.1002/tal.70104
- Dec 2, 2025
- The Structural Design of Tall and Special Buildings
Generalization of machine learning (ML) surrogate models across distinct databases is underexplored, despite being crucial as retraining the entire model every time new data become available is inefficient. This study proposes an incremental learning methodology to improve ML models' prediction of seismic collapse of steel moment‐resisting frames (SMRFs) across distinct datasets. Three boosting algorithms, XGBoost, LightGBM, and CatBoost, were trained on a source dataset to generate surrogate ML models that can predict the SMRF's seismic response. Thereafter, the ML models were used to predict the response on a new (target) dataset of SMRFs that differ in geometric dimensions and design approaches. Initially, boosting models trained on one dataset performed poorly on another dataset, even if the datasets displayed similar characteristics and consistent feature importance rankings. Incorporation of incremental learning improved the prediction on the target dataset, but introduced catastrophic forgetting that reduced the effectiveness of the ML model on the source dataset, a problem mitigated with a rehearsal strategy. Incremental learning with rehearsal yields results comparable to those obtained by fully retraining with both source and target datasets, resulting in an effective method for ML transferability, without having to retrain entire databases and without reducing the effectiveness of ML models on the source database.
- Research Article
- 10.1182/blood-2024-211964
- Nov 5, 2024
- Blood
Systematic Review of Machine Learning Models for Myelodysplastic Syndrome Diagnosis
- Research Article
- 10.1109/tns.2025.3637553
- Jan 1, 2025
- IEEE Transactions on Nuclear Science
Machine learning (ML) models are able to process complex images, providing state-of-the-art performance in tasks such as image classification and semantic segmentation. These models can be mapped to highly efficient commercial-off-the-shelf (COTS) specialized hardware accelerators, whose reliability should be carefully evaluated before deployment. Unfortunately, given the large number of ML model architectures, possible configurations or input selections, and the numerous COTS accelerator architectures available, exhaustively testing every model-accelerator combination with beam experiments is unfeasible. Additionally, the radiation data obtained with a specific combination can hardly be extended to different configurations. In this paper, we test vision transformer (ViT) and segmentation convolutional neural network (CNN) models, in addition to several ML micro-benchmarks, on the Google Coral Edge TPU at 6 different radiation facilities, investigating particle-, software-, and hardware-dependent reliability behaviors. Our experimental results show that, while the cross section for radiation-induced silent data corruption (SDCs) can be up to 8 orders-of-magnitude higher when testing with high-LET heavy ions compared to atmospheric neutrons, the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">characteristics</i> of the SDCs are similar across all types of radiation tested. Instead, the most impactful factors that lead to misclassifications in beam tests are actually the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">model complexity</i> and the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">input selection</i>. These results can be leveraged to more efficiently plan and utilize the beam time available in radiation experiments, thus improving the understanding of the fault models affecting the software, while also characterizing the reliability of the underlying hardware accelerator.
- Research Article
19
- 10.1371/journal.pone.0282608
- Mar 9, 2023
- PLOS ONE
COVID-19 is highly infectious and causes acute respiratory disease. Machine learning (ML) and deep learning (DL) models are vital in detecting disease from computerized chest tomography (CT) scans. The DL models outperformed the ML models. For COVID-19 detection from CT scan images, DL models are used as end-to-end models. Thus, the performance of the model is evaluated for the quality of the extracted feature and classification accuracy. There are four contributions included in this work. First, this research is motivated by studying the quality of the extracted feature from the DL by feeding these extracted to an ML model. In other words, we proposed comparing the end-to-end DL model performance against the approach of using DL for feature extraction and ML for the classification of COVID-19 CT scan images. Second, we proposed studying the effect of fusing extracted features from image descriptors, e.g., Scale-Invariant Feature Transform (SIFT), with extracted features from DL models. Third, we proposed a new Convolutional Neural Network (CNN) to be trained from scratch and then compared to the deep transfer learning on the same classification problem. Finally, we studied the performance gap between classic ML models against ensemble learning models. The proposed framework is evaluated using a CT dataset, where the obtained results are evaluated using five different metrics The obtained results revealed that using the proposed CNN model is better than using the well-known DL model for the purpose of feature extraction. Moreover, using a DL model for feature extraction and an ML model for the classification task achieved better results in comparison to using an end-to-end DL model for detecting COVID-19 CT scan images. Of note, the accuracy rate of the former method improved by using ensemble learning models instead of the classic ML models. The proposed method achieved the best accuracy rate of 99.39%.
- Research Article
- 10.1093/clinchem/hvad097.479
- Sep 27, 2023
- Clinical Chemistry
Background In clinical scenarios, incorrect predictions by machine learning (ML) models are inevitable. One way to reduce misleading predictions is to avoid reporting predictions that fall within a predefined “gray zone”. This method improves predictive performance by reporting only the less uncertain cases. However, the cost-effectiveness of applying the “gray zone” rule in an ML model is unclear without massive computation. Thus, this study aims to propose a novel metric to evaluate the effectiveness of using gray zones and validate the metric in real-world ML models. Methods This study defined a statistical metric called the “discriminative index” (D-index) for evaluating the effectiveness of gray zones. To calculate the D-index, the predictive outcomes of the ML model are first transformed into two probability distributions based on the truth labels (e.g., positive or negative). The D-index is then derived from the kurtosis of these two distributions. To validate the metric, we applied the D-index to three different antibiotic susceptibility-predicting ML models (namely, convolutional neural network (CNN), random forest (RF), and XGBoost (XGB)) based on mass spectrometry data. We assessed the performance and unpredicted case numbers of each model with different gray zones and correlated the results with the proposed D-index. Results The D-index values for the CNN, XGB, and RF models were 4.36, 0.38, and −1.66, respectively. When applying the “gray zone” rule to achieve 90% area under the receiver operating characteristic, the CNN, XGB and RF models retained up to 90%, 68%, and 62% of total cases, respectively. A higher D-index value indicates a more effective application of the gray zone rule. Conclusion The D-index is a simple and statistically insightful metric for evaluating the cost-effectiveness of applying gray zone rule in an ML model. This metric has been validated in three mass spectrometry-based predictive models and has shown promising results. The D-index can be a useful tool for comparing and applying different ML algorithms.