EMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing
State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This article presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63–19.9× fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95–5.62× lower latency and 2.22–9.95× higher throughput, with 4.77× smaller area, 9.84× lower power, and 48.6× lower energy consumption than baseline solutions while maintaining competitive accuracy.
- Conference Article
4
- 10.1109/wacv56688.2023.00145
- Jan 1, 2023
In this work, we propose a novel and scalable solution to address the challenges of developing efficient dense predictions on edge platforms. Our first key insight is that MultiTask Learning (MTL) and hardware-aware Neural Architecture Search (NAS) can work in synergy to greatly benefit on-device Dense Predictions (DP). Empirical results reveal that the joint learning of the two paradigms is surprisingly effective at improving DP accuracy, achieving superior performance over both the transfer learning of single-task NAS and prior state-of-the-art approaches in MTL, all with just 1/10th of the computation. To the best of our knowledge, our framework, named EDNAS, is the first to successfully leverage the synergistic relationship of NAS and MTL for DP. Our second key insight is that the standard depth training for multi-task DP can cause significant instability and noise to MTL evaluation. Instead, we propose JAReD, an improved, easy-to-adopt Joint Absolute-Relative Depth loss, that reduces up to 88% of the undesired noise while simultaneously boosting accuracy. We conduct extensive evaluations on standard datasets, benchmark against strong baselines and state-of-the-art approaches, as well as provide an analysis of the discovered optimal architectures.
- Research Article
38
- 10.1016/j.jai.2022.100002
- Dec 1, 2022
- Journal of Automation and Intelligence
A survey on computationally efficient neural architecture search
- Research Article
- 10.3390/s25185821
- Sep 18, 2025
- Sensors (Basel, Switzerland)
Modeling mobile robots is crucial to odometry estimation, control design, and navigation. Classical state-space models (SSMs) have traditionally been used for system identification, while recent advances in deep learning, such as Long Short-Term Memory (LSTM) networks, capture complex nonlinear dependencies. However, few direct comparisons exist between these paradigms. This paper compares two multivariate modeling approaches for a differential drive robot: a classical SSM and an LSTM-based recurrent neural network. Both models predict the robot’s linear (v) and angular () velocities using experimental data from a five-minute navigation sequence. Performance is evaluated in terms of prediction accuracy, odometry estimation, and computational efficiency, with ground-truth odometry obtained via a SLAM-based method in ROS2. Each model was tuned for fair comparison: order selection for the SSM and hyperparameter search for the LSTM. Results show that the best SSM is a second-order model, while the LSTM used seven layers, 30 neurons, and 20-sample sliding windows. The LSTM achieved a FIT of 93.10% for v and 90.95% for , with an odometry RMSE of 1.09 m and 0.23 rad, whereas the SSM outperformed it with FIT values of 94.70% and 91.71% and lower RMSE (0.85 m, 0.17 rad). The SSM was also more resource-efficient (0.00257 ms and 1.03 bytes per step) compared to the LSTM (0.0342 ms and 20.49 bytes). The results suggest that SSMs remain a strong option for accurate odometry with low computational demand while encouraging the exploration of hybrid models to improve robustness in complex environments. At the same time, LSTM models demonstrated flexibility through hyperparameter tuning, highlighting their potential for further accuracy improvements with refined configurations.
- Research Article
30
- 10.1145/3476995
- Sep 17, 2021
- ACM Transactions on Embedded Computing Systems
The increasing paradigm shift towards i ntermittent computing has made it possible to intermittently execute d eep neural network (DNN) inference on edge devices powered by ambient energy. Recently, n eural architecture search (NAS) techniques have achieved great success in automatically finding DNNs with high accuracy and low inference latency on the deployed hardware. We make a key observation, where NAS attempts to improve inference latency by primarily maximizing data reuse, but the derived solutions when deployed on intermittently-powered systems may be inefficient, such that the inference may not satisfy an end-to-end latency requirement and, more seriously, they may be unsafe given an insufficient energy budget. This work proposes iNAS, which introduces intermittent execution behavior into NAS to find accurate network architectures with corresponding execution designs, which can safely and efficiently execute under intermittent power. An intermittent-aware execution design explorer is presented, which finds the right balance between data reuse and the costs related to intermittent inference, and incorporates a preservation design search space into NAS, while ensuring the power-cycle energy budget is not exceeded. To assess an intermittent execution design, an intermittent-aware abstract performance model is presented, which formulates the key costs related to progress preservation and recovery during intermittent inference. We implement iNAS on top of an existing NAS framework and evaluate their respective solutions found for various datasets, energy budgets and latency requirements, on a Texas Instruments device. Compared to those NAS solutions that can safely complete the inference, the iNAS solutions reduce the intermittent inference latency by 60% on average while achieving comparable accuracy, with an average 7% increase in search overhead.
- Research Article
13
- 10.1145/3575798
- Apr 20, 2023
- ACM Transactions on Embedded Computing Systems
Recently, automated co-design of machine learning (ML) models and accelerator architectures has attracted significant attention from both the industry and academia. However, most co-design frameworks either explore a limited search space or employ suboptimal exploration techniques for simultaneous design decision investigations of the ML model and the accelerator. Furthermore, training the ML model and simulating the accelerator performance is computationally expensive. To address these limitations, this work proposes a novel neural architecture and hardware accelerator co-design framework, called CODEBench. It comprises two new benchmarking sub-frameworks, CNNBench and AccelBench, which explore expanded design spaces of convolutional neural networks (CNNs) and CNN accelerators. CNNBench leverages an advanced search technique, Bayesian Optimization using Second-order Gradients and Heteroscedastic Surrogate Model for Neural Architecture Search, to efficiently train a neural heteroscedastic surrogate model to converge to an optimal CNN architecture by employing second-order gradients. AccelBench performs cycle-accurate simulations for diverse accelerator architectures in a vast design space. With the proposed co-design method, called Bayesian Optimization using Second-order Gradients and Heteroscedastic Surrogate Model for Co-Design of CNNs and Accelerators, our best CNN–accelerator pair achieves 1.4% higher accuracy on the CIFAR-10 dataset compared to the state-of-the-art pair while enabling 59.1% lower latency and 60.8% lower energy consumption. On the ImageNet dataset, it achieves 3.7% higher Top1 accuracy at 43.8% lower latency and 11.2% lower energy consumption. CODEBench outperforms the state-of-the-art framework, i.e., Auto-NBA, by achieving 1.5% higher accuracy and 34.7× higher throughput while enabling 11.0× lower energy-delay product and 4.0× lower chip area on CIFAR-10.
- Research Article
2
- 10.1007/s40747-025-02019-z
- Aug 1, 2025
- Complex & Intelligent Systems
The rapid advancement of large language models (LLMs) has driven significant progress in natural language processing (NLP) and related domains. However, their deployment remains constrained by challenges related to computation, memory, and energy efficiency—particularly in real-world applications. This work presents a comprehensive review of state-of-the-art compression techniques, including pruning, quantization, knowledge distillation, and neural architecture search (NAS), which collectively aim to reduce model size, enhance inference speed, and lower energy consumption while maintaining performance. A robust evaluation framework is introduced, incorporating traditional metrics, such as accuracy and perplexity (PPL), alongside advanced criteria including latency-accuracy trade-offs, parameter efficiency, multi-objective Pareto optimization, and fairness considerations. This study further highlights trends and challenges, such as fairness-aware compression, robustness against adversarial attacks, and hardware-specific optimizations. Additionally, NAS-driven strategies are explored as a means to design task-aware, hardware-adaptive architectures that enhance LLM compression efficiency. Hybrid and adaptive methods are also examined to dynamically optimize computational efficiency across diverse deployment scenarios. This work not only synthesizes recent advancements and identifies open problems but also proposes a structured research roadmap to guide the development of efficient, scalable, and equitable LLMs. By bridging the gap between compression research and real-world deployment, this study offers actionable insights for optimizing LLMs across a range of environments, including mobile devices and large-scale cloud infrastructures.
- Research Article
132
- 10.1109/tcad.2020.2986127
- Jul 12, 2019
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
We propose a novel hardware and software co-exploration framework for efficient neural architecture search (NAS). Different from existing hardware-aware NAS which assumes a fixed hardware design and explores the NAS space only, our framework simultaneously explores both the architecture search space and the hardware design space to identify the best neural architecture and hardware pairs that maximize both test accuracy and hardware efficiency. Such a practice greatly opens up the design freedom and pushes forward the Pareto frontier between hardware efficiency and test accuracy for better design tradeoffs. The framework iteratively performs a two-level (fast and slow) exploration. Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process. Then, the slow exploration trains candidates on a validation set and updates a controller using the reinforcement learning to maximize the expected accuracy together with the hardware efficiency. In this article, we demonstrate that the co-exploration framework can effectively expand the search space to incorporate models with high accuracy, and we theoretically show that the proposed two-level optimization can efficiently prune inferior solutions to better explore the search space. The experimental results on ImageNet show that the co-exploration NAS can find solutions with the same accuracy, 35.24% higher throughput, 54.05% higher energy efficiency, compared with the hardware-aware NAS.
- Research Article
2
- 10.1007/s44443-025-00060-z
- Jun 1, 2025
- Journal of King Saud University Computer and Information Sciences
HLSK-CASMamba: hybrid large selective kernel and convolutional additive self-attention mamba for hyperspectral image classification
- Conference Article
30
- 10.1109/cvpr42600.2020.01128
- Jun 1, 2020
Neural architecture search (NAS) aims to discover network architectures with desired properties such as high accuracy or low latency. Recently, differentiable NAS (DNAS) has demonstrated promising results while maintaining a search cost orders of magnitude lower than reinforcement learning (RL) based NAS. However, DNAS models can only optimize differentiable loss functions in search, and they require an accurate differentiable approximation of non-differentiable criteria. In this work, we present UNAS, a unified framework for NAS, that encapsulates recent DNAS and RL-based approaches under one framework. Our framework brings the best of both worlds, and it enables us to search for architectures with both differentiable and non-differentiable criteria in one unified framework while maintaining a low search cost. Further, we introduce a new objective function for search based on the generalization gap that prevents the selection of architectures prone to overfitting. We present extensive experiments on the CIFAR-10, CIFAR-100 and ImageNet datasets and we perform search in two fundamentally different search spaces. We show that UNAS obtains the state-of-the-art average accuracy on all three datasets when compared to the architectures searched in the DARTS space. Moreover, we show that UNAS can find an efficient and accurate architecture in the ProxylessNAS search space, that outperforms existing MobileNetV2 based architectures. The source code is available at \url{https://github.com/NVlabs/unas}.
- Research Article
3
- 10.53297/18293336-2023.2-30
- Jan 1, 2023
- INFORMATION TECHNOLOGIES, ELECTRONICS, RADIO ENGINEERING
Transformer models have become a key component in many natural language processing and computer vision tasks. However, these models are often computationally intensive and require a lot of resources to run efficiently. To address this challenge, this study studies the use of TensorRT, an optimization library provided by NVIDIA, to accel-erate the inference speed of transformer models on Jetson Xavier NX, a low-power and high-performance embedded platform. This research demonstrates the significant impact of TensorRT optimization on transformer models. Specifically, we present two case studies: one involving a Transformer model for text-to-speech synthesis and another featuring a Vision Transformer model for image classification. In both cases, TensorRT optimization leads to substantial improve-ments in inference speed, making these models highly efficient for edge device deploy-ment. For the text-to-speech task, TensorRT optimization results in a remarkable 60% re-duction in inference time while decreasing memory usage by 17%. Similarly, for image classification, the Vision Transformer model experiences over a 60% increase in inference speed with a negligible 0.1% decrease in accuracy. This study not only showcases the prac-tical benefits of TensorRT but also highlights the potential for further optimization and deployment of transformer models on edge platforms. This demonstrates the potential of TensorRT to optimize transformer models, both in terms of performance and memory usage. This could have far-reaching implications for edge computing, allowing more appli-cations to be deployed on low-power devices.
- Research Article
43
- 10.1007/s10922-023-09767-8
- Sep 4, 2023
- Journal of Network and Systems Management
Despite the fact that satellite-terrestrial systems have advantages such as high throughput, low latency, and low energy consumption, as well as low exposure to physical threats and natural disasters and cost-effective global coverage, their integration exposes both of them to particular security challenges that can arise due to the migration of security challenges from one to another. Intrusion Detection Systems (IDS) can also be used to provide a high level of protection for modern network environments such as satellite-terrestrial integrated networks (STINs). To optimize the detection performance of malicious activities in network traffic, four hybrid intrusion detection systems for satellite-terrestrial communication systems (SAT-IDSs) are proposed in this paper. All the proposed systems exploit the sequential forward feature selection (SFS) method based on random forest (RF) to select important features from the dataset that increase relevance and reduce complexity and then combine them with a machine learning (ML) or deep learning (DL) model; Random Forest (RF), Long Short-Term memory (LSTM), Artificial Neural Networks (ANN), and Gated Recurrent Unit (GRU). Two datasets—STIN, which simulates satellite networks, and UNSW-NB15, which simulates terrestrial networks—were used to evaluate the performance of the proposed SAT-IDSs. The experimental results indicate that selecting significant and crucial features produced by RF-SFS vastly improves detection accuracy and computational efficiency. In the first dataset (STIN), the proposed hybrid ML system SFS-RF achieved an accuracy of 90.5% after using 10 selected features, compared to 85.41% when using the whole dataset. Furthermore, the RF-SFS-GRU model achieved the highest performance of the three proposed hybrid DL-based SAT-IDS with an accuracy of 87% after using 10 selected features, compared to 79% when using the entire dataset. In the second dataset (UNSW-NB15), the proposed hybrid ML system SFS-RF achieved an accuracy of 78.52% after using 10 selected features, compared to 75.4% when using the whole dataset. The model with the highest accuracy of the three proposed hybrid DL-based SAT-IDS was the RF-SFS-GRU model. It achieved an accuracy of 79% after using 10 selected features, compared to 74% when using the whole dataset.
- Conference Article
12
- 10.1109/cvprw56347.2022.00310
- Jun 1, 2022
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation (MAPLE) that leverages hardware characteristics to predict deep neural network latency on previously unseen hardware devices. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. The CPU-specific performance metrics are also able to characterize GPUs, resulting in a versatile descriptor that does not rely on the availability of hardware counters on GPUs or other deep learning accelerators. We provide experimental insight into this novel strategy. Through this hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy, requiring as few as 3 samples from the target hardware to yield 6% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. We also demonstrate MAPLE identification of Pareto-optimal DNN architectures exhibit superlative accuracy and efficiency. The proposed technique provides a versatile and practical latency prediction methodology for DNN run-time inference on multiple hardware devices while not imposing any significant overhead for sample collection.
- Book Chapter
- 10.71443/9788197282164-06
- Jun 22, 2024
Neural Architecture Search (NAS) has revolutionized the design of deep learning models by automating the exploration of neural network architectures, thereby enhancing performance across various domains. This chapter delves into the latest advancements in NAS, focusing on its application in image classification, natural language processing, autonomous systems, and hardware optimization. Key methodologies, including reinforcement learning-based and efficient NAS approaches, are explored in depth to illustrate their impact on model accuracy and computational efficiency. Through comprehensive case studies, the chapter highlights the transformative potential of NAS in generating state-of-the-art architectures, optimizing resource utilization, and addressing complex tasks with unprecedented precision. The discussion emphasizes the balance between search efficiency and model performance, providing insights into the future trajectory of NAS research. This chapter was essential for understanding the cutting-edge techniques and practical applications of NAS, offering valuable knowledge for researchers and practitioners in the field of machine learning and artificial intelligence.
- Research Article
28
- 10.1016/j.neucom.2021.12.002
- Jan 3, 2022
- Neurocomputing
NAP: Neural architecture search with pruning
- Research Article
- 10.1007/s11119-026-10323-y
- Jan 31, 2026
- Precision Agriculture
Introduction Soil nutrient management is essential for sustainable agriculture, directly affecting crop productivity and food security. Conventional laboratory-based methods for estimating soil nitrogen (N) and phosphorus (P), although accurate, are time-consuming, labor-intensive, and unsuitable for rapid or large-scale monitoring. Objectives This study aimed to develop an efficient, accurate, and scalable framework for soil nitrogen and phosphorus estimation using hyperspectral imaging integrated with deep learning techniques. Methods A total of 286 soil samples were collected from two agricultural locations in North Dakota during pre-sowing and post-harvest periods, capturing spatio-temporal variability. Laboratory chemical analyses were conducted to quantify soil N and P, and corresponding hyperspectral data were acquired in the visible and near-infrared (VNIR) and short-wave infrared (SWIR) regions. Spectral data were processed and categorized based on laboratory reference values. A convolutional neural network (CNN) model was developed for nutrient prediction, incorporating neural architecture search (NAS) and hyperparameter tuning for model optimization. The framework was evaluated using single-sensor and fused multi-sensor datasets, with spectral augmentation techniques applied to improve model robustness. Results Baseline CNN models achieved prediction accuracies of approximately 0.44, which improved to 0.68 with multi-sensor data fusion and spectral augmentation. Integration of NAS and hyperparameter tuning resulted in an additional 10–15% performance gain, achieving a final prediction accuracy of approximately 0.83 for combined nitrogen and phosphorus classification. NAS-based models showed minimal performance differences between raw and augmented datasets, while computational training time nearly doubled due to increased model search complexity. Applying NAS on raw hyperspectral data provided the most balanced trade-off between computational efficiency and predictive performance. Conclusions The integration of hyperspectral imaging with optimized CNN architectures and NAS enables accurate, scalable, and efficient soil nutrient prediction. This framework addresses spectral variability and environmental noise, offering a robust pathway for real-time soil nutrient monitoring and advancing data-driven precision agriculture.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.