HDRLPIM: A Simulator for Hyper-Dimensional Reinforcement Learning Based on Processing In-Memory
Processing In-Memory (PIM) is a data-centric computation paradigm that performs computations inside the memory, hence eliminating the memory wall problem in traditional computational paradigms used in Von-Neumann architectures. The associative processor, a type of PIM architecture, allows performing parallel and energy-efficient operations on vectors. This architecture is found useful in vector-based applications such as Hyper-Dimensional (HDC) Reinforcement Learning (RL). HDC is rising as a new powerful and lightweight alternative to costly traditional RL models such as Deep Q-Learning. The HDC implementation of Q-Learning relies on encoding the states in a high-dimensional representation where calculating Q-values and finding the maximum one can be done entirely in parallel. In this article, we propose to implement the main operations of a HDC RL framework on the associative processor. This acceleration achieves up to \(152.3\times\) and \(6.4\times\) energy and time savings compared to an FPGA implementation. Moreover, HDRLPIM shows that an SRAM-based AP implementation promises up to \(968.2\times\) energy-delay product gains compared to the FPGA implementation.
- Conference Article
1
- 10.1109/isocc.2016.7799848
- Oct 1, 2016
Since the new technologies like big data and cloud computing require tremendous amount of transactions between processors and memory, researches on a new memory system called Processing in Memory (PIM) architecture has been suggested as a solution for those memory intensive applications. To make software utilize the new architecture, a development environment with tool chain and debug infrastructures supplementing extended instruction sets and target emulation is needed. This paper introduces a way to emulate a target platform having PIM architecture memory device. The emulated platform may use the PIM device as a system memory or simple memory mapped bus device. The actual PIM architecture is implemented in both cycle-based memory simulator and FPGA based hardware platform. The emulator has a bridge for these implementations so the emulated target platform may be freely used with early stage application development for the PIM architecture.
- Research Article
- 10.1108/hff-02-2025-0118
- Aug 15, 2025
- International Journal of Numerical Methods for Heat & Fluid Flow
Purpose This study aims to investigate the effectiveness of Model Predictive Control (MPC) and Reinforcement Learning (RL) approaches for active flow control over a NACA 4412 airfoil near the static stall condition at a Reynolds number of 4 * 105. By systematically evaluating these control strategies, the research seeks to address a critical gap in optimizing excitation frequency and improving response time in flow control applications. The study contributes to a deeper understanding of the adaptability and performance of RL-based methods compared to traditional MPC in aerodynamic flow separation control. Design/methodology/approach The study employs a quantitative approach through numerical simulations of the Reynolds Averaged Navier-Stokes (RANS) equations with the Scale-Adaptive Simulation (SAS) turbulence model. Dielectric Barrier Discharge (DBD) plasma actuators, operating in dual-point excitation mode, are utilized for flow separation control. The research evaluates adaptive MPC, temporal difference reinforcement learning (TDRL) and deep Q-learning (DQL) in optimizing excitation frequency and expediting the stabilization process. Additionally, an integrated approach combining signal processing with DQL is examined to enhance control performance. Findings This study explores advanced control strategies for optimizing aerodynamic performance by managing flow separation using plasma actuators. We evaluate adaptive MPC, TDRL, DQL and DQL with signal processing, utilizing dual-point excitation via DBD plasma actuators. Adaptive MPC successfully achieved a target lift coefficient Cl of 1.60 using an excitation frequency of approximately 110 Hz, but struggled to reach higher target Cl values near the physical limits. RL methods effectively optimized excitation frequencies, achieving a lift coefficient of approximately 1.62 in under 2.5 s with an excitation frequency of 100 or 200 Hz. Originality/value This study presents a novel comparison of RL and MPC methods for active flow control, utilizing DBD plasma actuators to mitigate flow separation and enhance aerodynamic performance. Prior approaches have primarily focused on either MPC or RL independently, often relying on offline learning with separate training and testing phases. In contrast, our research employs an online learning framework, where RL-based techniques such as TDRL, DQL and signal processing-enhanced DQL dynamically adapt to real-time aerodynamic conditions. By simultaneously evaluating adaptive MPC and RL methods in an online learning setup, this paper provides new insights into their comparative performance in dynamic environments.
- Research Article
2
- 10.14209/jcis.2023.15
- Jan 1, 2023
- Journal of Communication and Information Systems
The collision avoidance mechanism adopted by the IEEE 802.11 standard is not optimal. The mechanism employs a binary exponential backoff (BEB) algorithm in the medium access control (MAC) layer. Such an algorithm increases the backoff interval whenever a collision is detected to minimize the probability of subsequent collisions. However, the increase of the backoff interval causes degradation of the radio spectrum utilization (i.e., bandwidth wastage). That problem worsens when the network has to manage the channel access to a dense number of stations, leading to a dramatic decrease in network performance. Furthermore, a wrong backoff setting increases the probability of collisions such that the stations experience numerous collisions before achieving the optimal backoff value. Therefore, to mitigate bandwidth wastage and, consequently, maximize the network performance, this work proposes using reinforcement learning (RL) algorithms, namely Deep Q Learning (DQN) and Deep Deterministic Policy Gradient (DDPG), to tackle such an optimization problem. In our proposed approach, we assess two different observation metrics, the average of the normalized level of the transmission queue of all associated stations and the probability of collisions. The overall network’s throughput is defined as the reward. The action is the contention window (CW) value that maximizes throughput while minimizing the number of collisions. As for the simulations, the NS-3 network simulator is used along with a toolkit known as NS3-gym, which integrates a reinforcement-learning (RL) framework into NS-3. The results demonstrate that DQN and DDPG have much better performance than BEB for both static and dynamic scenarios, regardless of the number of stations. Additionally, our results show that observations based on the average of the normalized level of the transmission queues have a slightly better performance than observations based on the collision probability. Moreover, the performance difference with BEB is amplified as the number of stations increases, with DQN and DDPG showing a 45.52% increase in throughput with 50 stations.
- Research Article
1
- 10.30574/ijsra.2024.12.2.1471
- Aug 30, 2024
- International Journal of Science and Research Archive
This paper presents the distinct mechanisms and applications of traditional Q-learning (QL) and Deep Q-learning (DQL) within the realm of reinforcement learning (RL). Traditional Q-learning (QL) utilizes the Bellman equation to update Q-values stored in a Q-table, making it suitable for simple environments. However, its scalability is limited due to the exponential growth of state-action pairs in complex environments. Deep Q-learning (DQL) addresses this limitation by using neural networks to approximate Q-values, thus eliminating the need for a Q-table, and enabling efficient handling of complex environments. The neural network (NN), acting as the agent's decision-making brain, learns to predict Q-values through training, adjusting its weights based on received rewards. The study highlights the importance of well-calibrated reward systems in reinforcement learning (RL). Proper reward structures guide the agent towards desired behaviors while minimizing unintended actions. By running multiple environments simultaneously, the training process is accelerated, allowing the agent to gather diverse experiences and improve its performance efficiently. Comparative analysis of training models demonstrates that a well-balanced reward system results in more consistent and effective learning. The findings underscore the necessity of careful design in reinforcement learning systems to ensure optimal agent behavior and efficient learning outcomes in both simple and complex environments. Through this research, we gain valuable insights into the application of Q-learning (QL) and Deep Q-learning (DQL), enhancing our understanding of how agents learn and adapt to their environments.
- Research Article
45
- 10.1109/tase.2020.3024725
- Oct 2, 2020
- IEEE Transactions on Automation Science and Engineering
Reinforcement learning (RL) has been increasingly used for single peg-in-hole assembly, where assembly skill is learned through interaction with the assembly environment in a manner similar to skills employed by human beings. However, the existing RL algorithms are difficult to apply to the multiple peg-in-hole assembly because the much more complicated assembly environment requires sufficient exploration, resulting in a long training time and less data efficiency. To this end, this article focuses on how to predict the assembly environment and how to use the predicted environment in assembly action control to improve the data efficiency of the RL algorithm. Specifically, first, the assembly environment is exactly predicted by a variable time-scale prediction (VTSP) defined as general value functions (GVFs), reducing the unnecessary exploration. Second, we propose a fuzzy logic-driven variable time-scale prediction-based reinforcement learning (FLDVTSP-RL) for assembly action control to improve the efficiency of the RL algorithm, in which the predicted environment is mapped to the impedance parameter in the proposed impedance action space by a fuzzy logic system (FLS) as the action baseline. To demonstrate the effectiveness of VTSP and the data efficiency of the FLDVTSP-RL methods, a dual peg-in-hole assembly experiment is set up; the results show that FLDVTSP-deep Q-learning (DQN) decreases the assembly time about 44% compared with DQN and FLDVTSP-deep deterministic policy gradient (DDPG) decreases the assembly time about 24% compared with DDPG. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —The complicated assembly environment of the multiple peg-in-hole assembly results in a contact state that cannot be recognized exactly from the force sensor. Therefore, contact-model-based methods that require tuning of the control parameters based on the contact state recognition cannot be applied directly in this complicated environment. Recently, reinforcement learning (RL) methods without contact state recognition have recently attracted scientific interest. However, the existing RL methods still rely on numerous explorations and a long training time, which cannot be directly applied to real-world tasks. This article takes inspiration from the manner in which human beings can learn assembly skills with a few trials, which relies on the variable time-scale predictions (VTSPs) of the environment and the optimized assembly action control strategy. Our proposed fuzzy logic-driven variable time-scale prediction-based reinforcement learning (FLDVTSP-RL) can be implemented in two steps. First, the assembly environment is predicted by the VTSP defined as general value functions (GVFs). Second, assembly action control is realized in an impedance action space with a baseline defined by the impedance parameter mapped from the predicted environment by the fuzzy logic system (FLS). Finally, a dual peg-in-hole assembly experiment is conducted; compared with deep Q-learning (DQN), FLDVTSP-DQN can decrease the assembly time about 44%; compared with deep deterministic policy gradient (DDPG), FLDVTSP-DDPG can decrease the assembly time about 24%.
- Research Article
- 10.1145/3716873
- Jun 30, 2025
- ACM Transactions on Architecture and Code Optimization
As modern applications demand more data, processing-in-memory (PIM) architectures have emerged to address the challenges of data movement and parallelism. In this article, we propose VersaTile, a heterogeneous, fully CMOS-based tiled architecture that combines conventional out-of-order (OoO) superscalar CPUs and associative processors (APs), a type of CAM-based PIM core. Both CPUs and APs leverage the RISC-V ISA and its standard RVV vector extension. VersaTile fosters collaboration between multiple low-latency CPUs and high-throughput APs by sharing the same software stack and adopting a common CPU programming and compilation frontend. Moreover, we introduce tile stitching, a mechanism enabling the aggregation of multiple APs into a single vector super-unit with modest hardware support and no programming effort. Tile stitching allows us to configure an architecture for optimal performance across a wide range of applications. We provide a detailed case study, including a scalable floorplan example, as well as a comprehensive evaluation of various design points. Our experiments show that, when using only AP tiles, VersaTile can achieve, on average across the Phoenix benchmark suite and 3D convolution, a \(5.7\times\) speedup with respect to area-equivalent OoO CPU cores with SIMD ALUs (up to \(23\times\) ), and \(4.6\times\) with respect to an equivalent-sized monolithic AP baseline (up to \(29\times\) ). For applications with both DLP (vector) and ILP (scalar) regions, VersaTile can use APs and OoO cores collaboratively to achieve better performance than using either one of them only, up to \(4.4\times\) .
- Research Article
- 10.1145/3770756
- Oct 7, 2025
- ACM Transactions on Reconfigurable Technology and Systems
The matrix operations that underpin today’s deep learning models are routinely implemented in SIMD domain specific accelerators. [1–19]. SIMD accelerators including GPUs and array processors can effectively leverage parallelism in models that are compute-bound, but their effectiveness can be diminished for models that are memory-bound. Processing-in-Memory (PIM) architectures are being explored to provide better energy efficiency and scalable performance for these memory-bound models [20–33]. Modern Field Programmable Gate Arrays (FPGAs) feature hundreds of megabits of SRAM distributed across the device as disaggregated memory resources. This makes FPGAs ideal programmable platforms for developing custom Processor In/Near Memory accelerators. Several PIM array-based accelerator designs [24–31] have been proposed to leverage this substantial internal bandwidth. However, results reported to date show the FPGA based PIM architectures operating at system clock frequencies well below a chips BRAM Fmax clock frequency. Results also show that the compute densities of the designs do not scale linearly with BRAM densities. These results indicate that FPGA PIM architectures will never be competitive with their custom Application-Specific Integrated Circuit (ASIC) counterparts. In this paper, we introduce DA-VinCi, a D eeplearning A ccelerator O v erlay using in -Memory C omput i ng. DA-VinCi is the first scalable FPGA based PIM deep-learning accelerator overlay capable of clocking at the maximum frequency of a device’s BRAM. Further, the architecture of DA-VinCi allows the number of compute units to scale linearly up to the maximum capacity of a devices BRAM, and at the maximum clock frequency of the BRAM. The DA-VinCi overlay has a programmable Instruction Set Architecture (ISA) that allows the same synthesized design to provide low-latency inferencing of a range of memory-bound deep-learning models, including MLP, RNN, LSTM, and GRU networks. The scalability and high clocking frequency of DA-VinCi is achieved through a new Processor In Memory (PIM) Tile architecture and a highly scalable system-level framework. We present results showing DA-VinCi linearly scaling the number of PEs to 100% of the BRAM capacity (over 60K PEs) on an Alveo U55 clocking at 737 MHz, the chips BRAM Fmax. We provide comparative studies on inference latency across multiple deep-learning applications that show DA-VinCi achieves up to a 201× improvement over a state-of-the-art PIM overlay accelerator, up to 87× improvement over existing PIM-based FPGA accelerators, and up to 57× improvement over custom deep-learning accelerators on FPGAs.
- Conference Article
21
- 10.1109/itec.2019.8790630
- Jun 1, 2019
In this paper, a novel deep Q-learning (DQL) algorithm based energy management strategy for a series hybrid tracked electric vehicle (SHETV) is proposed. Initially, the configurations of the SHETV powertrain are introduced, then its system model is established accordingly, and the energy management problem is formulated. Secondly, the energy management control policy based on DQL algorithm is developed. Given the curse of dimensionality problem of conventional reinforcement learning (RL) strategy, two deep Q-Networks with identical structure and initial weights are built and trained to approximate the action-value function and improve robustness of the whole model. Then the DQL-based strategy is trained and validated by using driving cycle data collected in real world, and results show that the DQL-based strategy performs better in cutting down fuel consumption by approximately 5.9% compared with the traditional RL strategy. Finally, a new driving cycle is executed on the trained DQL model and applied to retrain the RL model for comparison. The result indicates that the DQL strategy consumes about 6.34% less of fuel than the RL strategy, which confirms the adaptability of the DQL strategy consequently.
- Conference Article
7
- 10.1109/asap.2003.1212852
- Jun 24, 2003
Motion estimation is the most time consuming stage of MPEG family encodings and it reportedly absorbs up to 90% of the total execution time of MPEG processing. Therefore, we propose a hardware/software co-design paradigm that uses a PIM module to efficiently execute motion estimation operations. We use a PIM module to reduce the memory access penalty caused by a large number of memory accesses. We segment the PIM module into small pieces so that each smaller PIM module can execute the operations in parallel fashion. However, in order to execute the operations in parallel, there are critical overheads that involve replicating a huge amount of data to many of these smaller PIM modules. Not only do these replications require a huge amount of additional memory accesses but also calculations when generating addresses. Therefore, we also present an efficient data distribution mechanism to effectively support parallel executions among these smaller PIM modules. With our paradigm, the host processor can be relieved from computationally-intensive and data-intensive workloads of motion estimation. We observed up to 2034/spl times/ improvement in reduction of the number of memory accesses and up to 439/spl times/ performance improvement for the execution of motion estimation operations when using our computing paradigm.
- Research Article
3
- 10.1038/s41598-025-02933-9
- May 26, 2025
- Scientific Reports
Traffic congestion forecasting is one of the major elements of the Intelligent Transportation Systems (ITS). Traffic congestion in urban road networks significantly influences sustainability by increasing air pollution levels. Efficient congestion management enables drivers to bypass heavily trafficked areas and reducing pollutant emissions. However, properly forecasting congestion spread remains challenging due to complex, dynamic, and non-linear nature of traffic patterns. The advent of Internet of Things (IoT) devices has introduced valuable datasets that can support the development of intelligent and sustainable transportation for modern cities. This work presents a Deep Learning (DL) approach of Reinforcement Learning (RL) based Bidirectional Long Short-Term Memory (BiLSTM) with Adaptive Secretary Bird Optimizer (ASBO) for traffic congestion prediction. The experimentation is evaluated on Traffic Prediction Dataset and achieved better Mean Square Error (MSE) and Mean Absolute Error (MAE) with results of 0.015 and 0.133 respectively. Compared to the existing algorithms like RL, Deep Q Learning (DQL), LSTM and BiLSTM, the RL – BiLSTM with ASBO outperformed with the parameters MSE, RMSE, R2, MAE and MAPE with 37%, 27.44%, 26%, 33.52% and 35.8% respectively. The better performance demonstrates that RL- BiLSTM with ASBO is well-suited to predict congestion patterns in road networks.
- Research Article
7
- 10.1145/3639046
- Feb 16, 2024
- Proceedings of the ACM on Measurement and Analysis of Computing Systems
Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM "nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.
- Research Article
123
- 10.1038/s41534-019-0201-8
- Oct 8, 2019
- npj Quantum Information
Reinforcement learning has been widely used in many problems, including quantum control of qubits. However, such problems can, at the same time, be solved by traditional, non-machine-learning methods, such as stochastic gradient descent and Krotov algorithms, and it remains unclear which one is most suitable when the control has specific constraints. In this work, we perform a comparative study on the efficacy of three reinforcement learning algorithms: tabular Q-learning, deep Q-learning, and policy gradient, as well as two non-machine-learning methods: stochastic gradient descent and Krotov algorithms, in the problem of preparing a desired quantum state. We found that overall, the deep Q-learning and policy gradient algorithms outperform others when the problem is discretized, e.g. allowing discrete values of control, and when the problem scales up. The reinforcement learning algorithms can also adaptively reduce the complexity of the control sequences, shortening the operation time and improving the fidelity. Our comparison provides insights into the suitability of reinforcement learning in quantum control problems.
- Conference Article
- 10.1109/nanoarch.2015.7180589
- Jul 1, 2015
We discuss a new approach to computing that retains the possibility of exponential growth while making substantial use of the existing technology. The exponential improvement path of Moore's Law has been the driver behind the computing approach of Turing, von Neumann, and FORTRAN-like languages. Performance growth is slowing at the system level, even though further exponential growth should be possible. We propose two technology shifts as a remedy, the first being the formulation of a scaling rule for scaling into the third dimension. This involves use of circuit-level energy efficiency increases using adiabatic circuits to avoid overheating. However, this scaling rule is incompatible with the von Neumann architecture. The second technology shift is a computer architecture and programming change to an extremely aggressive form of Processor-In-Memory (PIM) architecture, which we call Processor-In-Memory-and-Storage (PIMS). Theoretical analysis shows that the PIMS architecture is compatible with the 3D scaling rule, suggesting both immediate benefit and a long-term improvement path.
- Conference Article
1
- 10.1109/iccd53106.2021.00022
- Oct 1, 2021
Processing in Memory (PIM) is a recent novel computing paradigm that is still in its nascent stage of development. Therefore, there has been an observable lack of standardized and modular Instruction Set Architectures (ISA) for the PIM devices. In this work, we present the design of an ISA which is primarily aimed at a recent programmable Look-up Table (LUT) based PIM architecture. Our ISA performs the three major tasks of i) controlling the flow of data between the memory and the PIM units, ii) reprogramming the LUTs to perform various operations required for a particular application, and iii) executing sequential steps of operation within the PIM device. A microcoded architecture of the Controller/Sequencer unit ensures minimum circuit overhead as well as offers programmability to support any custom operation. We provide a case study of CNN inferences, large matrix multiplications, and bitwise computations on the PIM architecture equipped with our ISA and present performance evaluations based on this setup. We also compare the performances with several other PIM architectures.
- Research Article
2
- 10.3390/mi15101222
- Sep 30, 2024
- Micromachines
The rapid advancement of artificial intelligence (AI) technology, combined with the widespread proliferation of Internet of Things (IoT) devices, has significantly expanded the scope of AI applications, from data centers to edge devices. Running AI applications on edge devices requires a careful balance between data processing performance and energy efficiency. This challenge becomes even more critical when the computational load of applications dynamically changes over time, making it difficult to maintain optimal performance and energy efficiency simultaneously. To address these challenges, we propose a novel processing-in-memory (PIM) technology that dynamically optimizes performance and power consumption in response to real-time workload variations in AI applications. Our proposed solution consists of a new PIM architecture and an operational algorithm designed to maximize its effectiveness. The PIM architecture follows a well-established structure known for effectively handling data-centric tasks in AI applications. However, unlike conventional designs, it features a heterogeneous configuration of high-performance PIM (HP-PIM) modules and low-power PIM (LP-PIM) modules. This enables the system to dynamically adjust data processing based on varying computational load, optimizing energy efficiency according to the application’s workload demands. In addition, we present a data placement optimization algorithm to fully leverage the potential of the heterogeneous PIM architecture. This algorithm predicts changes in application workloads and optimally allocates data to the HP-PIM and LP-PIM modules, improving energy efficiency. To validate and evaluate the proposed technology, we implemented the PIM architecture and developed an embedded processor that integrates this architecture. We performed FPGA prototyping of the processor, and functional verification was successfully completed. Experimental results from running applications with varying workload demands on the prototype PIM processor demonstrate that the proposed technology achieves up to 29.54% energy savings.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.