Abstract

Reinforcement Learning (RL) is a technique that enables an agent to learn to behave optimally by repeatedly interacting with the environment and receiving the rewards. RL is widely used in domains such as robotics, game playing and finance. Proximal Policy Optimization (PPO) is the state-of-the-art policy optimization algorithm which achieves superior overall performance on various RL benchmarks. PPO iteratively optimizes its policy - a function which chooses optimal actions, with each iteration consisting of two computationally intensive phases: Inference phase - agents infer actions to interact with the environment and collect data, and Training phase - agents train the policy using the collected data. In this work, we develop the first high-throughput PPO accelerator on CPU-FPGA heterogeneous platform, targeting both phases of the algorithm for acceleration. We implement a systolic-array based architecture coupled with a novel memory-blocked data layout that enables streaming data access in both forward and backward propagations to achieve high-throughput performance. Additionally, we develop a novel systolic array compute sharing technique to mitigate the potential load imbalance in the training of two networks. We develop an accurate performance model of our design, based on which we perform design space exploration to obtain optimal design points. Our design is evaluated on widely used robotics benchmarks, achieving $2.1 \times - 30.5 \times$ and $2 \times - 27.5 \times$ improvements in throughput against state-of-the-art CPU and CPU-GPU implementations, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call