Accelerating Proximal Policy Optimization on CPU-FPGA Heterogeneous Platforms

Yuan Meng,Viktor Prasanna,Sanmukh Kuppannagari

doi:10.1109/fccm48280.2020.00012

Yuan Meng, Viktor Prasanna + Show 1 more

https://doi.org/10.1109/fccm48280.2020.00012

Copy DOI

Export

Save

Cite

Publication Date: May 1, 2020

Citations: 29

Affiliation: University of Southern California

Abstract
Full-Text
Similar Papers

Abstract

Listen

Reinforcement Learning (RL) is a technique that enables an agent to learn to behave optimally by repeatedly interacting with the environment and receiving the rewards. RL is widely used in domains such as robotics, game playing and finance. Proximal Policy Optimization (PPO) is the state-of-the-art policy optimization algorithm which achieves superior overall performance on various RL benchmarks. PPO iteratively optimizes its policy - a function which chooses optimal actions, with each iteration consisting of two computationally intensive phases: Inference phase - agents infer actions to interact with the environment and collect data, and Training phase - agents train the policy using the collected data. In this work, we develop the first high-throughput PPO accelerator on CPU-FPGA heterogeneous platform, targeting both phases of the algorithm for acceleration. We implement a systolic-array based architecture coupled with a novel memory-blocked data layout that enables streaming data access in both forward and backward propagations to achieve high-throughput performance. Additionally, we develop a novel systolic array compute sharing technique to mitigate the potential load imbalance in the training of two networks. We develop an accurate performance model of our design, based on which we perform design space exploration to obtain optimal design points. Our design is evaluated on widely used robotics benchmarks, achieving $2.1 \times - 30.5 \times$ and $2 \times - 27.5 \times$ improvements in throughput against state-of-the-art CPU and CPU-GPU implementations, respectively.

Full Text