Abstract

Field-programmable gate arrays (FPGAs) and other reconfigurable computing (RC) devices have been widely shown to have numerous advantages including order of magnitude performance and power improvements compared to microprocessors for some applications. Unfortunately, FPGA usage has largely been limited to applications exhibiting sequential memory access patterns, thereby prohibiting acceleration of important applications with irregular patterns (e.g., pointer-based data structures). In this paper, we present a design pattern for RC application development that serializes irregular data structure traversals online into a traversal cache, which allows the corresponding data to be efficiently streamed to the FPGA. The paper presents a generalized framework that benefits applications with repeated traversals, which we show can achieve between 7x and 29x speedup over pointer-based software. For applications without strictly repeated traversals, we present application-specialized extensions that benefit applications with highly similar traversals by exploiting similarity to improve memory bandwidth and execute multiple traversals in parallel. We show that these extensions can achieve a speedup between 11x and 70x on a Virtex4 LX100 for Barnes-Hut n-body simulation.

Highlights

  • Numerous studies have shown that field-programmable gate arrays (FPGAs) and other reconfigurable computing (RC) devices can achieve order of magnitude or larger performance improvements over microprocessors [1, 2] for application domains including embedded systems, digital signal processing, and scientific computing

  • The advantages of FPGAs result from the ability to implement custom circuits that exploit tremendous amounts of parallelism, often using deep pipelines with additional parallelism ranging from the bit level up to the task level

  • Irregular access patterns can result in many different ways, in this paper, we focus on the common example of pointer-based data structure traversals, such as lists and trees

Read more

Summary

Introduction

Numerous studies have shown that field-programmable gate arrays (FPGAs) and other reconfigurable computing (RC) devices can achieve order of magnitude or larger performance improvements over microprocessors [1, 2] for application domains including embedded systems, digital signal processing, and scientific computing. Though this model is simple and generic, it achieves limited speedup for applications that never or very rarely have identically repeated traversals resulting in a high invalidation rate and constant thrashing. This similarity-exploiting use model has the added benefit of allowing hardware to generate and process multiple traversals in parallel, which can enable a large amount of data reuse when the similarity between traversals is high, resulting in additional speedup This approach to traversal caches handles repeated traversals as a special case, automatically handling the repeated traversals in parallel, enabling for example greater loop unrolling by improving access to memory.

Previous Work
Traversal Cache Framework
Exploiting Traversal Similarity: A Case Study on Barnes-Hut
General Framework
Similarity Extensions
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call