Exploit Pipeline Parallelism Research Articles

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

Read full abstract

Multimedia and DSP applications have several computationally intensive kernels which are often off loaded and accelerated by application-specific hardware. This paper presents a speculative loop pipelining technique to overcome limitations of binary translation for hardware acceleration. Although many compilers have been developed at source level, it is desirable to translate the binary targeted to popular processors onto hardware for several practical benefits. However, the translated code can be less optimized. In particular, it is difficult to optimize memory accesses on binary to exploit pipeline parallelism since memory optimization techniques require perfect dependence information for correctness and efficiency. This information is not often available at binary level or even at the source level. Our technique synthesizes the pipeline with memory dependence speculation and postpones some phases of compilation by generating a small dependence analysis code or logic which makes use of runtime values. Such speculative optimization achieves the large amount of parallelism and does not depend on any user annotation. The experimental results show a promising speedup of up to 2.53 compared with the code in which memory accesses are not optimized in the pipeline fashion due to conservative memory analysis. In addition, we have evaluated our technique at hardware level implementation on FPGA devices and achieved comparable clock frequency and power consumption compared to a conservative method while achieving significant improvement in throughput.

Read full abstract

Exploit Pipeline Parallelism Research Articles

Related Topics

Articles published on Exploit Pipeline Parallelism

Memory-Throughput Trade-off for CNN-Based Applications at the Edge

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Understanding parallelism in graph traversal on multi-core clusters

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

Speculative Loop-Pipelining in Binary Translation for Hardware Acceleration

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Exploit Pipeline Parallelism Research Articles

Related Topics

Articles published on Exploit Pipeline Parallelism

Memory-Throughput Trade-off for CNN-Based Applications at the Edge

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Understanding parallelism in graph traversal on multi-core clusters

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

Speculative Loop-Pipelining in Binary Translation for Hardware Acceleration