Compiler Support Research Articles

With the recent trend of promoting Field-Programmable Gate Arrays (FPGAs) to first-class citizens in accelerating compute-intensive applications in networking, cloud services and artificial intelligence, FPGAs face two major challenges in sustaining competitive advantages in performance and energy efficiency for diverse cloud workloads: 1) limited configuration capability for supporting light-weight computations/on-chip data storage to accelerate emerging search-/data-intensive applications. 2) lack of architectural support to hide reconfiguration overhead for assisting virtualization in a cloud computing environment. In this paper, we propose a reconfigurable memory-oriented computing fabric, namely Liquid Silicon-Monona (L-Si), enabled by emerging nonvolatile memory technology i.e. RRAM, to address these two challenges. Specifically, L-Si addresses the first challenge by virtue of a new architecture comprising a 2D array of physically identical but functionally-configurable building blocks. It, for the first time, extends the configuration capabilities of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory, greatly benefiting emerging search- and data-intensive applications. To address the second challenge, L-Si provides scalable multi-context architectural support to minimize reconfiguration overhead for assisting virtualization. In addition, we provide compiler support to facilitate the programming of applications written in high-level programming languages (e.g. OpenCL) and frameworks (e.g. TensorFlow, MapReduce) while fully exploiting the unique architectural capability of L-Si. Our evaluation results show L-Si achieves 99.6% area reduction, 1.43× throughput improvement and 94.0% power reduction on search-intensive benchmarks, as compared with the FPGA baseline. For neural network benchmarks, on average, L-Si achieves 52.3× speedup, 113.9× energy reduction and 81% area reduction over the FPGA baseline. In addition, the multi-context architecture of L-Si reduces the context switching time to - 10ns, compared with an off-the-shelf FPGA (∼100ms), greatly facilitating virtualization.

Read full abstract

Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism . Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them. In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization . We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.

Read full abstract

Compiler Support Research Articles

Related Topics

Articles published on Compiler Support

Compiler Support for the Fortran 2003, 2008, TS29113, and 2018 Standards Revision 24

Compiler Support for the Fortran 2003 and 2008 Standards Revision 22

Liquid Silicon-Monona

Contention free delayed keeper for high density large signal sensing memory compiler

CAIRO

Compiler Support for the Fortran 2003 and 2008 Standards Revision 22

Optimization of Low-Density Parity Check decoder performance for OpenCL designs synthesized to FPGAs

Compiler Support for the Fortran 2003 and 2008 Standards Revision 21

LD

A 5.3 pJ/op approximate TTA VLIW tailored for machine learning

Static Instruction Scheduling for High Performance on Limited Hardware

F ree R ider

Compiler Support for the Fortran 2003 and 2008 Standards Revision 20

Compiler Support for the Fortran 2003 and 2008 Standards Revision 19

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

Compiler Support for the Fortran 2003 and 2008 Standards Revision 18

A Framework for Practical Dynamic Software Updating

Refined transactional lock elision

ROOT 6 and beyond: TObject, C++14 and many cores.

Compiler Support for the Fortran 2003 and 2008 Standards Revision 17

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Compiler Support Research Articles

Related Topics

Articles published on Compiler Support

Compiler Support for the Fortran 2003, 2008, TS29113, and 2018 Standards Revision 24

Compiler Support for the Fortran 2003 and 2008 Standards Revision 22

Liquid Silicon-Monona

Contention free delayed keeper for high density large signal sensing memory compiler

CAIRO

Compiler Support for the Fortran 2003 and 2008 Standards Revision 22

Optimization of Low-Density Parity Check decoder performance for OpenCL designs synthesized to FPGAs

Compiler Support for the Fortran 2003 and 2008 Standards Revision 21

LD

A 5.3 pJ/op approximate TTA VLIW tailored for machine learning

Static Instruction Scheduling for High Performance on Limited Hardware

F ree R ider

Compiler Support for the Fortran 2003 and 2008 Standards Revision 20

Compiler Support for the Fortran 2003 and 2008 Standards Revision 19

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

Compiler Support for the Fortran 2003 and 2008 Standards Revision 18

A Framework for Practical Dynamic Software Updating

Refined transactional lock elision

ROOT 6 and beyond: TObject, C++14 and many cores.

Compiler Support for the Fortran 2003 and 2008 Standards Revision 17