Roofline Model Research Articles

MOST (Method Of Splitting Tsunami) is widely used to solve shallow water equations (SWEs) for simulation of tsunami. This paper presents high-performance and power-efficient computation of MOST for practical tsunami simulation with FPGA. The custom hardware for simulation is based on a stream computing architecture for deeply pipelining to increase performance with a limited bandwidth. We design a stream processing element (SPE) of computing kernels combined with stencil buffers. We also introduce an SPE array architecture with spatial and temporal parallelism to further exploit available hardware resources by implementing multiple SPEs with parallel internal pipelines. Our prototype implementation with Arria 10 FPGA demonstrates that the FPGA-based design performs numerically stable tsunami simulation with real ocean-depth data in single precision by introducing non-dimensionalization. We explore the design space of SPE arrays, and find that the design of six cascaded SPEs with a single pipeline achieves the sustained performance of 383 GFlops and the performance per power of 8.41 GFlops/W with a stream bandwidth of only 7.2 GB/s. These numbers are 8.6 and 17.2 times higher than those of NVidia Tesla K20c GPU, and 1.7 and 7.1 times higher than those of AMD Radeon R9 280X GPU, respectively, for the same tsunami simulation in single precision. Moreover, we proposed a roofline model for stream computing with the SPE array in order to investigate factors of performance degradation and possible performance improvement for given FPGAs. With the model, we estimate that an upcoming Stratix 10 GX2800 FPGA can achieve the sustained performance of 8.7 TFlops at most with our SPE array architecture for tsunami simulation.

Read full abstract

Execution of complex analytic queries on massive semantic graphs is a challenging problem in big-data analytics that requires high-performance parallel computing. In a semantic graph, vertices and edges carry attributes of various types and the analytic queries typically depend on the values of these attributes. Thus, the computation must view the graph through a filter that passes only those individual vertices and edges of interest. Previous investigations have developed Knowledge Discovery Toolbox (KDT), a sophisticated Python library for parallel graph computations. In KDT, the user can write custom graph algorithms by specifying operations between edges and vertices (semiring operations). The user can also customize existing graph algorithms by writing filters. Although the high-level language for this customization enables domain scientists to productively express their graph analytics requirements, the customized queries perform poorly due to the overhead of having to call into the Python virtual machine for each vertex and edge.In this work, we use the Selective Embedded Just-In-Time Specialization (SEJITS) approach to automatically translate semiring operations and filters defined by programmers into a lower-level efficiency language, bypassing the upcall into Python. We evaluate our approach by comparing it with the high-performance Combinatorial BLAS engine and show that our approach combines the benefits of programming in a high-level language with executing in a low-level parallel environment. We increase the system’s flexibility by developing techniques that provide users with the ability to define new vertex and edge types from Python. We also present a new Roofline model for graph traversals and show that we achieve performance that is significantly closer to the bounds suggested by the Roofline. Finally, to further understand the complex interaction with the underlying architecture, we present an analysis using performance counters that quantifies the improvement in hardware behavior in the context our SEJITS methodology. Overall, we demonstrate the first known solution to the problem of obtaining high performance from a productivity language when applying graph algorithms selectively on semantic graphs with hundreds of millions of edges and scaling to thousands of processors for graphs.

Read full abstract

Roofline Model Research Articles

Articles published on Roofline Model

A Multi-level Optimization Strategy to Improve the Performance of Stencil Computation

FPGA-based tsunami simulation: Performance comparison with GPUs, and roofline model for scalability analysis

A lightweight approach to performance portability with targetDP

Performance analysis of finite-difference time-domain schemes for acoustic simulation implemented on multi-core and many-core processor architectures

On Using the Roofline Model with Lower Bounds on Data Movement

Modeling the Performance of Geometric Multigrid Stencils on Multicore Computer Architectures

Parallel processing of filtered queries in attributed semantic graphs

3DyRM: a dynamic roofline model including memory latency information

Cache-aware Roofline model: Upgrading the loft

A roofline model based on working set size for embedded systems

Exploring performance and power properties of modern multi‐core chips via simple machine models

Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools

An efficient mixed-precision, hybrid CPU–GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model

Performance, optimization, and fitness: Connecting applications to architectures

Roofline

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Roofline Model Research Articles

Articles published on Roofline Model

A Multi-level Optimization Strategy to Improve the Performance of Stencil Computation

FPGA-based tsunami simulation: Performance comparison with GPUs, and roofline model for scalability analysis

A lightweight approach to performance portability with targetDP

Performance analysis of finite-difference time-domain schemes for acoustic simulation implemented on multi-core and many-core processor architectures

On Using the Roofline Model with Lower Bounds on Data Movement

Modeling the Performance of Geometric Multigrid Stencils on Multicore Computer Architectures

Parallel processing of filtered queries in attributed semantic graphs

3DyRM: a dynamic roofline model including memory latency information

Cache-aware Roofline model: Upgrading the loft

A roofline model based on working set size for embedded systems

Exploring performance and power properties of modern multi‐core chips via simple machine models

Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools

An efficient mixed-precision, hybrid CPU–GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model

Performance, optimization, and fitness: Connecting applications to architectures

Roofline