Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Abdullah Al Hasib,Per Kjeldsberg,Juan Cebrián,Lasse Natvig

doi:10.3390/jlpea7010005

Abstract

Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi- and many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel® CoreTM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel® Xeon PhiTM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction sets. Results using a single-threaded execution on the Haswell and Sandy Bridge systems show that performance and EDP (Energy Delay Product) can be improved through data reuse transformations on the scalar code by a factor of ≈3× and ≈6×, respectively. Compared to scalar code without data reuse optimization, the SSE/AVX2 version achieves ≈10×/17× better performance and ≈92×/307× better EDP, respectively. These results can be improved by 10% to 15% using data reuse techniques. Finally, the most optimized version using data reuse and AVX512 achieves a speedup of ≈35× and an EDP improvement of ≈1192× on the Xeon Phi system. While single-threaded execution serves as a common reference point for all architectures to analyze the effects of data reuse on both scalar and vector codes, scalability with thread count is also discussed in the paper.

Highlights

The continuously-increasing computational demands of advanced scientific problems, combined with limited energy budgets, has motivated the need to reach exascale computing center systems under reasonable power budgets by the year 2020
Most performance and energy improvements will come from heterogeneity combined with coarse-grained parallelism, through Simultaneous Multithreading (SMT) and Chip Multiprocessing (CMP), as well as fine-grained parallelism, through Single Instruction Multiple Data (SIMD) or vector units
We have investigated the performance and energy efficiency effects of applying data-reuse transformations on a multi-core processor running a motion estimation algorithm

Summary

Introduction

The continuously-increasing computational demands of advanced scientific problems, combined with limited energy budgets, has motivated the need to reach exascale computing center systems under reasonable power budgets (below 20 MW) by the year 2020. Such systems will require huge improvements in energy efficiency at all system levels. Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) are SIMD instruction sets supported by Intel. Modern compilers do not yet have adequate auto-vectorization support for complex codes to maximize the potential of SIMD instructions [4,5]. When code efficiency is required, the vectorization is often written manually in assembly language or using SIMD intrinsics

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Low Power Electronics and Applications	Publication Date: Feb 22, 2017
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Low Power Electronics and Applications

Lead the way for us

Similar Papers

V-PFORDelta: Data Compression for Energy Efficient Computation of Time Series
Abdullah Al Hasib ... Juan M Cebrian
-
Abdullah Al Hasib, et. al.Abdullah Al Hasib ... Juan M Cebrian
01 Dec 2015
01 Dec 2015

Performance issues on many-core processors: A D2Q37 Lattice Boltzmann scheme as a test-case
F Mantovani ... R Tripiccione
Computers & Fluids | VOL. 88
F Mantovani, et. al.F Mantovani ... R Tripiccione
30 May 2013
Computers & Fluids | VOL. 88

Systematic Memory MDS Sliding Window Codes Over Erasure Channels
Xiangyu Chen ... Qifu Tyler Sun
IEEE Transactions on Communications | VOL. 69
Xiangyu Chen, et. al.Xiangyu Chen ... Qifu Tyler Sun
30 Nov 2020
IEEE Transactions on Communications | VOL. 69

<title>Real-time image processing architecture for robot vision</title>
Stelian Persa ... Pieter P Jonker
-
Stelian Persa, et. al.Stelian Persa ... Pieter P Jonker
11 Oct 2000
11 Oct 2000

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Low Power Electronics and Applications