Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Marcin Gorawski,Michal Lorek

doi:10.1007/s10766-017-0515-0

Marcin Gorawski, Michal Lorek

Open Access

https://doi.org/10.1007/s10766-017-0515-0

Copy DOI

Abstract

The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered—an improved version of the in-place algorithm for square matrices, out-of-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm’s memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia’s algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition.

Highlights

Matrix transposition is one of the fundamental operations in linear algebra, and is used in many scientific and engineering applications [9]
This mapping is achieved using various enumeration schemes and can be applied to both cores [10] or blocks as we propose. – We describe in detail how enumeration schemes can be used to mitigate performance problems associated with Translation Lookaside Buffer (TLB) cache misses, and how to control memory access pattern through them. – We offer an improved version of a thread-wise algorithm that delivers stable performance and high throughput regardless of matrix size. – We propose a modified version of NVIDIA’s out-of-place algorithm by applying an enumeration scheme that delivers sustained high throughput for large matrices. – We demonstrate that the 3D matrix transposition presented in [14] is susceptible to the TLB cache misses
This paper presents an improved version of two matrix transposition algorithms

Summary

Introduction

Matrix transposition is one of the fundamental operations in linear algebra, and is used in many scientific and engineering applications [9]. Based on the optimisations techniques described in [24], a new range of efficient 3D matrix transposition algorithms have been proposed in [14]. – We present an extended version of the concept of mapping a rectangular grid of elements onto a triangular part of a matrix. – We propose a modified version of NVIDIA’s out-of-place algorithm by applying an enumeration scheme that delivers sustained high throughput for large matrices. We extend the analysis of this technique described in [10] and propose an improved version of the transposition algorithm.

Prior Art

Background

Problem Definition

Transparent Block Reordering

Involution Transposition Optimisation

Enumeration Schemes

Basic Schemes

Basic Pairing Functions

V1 Scheme

Banded Schemes

V1B Scheme

Rectangular Schemes

Performance Evaluation

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Parallel Programming	Publication Date: Jul 4, 2017
Citations: 3	License type: open-access

R Discovery Prime

R Discovery Prime

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Parallel Programming

Lead the way for us

Similar Papers

Software and Hardware Co-designed Multi-level TLBs for Chip Multiprocessors
Xiaohui Zhang ... Guangqiang Chen
-
Xiaohui Zhang, et. al.Xiaohui Zhang ... Guangqiang Chen
01 Aug 2011
01 Aug 2011

On the Performance of Tagged Translation Lookaside Buffers: A Simulation-Driven Analysis
Girish Venkatasubramanian ... Renato J Figueiredo
-
Girish Venkatasubramanian, et. al.Girish Venkatasubramanian ... Renato J Figueiredo
01 Jul 2011
01 Jul 2011

Inter-core cooperative TLB for chip multiprocessors
Abhishek Bhattacharjee ... Margaret Martonosi
ACM SIGARCH Computer Architecture News | VOL. 38
Abhishek Bhattacharjee, et. al.Abhishek Bhattacharjee ... Margaret Martonosi
05 Mar 2010
ACM SIGARCH Computer Architecture News | VOL. 38

Inter-core cooperative TLB for chip multiprocessors
Abhishek Bhattacharjee ... Margaret Martonosi
ACM SIGPLAN Notices | VOL. 45
Abhishek Bhattacharjee, et. al.Abhishek Bhattacharjee ... Margaret Martonosi
05 Mar 2010
ACM SIGPLAN Notices | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Parallel Programming