Hide Memory Latency Research Articles

It has been more than a decade since general porous applications targeted GPUs to benefit from the enormous processing power they offer. However, not all applications gain speedup running on GPUs. If an application does not have enough parallel computation to hide memory latency, running it on a GPU will degrade the performance compared to what it could achieve on a CPU. On the other hand, the efficiency that an application with high level of parallelism can achieve running on a GPU depends on how well the application’s memory and computational demands are balanced with a GPU’s resources.In this work we tackle the problem of finding a GPU configuration that performs well on a set of GPGPU applications. To achieve this, we propose two models as follows.First, we study the design space of 20 GPGPU applications and show that the relationship between the architectural parameters of the GPU and the power and performance of the application it runs can be learned by a Neural Network (NN). We propose application-specific NN-based predictors that train with 5% of the design space and predict the power and performance of the remaining 95% configurations (blind set). Although the models make accurate predictions, there exist few configurations that their power and performance are mispredicted. We propose a filtering heuristic that captures most of the predictions with large errors by marking only 5% of the configurations in the blind set as outliers.Using the models and the filtering heuristic, one will have the power and performance values for all configurations in the design space of an application. Searching the design space for a set of configurations that meet certain restrictions on the power and performance can be a tedious task as some applications have large design spaces. In the Second model, we propose to employ the Pareto Front multiobjective optimization technique to obtain a subset of the design space that run the application optimally in terms of power and performance. We show that the optimum configurations predicted by our model is very close to the actual optimum configurations. While this method gives the optimum configurations for each application, having a set of GPGPU applications, one may look for a configuration that performs well over all the applications. Therefore, we propose a method to find such a configuration with respect to different performance objectives.

As multicore and many-core architectures evolve, their memory systems are becoming increasingly more complex. To bridge the latency and bandwidth gap between the processor and memory, they often use a mix of multilevel private/shared caches that are either blocking or nonblocking and are connected by high-speed network-on-chip. Moreover, they also incorporate hardware and software prefetching and simultaneous multithreading (SMT) to hide memory latency. On such multi- and many-core systems, to incorporate various memory optimization schemes using compiler optimizations and performance tuning techniques, it is crucial to have microarchitectural details of the target memory system. Unfortunately, such details are often unavailable from vendors, especially for newly released processors. In this article, we propose a novel microbenchmarking methodology based on short elapsed-time events (SETEs) to obtain comprehensive memory microarchitectural details in multi- and many-core processors. This approach requires detailed analysis of potential interfering factors that could affect the intended behavior of such memory systems. We lay out effective guidelines to control and mitigate those interfering factors. Taking the impact of SMT into consideration, our proposed methodology not only can measure traditional cache/memory latency and off-chip bandwidth but also can uncover the details of software and hardware prefetching units not attempted in previous studies. Using the newly released Intel Xeon Phi many-core processor (with in-order cores) as an example, we show how we can use a set of microbenchmarks to determine various microarchitectural features of its memory system (many are undocumented from vendors). To demonstrate the portability and validate the correctness of such a methodology, we use the well-documented Intel Sandy Bridge multicore processor (with out-of-order cores) as another example, where most data are available and can be validated. Moreover, to illustrate the usefulness of the measured data, we do a multistage coordinated data prefetching case study on both Xeon Phi and Sandy Bridge and show that by using the measured data, we can achieve 1.3X and 1.08X performance speedup, respectively, compared to the state-of-the-art Intel ICC compiler. We believe that these measurements also provide useful insights into memory optimization, analysis, and modeling of such multicore and many-core architectures.

Hide Memory Latency Research Articles

Related Topics

Articles published on Hide Memory Latency

Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor core

Modeling Emerging Memory-Divergent GPU Applications

Multiobjective GPU design space exploration optimization

Interleaving with coroutines: a systematic and practical approach to hide memory latency in index joins

CODA

Mosaic

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Analysis of classic algorithms on highly-threaded many-core architectures

Static Instruction Scheduling for High Performance on Limited Hardware

Reducing Competitive Cache Misses in Modern Processor Architectures

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

PrORAM

Minimizing write operation for multi-dimensional DSP applications via a two-level partition technique with complete memory latency hiding

Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking

Boosting mobile GPU performance with a decoupled access/execute fragment processor

A Data Drivered Refresh with Multi-Bit Error-Correcting Power Optimize Method for Cache Based eDRAM

Software Controlled Adaptive Pre-Execution for Data Prefetching

Energy-Efficient Hardware Data Prefetching

The virtual write queue

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Hide Memory Latency Research Articles

Related Topics

Articles published on Hide Memory Latency

Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor core

Modeling Emerging Memory-Divergent GPU Applications

Multiobjective GPU design space exploration optimization

Interleaving with coroutines: a systematic and practical approach to hide memory latency in index joins

CODA

Mosaic

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Analysis of classic algorithms on highly-threaded many-core architectures

Static Instruction Scheduling for High Performance on Limited Hardware

Reducing Competitive Cache Misses in Modern Processor Architectures

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

PrORAM

Minimizing write operation for multi-dimensional DSP applications via a two-level partition technique with complete memory latency hiding

Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking

Boosting mobile GPU performance with a decoupled access/execute fragment processor

A Data Drivered Refresh with Multi-Bit Error-Correcting Power Optimize Method for Cache Based eDRAM

Software Controlled Adaptive Pre-Execution for Data Prefetching

Energy-Efficient Hardware Data Prefetching

The virtual write queue

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR