Available Task-Level Parallelism on the Cell BE

Alejandro Rico,Mateo Valero,Alex Ramirez

doi:10.1155/2009/741282

Abstract

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.

Highlights

Power consumption and design complexity have led the computer architecture community to design chip multiprocessors (CMP)
In this paper we have evaluated the performance of Cell Superscalar applications in terms of their scalability to generation Cell Broadband Engine Architecture (CBEA) implementations including more Synergistic Processor Elements (SPE) processors
We observe that the fact that the Synergistic Processor Unit (SPU) must fit all of its working set on the Local Store effectively limits the size of the tasks to be executed there, making the task generation overhead the limiting factor for scalability with the number of processors

Summary

Introduction

Power consumption and design complexity have led the computer architecture community to design chip multiprocessors (CMP). There are pure task-based parallel programming models such as Cell Superscalar [4] and Tagged Procedure Calls (TPC) [18] In all these models, the task concept provides an intuitive abstraction that can be directly mapped to processing units since it encapsulates computation and its working data set. Some applications (such as sparse linear algebra) do not use all the data on the working set, so enlarging the task size has a smaller impact Despite of this fact, the task execution time is proportional to the task size. Enlarging the task size to achieve more parallelism is not a possible solution for CBEAcompliant processors This situation sets our focus on decreasing the task generation overhead which becomes the critical factor regarding scalability and full resource utilization.

Scalability on the Cell BE

Methodology

Scientific applications

Synthetic application

Task generation analysis

Simulation setup

Branch prediction

Cache size

Memory latency

Related work

Findings

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Programming	Publication Date: Jan 1, 2009
Citations: 29	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Available Task-Level Parallelism on the Cell BE

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming

Lead the way for us

Similar Papers

Editorial: Special Section on CMP Architectures
Dean M Tullsen ... Ravi Iyer
IEEE Transactions on Parallel and Distributed Systems | VOL. 18
Dean M Tullsen, et. al.Dean M Tullsen ... Ravi Iyer
01 Aug 2007
IEEE Transactions on Parallel and Distributed Systems | VOL. 18

Automated techniques for energy efficient scheduling on homogeneous and heterogeneous chip multi-processor architectures
Sushu Zhang ... Karam S Chatha
-
Sushu Zhang, et. al. Sushu Zhang ... Karam S Chatha
01 Jan 2008
01 Jan 2008

Automated techniques for energy efficient scheduling on homogeneous and heterogeneous chip multi-processor architectures
...
-
, et. al. ...
21 Jan 2008
21 Jan 2008

Reliability aware throughput management of chip multi-processor architecture via thread migration
Saeed Safari ... Sied Mehdi Fakhraie
The Journal of Supercomputing | VOL. 72
Saeed Safari, et. al.Saeed Safari ... Sied Mehdi Fakhraie
18 Feb 2016
The Journal of Supercomputing | VOL. 72

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Available Task-Level Parallelism on the Cell BE

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming