Abstract

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.

Highlights

  • Power consumption and design complexity have led the computer architecture community to design chip multiprocessors (CMP)

  • In this paper we have evaluated the performance of Cell Superscalar applications in terms of their scalability to generation Cell Broadband Engine Architecture (CBEA) implementations including more Synergistic Processor Elements (SPE) processors

  • We observe that the fact that the Synergistic Processor Unit (SPU) must fit all of its working set on the Local Store effectively limits the size of the tasks to be executed there, making the task generation overhead the limiting factor for scalability with the number of processors

Read more

Summary

Introduction

Power consumption and design complexity have led the computer architecture community to design chip multiprocessors (CMP). There are pure task-based parallel programming models such as Cell Superscalar [4] and Tagged Procedure Calls (TPC) [18] In all these models, the task concept provides an intuitive abstraction that can be directly mapped to processing units since it encapsulates computation and its working data set. Some applications (such as sparse linear algebra) do not use all the data on the working set, so enlarging the task size has a smaller impact Despite of this fact, the task execution time is proportional to the task size. Enlarging the task size to achieve more parallelism is not a possible solution for CBEAcompliant processors This situation sets our focus on decreasing the task generation overhead which becomes the critical factor regarding scalability and full resource utilization.

Scalability on the Cell BE
Methodology
Scientific applications
Synthetic application
Task generation analysis
Simulation setup
Branch prediction
Cache size
Memory latency
Related work
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call