Abstract

Commercial multicore central processing units (CPU) integrate a number of processor cores on a single chip to support parallel execution of computational tasks. Multicore CPUs can possibly improve performance over single cores for independent parallel tasks nearly linearly as long as sufficient bandwidth is available. Ideal speedup is, however, difficult to achieve when dense intercommunication between the cores or complex memory access patterns is required. This is caused by expensive synchronization and thread switching, and insufficient latency toleration. These facts guide programmers away from straight-forward parallel processing patterns toward complex and error-prone programming techniques. To address these problems, we have introduced the Thick control flow (TCF) Processor Architecture. TCF is an abstraction of parallel computation that combines self-similar threads into computational entities. In this paper, we compare the performance and programmability of an entry-level TCF processor and two Intel Skylake multicore CPUs on commonly used parallel kernels to find out how well our architecture solves these issues that greatly reduce the productivity of parallel software development. Code examples are given and programming experiences recorded.

Highlights

  • Multicore Central Processing Units (CPUs) are the workhorses of modern general purpose computing devices, such as workstations, tablets and smartphones

  • We focused on an entry-level Thick Control Flow (TCF) architecture Thick Control Flow Processor Architecture (TPA)-16 against Intel Skylake client and server multicore CPUs Core i7 and Xeon W

  • The comparison was implemented by writing similar parallel programs for all processors with popular programming solutions (Pthreads, OpemMP and baseline TCF language), measuring the execution time with a clock accurate simulator (TPA) and actual computers (Skylake CPUs) and counting the active code lines of programs

Read more

Summary

Introduction

Multicore Central Processing Units (CPUs) are the workhorses of modern general purpose computing devices, such as workstations, tablets and smartphones. Programmers often cannot employ natural, straight-forward parallel processing patterns; but have to replace them with more complex and error-prone structures [3] as will be confirmed by our experiments. This can be seen as extra code lines compared to textbook counterparts of matmul and matsum [4, 5] which reduce the number of active code lines (4–6 and 5–9, respectively) in both cases to a single code line containing just a parallel statement with no for-loops and explicit synchronization. The fibers within a TCF are executed synchronously with respect to each other in order to simplify parallel programming

Related work
Contribution
Hardware architectures
Xeon W
Programming methodologies
Thick control flows
POSIX threads
OpenMP
Comparison
Quantitative measurements
Overall tests
OpenMP and sequential notation
The effect of access patterns
Factors of efficient programming
Programming experiences
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call