The introduction of multicore/multithreaded processors, comprised of a large number of hardware contexts (virtual CPUs) that share resources at multiple levels, has made process scheduling, in particular assignment of running threads to available hardware contexts, an important aspect of system performance. Nevertheless, thread assignment of applications running on state-of-the art processors is an NP-complete problem. Over the years, numerous studies have proposed heuristic-based algorithms for thread assignment. Since the thread assignment problem is intractable, it is in general impossible to know the performance of the optimal assignment, so the room for improvement of a given algorithm is also unknown. It is therefore hard to decide whether to invest more effort and time to improve an algorithm that may already be close to optimal. In this paper, we present a statistical approach to the thread assignment problem. First, we present a method that predicts the performance of the optimal thread assignment, based on the observed performance of each thread assignment in a random sample. The method is based on Extreme Value Theory (EVT), a branch of statistics that analyses extreme deviations from the population mean. We also propose sample pruning, a method that significantly reduces the time required to apply the statistical method by reducing the number of candidate solutions that need to be measured. Finally, we show that, if no suitable heuristic-based algorithm is available, a sample of several thousand random thread assignments is enough to obtain, with high confidence, an assignment with performance close to optimal. The presented approach is architecture and application independent, and it can be used to address the thread assignment problem in various domains. It is especially well suited for systems in which the workload seldom changes. An example is network systems, which typically provide a constant set of services that are known in advance, with network applications performing a similar processing algorithm for each packet in the system. In this paper, we validate our methods with an industrial case study for a set of multithreaded network applications on an UltraSPARC T2 processor. This article is an extension of our previous work [44] , which was published in Proceedings of 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-2012).
Read full abstract