Current high-end parallel systems consist of hundreds of thousands of compute cores arranged in a complex hierarchical structure; future systems will have millions of cores. Systems, such as the Altix 4700, Blue Gene, Roadrunner, and Cray XT5, deploy multiple compute cores (homogeneous or heterogeneous) with multiple levels of shared and private caches within a processor, clustered into SMP nodes and coupled via a communication network to large-scale distributed systems. The development of efficient programs is extremely complex since the architectural details are exposed to the programmer. Productive use of such machines requires highly scalable programming tools for debugging, performance analysis, and fault tolerance. In addition, new programming models might significantly facilitate the task of the programmer. This special issue of Concurrency and Computation: Practice and Experience is devoted to programming tools that facilitate the development of efficient programs for such large-scale architectures. It is a collection of the best papers submitted to the international workshop on Scalable Tools for High-end Computing (STHEC 2008) that was held in conjunction with the International Conference on Supercomputing on June 7th on the Greek Island Kos. The papers present state-of-the-art tools for performance analysis and checkpointing on those machines. Performance analysis tools use measurements gathered during the execution of the application to detect portions of the code that can be further improved. Thus, they have to be able to cope with the large number of processors. Tools for checkpointing provide the possibility to restart an application in the case of a system failure; they have to be able to handle large number of cores as well. The selected papers present different techniques for building tools that will scale to thousands of cores. HPCToolkit 1 is a profiling-based performance analysis environment presenting the data in close relation to the source code without requiring an instrumentation of the source code. Scalasca 2 performs a parallel replay of the execution on the application's processors to find performance bottlenecks automatically. The combination of TAU and MRNet 3 provides a scalable infrastructure to offload performance data. Establishing the overlay network requires no added support from the job manager or application. Periscope 4 is based on a network of analysis agents that performs an online analysis of the application's performance behavior. When the application is started, additional processors can be allocated for the analysis agents to scale the analysis. CPPC 5 is a tool for portable checkpointing of message-passing applications. It consists of a runtime library and a compiler that assists the user by performing time-consuming tasks, such as data flow and communications analyses as well as code instrumentation. We would like to thank the authors for their excellent contributions to this special issue. We hope that it inspires future research in tools that support programmers of high-end systems in the development of efficient programs.