Abstract

The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operatingmore » system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel computations employing work stealing for load balancing that scaled to the largest existing supercomputers. Finally, we implemented the Elastic Building Blocks runtime, a library to manage object-oriented distributed software components. To support the research, we won two INCITE awards for time on Intrepid (BG/P) and Mira (BG/Q). Much of our work has had impact in the OS and runtime community through the ASCR Exascale OS/R workshop and report, leading to the research agenda of the Exascale OS/R program. Our project was, however, also affected by attrition of multiple PIs. While the PIs continued to participate and offer guidance as time permitted, losing these key individuals was unfortunate both for the project and for the DOE HPC community.« less

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.