Abstract

At the request of the Department of Energy (DOE) Office of Science (SC), Lawrence Livermore National Laboratory (LLNL) hosted a two-day Risk Management Techniques and Practice (RMTAP) workshop held September 18-19 at the Hotel Nikko in San Francisco. The purpose of the workshop, which was sponsored by the SC/Advanced Scientific Computing Research (ASCR) program and the National Nuclear Security Administration (NNSA)/Advanced Simulation and Computing (ASC) program, was to assess current and emerging techniques, practices, and lessons learned for effectively identifying, understanding, managing, and mitigating the risks associated with acquiring leading-edge computing systems at high-performance computing centers (HPCCs). Representatives from fifteen high-performance computing (HPC) organizations, four HPC vendor partners, and three government agencies attended the workshop. The overall workshop findings were: (1) Standard risk management techniques and tools are in the aggregate applicable to projects at HPCCs and are commonly employed by the HPC community; (2) HPC projects have characteristics that necessitate a tailoring of the standard risk management practices; (3) All HPCC acquisition projects can benefit by employing risk management, but the specific choice of risk management processes and tools is less important to the success of the project; (4) The special relationship between the HPCCs and HPC vendors must be reflected in the risk management strategy; (5) Best practices findings include developing a prioritized risk register with special attention to the top risks, establishing a practice of regular meetings and status updates with the platform partner, supporting regular and open reviews that engage the interests and expertise of a wide range of staff and stakeholders, and documenting and sharing the acquisition/build/deployment experience; and (6) Top risk categories include system scaling issues, request for proposal/contract and acceptance testing, and vendor technical or business problems. HPC, by its very nature, is an exercise in multi-level risk management. Every aspect of stewarding HPCCs into the petascale era, from identification of the program drivers to the details of procurement actions and simulation environment component deployments, represents unprecedented challenges and requires effective risk management. The fundamental purpose of this workshop was to go beyond risk management processes as such and learn how to weave effective risk management practices, techniques, and methods into all aspects of migrating HPCCs into the next generation of leadership computing systems. This workshop was a follow-on to the Petascale System Integration Workshop hosted by Lawrence Berkeley National Laboratory (LBNL)/NERSC last year. It was intended to leverage and extend the risk management experience of the participants by looking for common best practices and unique processes that have been especially successful. This workshop assessed the effectiveness of tools and techniques that are or could be helpful in HPCC risk management, with a special emphasis on how practice meets process. As the saying goes: 'In theory, there is no difference between theory and practice. In practice there is'. Finally, the workshop brought together a network of experts who shared information as technology moves into the petascale era and beyond.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call