Holistic Measurement-Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying
Fault Detection, Propagation, and Impact
Principal Investigator: William Kramer, University of Illinois at Urbana-Champaign/National Center for Supercomputing Applications/Computer Science
Co-Principal Investigators: Ravishankar Iyer, University of Illinois at Urbana-Champaign /Electrical and Computer Engineering/Computer Science; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign /Coordinated Science Laboratory; James Brandt, SNL; Nicholas J. Wright, LBNL/NERSC; Jim Lujan, LANL;
Senior Investigators: James Botts, LBNL; Jeremy Enos, University of Illinois at Urbana-Champaign/NCSA); Joseph Fullop, University of Illinois at Urbana-Champaign /NCSA; Ann Gentile, SNL; Larry Kaplan, Cray;?Cindy Martin, LANL; Catello Di Martino, University of Illinois at Urbana-Champaign /Coordinated Science Laboratory.
Extreme-scale systems have billions of hardware components and hundreds of millions of lines of software that must all work in perfectly coordinated fashion. Unfortunately, extreme-scale systems, whether the largest supercomputers, cloud systems or other types, have complex failure modes that impact their ability to serve their intended purposes at the quality and reliability desired. Analysis of field data on the current and past generations of extreme-scale computing and data analysis systems has revealed multiple challenges that, if not addressed, may hinder the effectiveness of future computing systems. Using data from six centers that house many of the largest high-performance computing (HPC) resources in the world, this project will provide unparalleled insight into the resilience challenges of current and future systems. Production and experimental data will be analyzed to produce a rich understanding of failure modes and propagation, and of detection/mitigation mechanisms for errors and faults. These insights will form the basis for future instrumentation and tools necessary to continue to scale high performance computing and data-analysis systems. Resiliency and usage data collected in this project will be made for to other researchers. To achieve these objectives, we have assembled a team of world-renowned experts in resilient extreme-scale computing and analysis from the University of Illinois faculty and NCSA, SNL, LANL, NERSC, and Cray.