Public Abstract

DE-SC0014328: Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact

Award Status: Inactive

Institution: Board of Trustees of the University of Illinois, Champaign, IL
UEI: Y8CWNJRCNN91
DUNS: 041544081

Most Recent Award Date: 07/17/2018
Number of Support Periods: 3
PM: Pino, Robinson

Current Budget Period: 07/15/2017 - 07/14/2019
Current Project Period: 07/15/2015 - 07/14/2019
PI: Kramer, William

Supplement Budget Period: N/A

Public Abstract

Holistic Measurement-Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation, and Impact Principal Investigator: William Kramer, University of Illinois at Urbana-Champaign/National Center for Supercomputing Applications/Computer Science Co-Principal Investigators: Ravishankar Iyer, University of Illinois at Urbana-Champaign /Electrical and Computer Engineering/Computer Science; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign /Coordinated Science Laboratory; James Brandt, SNL; Nicholas J. Wright, LBNL/NERSC; Jim Lujan, LANL; Senior Investigators: James Botts, LBNL; Jeremy Enos, University of Illinois at Urbana-Champaign/NCSA); Joseph Fullop, University of Illinois at Urbana-Champaign /NCSA; Ann Gentile, SNL; Larry Kaplan, Cray;?Cindy Martin, LANL; Catello Di Martino, University of Illinois at Urbana-Champaign /Coordinated Science Laboratory. Extreme-scale systems have billions of hardware components and hundreds of millions of lines of software that must all work in perfectly coordinated fashion. Unfortunately, extreme-scale systems, whether the largest supercomputers, cloud systems or other types, have complex failure modes that impact their ability to serve their intended purposes at the quality and reliability desired. Analysis of field data on the current and past generations of extreme-scale computing and data analysis systems has revealed multiple challenges that, if not addressed, may hinder the effectiveness of future computing systems. Using data from six centers that house many of the largest high-performance computing (HPC) resources in the world, this project will provide unparalleled insight into the resilience challenges of current and future systems. Production and experimental data will be analyzed to produce a rich understanding of failure modes and propagation, and of detection/mitigation mechanisms for errors and faults. These insights will form the basis for future instrumentation and tools necessary to continue to scale high performance computing and data-analysis systems. Resiliency and usage data collected in this project will be made for to other researchers. To achieve these objectives, we have assembled a team of world-renowned experts in resilient extreme-scale computing and analysis from the University of Illinois faculty and NCSA, SNL, LANL, NERSC, and Cray.

Holistic Measurement-Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying

Fault Detection, Propagation, and Impact

Principal Investigator: William Kramer, University of Illinois at Urbana-Champaign/National Center for Supercomputing Applications/Computer Science

Co-Principal Investigators: Ravishankar Iyer, University of Illinois at Urbana-Champaign /Electrical and Computer Engineering/Computer Science; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign /Coordinated Science Laboratory; James Brandt, SNL; Nicholas J. Wright, LBNL/NERSC; Jim Lujan, LANL;

Senior Investigators: James Botts, LBNL; Jeremy Enos, University of Illinois at Urbana-Champaign/NCSA); Joseph Fullop, University of Illinois at Urbana-Champaign /NCSA; Ann Gentile, SNL; Larry Kaplan, Cray;?Cindy Martin, LANL; Catello Di Martino, University of Illinois at Urbana-Champaign /Coordinated Science Laboratory.

Extreme-scale systems have billions of hardware components and hundreds of millions of lines of software that must all work in perfectly coordinated fashion. Unfortunately, extreme-scale systems, whether the largest supercomputers, cloud systems or other types, have complex failure modes that impact their ability to serve their intended purposes at the quality and reliability desired. Analysis of field data on the current and past generations of extreme-scale computing and data analysis systems has revealed multiple challenges that, if not addressed, may hinder the effectiveness of future computing systems. Using data from six centers that house many of the largest high-performance computing (HPC) resources in the world, this project will provide unparalleled insight into the resilience challenges of current and future systems. Production and experimental data will be analyzed to produce a rich understanding of failure modes and propagation, and of detection/mitigation mechanisms for errors and faults. These insights will form the basis for future instrumentation and tools necessary to continue to scale high performance computing and data-analysis systems. Resiliency and usage data collected in this project will be made for to other researchers. To achieve these objectives, we have assembled a team of world-renowned experts in resilient extreme-scale computing and analysis from the University of Illinois faculty and NCSA, SNL, LANL, NERSC, and Cray.