Public Abstract

DE-SC0022317: Reliable and Efficient Machine Learning for Leadership Facility Scientific Data Analytics

Award Status: Active

Institution: The Trustees of Columbia University in the City of New York (Morningside Campus), New York, NY
UEI: F4N1QNPB95M4
DUNS: 049179401

Most Recent Award Date: 07/27/2023
Number of Support Periods: 3
PM: Lee, Steven

Current Budget Period: 09/01/2023 - 08/31/2024
Current Project Period: 09/01/2021 - 08/31/2024
PI: Du, Qiang

Supplement Budget Period: N/A

Public Abstract

We propose to establish mathematical foundations of the prioritized scientific machine learning (SciML) methods for extracting interpretable information from scientific data, inferring physical laws, and steering experiments toward scientific discovery. US Department of Energy's (DOE) scientific user facilities generate a deluge of dynamic experimental data at a rapid velocity on a daily basis. However, our ability to extract interpretable information from the massive dynamic data is far behind our ability to generate the data. The advances in machine learning have had revolutionary effects on large-scale data analytics in the business world, but it is challenging to transfer a successful machine learning method for commercial use to an effective SciML method for scientific use. Thus, a new class of mathematically rigorous and computationally efficient and reliable SciML methods are required for real-time scientific data analytics to expedite the pace of scientific discovery. Scientific data analytics is deeply embedded in the scientific discovery process that involves (1) extracting interpretable features from raw experimental data, (2) inferring unknown physics (i.e., unmasking hidden dynamics), and (3) designing and steering a series of experiments to achieve a scientific goal. Despite many promising efforts in these two directions, the progress of mathematical analysis is far behind the SciML algorithms development. Moreover, the scientific data analytics requires rigorous uncertainty quantification (UQ) to ensure that we are obtaining the right answer for the right reason and the methods are sufficiently robust to be deployed at the user facilities. To address these challenges, this project will not only focus on developing novel SciML methods that can be practically used to analyze massive scientific data, but also on establishing mathematical analysis on these methods. Our research objectives include the following: (1) Develop reliable and efficient feature extraction methods for both high-frequency high-resolution dynamic data and high-dimensional functional data collected at DOE's user facilities; (2) Develop mathematical foundations for neural network-based dynamics discovery models and stochastic back-propagation algorithms for training the neural network models; (3) Develop goal-oriented data assimilation methods for dynamic experimental design, i.e., optimally designing and steering a series of experiments to achieve a desired scientific goal. To motivate, illustrate, and evaluate our new methodologies, we will apply them to neutron scattering data generated at the Spallation Neutron Source (SNS) and High Flux Isotope Reactor (HIFR) facilities, and in situ scanning transmission electron microscopy (STEM) data generated at Center for Nanophase Materials Sciences (CNMS). The purpose of choosing these datasets is not only to demonstrate how the proposed SciML algorithms and mathematical analysis can help address current, urgent needs for advanced data analytics at DOE's user facilities, but also to show the critical role of the proposed research in establishing self-driving user facilities in the near future.

We propose to establish mathematical foundations of the prioritized scientific machine learning (SciML) methods for extracting interpretable information from scientific data, inferring physical laws, and steering experiments toward scientific discovery. US Department of Energy's (DOE) scientific user facilities generate a deluge of dynamic experimental data at a rapid velocity on a daily basis. However, our ability to extract interpretable information from the massive dynamic data is far behind our ability to generate the data. The advances in machine learning have had revolutionary effects on large-scale data analytics in the business world, but it is challenging to transfer a successful machine learning method for commercial use to an effective SciML method for scientific use. Thus, a new class of mathematically rigorous and computationally efficient and reliable SciML methods are required for real-time scientific data analytics to expedite the pace of scientific discovery.

Scientific data analytics is deeply embedded in the scientific discovery process that involves (1) extracting interpretable features from raw experimental data, (2) inferring unknown physics (i.e., unmasking hidden dynamics), and (3) designing and steering a series of experiments to achieve a scientific goal. Despite many promising efforts in these two directions, the progress of mathematical analysis is far behind the SciML algorithms development. Moreover, the scientific data analytics requires rigorous uncertainty quantification (UQ) to ensure that we are obtaining the right answer for the right reason and the methods are sufficiently robust to be deployed at the user facilities. To address these challenges, this project will not only focus on developing novel SciML methods that can be practically used to analyze massive scientific data, but also on establishing mathematical analysis on these methods. Our research objectives include the following: (1) Develop reliable and efficient feature extraction methods for both high-frequency high-resolution dynamic data and high-dimensional functional data collected at DOE's user facilities; (2) Develop mathematical foundations for neural network-based dynamics discovery models and stochastic back-propagation algorithms for training the neural network models; (3) Develop goal-oriented data assimilation methods for dynamic experimental design, i.e., optimally designing and steering a series of experiments to achieve a desired scientific goal.

To motivate, illustrate, and evaluate our new methodologies, we will apply them to neutron scattering data generated at the Spallation Neutron Source (SNS) and High Flux Isotope Reactor (HIFR) facilities, and in situ scanning transmission electron microscopy (STEM) data generated at Center for Nanophase Materials Sciences (CNMS). The purpose of choosing these datasets is not only to demonstrate how the proposed SciML algorithms and mathematical analysis can help address current, urgent needs for advanced data analytics at DOE's user facilities, but also to show the critical role of the proposed research in establishing self-driving user facilities in the near future.