Public Abstract

DE-SC0025473: Infrastructure and Application Aware Reduction Methods for Scientific Data

Award Status: Active

Institution: New York University, New York, NY
UEI: NX9PXMKW5KW8
DUNS: 041968306

Most Recent Award Date: 09/12/2024
Number of Support Periods: 1
PM: Finkel, Hal

Current Budget Period: 09/01/2024 - 08/31/2025
Current Project Period: 09/01/2024 - 08/31/2027
PI: Vanden-Eijnden, Eric

Supplement Budget Period: N/A

Public Abstract

INFRASTRUCTURE AND APPLICATION AWARE REDUCTION METHODS FOR SCIENTIFIC DATA R. Archibald, Oak Ridge National Laboratory (Principal Investigator) A. Gelb, Dartmouth College (Co-Investigator) L. Rebholz, Clemson University (Co-Investigator) E. Vanden- Eijnden, New York University (Co-Investigator) The objective of this project is to extend recent developments in compressive sensing, statistical machine learning, and data assimilation tools to design data reduction methods uniquely capable for the emerging US Department of Energy (DOE) infrastructure being built for interconnected science and analysis of computa- tional, experimental, and observational data. A critical feature of scientific data, in contrast to other types of data, is that (possibly unknown) physical principles lie at the foundation. Hence the reduction methods in this proposal seek to optimize for known physical properties as well as discover the underlying character- istics of unknown physical properties. Importantly, while DOE scientific research is constantly growing in size and scale, there are efforts to evolve into a new paradigm that considers more connected, collaborative, autonomous, and real-time environments – creating exciting opportunities and challenges for data reduction which the proposed research aims to address. As they relate to the proposed work, key data reduction challenges facing the DOE include: (1) the in- corporation of known or discovered scientific information into data reduction; (2) progressive data reduction with tight error bounds near the point of generation taking advantage of similarities across the interconnected infrastructure to optimize information flow; (3) uncertainty quantification for data with noise, error, or miss- ing elements; and (4) the ability to effectively use new computing architecture, both centralized and at the edge, in order to accelerate analysis through computation on reduced data. We propose the following three research thrusts for developing methods in compressed sensing (CS), statistical/machine learning (ML), and continuous data assimilation (CDA) for the data reduction challenges at the DOE. Thrust 1: CS Methods for Data Reduction on Distributed Data in Scientific Ecosystems. We have devel- oped new CS frameworks, adapted to the challenges of scientific data reduction, that can preserve structure and known properties (prior information). We will derive new CS methods that can optimize compression on distributed data across the DOE complex, providing progressively rank information and tight error es- timation, that can be used to accelerate end-point analysis, optimize network communication, and order data storage, and prioritize information for streaming. Thrust 2: Statistical/ML Reduction Methods for Scientific Data. This team has developed statistical interpolation/generative models and homotopy methods that can reduce data size and dimension while preserving the statistical properties of original data. We will develop progressive statistical data reduction (SDR) for the DOE challenges of storage/ transmission and accelerate machine learning and analysis. Thrust 3: CDA to Reduce Required Amount of Simulation Data. We have developed temporal/spatial CDA methods for data reduction of streaming scientific simulations. We will develop new CDA methods that will allow local users to interact and analyze high resolution leadership computing facilities (LCFs) simulations given their limited data transfer and computing budget. This team consists of experts in these three thrust areas from Clemson University, Dartmouth College, New York University, and Oak Ridge National Laboratory. The diverse aspects of scientific data will require different approaches for reduction, and this proposal is designed to develop a variety of different reduction methods for this purpose. We explain in this proposal, unique connections between thrust, where the methods developed in each thrust can complement each other. We will demonstrate our methods in this proposal, with applications that span the domain of scientific data, on neutron and X-ray light facility data, climate science observational and simulation data, and LCF fluid/gas/plasma simulation data.

INFRASTRUCTURE AND APPLICATION AWARE REDUCTION METHODS FOR SCIENTIFIC DATA
R. Archibald, Oak Ridge National Laboratory (Principal Investigator)
A. Gelb, Dartmouth College (Co-Investigator)
L. Rebholz, Clemson University (Co-Investigator)
E. Vanden- Eijnden, New York University (Co-Investigator)

The objective of this project is to extend recent developments in compressive sensing, statistical
machine learning, and data assimilation tools to design data reduction methods uniquely capable for
the emerging US Department of Energy (DOE) infrastructure being built for interconnected science
and analysis of computa- tional, experimental, and observational data. A critical feature of
scientific data, in contrast to other types of data, is that (possibly unknown) physical principles
lie at the foundation. Hence the reduction methods in this proposal seek to optimize for known
physical properties as well as discover the underlying character- istics of unknown physical
properties. Importantly, while DOE scientific research is constantly growing in size and scale,
there are efforts to evolve into a new paradigm that considers more connected, collaborative,
autonomous, and real-time environments – creating exciting opportunities and challenges for data
reduction which the proposed research aims to address.
As they relate to the proposed work, key data reduction challenges facing the DOE include: (1) the
in- corporation of known or discovered scientific information into data reduction; (2) progressive
data reduction with tight error bounds near the point of generation taking advantage of
similarities across the interconnected infrastructure to optimize information flow; (3) uncertainty
quantification for data with noise, error, or miss- ing elements; and (4) the ability to
effectively use new computing architecture, both centralized and at the edge, in order to
accelerate analysis through computation on reduced data. We propose the following three research
thrusts for developing methods in compressed sensing (CS), statistical/machine learning (ML), and
continuous data assimilation (CDA) for the data reduction challenges at the DOE.
Thrust 1: CS Methods for Data Reduction on Distributed Data in Scientific Ecosystems. We have
devel- oped new CS frameworks, adapted to the challenges of scientific data reduction, that can
preserve structure and known properties (prior information). We will derive new CS methods that can
optimize compression on distributed data across the DOE complex, providing progressively rank
information and tight error es- timation, that can be used to accelerate end-point analysis,
optimize network communication, and order data storage, and prioritize information for streaming.
Thrust 2: Statistical/ML Reduction Methods for Scientific Data. This team has developed statistical
interpolation/generative models and homotopy methods that can reduce data size and dimension while
preserving the statistical properties of original data. We will develop progressive statistical
data reduction (SDR) for the DOE challenges of storage/ transmission and accelerate machine
learning and analysis.
Thrust 3: CDA to Reduce Required Amount of Simulation Data. We have developed temporal/spatial CDA
methods for data reduction of streaming scientific simulations. We will develop new CDA methods
that will allow local users to interact and analyze high resolution leadership computing facilities
(LCFs) simulations given their limited data transfer and computing budget.
This team consists of experts in these three thrust areas from Clemson University, Dartmouth
College, New York University, and Oak Ridge National Laboratory. The diverse aspects of scientific
data will require different approaches for reduction, and this proposal is designed to develop a
variety of different reduction methods for this purpose. We explain in this proposal, unique
connections between thrust, where the methods developed in each thrust can complement each other.
We will demonstrate our methods in this proposal, with applications that span the domain of
scientific data, on neutron and X-ray light facility data, climate science observational and
simulation data, and LCF fluid/gas/plasma simulation data.