Open and FAIR Fusion for Machine Learning Applications
Cristina Rea (Lead PI), Saskia Mordijck, Aleksandar Jelenak,
Stephanie Diem, Evdokiya Kostadinova
Massachusetts Institute of Technology, William & Mary, The HDF Group,
University of Wisconsin - Madison, Auburn University
Initial forays into leveraging the computing capabilities of Machine Learning for magnetically confined plasmas have shown tremendous potential. Machine Learning (ML) thrives on large well documented and curated datasets that are easily accessible. Most ML projects in Magnetic Fusion Energy (MFE) encounter challenges related to data structures that are not equipped to handle the scales of I/O required for ML applications, limited metadata information, lack of Open and FAIR workflows and available databases. US experimental MFE databases at various user facilities can only be accessed through signing a user agreement and an additional steep learning curve, with limited documentation or access to existing workflows.
This research endeavors to reduce these challenges, by developing a Fusion Data Platform for Machine Learning with a focus on MFE data that will explicitly adhere to Findable, Interoperable, Accessible, Reusable (FAIR) and Open Science (OS) guidelines. To develop this Data Platform the scope of work will focus on:
-
Redefining an appropriate metadata structure that matches FAIR/OS principles for MFE data and that is suitable for ML workflows,
-
Developing FAIR/OS workflows to curate and augment labeled data for classification of relevant events from multiple US MFE devices,
-
Making publicly available selected experimental and simulation data,
-
Diversifying workforce skills through interdisciplinary education of students and junior scientists in fusion and ML tasks.
The multi-institutional team will focus on four main research topics to develop a Fusion Data Platform for Machine Learning applications: (1) MDSplusML, (2) FAIR Workflows, (3) Open Databases, and (4) Student Engagement.
MFE devices participating in this research are Alcator C-Mod, Pegasus-III, CTH and HBT-EP. An interoperable and publicly available library will be developed leveraging data from these devices. The library will have built-in pipelines for ML application design, allowing preservation of reproducible scientific results.
The team will also expand the interdisciplinary student engagement by designing an intensive 2-week summer school focusing on data science and fusion for undergraduate students, which will be followed by funded summer research. Hands-on training will leverage databases developed for the physics-based use cases of this work. The summer school will be hosted at William & Mary, and will be essential in the expansion of a new interdisciplinary workforce for ML and fusion science.