Public Abstract

DE-SC0024559: Scalable and Resilient Modeling for Federated-Learning-Based Complex Workflows

Award Status: Active

Institution: University of Iowa, Iowa City, IA
UEI: Z1H9VJS8NG16
DUNS: 062761671

Most Recent Award Date: 08/04/2025
Number of Support Periods: 3
PM: Finkel, Hal

Current Budget Period: 07/01/2025 - 06/30/2026
Current Project Period: 07/01/2023 - 06/30/2028
PI: Li, Guanpeng

Supplement Budget Period: N/A

Public Abstract

This research proposal aims to address the critical need for a scalable and resilient Federated Learning (FL) simulation and modeling system in the context of edge computing-related scientific research and exploration. Federated learning is becoming an essential technique for machine learning (ML) on edge devices as the sheer amount of raw data generated by these devices requires real-time, effective data processing at the edge device ends. The processed data carrying intelligent information must be encrypted for privacy protection, making federated learning the best solution for building a well-trained model across decentralized smart edge devices with secure and efficient data-sharing policies. Despite the availability of existing open-source federated learning frameworks, understanding the scalability and robustness of the federated learning systems remains non-trivial due to the complex workflow processing involved. Therefore, a scalable and resilient federated learning simulation and modeling system is highly demanded by researchers and developers for proof-of-concept implementations and performance validation before deploying and testing their machine learning models in the real world. In this work, we propose a scalable and resilient federated learning simulation and modeling system, called SR- APPFL. Leveraging and extending the capabilities of Argonne Privacy-Preserving Federated Learning (APPFL), SR-APPFL efficiently supports the simulation and modeling of complex federated learning workflows. In addition, SR-APPFL effectively tackles scalability and resilience challenges that arise during the process of federated learning. We will significantly enhance the scalability and resilience of APPFL by leveraging state-of-the-art libraries, including Decaf, communication runtimes (gRPC and MPICH), error-bounded lossy compression, and our expertise in distributed systems, fault/error injection and analysis, and fault tolerance techniques. The proposed system offers substantial benefits to researchers and developers working on real-world federated learning systems. It provides them with a valuable platform for conducting proof-of-concept implementations and performance validation, crucial steps prior to deploying and testing their machine learning models in real-world scenarios. Furthermore, the proposed system will have scientific impacts on DOE-mission-based applications such as scientific machine learning and critical infrastructure, where data privacy challenges are significant concerns.

This research proposal aims to address the critical need for a scalable and resilient Federated Learning (FL) simulation and modeling system in the context of edge computing-related scientific research and exploration. Federated learning is becoming an essential technique for machine learning (ML) on edge devices as the sheer amount of raw data generated by these devices requires real-time, effective data processing at the edge device ends. The processed data carrying intelligent information must be encrypted for privacy protection, making federated learning the best solution for building a well-trained model across decentralized smart edge devices with secure and efficient data-sharing policies. Despite the availability of existing open-source federated learning frameworks, understanding the scalability and robustness of the federated learning systems remains non-trivial due to the complex workflow processing involved. Therefore, a scalable and resilient federated learning simulation and modeling system is highly demanded by researchers and developers for proof-of-concept implementations and performance validation before deploying and testing their machine learning models in the real world.

In this work, we propose a scalable and resilient federated learning simulation and modeling system, called SR- APPFL. Leveraging and extending the capabilities of Argonne Privacy-Preserving Federated Learning (APPFL), SR-APPFL efficiently supports the simulation and modeling of complex federated learning workflows. In addition, SR-APPFL effectively tackles scalability and resilience challenges that arise during the process of federated learning. We will significantly enhance the scalability and resilience of APPFL by leveraging state-of-the-art libraries, including Decaf, communication runtimes (gRPC and MPICH), error-bounded lossy compression, and our expertise in distributed systems, fault/error injection and analysis, and fault tolerance techniques. The proposed system offers substantial benefits to researchers and developers working on real-world federated learning systems. It provides them with a valuable platform for conducting proof-of-concept implementations and performance validation, crucial steps prior to deploying and testing their machine learning models in real-world scenarios. Furthermore, the proposed system will have scientific impacts on DOE-mission-based applications such as scientific machine learning and critical infrastructure, where data privacy challenges are significant concerns.