Skip to Main Content

Title ImagePublic Abstract

 
Collapse

DE-SC0024559: Scalable and Resilient Modeling for Federated-Learning-Based Complex Workflows

Award Status: Active
  • Institution: University of Iowa, Iowa City, IA
  • UEI: Z1H9VJS8NG16
  • DUNS: 062761671
  • Most Recent Award Date: 05/29/2024
  • Number of Support Periods: 2
  • PM: Finkel, Hal
  • Current Budget Period: 07/01/2024 - 06/30/2025
  • Current Project Period: 07/01/2023 - 06/30/2028
  • PI: Li, Guanpeng
  • Supplement Budget Period: N/A
 

Public Abstract

This research proposal aims to address the critical need for a scalable and resilient Federated Learning (FL) simulation and modeling system in the context of edge computing-related scientific research and exploration. Federated learning is becoming an essential technique for machine learning (ML) on edge devices as the sheer amount of raw data generated by these devices requires real-time, effective data processing at the edge device ends. The processed data carrying intelligent information must be encrypted for privacy protection, making federated learning the best solution for building a well-trained model across decentralized smart edge devices with secure and efficient data-sharing policies. Despite the availability of existing open-source federated learning frameworks, understanding the scalability and robustness of the federated learning systems remains non-trivial due to the complex workflow processing involved. Therefore, a scalable and resilient federated learning simulation and modeling system is highly demanded by researchers and developers for proof-of-concept implementations and performance validation before deploying and testing their machine learning models in the real world. 

 

In this work, we propose a scalable and resilient federated learning simulation and modeling system, called SR- APPFL. Leveraging and extending the capabilities of Argonne Privacy-Preserving Federated Learning (APPFL), SR-APPFL efficiently supports the simulation and modeling of complex federated learning workflows. In addition, SR-APPFL effectively tackles scalability and resilience challenges that arise during the process of federated learning. We will significantly enhance the scalability and resilience of APPFL by leveraging state-of-the-art libraries, including Decaf, communication runtimes (gRPC and MPICH), error-bounded lossy compression, and our expertise in distributed systems, fault/error injection and analysis, and fault tolerance techniques. The proposed system offers substantial benefits to researchers and developers working on real-world federated learning systems. It provides them with a valuable platform for conducting proof-of-concept implementations and performance validation, crucial steps prior to deploying and testing their machine learning models in real-world scenarios. Furthermore, the proposed system will have scientific impacts on DOE-mission-based applications such as scientific machine learning and critical infrastructure, where data privacy challenges are significant concerns. 



Scroll to top