Application Co-Design for an AI Framework for Exascale
One of the CoE RAISE goals is to design, implement, and evaluate an AI framework that is ready for future Exascale HPC systems (see ‘AI at Exascale‘ ). This framework is an enabler for highly scalable applications accelerating scientific discovery and advancing engineering in a wide variety of domains. It is co-designed by the RAISE Use Cases from natural sciences and engineering (see ‘Use Cases‘ ). The application co-design process of the AI framework follows proper software engineering methodologies, starting initially with Fact Sheets and followed by a more intensive requirements analysis via Interaction Rooms.
Fact Sheets foster the initial understanding of which components in different Use Cases are relevant on different levels. Figure 1 shows that those components include various software aspects required by the applications, such as necessary libraries, software codes, datasets, or container technologies. The hardware infrastructure aspects complement the overview with components such as specific HPC systems, GPU/CPU setups, innovative memory hierarchies, or data storages. As the Fact Sheets evolve, they act as living documents in CoE RAISE. They support the software engineering process driven by multi-disciplinary teams towards designing and implementing the AI framework. In some cases, Fact Sheets are part of scientific publications [1,2] and clarify the used software and hardware aspects. Additionally, they are helpful to explain the application Use Cases in CoE RAISE presentations to an audience outside the RAISE consortium.
Figure 1: Fact Sheet example of a co-design Use Case in the application area of remote sensing [1].
Another goal of CoE RAISE is knowledge and technology transfer to the industry with many commercial consortium members (see ‘Partners’ ). As shown in Figure 1, the Fact Sheets also demonstrate the bigger picture of Use Cases in delivering interoperability with cloud computing vendors and critical commercial technologies such as containers. The first iteration of Fact Sheets is almost finished. They will be available on the CoE RAISE web page and will include initial results from the Interaction Room process recently started in CoE RAISE.
The Fact Sheet process was received very positively by all members of the Use Case teams in CoE RAISE. It significantly enhanced the understanding of the Use Cases across the multi-disciplinary teams. The groups also noticed the massive complexity of designing an AI framework for Exascale, with the framework requiring a skillset from many domains, including software engineering, AI and HPC expertise, and application domain know-how. CoE RAISE and its Use Case teams have to overcome various understanding and communication challenges due to many different area-specific terminologies and possible misunderstandings. Hence, CoE RAISE requires a systematic interaction methodology to succeed in the co-design process with the nine Use Cases and potentially other future external Use Cases that will adopt the same methods for moving towards ‘AI at Exascale’.
Figure 2: Interaction Room process example of using MURAL-based whiteboards in Zoom meetings.
The approach taken in CoE RAISE to address the challenges mentioned above is the Interaction Room methodology used to design enterprise management systems in the industry. It was adopted by Book et al. for HPC environments [3] and will be advanced in AI within the CoE RAISE project. The Interaction Room technique facilitates interdisciplinary collaboration in complex software projects and aims to improve the cooperation and communication between the Use Case experts from different disciplines.
Figure 2 shows an example of the Interaction Room process in the CoE RAISE Use Case Sound Engineering. It is driven by a collaboration between the Icelandic Acoustic and Tactile Engineering (ACUTE) lab, the Icelandic start-up company Treble and Forschungszentrum Jülich. The Problem Canvas supports the understanding of the larger research question of the Use Case, including boundary conditions, abstractions, assumptions, quality requirements, and the goal of using AI. The use of AI varies between Use Cases may include surrogate models, simulation intertwined with AI to learn parameter sets, post-processing, or data-intensive workflows. The Data Canvas is used to clarify available datasets for AI methods such as datasets for training, testing, and validation, including approaches for feature engineering or data format conversions. The Model Canvas is used to identify relevant AI models in the Use Cases. This includes image-based models such as Convolutional Neural Networks (CNNs) or sequence-based models, e.g., Gated Recurrent Units (GRUs) or Long Short-Term Memory (LSTM) for time-series datasets. Finally, the Architecture Canvas maps the Model Canvas and Data Canvas items to corresponding specific HPC machines providing the necessary libraries and software. CoE RAISE just started the Interaction Room process with all the Use Case teams and will offer another news item on its results.
Selected References
[1] M. Riedel et al., Practice & Experience in using Parallel & Scalable Machine Learning with Heterogeneous Modular Supercomputing Architectures, in proceedings of IEEE IPDPS, 2021
[2] M. Riedel, G. Cavallaro, J.A. Benediktsson, ‘PRACTICE AND EXPERIENCE IN USING PARALLEL AND SCALABLE MACHINE LEARNING IN REMOTE SENSING FROM HPC OVER CLOUD TO QUANTUM COMPUTING‘, in Proceedings of the IGARSS 2021 Conference, to appear
[3] Book, M., Riedel, M., Neukirchen, H., Goetz, M.: Facilitating Collaboration in High-Performance Computing Projects with an Interaction Room, Facilitating Collaboration in High-Performance Computing Projects with an Interaction Room, in conference proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems (SEPS 2017), October 22-27, 2017, Vancouver, Canada