HPC challenges for new extreme scale applications
Exascale machines are now available, based on several different arithmetic (from 64-bits to 16-8 bits arithmetic, including mixed versions and some no longer IEEE standard) and using different architectures (with network-on-chip processors and/or with accelerators). Some execution and programming paradigms are being rehabilitated, such as data flow models and data parallelism without shared memories. Brain-scale applications, from machine learning and AI for example, manipulate many huge graphs that lead to very sparse non-symmetric linear algebra problems, resulting in performance closer to the HPCG benchmark than to the LINPACK one.
End-users and scientists have to face a lot of challenge associated to these evolutions and the increasing size of the data. The convergence of Data science (big Data) and the computational science to develop new applications generates important challenges.
This two-day workshop aims to bring together senior scientists in the field of HPC and some major applications associated with it, to brainstorm on those challenges and propose potential research collaborations. The number of invited participants is expected to be less than 35 (from Asia, USA and Europe mainly). We plan to organize panels, combined with some talks.
|Adjoint à la directrice de la Recherche Fondamentale du CEA en charge du HPC||Research Professor and Research Director, USC Information Sciences Institute||Professor, Supercomputing Division Information Technology Center, The University of Tokyo||Professeur, Université de Lille. Centre de Recherche en Informatique, Signal et Automatisme de Lille, CNRS|
DATE & LIEU
Du 06 mars au 07 mars 2023
Du 06-03 au 07-03-2023
9h30 - 16h00
Hôtel Pullman Paris Montparnasse
19 Rue du Commandant René Mouchotte, 75014 Paris
|8h30 – 8h45||Welcome|
|8h45 – 10h15||Chairperson : Christophe Calvin / CEA|
|8h45||HPC challenges and new computing frontiers||Serge Petiton (U. Lille)|
|9h15||Innovative Supercomputing by Integration of Simulation/Data/Learning||Kengo Nakajima (U. Tokyo/RIKEN)|
|9h45||Intelligent Simulations Will Demand New Extreme-scale Computing Capabilities||Ian Foster (U. Chicago)|
|10h15 – 10h45||Coffee and tea break|
|10h45 – 12h45||Chairperson : Kengo Nakajima|
|10h45||Exascale challenge : An overview of the French NumPeX project and of some challenges in Astronomy, Earth and Environmental sciences||Jean-Pierre Vilotte (CNRS-INSU, IPGP)|
|11h15||A challenge of exploiting low precision computing in iterative linear solvers||Takeshi Fukaya (U. Hokkaido)|
|11h45||ML/AI Research Directions within the US Department of Energy||Osni Marques (LBNL)|
|12h15||Large-Scale Graph Neural Networks for Real-World Industrial Applications||Toyotaro Suzumura (U. Tokyo)|
|12h45 – 13h30||Lunch, on site|
|13h30 – 15h30||Chairperson : France Boillod-Cerneux / CEA|
|13h30||Modeling a novel laser-driven electron accelerator concept : Particle-In-Cell simulations at the exascale||Neïl Zaim (CEA)|
|14h00||Towards Integrated Hardware/Software Ecosystems for the Edge-Cloud-HPC Continuum : the Transcontinuum Initiative||Gabriel Antoniu (INRIA)|
|14h30||Tiny, Tiny Tasks – Huge Impact||Ivo Kabadshow (FZJ)|
|15h00||Heterogeneous system for exascale using h3-Open-SYS/WaitIO||Shinji Sumimoto (U. Tokyo)|
|15h30 – 16h00||Coffee and tea break|
|16h00 – 17h30||Chairperson: Nahid Emad / U. Paris-Saclay|
|16h00||How much can we really compress scientific data||Franck Cappello (ANL)|
|16h30||Programming Systems for Heterogeneous Memory Architectures||Christian Terboven (RWTH)|
|17h00||FasTensor : Efficient Tensor Computation for Large-Scale Data Analysis||Kesheng Wu (LBNL)|
|19h30||Dinner « Chez Françoise »
|09h30 – 10h00||Discussions||Moderator : Kengo Nakajima
|10h00 – 12h00||Chair : France Boillod-Cerneux|
|10h00||Exascale challenges and opportunities for fundamental research||Christophe Calvin (CEA)|
|10h30||Multi-Hybrid Device Programming and Application by Uniform Language||Taisuke Boku (U. Tsukuba)|
|11h00||France within the international Exascale ecosystem||France Boillod-Cerneux (CEA)|
|11h30||Development of a heterogeneous coupling library h3-Open-UTIL/MP||Takashi Arakawa (U. Tokyo)|
|12h00 – 14h00||Lunch break (Lunch on site if in situ)
|14h00 – 16h30||Chairperson : Osni Marques|
|14h00||Towards Next JCAHPC System||Toshihiro Hanawa (U. Tokyo)|
|14h30||Hybrid AI/HPC Approaches and Linear Algebra||Nahid Emad (UVSQ)|
|15h00||Extreme Scale, Tissue Analytics and AI||Joel Saltz (U. Stony-Brook)|
|15h30||Living in a Heterogeneous World : How scientific workflows bridge diverse cyberinfrastructure||Ewa Deelman (USC)|
|16h00 – 17h00||Discussions||Moderator : Serge Petiton|
HPC challenges and new computing frontiers| Serge G. Petiton, U. of Lille
Existing exascale supercomputers have been designed primarily for computer science, not for machine learning and AI. The increase in the number of nodes, on the one hand, and the network-on-chip, on the other, add two more levels of programming: the task graph, at the top level, and distributed computing on-chip, at the bottom level. In addition, the recent evolution of new processors and arithmetic for new applications that are maturing after the convergence of big data and HPC to machine learning will generate post-exascale computing that will redefine some programming and application development paradigms.
In this talk, I review some results obtained for sparse linear algebra for iterative methods and also for machine learning methods. I also discuss the potential evolution we would face in being able to combine computational science, data science and machine learning on future faster supercomputers.
Innovative Supercomputing by Integration of Simulation/Data/Learning | Kengo Nakajima, The University of Tokyo/RIKEN R-CCS
Supercomputing is shifting from the traditional simulation for computational science to the integration with data science and machine learning/AI. Since 2015, the Information Technology Center, the University of Tokyo (ITC/U.Tokyo) has been working on the “Big Data & Extreme Computing (BDEC)” project aimed at new supercomputing through the integration of “Simulation/Data/Learning (S+D+L)”. In May 2021, Wisteria/BDEC-01, the first system of the BDEC project, began its operation. Wisteria/BDEC-01 has a total peak performance of 33+PF, and consists of a simulation node group (Odyssey) consisting of 7,680x A64FX nodes and a data/learning node group (Aquarius) equipped with 360x NVIDIA A100 GPUs. . Aquarius can be directly connected to the outside and real-time acquisition of observation data is also possible. Some nodes of Aquarius are directly connected to the outside, and real-time acquisition of observation data etc. is also possible via SINET. Since 2019, we have been developing an innovative software platform « h3-Open-BDEC » that realizes the integration of (S+D+L).), with the support of Grant-in-Aid for Scientific Research (S). Integration of (S+D+L) is now being realized on Wisteria/BDEC-01. Those activities are described in the talk with future perspectives.
Intelligent Simulations Will Demand New Extreme-scale Computing Capabilities | Ian Foster, University of Chicago
The search for ever-more accurate and detailed simulations of physical phenomenon has driven decades of improvements in both supercomputer architecture and computational methods. It seems likely that the next several orders of magnitude improvements are likely to come, at least in part, from the use of machine learning and artificial intelligence methods to learn approximations to complex functions and to assist in navigating complex search spaces. Without any aspiration for completeness, I will review some relevant activities in this space and suggest implications for post-exascale research.
Exascale challenge : An overview of the French NumPeX project and of some challenges in Astronomy, Earth and Environmental sciences | Jean-Pierre Vilotte, CNRS-INSU, IPGP
In this presentation we will provide a brief overview of the new French project NumPeX addressing science-driven exascale software stack for capable exascale system. This will be illustrated by some data-driven exascale challenges in the context of the Square Kilometre Array (SKA), the large observational infrastructure in radio-astronomy and of the Earth system and environment modelling, emphasising new data-driven exascale needs in HPC/HPDA/ML.
A challenge of exploiting low precision computing in iterative linear solvers | Takeshi Fukaya, Hokkaido University
Recently, the use of low precision computing such as FP32 and FP16 has attracted much attention, and mixed precision methods that can efficiently use them have been actively investigated in the field of numerical linear algebra. In this talk, we present our attempt of developing a mixed precision iterative linear solver, in which we consider to exploit low precision computing and provide a numerical solution as accurate as that by conventional methods using only FP64. We focus on the GMRES method with restart, so-called the GMRES(m) method, and introduce low precision computing into the method based on the underlying structure of the iterative refinement scheme. Through numerical experiments, we investigate the possibility of aggressively using low precision computing in the GMRES(m) method and discuss issues for further performance improvement.
ML/AI Research Directions within the US Department of Energy | Osni Marques, LBNL
In this presentation, I will summarize some of the research directions within the US Department of Energy (DOE), specifically targeting ML and AI for science and energy applications. The presentation will be based on the report Basic Research Needs Workshop for Scientific Machine Learning – Core Technologies for Artificial Intelligence, held in 2018. The report discusses opportunities for research, in and past the exascale era. I will also summarize efforts in DOE related to mixed precision computations.
Large-Scale Graph Neural Networks for Real-World Industrial Applications | Toyotaro Suzumura, The University of Toky
A graph or network is a powerful data structure that can represent relationships between any entities in both the digital world and the physical world. The way of analyzing graphs has been advancing from algorithm-based approaches to data-driven approaches with machine learning and neural networks just like other types of data such as text, image, and speech. In this talk, I will talk about how graph neural networks have emerged as a powerful learning paradigm that backs up conventional graph algorithm-based approaches, and also introduce our ongoing research projects and collaborations with industry around graph neural networks such as recommendation. I will briefly introduce a nationwide cloud computing project called “mdx” as well as a nationwide materials informatics project named ARIM (Advanced Research Infrastructure for Materials and Nanotechnology).
Modeling a novel laser-driven electron accelerator concept: Particle-In-Cell simulations at the exascale | Neïl Zaim, CEA/DRF/IRAMIS/LYDL
Intense femtosecond lasers focused on low-density gas jets can accelerate ultra-short electron bunches up to very high energies (from hundreds of MeV to several GeV) over a few millimeters or a few centimeters. However, conventional laser-driven electron acceleration schemes do not provide enough charge for most of the foreseen applications. To address this issue, we have devised a novel scheme consisting of a gas jet coupled to a solid target to accelerate substantially more charge. In 2022 we validated this concept with proof-of-principle experiments at the LOA laser facility (France), and with a large-scale Particle-In-Cell simulation campaign, carried out with the open-source WarpX code. Performing such simulations requires the use of the most powerful supercomputers in the world, as well as advanced numerical techniques such as mesh refinement, which is very challenging to implement in an electromagnetic Particle-In-Cell code, and indeed unique to the WarpX code. A work describing the technical challenges that we addressed to make these simulations possible was awarded the Gordon Bell prize in 2022. In this contribution, we will also discuss the performance portability of the WarpX code by presenting scaling tests on Frontier, Fugaku, Summit, and Perlmutter supercomputers.
Towards Integrated Hardware/Software Ecosystems for the Edge-Cloud-HPC Continuum : the Transcontinuum Initiative | Gabriel Antoniu, INRIA
Modern use cases such as autonomous vehicles, digital twins, smart buildings and precision agriculture, greatly increase the complexity of application workflows. They typically combine physics-based simulations, analysis of large data volumes and machine learning and require a hybrid execution infrastructure: edge devices create streams of input data, which are processed by data analytics and machine learning applications in the Cloud, and simulations on large, specialised HPC systems provide insights into and prediction of future system state. All of these steps pose different requirements for the best suited execution platforms, and they need to be connected in an efficient and secure way. This assembly is called the Computing Continuum (CC). It raises challenges at multiple levels: at the application level, innovative algorithms are needed to bridge simulations, machine learning and data-driven analytics; at the middleware level, adequate tools must enable efficient deployment, scheduling and orchestration of the workflow components across the whole distributed infrastructure; and, finally, a capable resource management system must allocate a suitable set of components of the infrastructure to run the application workflow, preferably in a dynamic and adaptive way, taking into account the specific capabilities of each component of the underlying heterogeneous infrastructure. This talk discusses these challenges and introduces TCI – the Transcontinuum Initiative – a European multidisciplinary collaborative action aiming to identify the related gaps for both hardware and software infrastructures to build CC use cases, with the ultimate goal of accelerating scientific discovery, improving timeliness, quality and sustainability of engineering artefacts, and supporting decisions in complex and potentially urgent situations.
Tiny, Tiny Tasks - Huge Impact | Ivo Kabadshow, FZJ
Programming today’s supercomputers and upcoming Exascale hardware requires us to deal with hierarchical and heterogeneous parallelism. To harvest these FLOPs efficiently a lot of fine-grained parallelism in the examined algorithm needs to be uncovered and exploited. This begs the question if we can utilize the hidden performance from high-level languages with greater abstraction possibilities like C++.
In this talk, we present our current efforts towards a performance portable C++ tasking layer for tiny, inter-dependent tasks. Especially algorithms that cannot rely on weak scaling — like MD, need to be carefully investigated when going to extreme-scale.
Our goal is to reduce the per-task overhead to a minimum and allow execution of the task-dependency graph alongside the critical path of the algorithm. By utilizing ready-to-execute tasks and multiple typed task queues per core, work sharing and work stealing can be optimized on NUMA architectures with hierarchical memory.
Heterogenous system for exascale using h3-Open-SYS/WaitIO | Shinji Sumimoto, The University of Tokyo
This talk presents a system-wide communication library to couple multiple MPI programs for heterogeneous coupling computing called h3-Open-SYS/WaitIO (WaitIO for short). WaitIO provides an inter-program communication environment among MPI programs and supports different MPI libraries with various interconnects and processor types. This talk discusses how should complicated problems should be solved in such heterogeneous systems. We have developed the WaitIO communication library to realize the environments. We present how WaitIO works and performs in such heterogeneous computing environments.
How much can we really compress scientific data | Franck Cappello, ANL
In 2016, the lossy compression of scientific data was in its infancy : few compressors, no rigorous evaluation methodology, and few users.
Exascale made it mandatory for many applications. We are observing a rapid adoption by many communities of modern compressors that are much faster, more effective, and trustable than the initial ones. In this talk, we will address the questions that intrigued potential new users often ask : how do modern lossy compressors works? How do other scientists use them ? What are the use cases and the current results of scientific data compression for different domains ? How fast can we compress ? And, more importantly : how to trust lossy compressors ? We will mainly focus on the SZ lossy compressor developed during the Exascale Computing Project in the USA. Finally, while the lossy compression of scientific data made dramatic progress, we are still observing significant performance improvements; which raises the question: How much can we really compress scientific data ?
Programming Systems for Heterogeneous Memory Architectures | Christian Terboven, RWTH
The memory subsystem is changing: the evolution of the cache hierarchy is followed by new technologies with new kinds of memory, ranging from high-bandwidth (HBM) to large-capacity (NVM). But applications usually have to be heavily modified to use different kinds of memory. A portable, vendor-neutral view of heterogeneous memory could come in the form of a hierarchy of abstractions to cope with the variety of existing hardware. Further, as memory with the highest bandwidth is limited in capacity, decisions about data placement in the different kinds of memory are required, also considering changes in data access patternsover time.
A hierarchy of programming abstractions to expose and manage heterogeneous memory at different levels of detail and control allows the programmer to express hints (traits) for allocations that describe how data is used and accessed. Combined with characteristics of the platforms’ memory subsystem, these traits are exploited by strategies to decide where to place data items. This talk will first present a characterization of different kinds of memory, and then evaluate support for heterogeneous memory in OpenMP and higher-level middleware.
FasTensor : Efficient Tensor Computation for Large-Scale Data Analysis | Kesheng Wu, Bin Dong, and Suren Byna, LBNL
A key strategy for processing the large-scale scientific data is to parallelize the analysis tasks to harness the power of many cores on modern CPUs and GPUs. Before the advent of Big Data Systems, this type of parallel processing required dedicated custom computer hardware and software, or very expensive parallel database systems. Fortunately, along with the growth of internet businesses, a data processing revolution has emerged, and Big Data technology quickly spread. As exemplified by the MapReduce (MR) system, they enable complex data analyses without requiring users to prescribe details of parallel execution, data management, or error handling. However, for large-scale data from scientific simulations, experiments, and observations, many common operations, such as convolution, are hard to describe and slow to execute with the current generation of Big Data systems.
This presentation describes a new design named FasTensor to address these challenges.
Exascale challenges and opportunities for fundamental research | Christophe Calvin, CEA
Multi-Hybrid Device Programming and Application by Uniform Language | Taisuke Boku, Center for Computational Sciences, University of Tsukuba
While GPU is the most powerful main player on the advanced HPC platforms, the applicability on applications or partial codes is limited only when the dominant computation consists of highly parallel regular computation. Especially, when the computation capability is limited by memory capacity, we have to face to a severe strong scalability to improve the performance for time to solution. We have been researching to apply FPGA together with GPU to compensate with each other of devices with different performance behavior. I will introduce several programming approaches include a single notation of OpenACC to apply both GPU and FPGA simultaneously based on our original language system. In the best case, we successfully achieved 10x performance on time to solution comparing with GPU-only case.
France within the international Exascale ecosystem | France Boillod-Cerneux, CEA
Development of a heterogeneous coupling library h3-Open-UTIL/MP | Takashi Arakawa, CliMTech/The University of Tokyo
« Heterogeniety » is one of the key words of recent years in high-performance computing. In fact, the majority of the systems at the top of the TOP500 list are heterogeneous system composed of CPUs and GPUs. This heterogeneity can be classified into two categories. The one is Intra node heterogeneity, such as GPU machine and VE machine. And the other is internode heterogeneity, such as CPU node + GPU node, or CPU node + VE node. Our presentation will focus on the later system. The reason for the development of such a system is that role of HPC has expanded beyond not only simple simulation but also to large-scale data analysis and machine learning. Therefore, software that allows to interact simulation programs with Data analysis/AI programs on heterogeneous systems is required. Based on these backgrounds, we are developing a heterogeneous coupling library h3-Open-UTIL/MP, as a part of the h3-Open-BDEC project. h3-Open-UTIL/MP is a general-purpose coupling library which can couple any simulation models and applications that meet the following two conditions: 1) it has uniquely numbered and time-invariant grid points, and 2) the time interval of data exchange does not change in time. In addition, it can couple on heterogeneous environment by collaborating with a communication library h3-Open-SYS/WaitIO. In out presentation, we will describe the structure and function of h3-Open-UTIL/MP and discuss the results of performance measurements. Furthermore, we introduce one of the applications, the coupling of the atmospheric model NICAM with the machine learning library PyTorch.
Towards Next JCAHPC System | Toshihiro Hanawa, The University of Tokyo
JCAHPC (Joint Center for Advanced HPC) is a virtual organization between the Center for Computational Sciences at the University of Tsukuba (CCS) and the Information Technology Center at the University of Tokyo (ITC) to design, operate and manage a next-generation supercomputer system provided. We plan to introduce the “Oakforest-PACS II” system as the successor to the Oakforest-PACS system in FY2024, targeting a peak performance of 200 PFLOPS mostly using GPUs as accelerators. In this talk, the efforts to design the new system and to port the existing applications to the GPU system will be presented.
Hybrid AI/HPC Approaches and Linear Algebra | Nahid Emad, Paris-Saclay University/Versailles
Extreme Scale, Tissue Analytics and AI | Joel Saltz, Stony Brook University
I will discuss the vision and challenge of creating models and fine tuning pipelines capable of 1) carrying out complex Pathology image classification tasks, 2) answering nuanced questions requiring examples from comparable patients and scientific literature citations and 3) finding and/or generating examples of similar cases that differ in subtle but important details. The tremendous success of large language models such as GPT-3 and BERT along with the demonstrated ability to tune such models to carry out discourse (e.g. ChatGPT) suggests that over the coming few years, this ambitious goal may be realizable. I will discuss the prospects and challenges associated with this ambitious project.
Living in a Heterogenous World: How scientific workflows bridge diverse cyberinfrastructure | Ewa Deelman, USC
Pegasus (http://pegasus.isi.edu) automates the process of executing science workflows on modern cyberinfrastructure. It takes high-level, resource-independent descriptions and maps them onto the available heterogeneous resources: campus clusters, high-performance computing resources, high-throughput resources, clouds, and the edge. This talk describes the challenges and opportunities of workflow management in this heterogeneous context.