HPCW 3.0
|
The applications in HPCW exhibit different performance characteristics, stressing different parts of a given system. The purpose of this document is to highlight some of the characteristics for the applications. This can serve as a first stop to pick a benchmark for your intended use case, or to provide initial hints for common bottlenecks in the applications.
Due to the differences in hardware/software between different systems, these hints are not guaranteed to be true for every given system. The tips in this document are based on the experience of developers of the respective applications, as well as some performance analysis on one HPC system. The exact configuration of used MPI processes, OpenMP threads etc obviously has an impact as well, so the following memory requirements should be seen as rough estimations of lower limits.
cloudsc-fortran-small
: < 1 GBcloudsc-fortran-medium
: ~ 10 GBcloudsc-fortran-big
: ~ 50 GBcloudsc-gpu-*
: ~ 16 GB device, ~ 24 GB host (measured with single GPU, single process)cloudsc-mpi-*
: ~ 160 GB (measured with single process)The memory usage requirements of the MPI test cases, and by extension the GPU test cases for multi-GPU runs, depend on the number of processes. For reference, executing the MPI test cases with 8 processes instead of one leads to 200 GB of occupied memory.
Historically, CLOUDSC is known to create huge pressure on the registers and can suffer from register spilling, thus stress-testing the compiler's register allocation strategies in particular. This is also true for the GPU versions, where occupancy is traded in to improve register spillage.
The initial version (cloudsc-fortran-*
) is intended to run on one NUMA, perhaps one socket ; but it is not expected to scale on multiple sockets or nodes. An artificial MPI version has been implemented (cloudsc-mpi-*
), that is less relevant on a science side, but can mimic the behavior of CLOUDSC running on larger machines.
ecrad-small
: < 1 GBecrad-medium
: ~32 GBecrad-big
: ~55 GBecRad is parallelized solely via OpenMP, meaning there is no distributed memory parallelism available. ecRad incurs a relatively large memory footprint and is often memory-bound, if the compiler is able to sufficiently vectorize.
The memory requirements for ecTrans depend heavily on the exact configuration of MPI processes. The following requirements should serve as a lower boundary, and were obtained by running with as few processes/GPUs as possible.
ectrans-cpu-small
: ~ 12 GB with 1 processectrans-cpu-medium
: ~ 30 GB with 3 processesectrans-cpu-big
: ~ 100 GB with 3 processesectrans-gpu-small
: ~ 15 GB device with 1 A100 (40GB), ~ 8 GB hostectrans-gpu-medium:
~ 50 GB device with 4 A100s, ~ 10 GB hostectrans-gpu-big
: ~ 240 GB device with 4*4 A100s, ~ 80 GB hostFor reference, executing ectrans-cpu-big
with 8 processes on a single node yielded a memory usage of 150 GB, while running with 64 nodes, 8 processes each, displays a memory usage of roughly 640 GB total.
The single-precision benchmark version of ecTrans loops through multiple inverse and direct spectral transforms. These transforms require a lot of global communication, stressing the network, and a large amount of floating-point operations due to the Fourier and the Legendre transforms.
icon-test-nwp-R02B04N06multi
: ~ 20 GB with 16 processes, 4 threads each (CPU only)icon-atm-tracer-Hadley
: ~ 50 GB with 32 processes, 4 threads each (CPU only)icon-aes-physics
: ~ 110 GB with 32 processes, 4 threads each (CPU only)icon-NextGEMS-R2B8-2020
: ~ 1.6 TB with 224 processes, 4 threads each (CPU only)icon-LAM
: ~ 870 GB with 240 processes, 8 threads each (CPU only)This test case was designed to fit on a single node. After the initialization, it spends the majority of its time in the dynamical core and is mostly bound by memory bandwidth. The communication happens mostly between neighbouring processes, resulting in a diagonal communication matrix, which is only broken up by global reductions every few time steps. Overall, the communication makes up around 5% of the run time, of which waiting for the completion of asynchronous communication calls is a significant portion.
This setup is similar to icon-aes-physics
, but with higher resolution and some other minor differences regarding the physics. The communication matrix also displays a diagonal due to the data exchange with the neighbouring cells/processes, plus there is some global communication happening after each time step.
This setup is very different from the others, as it is not used for climate simulations, but rather for weather prediction. "LAM" stands for limited-area mode, and this setup in particular calculates a weather forecast for a region over Germany. It also includes two nested domains over the Alps. This has multiple consequences. First, the employed physics package differs from the others. Second, the communication pattern is different. Alongside the main diagonal in the communication matrix that was previously described, the LAM setup also features slightly less pronounced diagonals alongside the "main diagonal". This is caused by the data exchange at the nest boundaries.
The other test cases serve mainly as smoke tests to enable you to easily check whether your installation of ICON works correctly. icon-atm-tracer-Hadley
is a simple Hadley test (proposed by Kent et al). The icon-test-nwp-R02B04N06multi
test case is a small nested setup with the same physics that are used in the LAM test case.
TODO
nicamdc-small
: ~ 8 GBnicamdc-medium
: ~ 185 GBnicamdc-big
: ~ 800 GBDue to constraints within NICAM-DC, each test case can only be executed with a fixed number of processes, namely 10, 640 and 2560 for the small, medium and big test case respectively. NICAM-DC contains only the dynamical core of NICAM (CPU only) and is, generally speaking, memory-bandwidth-bound with only small amounts of communication between the processes.