HPCW 3.0
Loading...
Searching...
No Matches
Performance Hints

The applications in HPCW exhibit different performance characteristics, stressing different parts of a given system. The purpose of this document is to highlight some of the characteristics for the applications. This can serve as a first stop to pick a benchmark for your intended use case, or to provide initial hints for common bottlenecks in the applications.

Due to the differences in hardware/software between different systems, these hints are not guaranteed to be true for every given system. The tips in this document are based on the experience of developers of the respective applications, as well as some performance analysis on one HPC system. The exact configuration of used MPI processes, OpenMP threads etc obviously has an impact as well, so the following memory requirements should be seen as rough estimations of lower limits.

CLOUDSC

Memory requirements

  • cloudsc-fortran-small: < 1 GB
  • cloudsc-fortran-medium: ~ 10 GB
  • cloudsc-fortran-big: ~ 50 GB
  • cloudsc-gpu-*: ~ 16 GB device, ~ 24 GB host (measured with single GPU, single process)
  • cloudsc-mpi-*: ~ 160 GB (measured with single process)

The memory usage requirements of the MPI test cases, and by extension the GPU test cases for multi-GPU runs, depend on the number of processes. For reference, executing the MPI test cases with 8 processes instead of one leads to 200 GB of occupied memory.

Other characteristics

Historically, CLOUDSC is known to create huge pressure on the registers and can suffer from register spilling, thus stress-testing the compiler's register allocation strategies in particular. This is also true for the GPU versions, where occupancy is traded in to improve register spillage.

The initial version (cloudsc-fortran-*) is intended to run on one NUMA, perhaps one socket ; but it is not expected to scale on multiple sockets or nodes. An artificial MPI version has been implemented (cloudsc-mpi-*), that is less relevant on a science side, but can mimic the behavior of CLOUDSC running on larger machines.

ecRad

Memory requirements

  • ecrad-small: < 1 GB
  • ecrad-medium: ~32 GB
  • ecrad-big: ~55 GB

Other characteristics

ecRad is parallelized solely via OpenMP, meaning there is no distributed memory parallelism available. ecRad incurs a relatively large memory footprint and is often memory-bound, if the compiler is able to sufficiently vectorize.

ecTrans

Memory requirements

The memory requirements for ecTrans depend heavily on the exact configuration of MPI processes. The following requirements should serve as a lower boundary, and were obtained by running with as few processes/GPUs as possible.

  • ectrans-cpu-small: ~ 12 GB with 1 process
  • ectrans-cpu-medium: ~ 30 GB with 3 processes
  • ectrans-cpu-big: ~ 100 GB with 3 processes
  • ectrans-gpu-small: ~ 15 GB device with 1 A100 (40GB), ~ 8 GB host
  • ectrans-gpu-medium: ~ 50 GB device with 4 A100s, ~ 10 GB host
  • ectrans-gpu-big: ~ 240 GB device with 4*4 A100s, ~ 80 GB host

For reference, executing ectrans-cpu-big with 8 processes on a single node yielded a memory usage of 150 GB, while running with 64 nodes, 8 processes each, displays a memory usage of roughly 640 GB total.

Other characteristics

The single-precision benchmark version of ecTrans loops through multiple inverse and direct spectral transforms. These transforms require a lot of global communication, stressing the network, and a large amount of floating-point operations due to the Fourier and the Legendre transforms.

ICON

Memory requirements

  • icon-test-nwp-R02B04N06multi: ~ 20 GB with 16 processes, 4 threads each (CPU only)
  • icon-atm-tracer-Hadley: ~ 50 GB with 32 processes, 4 threads each (CPU only)
  • icon-aes-physics: ~ 110 GB with 32 processes, 4 threads each (CPU only)
  • icon-NextGEMS-R2B8-2020: ~ 1.6 TB with 224 processes, 4 threads each (CPU only)
  • icon-LAM: ~ 870 GB with 240 processes, 8 threads each (CPU only)

Other characteristics

icon-aes-physics

This test case was designed to fit on a single node. After the initialization, it spends the majority of its time in the dynamical core and is mostly bound by memory bandwidth. The communication happens mostly between neighbouring processes, resulting in a diagonal communication matrix, which is only broken up by global reductions every few time steps. Overall, the communication makes up around 5% of the run time, of which waiting for the completion of asynchronous communication calls is a significant portion.

icon-NextGEMS-R2B8-2020

This setup is similar to icon-aes-physics, but with higher resolution and some other minor differences regarding the physics. The communication matrix also displays a diagonal due to the data exchange with the neighbouring cells/processes, plus there is some global communication happening after each time step.

icon-LAM

This setup is very different from the others, as it is not used for climate simulations, but rather for weather prediction. "LAM" stands for limited-area mode, and this setup in particular calculates a weather forecast for a region over Germany. It also includes two nested domains over the Alps. This has multiple consequences. First, the employed physics package differs from the others. Second, the communication pattern is different. Alongside the main diagonal in the communication matrix that was previously described, the LAM setup also features slightly less pronounced diagonals alongside the "main diagonal". This is caused by the data exchange at the nest boundaries.

The rest

The other test cases serve mainly as smoke tests to enable you to easily check whether your installation of ICON works correctly. icon-atm-tracer-Hadley is a simple Hadley test (proposed by Kent et al). The icon-test-nwp-R02B04N06multi test case is a small nested setup with the same physics that are used in the LAM test case.

NEMO

TODO

NICAM-DC

Memory requirements

  • nicamdc-small: ~ 8 GB
  • nicamdc-medium: ~ 185 GB
  • nicamdc-big: ~ 800 GB

Other characteristics

Due to constraints within NICAM-DC, each test case can only be executed with a fixed number of processes, namely 10, 640 and 2560 for the small, medium and big test case respectively. NICAM-DC contains only the dynamical core of NICAM (CPU only) and is, generally speaking, memory-bandwidth-bound with only small amounts of communication between the processes.