Searching...

No Matches

Goals

To allow using HBM on CloudSC and evaluate the performance of CloudSC on HBM, several additions have been made to HPCW in the context of the ESiWACE3 project. The goal of this file is to explain what changes have been made incrementally, so that others can refer to if similar needs arise when improving other parts of HPCW, and does not necessarily reflect the current status of CloudSC in the context of HPCW.

Summary of changes

Add AARCH64 build toolchain
Additional test cases for CloudSC
Support for MAQAO profiler
CloudSC runs with memory binding options (e.g. HBM)
CloudSC build & runs with MPI
- MPI-aware memory binding: adding an srun wrapper script

Add AARCH64 build toolchain

Since the compilers used on our end were GNU AARCH64 compilers, we added a new environment script in the toolchain/ directory that loads the environment for building on ARM machines (bull-gnu-aarch64.env.sh).

When building and running with HPCW on AARCH64 machines, this script can be simply loaded as follows:

export $HPCW_SOURCE_DIR=/path/to/hpcw

source $HPCW_SOURCE_DIR/toolchains/bull/bull-gnu-aarch64.env.sh

We tested this environment successfully with CloudSC on an Ampere Q8030 ARM CPU on the Spartan machine of Eviden.

Additional test cases for CloudSC

To diversify the available computing loads that can be tested for CloudSC through HPCW, we wanted to add small, medium and big test cases for CloudSC. To do so, we augmented the test cases section in the projects/cloudsc.cmake file, as follows:

add_cloudsc_test(fortran-small dwarf-cloudsc-fortran "4 16384 32")
add_cloudsc_test(fortran-medium dwarf-cloudsc-fortran "4 131072 32")
add_cloudsc_test(fortran-big dwarf-cloudsc-fortran "4 524288 32")

We then added these test cases in the bull-job-launcher.sbatch file to be able to launch them:

case "${cluster}-${partition}" in
  [...]
  ;;
  spartan*-INTEL)
    [...]
    case "${name}" in
      [...]
      cloudsc-fortran-small)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        ;;
      cloudsc-fortran-medium)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        ;;
      cloudsc-fortran-big)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        ;;
    [...]

These test cases can be now used when calling the ctest commande line:

ctest -R cloudsc-fortran-big -VV

Support for MAQAO profiler

To analyze the performance of CloudSC, we wanted to add the support of the MAQAO profiler in HPCW.

We first augmented the environment scripts to load the MAQAO profiler:

source maqao/x86_64/2.17.8

We first changed the bull-job-launcher.sbatch file to introduce a MAQAO flavor in the main test case to properly separate the MAQAO launchers from others:

case "${cluster}-${partition}-${ENABLE_MAQAO}" in

[...]

We then introduced in the same file a specific launcher type for MAQAO:

case "$launcher" in
  [...]
    ;;
  "direct-timed-maqao")
    \time -f "\$TIME_FORMAT" ${HPCW_MAQAO}
    ;;

We need to do this because MAQAO requires to be called directly on the binary used. However in HPCW, the CloudSC launch command prepends a script that computes the Gflop/s rate from a log file before the binary.

We circumvented that issue by redefining the command line in a specific case:

case "${cluster}-${partition}-${ENABLE_MAQAO}" in
  [...]
  ;;
  spartan*-INTEL-true)
    [...]
    # Ensure that MAQAO exists in the environment
    command -v maqao
 
    case "${name}" in
      cloudsc-fortran-big)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed-maqao
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        # Get absolute path to binary for MAQAO needs
        binary=$(command -v $3)
        # Here we need to rework the command line of the test because MAQAO
        # needs to be called with the actual executable of the run to work
        # properly. However, with CloudSC, the first and second arguments are a
        # script that computes the Gflop/s of the run and a log file. This is
        # why we need to rework the command line to call MAQAO at the right
        # place of the command line.
        HPCW_MAQAO="$1 $2 maqao oneview                               \
                              --create-report=one                     \
                              -dbg=1                                  \
                              --envv_OMP_NUM_THREADS=$OMP_NUM_THREADS \
                              --number-processes=1                    \
                              --replace                               \
                              --uarch=SAPPHIRE_RAPIDS                 \
                              -xp=${PWD}/maqao-logs-${name}           \
                              --show-program-output=on                \
                              -- $binary $4"
        ;;
    [...]

With these additions, when launching the runs with ctest, by setint the environment variable ENABLE_MAQAO to 1 the MAQAO profiling activates:

ENABLE_MAQAO=1 ctest -R cloudsc-fortran-big -VV

For now, this launcher can only be used with the newly introduced cloudsc-fortran-big test case, as it was specifically sized so that the MAQAO reports are meaningful.

CloudSC runs with memory binding options (e.g. HBM)

To be able to give a memory binding to CloudSC jobs, we introduced in the bull-job-launcher.sbatch file a specific launcher type with memory bindings:

case "$launcher" in
  [...]
    ;;
  "direct-timed-membind")
    \time -f "\$TIME_FORMAT" ${HPCW_MEMBIND:-} $@
    ;;

We also modified the test case launch to use this launcher for new CloudSC test cases:

case "${cluster}-${partition}-${ENABLE_MAQAO}" in
  [...]
  ;;
  spartan*-INTEL-false)
    [...]
    case "${name}" in
      [...]
      cloudsc-fortran-small)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed-membind
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        ;;
      cloudsc-fortran-medium)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed-membind
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        ;;
      cloudsc-fortran-big)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=4
        export KMP_AFFINITY="granularity=fine,proclist=[0-$((OMP_NUM_THREADS-1))],explicit"
        launcher=direct-timed-membind
        sbatch_options+=" -n 1 -c $OMP_NUM_THREADS -t 00:10:00"
        ;;
    [...]

With this launcher, it is now possible to specify the HPCW_MEMBIND environment variable to define which type of memory bindings you want (e.g. HBM / DDR), and what tool you want to use (e.g. numactl / hwloc-bind)

For our experiments on an Intel Sapphire Rapids CPU with 128GB HBM, we used the following command line:

HPCW_MEMBIND="hwloc-bind --cpubind core:0-3 --membind \$(hwloc-calc --oo --local-memory --best-memattr Bandwidth core:0-3) --" \

ctest -R cloudsc-fortran-big -VV

Beware when binding on HBM to use a recent hwloc release that can properly bind and report bindings on such memory. For our experiments, we used hwloc 2.10.0.

CloudSC build & runs with MPI

To enable the build of CloudSC with MPI, we added the MPI flavor support for building CloudSC in the projects/cloudsc.cmake file:

if(USE_SYSTEM_cloudsc)
  set_property(GLOBAL PROPERTY cloudsc_depends)
else()
  set_property(GLOBAL PROPERTY cloudsc_depends ecbuild hdf5)
  set_property(GLOBAL PROPERTY cloudsc_depends_optional mpi)
endif()
 
if(cloudsc_enabled)
  [...]
  if(USE_SYSTEM_cloudsc)
    [...]
  else()
    [...]
    if(mpi_enabled)
      list(APPEND cloudsc_depends_ep MPI::MPI_C MPI::MPI_Fortran)
    endif()
 
    ExternalProject_Add(
      cloudsc
      ${cloudsc_revision}
      DEPENDS ${cloudsc_depends_ep}
      CONFIGURE_COMMAND
        [...]
        -DENABLE_CLOUDSC_GPU_SCC_HOIST:BOOL=${ENABLE_cloudsc_gpu}
        -DENABLE_MPI:BOOL=${mpi_enabled}
        -DENABLE_ACC:BOOL=${ENABLE_cloudsc_gpu} <SOURCE_DIR>
      BUILD_COMMAND $(MAKE) -j${BUILD_PARALLEL_LEVEL}
      [...]

After that, we added a new environment script bull-intel+openmpi.env.sh that loads the MPI environment for building with MPI:

# Load Open MPI built with Intel compilers

log module load openmpi/4.1.6/intel/2023.2.0/ucx/1.15.0/standard

With this, we can build CloudSC + MPI with the following line:

export $HPCW_SOURCE_DIR=/path/to/hpcw
cd /path/to/build_dir
cmake $cmakeFlags -DENABLE_cloudsc=on -DUSE_SYSTEM_cloudsc=off \
      -DENABLE_mpi=on -DUSE_SYSTEM_mpi=on $HPCW_SOURCE_DIR
make

Finally, we added in projects/cloudsc.cmake a new test case for MPI:

if(mpi_enabled)
  add_cloudsc_test(mpi dwarf-cloudsc-fortran "7 2097152 32")
endif()

And we added the launching part in the bull-job-launcher.sbatch file:

case "${cluster}-${partition}-${ENABLE_MAQAO}" in
  [...]
  ;;
  spartan*-INTEL-false)
    [...]
    case "${name}" in
      [...]
      cloudsc-mpi)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=7
        launcher="srun"
        sbatch_options+=" -N 1 -n 32 -c $OMP_NUM_THREADS -t 00:10:00"
        SRUN_OPTIONS+=("--ntasks-per-node 32")
        SRUN_OPTIONS+=("--cpus-per-task $OMP_NUM_THREADS")
        ;;
    [...]

We chose to setup the test case with 32 MPI processes and 7 cores per process (224 in total). This is made to fill entirely a bi-socket Intel Sapphire Rapids CPU (2x56x2=224).

With these changes, the CloudSC-MPI test can be launched as follows:

ctest -R cloudsc-mpi -VV

MPI-aware memory binding: adding an srun wrapper script

For the needs of HBM testing with MPI, we needed to test with several NPROMA values to test the sensitivity of HBM and DDR to this parameter. To do so, we adapted the test cases in projects/cloudsc.cmake for MPI:

We renamed the cloudsc-mpi test case to cloudsc-mpi-nproma32
We added the cloudsc-mpi-nproma<64|128|256> test cases

if(mpi_enabled)
  add_cloudsc_test(mpi-nproma32 dwarf-cloudsc-fortran "7 2097152 32")
  add_cloudsc_test(mpi-nproma64 dwarf-cloudsc-fortran "7 2097152 64")
  add_cloudsc_test(mpi-nproma128 dwarf-cloudsc-fortran "7 2097152 128")
  add_cloudsc_test(mpi-nproma256 dwarf-cloudsc-fortran "7 2097152 256")
endif()

We also needed to be able to bind each MPI process to its closest HBM NUMA node. To that end, we need to introduce a wrapper script between the srun command and the executable so that the wrapper script can bind the memory depending of the environment propagated by Slurm, which allows each MPI process to identify its ID.

We then introduced in the bull-job-launcher.sbatch file a specific launcher type that allows specifying the HPCW_SRUN_WRAPPER environment variable to insert a wrapper script after the srun command:

case "$launcher" in
  [...]
    ;;
  "srun-wrapper")
    \time -f "\$TIME_FORMAT" srun -n \${SLURM_NTASKS} ${SRUN_OPTIONS[@]} ${HPCW_SRUN_WRAPPER:-} $@
    ;;

We also modified the test case launch to use this launcher for new CloudSC test cases:

SCRIPT_DIR=$(dirname $(readlink -f $0))
[...]
case "${cluster}-${partition}-${ENABLE_MAQAO}" in
  [...]
  ;;
  spartan*-INTEL-false)
    [...]
    case "${name}" in
      [...]
      cloudsc-mpi*)
        ulimit -S -s unlimited
        export OMP_NUM_THREADS=7
        launcher="srun-wrapper"
        sbatch_options+=" -N 1 -n 32 -c $OMP_NUM_THREADS -t 00:10:00"
        SRUN_OPTIONS+=("--ntasks-per-node 32")
        SRUN_OPTIONS+=("--cpus-per-task $OMP_NUM_THREADS")
        HPCW_SRUN_WRAPPER=${HPCW_SRUN_WRAPPER:-"${SCRIPT_DIR}/bull-job-wrapper-spr-hbm.sh"}
        ;;
    [...]

We added the bull-job-wrapper-spr-hbm.sh srun wrapper script that allows binding the MPI processes spawned by the test on the closest HBM NUMA node available. While it has been made to be as generic as possible, our experiments were made only on Intel Sapphire Rapids CPU with 128GB HBM.