Running on Large Systems
There are a few extra things to keep in mind when running SWIFT on a large system (i.e. over MPI on several nodes). Here are some recommendations:
Compile and run with tbbmalloc. You can add this to the configuration of SWIFT by running configure with the
--with-tbbmallocflag. Using this allocator, over the one included in the standard library, is particularly important on systems with large core counts per node. Alternatives include jemalloc and tcmalloc, and using these other allocation tools also improves performance on single-node jobs.
Run with one MPI rank per NUMA region, usually a socket, rather than per node. Typical HPC clusters now use two chips per node. Consult with your local system manager if you are unsure about your system configuration. This can be done by invoking
mpirun -np <NUMBER OF CHIPS> swift_mpi -t <NUMBER OF CORES PER CHIP>. You should also be careful to include this in your batch script, for example with the SLURM batch system you will need to include
Run with threads pinned. You can do this by passing the
-aflag to the SWIFT binary. This ensures that processes stay on the same core that spawned them, ensuring that cache is accessed more efficiently.
Ensure that you compile with ParMETIS or METIS. These are required if want to load balance between MPI ranks.
Your batch script should look something like the following (to run on 8 nodes each with 2x18 core processors for a total of 288 cores):
#SBATCH -N 8 # Number of nodes to run on #SBATCH --tasks-per-node=2 # This system has 2 chips per node mpirun -n 16 swift_mpi --threads=18 --pin parameter.yml