Abstract
Cosmological simulations require the use of a multiple time-stepping scheme. Without such a scheme, cosmological simulations would be impossible due to their high level of dynamic range; over eleven orders of magnitude in density. Such a large dynamic range leads to a range of over four orders of magnitude in time-step, which presents a significant load-balancing challenge. In this work, the extreme adaptivity that cosmological simulations present is tackled in three main ways through the use of the code SWIFT. First, an adaptive mesh is used to ensure that only the relevant particles are interacted in a given time-step. Second, task-based parallelism is used to ensure efficient load-balancing within a single node, using pthreads and SIMD vectorisation. Finally, a domain decomposition strategy is presented, using the graph domain decomposition library METIS, that bisects the work that must be performed by the simulation between nodes using MPI. These three strategies are shown to give SWIFT near-perfect weak-scaling characteristics, only losing 25% performance when scaling from 1 to 4096 cores on a representative problem, whilst being more than 30x faster than the de-facto standard Gadget-2 code.
Abstract
One of the big challenges in simulations of galaxy formation is to harness the raw computing power of the largest facilities. With current codes gravity+hydrodynamics code achieving less than 0.1% of the available flops, it is important to revisit how we write code for such systems and how we interact with the wider HPC community. We present the next-generation open-source cosmological code SWIFT, created as a joint effort between galaxy formation scientists and computer scientists with expertise in parallel programming. Using task-based parallelism for intra-node parallelisation, asynchronous communications for inter-node communications and SIMD intrinsics for the code kernels, our code is more than 30x faster than the de-facto standard Gadget code on representative cosmological problems. The code also uses much fewer resources allowing to run simulations an order of magnitude larger than is currently doable (e.g. EAGLE, Illustris, Horizon-AGN,...). We implement multiple SPH and unstructured hydrodynamics solvers as well as a full test suite to assess the accuracy of the schemes in various situations. This test suite is part of the code release and we highlight good programming practices to make sure the software is usable by the community and interfaces well with other tools (e.g. yt, paraview, ...).
Abstract
For many decades cosmological simulations have used individual time-steps for the particles (or cells) that are evolved forward in time. Given the large range in time-step sizes required for accurate integration, this allows substantial gains in time to solution compared to the more widely adopted in science technique of using a single time-step for the whole computational domain. However, only limited work has been done to improve domain decomposition techniques to make them aware of this multiple time-stepping scheme. This results in domains perfectly suited for time-steps where most particles require updating but leads to terrible performance with large levels of imbalance for the steps where only a small fraction of particles are being updated. This generates poor overall strong-scaling performance and leads to a poor utilisation of the systems. In this paper, we will introduce some of these short-comings and present a solution based on task-based parallelism and a weighted graph domain decomposition algorithm implemented in the open-source cosmological code SWIFT. By carefully weighting the different tasks in an overall cluster-wide graph, we can divide the work and not the data optimally. If we further penalise the creation of communication tasks we can achieve a domain decomposition where no MPI communication is required for the time-steps with few active particles. This effectively eliminates most imbalance issues and leads to optimal performance not just for time-steps with large number of updates but also for the ones with few particles to update. We achieve much better strong-scaling performance than commonly used codes and achieve a time-to-solution 30 times smaller than the de-facto standard Gadget code.
Abstract
The SWIFT simulation code has now matured enough that we can start targeting large-scale simulations using the EAGLE physics model. In this talk I will discuss the status of the code and present some ideas related to the domain decomposition that we implemented in order to tackle the challenge of deep time-step hierarchies that develop in cosmological simulations. I will also present some weak-scaling plots demonstrating the ability of the code to scale up to the largest systems in realistic scenarios.
Abstract
Traditional large HPC simulation codes rely on MPI or MPI plus OpenMP for their parallelization over clusters of more than 100,000 cores. This approach of task-based parallelism strategy is used in SPH With Interdependent Fine-Grained Tasking, or SWIFT. This open-source cosmological code makes use of vectorization, dynamic scheduling, task-based I/O, and more.
Abstract
In particle-based simulations, neighbour finding (i.e finding pairs of particles to interact within a given range) is the most time consuming part of the computation. One of the best such algorithms, which can be used for both Molecular Dynamics (MD) and Smoothed Particle Hydrodynamics (SPH) simulations is the pseudo-Verlet list algorithm. This algorithm, however, does not vectorize trivially, and hence makes it difficult to exploit SIMD-parallel architectures. In this paper, we present several novel modifications as well as a vectorization strategy for the algorithm which lead to overall speed-ups over the scalar version of the algorithm of 2.24x for the AVX instruction set (SIMD width of 8), 2.43x for AVX2, and 4.07x for AVX-512 (SIMD width of 16).
Abstract
We present the next-generation open-source cosmological code SWIFT. Using task-based parallelism for intra-node parallelisation, asynchronous communications for inter-node communications and SIMD intrinsics for the code kernels, our code is more than 20x faster than the de-facto standard Gadget code on representative cosmological problems. We implement multiple SPH and unstructured hydrodynamics solvers as well as a full test suite to assess the accuracy of the schemes in various situations. We expect to release the code to the public by the end of the summer. Current performance tests on 100k+ cores have been published in arxiv:1606.02738.
Abstract
We present a new open-source cosmological code, called swift, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. Swift was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialised accelerator hardware. This performance is due to three main computational approaches: - Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. - Graph-based and genetic algorithm-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. - Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives, - Explicit vectorization of the core kernel to exploit all the available FLOPS on architectures such as Xeon Phi. In order to use these approaches, the code had to be rewritten from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold on x86 architecture making SWIFT more than an order of magnitude faster than current alternative software.
Abstract
In molecular dynamics (MD) and smoothed particle hydrodynamics simulations (SPH), a set of particle interactions must be computed. Current algorithms that solve this problem using complex techniques to reduce the number of distance calculations do not translate well into SIMD instructions. Compilers tend to produce auto-vectorised code that is inefficient. Poor performance arises predominantly from branches in the code and inefficient memory access. We present new SIMD vectorisation strategies that address these issues. We implement a pseudo-Verlet list, a particle sorting algorithm, and our own vectorisation strategies. We have applied our strategy to the cosmological code called SWIFT, which solves the equations of hydrodynamics using SPH. We observe a speedup of 6.65x for the raw particle interactions and 2.32x for the overall neighbour find and interaction function using AVX2.
Abstract
We present a new open-source cosmological code, called swift, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. Swift was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialised accelerator hardware. This performance is due to three main computational approaches: - Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. - Graph-based and genetic algorithm-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. - Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives, - Explicit vectorization of the core kernel to exploit all the available FLOPS on architectures such as Xeon Phi. In order to use these approaches, the code had to be rewritten from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold on x86 architecture making SWIFT more than an order of magnitude faster than current alternative software.
Abstract
We present a new open-source cosmological code, called SWIFT, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared/distributed-memory architectures. SWIFT was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialised accelerator hardware. This performance is due to three main computational approaches: (1) Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. (2) Graph-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. (3) Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives. In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures.
Abstract
Simulations of galaxy formation follow the gravitational and hydrodynamical interactions between gas, stars and dark matter through cosmic time. The huge dynamic range of such calculations severely limits strong scaling behaviour of the community codes in use, with load-imbalance, cache inefficiencies and poor vectorisation limiting performance. The new swift code exploits task-based parallelism designed for many-core compute nodes interacting via MPI using asynchronous communication to improve speed and scaling. A graph-based domain decomposition schedules interdependent tasks over available resources. Strong scaling tests on realistic particle distributions yield excellent parallel efficiency, and efficient cache usage provides a large speed-up compared to current codes even on a single core. SWIFT is designed to be easy to use by shielding the astronomer from computational details such as the construction of the tasks or MPI communication. The techniques and algorithms used in SWIFT may benefit other computational physics areas as well, for example that of compressible hydrodynamics.