SPH With Inter-dependent Fine-grained Tasking

13th SPHERIC International Workshop

Galway, Ireland | 26 June 2018

SWIFT: Maintaining weak-scalability with a dynamic range of 10^4 in time-step size to harness extreme adaptivity | Josh Borrow, Matthieu Schaller

Abstract

Cosmological simulations require the use of a multiple time-stepping scheme. Without such a scheme, cosmological simulations would be impossible due to their high level of dynamic range; over eleven orders of magnitude in density. Such a large dynamic range leads to a range of over four orders of magnitude in time-step, which presents a significant load-balancing challenge. In this work, the extreme adaptivity that cosmological simulations present is tackled in three main ways through the use of the code SWIFT. First, an adaptive mesh is used to ensure that only the relevant particles are interacted in a given time-step. Second, task-based parallelism is used to ensure efficient load-balancing within a single node, using pthreads and SIMD vectorisation. Finally, a domain decomposition strategy is presented, using the graph domain decomposition library METIS, that bisects the work that must be performed by the simulation between nodes using MPI. These three strategies are shown to give SWIFT near-perfect weak-scaling characteristics, only losing 25% performance when scaling from 1 to 4096 cores on a representative problem, whilst being more than 30x faster than the de-facto standard Gadget-2 code.

European Week of Astronomy and Space Science 2018

Liverpool, UK | April 2018

The next-generation cosmological code SWIFT | Matthieu Schaller

Abstract

One of the big challenges in simulations of galaxy formation is to harness the raw computing power of the largest facilities. With current codes gravity+hydrodynamics code achieving less than 0.1% of the available flops, it is important to revisit how we write code for such systems and how we interact with the wider HPC community. We present the next-generation open-source cosmological code SWIFT, created as a joint effort between galaxy formation scientists and computer scientists with expertise in parallel programming. Using task-based parallelism for intra-node parallelisation, asynchronous communications for inter-node communications and SIMD intrinsics for the code kernels, our code is more than 30x faster than the de-facto standard Gadget code on representative cosmological problems. The code also uses much fewer resources allowing to run simulations an order of magnitude larger than is currently doable (e.g. EAGLE, Illustris, Horizon-AGN,...). We implement multiple SPH and unstructured hydrodynamics solvers as well as a full test suite to assess the accuracy of the schemes in various situations. This test suite is part of the code release and we highlight good programming practices to make sure the software is usable by the community and interfaces well with other tools (e.g. yt, paraview, ...).

Supercomputing Frontiers Europe 2018

Warsaw, Poland | March 2018

Individual time-stepping in cosmological simulations: A challenge for strong scaling and domain decomposition algorithms | Matthieu Schaller

Abstract

For many decades cosmological simulations have used individual time-steps for the particles (or cells) that are evolved forward in time. Given the large range in time-step sizes required for accurate integration, this allows substantial gains in time to solution compared to the more widely adopted in science technique of using a single time-step for the whole computational domain. However, only limited work has been done to improve domain decomposition techniques to make them aware of this multiple time-stepping scheme. This results in domains perfectly suited for time-steps where most particles require updating but leads to terrible performance with large levels of imbalance for the steps where only a small fraction of particles are being updated. This generates poor overall strong-scaling performance and leads to a poor utilisation of the systems. In this paper, we will introduce some of these short-comings and present a solution based on task-based parallelism and a weighted graph domain decomposition algorithm implemented in the open-source cosmological code SWIFT. By carefully weighting the different tasks in an overall cluster-wide graph, we can divide the work and not the data optimally. If we further penalise the creation of communication tasks we can achieve a domain decomposition where no MPI communication is required for the time-steps with few active particles. This effectively eliminates most imbalance issues and leads to optimal performance not just for time-steps with large number of updates but also for the ones with few particles to update. We achieve much better strong-scaling performance than commonly used codes and achieve a time-to-solution 30 times smaller than the de-facto standard Gadget code.

Computational Galaxy Formation 2018

Ringberg Castle, Tegernsee, Germany | March 2018

First steps towards cosmological simulations with full EAGLE physics | Matthieu Schaller

Abstract

The SWIFT simulation code has now matured enough that we can start targeting large-scale simulations using the EAGLE physics model. In this talk I will discuss the status of the code and present some ideas related to the domain decomposition that we implemented in order to tackle the challenge of deep time-step hierarchies that develop in cosmological simulations. I will also present some weak-scaling plots demonstrating the ability of the code to scale up to the largest systems in realistic scenarios.

SIAM Conference on Parallel Processing for Scientific Computing 2018

Tokyo, Japan | March 2018

Using Task-Based Parallelism, Asynchronous MPI and Dynamic Workload-Based Domain Decomposition to Achieve Near-Perfect Load-Balancing for Particle-Based Hydrodynamics and Gravity | Matthieu Schaller

Intel HPC DevCon 2017

Denver, USA | November 2017

Task-based Calculation + Task-based MPI + Task-based I/O = Maximal Performance | Matthieu Schaller

Abstract

Traditional large HPC simulation codes rely on MPI or MPI plus OpenMP for their parallelization over clusters of more than 100,000 cores. This approach of task-based parallelism strategy is used in SPH With Interdependent Fine-Grained Tasking, or SWIFT. This open-source cosmological code makes use of vectorization, dynamic scheduling, task-based I/O, and more.

ParCo Conference 2017

Bologna, Italy | September 2017

An Efficient SIMD Implementation of Pseudo-Verlet Lists for Neighbour Interactions in Particle-Based Codes | James S. Willis

Abstract

In particle-based simulations, neighbour finding (i.e finding pairs of particles to interact within a given range) is the most time consuming part of the computation. One of the best such algorithms, which can be used for both Molecular Dynamics (MD) and Smoothed Particle Hydrodynamics (SPH) simulations is the pseudo-Verlet list algorithm. This algorithm, however, does not vectorize trivially, and hence makes it difficult to exploit SIMD-parallel architectures. In this paper, we present several novel modifications as well as a vectorization strategy for the algorithm which lead to overall speed-ups over the scalar version of the algorithm of 2.24x for the AVX instruction set (SIMD width of 8), 2.43x for AVX2, and 4.07x for AVX-512 (SIMD width of 16).

UK National Astronomy Meeting

Hull, UK | July 2017

The cosmological hydrodynamical code SWIFT | Matthieu Schaller

Abstract

We present the next-generation open-source cosmological code SWIFT. Using task-based parallelism for intra-node parallelisation, asynchronous communications for inter-node communications and SIMD intrinsics for the code kernels, our code is more than 20x faster than the de-facto standard Gadget code on representative cosmological problems. We implement multiple SPH and unstructured hydrodynamics solvers as well as a full test suite to assess the accuracy of the schemes in various situations. We expect to release the code to the public by the end of the summer. Current performance tests on 100k+ cores have been published in arxiv:1606.02738.

ISC Intel Booth talk

Frankfurt, Germany | June 2017

SWIFT: Using Task-Based Parallelism, Fully Asynchronous Communication and Vectorization to achieve maximal HPC performance | James S. Willis

Abstract

We present a new open-source cosmological code, called swift, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. Swift was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialised accelerator hardware. This performance is due to three main computational approaches: - Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. - Graph-based and genetic algorithm-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. - Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives, - Explicit vectorization of the core kernel to exploit all the available FLOPS on architectures such as Xeon Phi. In order to use these approaches, the code had to be rewritten from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold on x86 architecture making SWIFT more than an order of magnitude faster than current alternative software.

IXPUG Annual Spring Conference

Cambridge, UK | April 2017

SWIFT: An efficient SIMD implementation of pseudo Verlet-lists for neighbour interactions in particle-based codes | James S. Willis

Abstract

In molecular dynamics (MD) and smoothed particle hydrodynamics simulations (SPH), a set of particle interactions must be computed. Current algorithms that solve this problem using complex techniques to reduce the number of distance calculations do not translate well into SIMD instructions. Compilers tend to produce auto-vectorised code that is inefficient. Poor performance arises predominantly from branches in the code and inefficient memory access. We present new SIMD vectorisation strategies that address these issues. We implement a pseudo-Verlet list, a particle sorting algorithm, and our own vectorisation strategies. We have applied our strategy to the cosmological code called SWIFT, which solves the equations of hydrodynamics using SPH. We observe a speedup of 6.65x for the raw particle interactions and 2.32x for the overall neighbour find and interaction function using AVX2.

Intel HPC Devcon

Salt Lake City, US | November 2016

SWIFT: Using Task-Based Parallelism, Fully Asynchronous Communication and Vectorization to achieve maximal HPC performance | Matthieu Schaller

Abstract

We present a new open-source cosmological code, called swift, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. Swift was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialised accelerator hardware. This performance is due to three main computational approaches: - Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. - Graph-based and genetic algorithm-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. - Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives, - Explicit vectorization of the core kernel to exploit all the available FLOPS on architectures such as Xeon Phi. In order to use these approaches, the code had to be rewritten from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold on x86 architecture making SWIFT more than an order of magnitude faster than current alternative software.

IPCC-EMEA Fall Meeting

Toulouse, France | October 2016

SWIFT: Scheduling tasks efficiently on 256 cores - The KNL challenge | Matthieu Schaller

PASC Conference 2016

Lausanne, Switzerland | June 2016

SWIFT: Strong scaling for particle-based simulations on more than 100'000 cores | Matthieu Schaller

Abstract

We present a new open-source cosmological code, called SWIFT, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared/distributed-memory architectures. SWIFT was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialised accelerator hardware. This performance is due to three main computational approaches: (1) Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. (2) Graph-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. (3) Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives. In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures.

EASC conference 2015

Edinburgh, UK | April 2015

SWIFT: task-based hydrodynamics and gravity for cosmological simulations | Tom Theuns

Abstract

Simulations of galaxy formation follow the gravitational and hydrodynamical interactions between gas, stars and dark matter through cosmic time. The huge dynamic range of such calculations severely limits strong scaling behaviour of the community codes in use, with load-imbalance, cache inefficiencies and poor vectorisation limiting performance. The new swift code exploits task-based parallelism designed for many-core compute nodes interacting via MPI using asynchronous communication to improve speed and scaling. A graph-based domain decomposition schedules interdependent tasks over available resources. Strong scaling tests on realistic particle distributions yield excellent parallel efficiency, and efficient cache usage provides a large speed-up compared to current codes even on a single core. SWIFT is designed to be easy to use by shielding the astronomer from computational details such as the construction of the tasks or MPI communication. The techniques and algorithms used in SWIFT may benefit other computational physics areas as well, for example that of compressible hydrodynamics.

Exascale Computing in Astrophysics

Ascona, Switzerland | September 2013

SWIFT: Task-based parallelism, hybrid shared/distributed-memory parallelism, and SPH simulations | Pedro Gonnet