A short-ranged n-body simulation is not only the first application built on LibGeoDecomp to break the PFLOPS barrier, but continued to rush on towards 16384 nodes. In total Titan sports 18688 nodes. The simulation model is essentially the same as the one used in a recent paper by Andrey Vladimirov and Vadim Karpusenko, but modified to discard interactions beyond a cut-off radius as they have little influence on the result.
As a sideeffect this decreases the code's computational intensity. The advanced latency hiding algorithm implemented in the HiParSimulator ensured good scalability despite each node receiving relatively little load. The CUDAStepper, which was recently added to the trunk, shifts the majority of calculations to the GPU, but updates the halos on the CPU. The latter is better suited for such irregular workloads and shortens the critical (data) path. Recursive bisection was used for domain decomposition. Weak scaling efficiency was beyond 90% at 16k nodes.
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.