We're very happy to announce the availability of LibFlatArray 0.3.0, our C++ library for Struct-of-Arrays containers and expression templates for vectorization. This latest release represents a huge leap forward. It comprises more code, more commits, and more supported instruction set architectures (ISAs) than all previous releases. Our direction of thrust for this release was to support all architectures that are releavant for HPC and to extend the vectorization intrinsics for kernels with irregular memory access patterns and control flow.

The authors would like to acknowledge the funding of the Deutsche Forschungsgemeinschaft (DFG) through the Cluster of Excellence Engineering of Advanced Materials.

We would also like to thank Google for sponsoring Larry Xiao as part of the Google Summer of Code 2015 program. LibFlatArray did participate as part of the Ste||ar Group in GSoC 2015.

Introduction to LibFlatArray

Version Date Size MD5 File Signature
0.3.0 2016.10.15 104544 ca10dd91c67f5bb1d03578d0c1d9f9f4 libflatarray-0.3.0.tar.bz2 libflatarray-0.3.0.tar.bz2.sig
143859 3e9cb08914768ab7f66a35312424dc30 libflatarray-0.3.0.tar.gz libflatarray-0.3.0.tar.gz.sig

LibFlatArray 0.3.0 in Numbers

  • Size: 102 KiB (bz2-packed tarball, 0.2.0 was 33 KiB)
  • Lines of code: 36k (0.2.0 was 6.5k)
  • 428 commits since 0.2.0
  • 4 contributors

Key Features New in 0.3.0

  • New ISAs: Intel AVX512 and ARM NEON. Our vector expression templates short_vec now support Intel's AVX512 which is being used by the current Intel Xeon Phi coprocessor (Knights Lading) and upcoming Intel Xeon products. ARM NEON is mostly interesting for handheld devices. Futher supported ISAs: SSE, AVX, QPX, MIC (Intel KNC).
  • Scatter/gather: short_vec can now perform gather loads and scatter stores to main memory. This is useful to vectorize kernels with irregular memory access patters, e.g. sparse matrix operations like SpMVM (requires a C++11 compliant compiler).
  • Better CUDA support: soa_grid and soa_array can both be used with CUDA memory and support moving data to host memory.
  • Handling of conditionals: short_vec now implements comparison operators (<, <=, ==, >, >=). any() can be used to quickly check if any vector element matches the comparison. Rare/expensive Conditionals can then be handled in a scalar fashin using get() for element retrieval.
  • New example: a 3D Jacobi smoother which highlights the use of vectoritzation, loop peeling, loop unrolling, and transparent switch to streaming stores for large arrays. Depending on the array size, this example Matches and exceeds the performance of the C99 reference code.
  • New example: a 2D Gauss filter that is being applied to a 3D volume. This is similar to the Jacobi example but highlights the compile time address calculation feature of LibFlatArray.
  • New example: a smoothed particle hydrodynamics code that demonstrates vectorization of particle methods and branch harvesting for handling rare/expensive conditionals (in this case: particle interactions). Exceeds the performance of the scalar C99 reference code as the compiler fails to vectorize it.

Further Changes

  • Boost is no longer a required dependency.
  • short_vec now supports double, float, and int as element types.
  • New cuda_array is a convenient helper class exchanging AoS data between host and device.
  • Transparent non-temporal stores and loop unrolling: the new streaming_short_vec behaves just like a short_vec, but will do all stores with the non-temporal (no read) hint to avoid cache pollution. The new type trait estimate_optimum_short_vec_type can be used to select both, the optimum store strategy and arity of short_vec. Choosing an arity larger than the machine word's width results in automatic loop unrolling.
  • Loop peeling: loop_peeler() can handle the scalar iterations at the begin and end of vectorizable loops. They're now also usable with C++14 lambdas (requires template lambdas), which results in a much more natural code layout. Previously the loop body had to be moved to a separate class.
  • Marshaling halos: soa_grid can now load a subset of the grid from a contiguous region of memory and store it back. This is most useful for marshaling parts of the grid for exchanging halo regions (ghost zones) when using MPI or HPX for multi-node parallelization. See this unit test for how to use this feature.
  • Member types: soa_grid and soa_array can now work with member arrays and types other than built-in types (previously no c-tors and d-tors were run for members of the SoA structure, which is obviously terrible).
  • Obviously: tons of bug fixes and improved test coverage (unit tests and performance tests).

Woosh ;-)

News archive »

last modified: Wed Oct 26 16:51:19 2016 +0200