Stencil codes are often seen as a prime example for real-world problems where vectorization can be applied easily. After all, the same operations have to be carried out for each grid cell and many prominent kernels, e.g. LBM (Lattice Boltzmann Method) or the RTM (Reverse Time Migration) don't even contain conditionals. So it is even more surprising, that compilers still struggle at generating vectorized code automatically. Or are times changing?

A while ago Julian, a student of mine, started working on his master's thesis. His task is to device means to facilitate auto-vectorization for user-supplied kernels within LibGeoDecomp. For a start, I wrote some variants of the 3D Jacobi smoother (link to the code). They're all very similar. Yet, the subtle differences may lead to significant performance penalties, as you can see on the right. I've plotted matrix size vs. kernel performance measured in GLUPS. My excuse for the awkward names of the kernels is: they're named that way for history reasons. JacobiCellSimple is unsurprisingly the shortest Jacobi implementation one can write using LibGeoDecomp. To evaluate the new cell API, I've added JacobiCellStreakUpdate and JacobiCellStraightForward. Both implement updateLine(). While the former is a fairly sophisticated 8x unrolled SSE kernel with register shuffling to avoid unaligned loads, the latter is just a rather dumb, unrolled version of JacobiCellSimple. As expected, my measurements show that the more intelligent a code is, the faster it will run. They also show that for large matrices memory bandwidth becomes the bottleneck. In this case non-temporal stores (or streaming stores, as Intel calls them) used in JacobiCellStraightforwardNT can squeeze out a couple of GLUPS. They avoid the write allocate, which typically doubles the required bandwidth for writing data.

Fig. 1: Core 2 Quad, GCC 4.6.3

A couple of days later, I got a mail back from Julian, basically telling me: "Uhm, what should I improve?" It turned out that he ran the tests on his own system and all curves, safe for the one based on non-temporal stores, more or less converged. Trying to repoduce Julian's results, I re-ran the benchmarks on my new notebook and what I got are the curves in Fig. 2. So, is the GCC 4.7 the silver bullet, case closed? I decided that I needed more data. The plots for Nehalem (Fig. 3 and 4) are remarably clear. As Julian suggested, GCC's tree vectorizer is doing an outstanding job. Just for completeness, I decided to return to Core 2 -- just to see how well the GCC 4.7 would fare on that machine. It doesn't. Fig. 5 (click to enlarge) is hardly different from Fig. 1. Terrible terrible terrible.

The Conclusion

My takeaway message is this: if you run a modern Intel CPU and a newer compiler, you probably don't need to worry so much about vectorization. But since LibGeoDecomp needs to run on various architectures (AMD Bulldozer, IBM BG/Q) and we are sometimes tied to older compilers (GCC 4.6 because of CUDA, anyone?) or even ones of which we cannot know how well they vectorize stencils (e.g. Cray or Intel's compilers for MIC), we need a more robust scheme. So again, I'm looking forward to Julians next results -- possibly next time based on Boost.SIMD?

Fig. 2: Ivy Bridge, GCC 4.7.2

Fig. 3: Nehalem, GCC 4.6.3

Fig. 4: Nehalem, GCC 4.7.1

Fig. 5: Core 2 Quad, GCC 4.7.2

News archive »

last modified: Wed Oct 26 16:51:19 2016 +0200