One point where LibGeoDecomp really shines is its powerful geometry subsystem. This part handles the domain decomposition and can adapt to every conceivable partitioning scheme. The downside of this flexibility is that we have to store the coordinates of the cells allocated on a node somewhere. This is the task of the Region class, which performs a run-length compression to save memory. For an approximately cubic region of cells it needs O(n²) space, which should be negligible compared to the O(n³) bytes for the grid itself. Well, should be... Actually such comparisons are only true for problem sizes which exceed (algorithm and implementation specific) constants. Something one tends to forget in times where each core handles gigabytes of data.

I first noticed that something was odd when I ran the first scaling tests on LiMa and could only get past a certain number of cores if the working set size per node was reduced -- otherwise the job would indefinitely hang during initialization. The bug only occurred when running on many cores and it took almost forever to get the jobs through LiMa's queue, so waiting to attach a debugger was out of the question. As my code passed all unit tests, I decided to blame this on MPI or a quirk of the machine and decided to move on. Matters became a bit more dramatic as I discovered the evening before our reservation on Tsubame that my job would consume excessive amounts of memory for jobs with more than 30 GPUs -- while I was prepping to run on >1000 GPUs. (BTW: I'll post the results from Tsubame soon, I just need to wait a bit until they're published.) The simulation model on Tsubame used much smaller simulation cells, which is why I could run larger grids and why the bug was triggered for smaller instances. Also, the nodes did not enter thrashing, but immediately crashed because of Tsubame's tighter restrictions on memory size. This allowed my to dig deeper into the code and isolate the Region class as the delinquent. While I could write a quit workaround for Tsubame, the problem itself did remain unsolved until today, 12:53 GMT. This is how it went: I first wrote a test case to measure memory consumption depending on the domain size n. It turned out that the original Region slurped a generous amount of 160 bytes per row (Streak) of consecutive coordinates -- much more than the approximately 32 bytes I was expecting. We need to store multiple Region instances per process. Together these may easily block more memory than the grid itself. Big O doesn't help us here, as the amount of memory per core did remain relatively stable at 2 GB per core in the past years and the grids can't outgrow geometry handling. The constants are obviously too large.

# Test Family Dimensions Performance
1RegionInsert (128, 128, 128) 0.002968 s
2RegionInsert (512, 512, 512) 0.052137 s
3RegionInsert (2048, 2048, 2048) 0.960965 s
4RegionIntersect (128, 128, 128) 0.023085 s
5RegionIntersect (512, 512, 512) 0.395194 s
6RegionIntersect (2048, 2048, 2048) 6.81856 s

So what's the solution? The original Region stores parts of the information redundantly. Also, it uses std::map, ergo a tree with lots and lots of pointers, which take up more precious RAM. A reimplementation was imperative. Replacing a core class of the library made me feel uneasy and it took a some days to get it right/debugged.

The new design strips all redundant storage and uses only three flat arrays (one per spatial dimension), interlinked via indices. The drawback of this is that we have to update the indices in large parts of the arrays on every insert and deletion of streaks. The wost case complexity went up from O(log n) to O(n²). Really really bad. But wait? What about these constants? Surely it's not that bad, right? At this point I have to say that I'm extraordinarily happy to have both, unit tests and performance tests in place. In this case the performance tests measure inserts and intersection, the two most common operations for the Region, iteration aside. The results however were rather mixed. The first table contains a list of the results prior to the reimplementation (revision 350:d14ec73a10cd, tested with GCC 4.6.3 on a Intel Core i7-2600K)...

The next table lists the timing and relative speedup of the new Region implementation, revision 351:cc0339630cdf. Inserts (test 1-3) look pretty good. This is because the tests insert only at the end of the Region, which yields (even with the new implementation) a much better insert complexity of O(log n) compared to the worst case. It is 4-5x faster than the original code as it usually doesn't need to costly allocate new memory. Contrary, the intersection (tests 4-6) are prohibitively slow. The intersection operator &was implemented as a & b == b - (b - a)) and subtract needs to insert/delete streaks at the beginning of the arrays as it operates in place, thus O(n²). Naturally I had to reimplement this operator, too. The new algorithm traverses both regions and inserts matching streaks to a new array. And inserts are fast. The last table was measured with revision 352:112f4d2ad22d. The best thing about the new intersection operator is that it's not only not slower, but almost 10x faster. Incidentally the code now allocates just 16 bytes per Streak, also a 10x improvement. Not bad.

# Test Family Dimensions Performance Speedup
1RegionInsert (128, 128, 128) 0.000591 s5.022
2RegionInsert (512, 512, 512) 0.011337 s4.599
3RegionInsert (2048, 2048, 2048) 0.213258 s4.506
4RegionIntersect (128, 128, 128) 0.484808 s0.048
5RegionIntersect (512, 512, 512) 149.2 s 0.003
6RegionIntersect (2048, 2048, 2048) eternity?zero?
# Test Family Dimensions Performance Speedup
1RegionInsert (128, 128, 128) 0.000646 s4.4257
2RegionInsert (512, 512, 512) 0.012235 s4.23098
3RegionInsert (2048, 2048, 2048) 0.227063 s4.19547
4RegionIntersect (128, 128, 128) 0.00203 s11.3015
5RegionIntersect (512, 512, 512) 0.039035 s10.0636
6RegionIntersect (2048, 2048, 2048) 0.718114 s9.39575
last modified: Wed Oct 26 16:51:19 2016 +0200