# Fork/Join and other Techniques to Improve Performance

11/02/12

In the last few years there has been nearly no improvement in single thread performance of CPUs. On the other hand, the number of cores increases: Laptops with eight cores are common (okay, including hyperthreading, only four real cores). Even modern smartphones often have four cores. To utilize these modern beasts, you need parallel programming.

In this article, I use a simple board game as an example for a parallel algorithm and other optimization techniques, a variant of peg solitaire. The problem to solve is: How many different solutions exist for a board with n pegs on a side? The focus is on different optimization techniques, not only the Fork/Join framework. You may be surprised to find other techniques are much more efficient for these problems.

# Definition of the Problem

Lets start with a more precise definition of the problem. We play on a triangular board. A board with edge length 5 (n = 5) before any move has been done looks like this:

x x x x o x x x x x x x x x x

The middle peg of the third row is empty. A legal move is a jump over one peg in one of the six different directions. The leapfrogged peg is removed from the board. So the board could look like this after one move:

x x x x x x x o x x x o x x x

A solution is found when there is only one peg left, wherever it is located on the board. You get different results for different starting positions, see Dan O’Briens Puzzle Solution Page for some more information on the topic.

Given a Java class which can represent a position and which is capable to compute a list of all resulting positions after one move, the solver is a simple recursive function (source code as zip):

long countSolutions(Board start) { if (start.isSolution()) { return 1; } else { long count = 0; for (Board board : start.nextPositions()) { count += countSolutions(board); } return count; } } |

When you feed it with the start board with edge length five, it takes about a tenth of a second and you can see there are 1,550 solutions for n = 5. A tenth of a second is a short time, so why optimize? Let’s see bigger values, e.g. n = 6. Takes a little bit longer. Much longer. Not as long as to compute 42, but about 30 hours resulting in 29,235,690,234 (now it should be obvious why `countSolutions()`

returns a `long`

and not an `int`

).

Why is there such a huge difference for a slightly larger board? Because the number of positions for a board of size n is 2^(n * (n+1)/2). The exponent is the number of holes/pegs on the board, which increases quadratically.

# Fork/Join

When you know the Java Fork/Join framework (otherwise read the fork/join tutorial), you should see the perfect match: In each recursion level, you can fork a thread for the list of next positions. Here is the code, first the initialization of the pool and the code for starting the computation:

ForkJoinPool pool = new ForkJoinPool(numThreads); RecursiveSolver root = new RecursiveSolver(startBoard, sequential); solutions = pool.invoke(root); |

Then the implementing class:

class RecursiveSolver extends RecursiveTask<Long> { private Board start; private int sequential; public RecursiveSolver(Board start, int sequential) { this.start = start; this.sequential = sequential; } @Override protected Long compute() { int card = start.cardinality(); if (card == 1) { return Long.valueOf(1); } else if (card < sequential) { return Long.valueOf(countSolutions(start)); } else { List<Board> nextPositions = start.nextPositions(); List<Board> tasks = new ArrayList<>(nextPositions.size()); for (Board b : nextPositions) { tasks.add(new RecursiveSolver(b, sequential)); } invokeAll(tasks); long count = 0; for (RecursiveSolver rs : tasks) { count += rs.join(); } return count; } return Long.valueOf(0); } } |

The recursion of the sequential algorithm has been replaced by the creation of new instances of `RecursiveTask`

. I introduced another optimization (as proposed in the fork/join tutorial): The parallel algorithm switches back to a sequential one when there are less than `sequential`

pegs left. This avoids the overhead of task creation for small problems. After some experiments I used eight as threshold in my test runs.

Starting this, my laptop (eight cores with hyperthreading, four real ones) was unusable for the next 7 hours and 28 minutes. Compared to the 30 hours of the sequential solver, a factor of four, which matches the number of “real” cores. So why bother? Four cores, four times faster than sequential, perfect speedup.

But what about n = 7? How many solutions are there for a board with edge length seven? I didn’t run this on my laptop, neither sequential nor parallel. I assume it would exceed the lifetime of the poor machine. So let’s look for some other optimizations.

# Caching

As in most board games, there is often more than one sequence of moves which result in the same position. An obvious optimization is to store the number of solutions for already computed positions in a `HashMap`

. This is a well known technique called transposition table. As a precondition, the class `Board`

has to implement `hashCode()`

and `equals()`

. For n = 5 this doesn’t make a big difference, we get the answer in 0.07 seconds, 70% of the time needed by the simple sequential solver. For n = 6 we get a more impressing effect, only 0.4 seconds elapse before we can see the result. That’s about 270,000 times faster compared to the sequential solver and even 67,500 times faster compared to the parallel solver running with four cores.

This sounds very promising, so let’s try the next board size, n = 7. Starting this without any JVM options results in an `OutOfMemoryError`

, the `HashMap`

does not fit into the standard heap. Increasing the heap size with the well known `-Xmx`

does not help on a 32 bit JVM: The memory needed does not fit into the 32 bit address space. The next step is to use the brute force approach: 64 bit JVM and the `-d64`

option to activate the 64 bit mode.

**Stop!**

I like the `HashMap`

, it’s is one of my favorite data structures and amazingly fast. But in this case there is a simpler, more efficient data structure, the good old array. A position in the game can be represented by some bits, for n = 7 you need 7*(7+1)/2=28 bits, which fits into an integer which can be used as the index of the array. The value in the array is the number of solutions for this position, -1 for positions which have not been evaluated so far. This still doesn’t fit into the 32 bit address space for n = 7, but is more efficient (in time and space) than the `HashMap`

solution. For n = 6, we need only 0.2 seconds compared to the 0.4 seconds.

When we have a 64 bit JVM, we can attack n = 7. But for one moment let’s assume we can’t afford the amount of memory and still want to solve the problem. When you add some debugging output to your code, you will find some strange behavior for n = 7: For n = 5 or n = 6 there are a lot of different solutions, usually the algorithms finds the first solutions quite fast. Not for n = 7. When I tried this first (some years ago, with C instead of Java on an old SUN workstation), the code found no solutions running several minutes. I had a strong suspicion: The triangle peg solitaire has no solution for n = 7. So I modified the code and used only one bit for each position: 0 = position not evaluated so far, 1 = position evaluated, no solution found.

Last week, when I tried this again I was too lazy to use bits, instead I changed the array from `long`

to `byte`

, which was small enough to fit into the 32 bit address space. I could have used a Java `BitSet`

, which saves even more space, but was too lazy. It confirmed what I already knew: There is no solution for n = 7, it took 34 seconds to compute this. Using the 64 bit JVM and `long`

is a little bit slower: 37 seconds. I attribute the three seconds to worse cache locality.

# Parallelism Again

We have seen two orthogonal ways to improve performance: Parallelism and caching. Is it possible to combine the approaches? Will this be even faster? Yes, we can combine them, but it becomes uglier. The sheer elegance of the fork join is based on its simplicity: We create new tasks, invoke them in a parallel fashion, wait for the result: You need no synchronized blocks or synchronized methods, every thread works on its own data. A global data structure like a `HashMap`

or array destroys this simplicity, they both need some way of synchronization. But what is the granularity? Locking the complete array for each access? This causes two problems:

- Much of the parallelism will be destroyed because all threads compete for one resource.
- It does not solve the problem of duplicate work: After one thread sees an unevaluated position and starts to evaluate it, another thread may evaluate the same position in parallel, wasting resources.

So let’s try a more fine grained approach: Locking an entry for one position. Because we need an object as lock holder, we have to change the array of `long`

s to an array of some sort of objects:

class Value { public Value() { v = -1; } public long v; } |

The rest of the code looks similar, but with a synchronized block:

long countSolutions(Board start) { Integer startAsInt = Integer.valueOf(start.asInteger()); Value value = cache[startAsInt]; synchronized (value) { if (value.v != -1) { return value.v; } else if (start.isSolution()) { value.v = 1; return 1; } else { long count = 0; List nextPositions = start.nextPositions(); for (Board board : nextPositions) { count += countSolutions(board); } value.v = count; return count; } } // synchronized } |

With this approach, we have a separate lock for each position. A thread holds the lock until the evaluation of the position is complete. This avoids duplicate work by several threads, but limits parallelism. For this reason, you should start this algorithm with more threads than CPUs on your system.

Unfortunately the overhead caused by the value object compared to the primitive data type and the synchronization is not compensated by the parallelism: For n = 6 we need 1 second, five times slower compared to the fastest sequential solution with the array of `long`

s.

# Lessons Learned

What can we learn from this experiment? Is there anything valuable learned here you can make use of when coding enterprise applications with boring/interesting (No)SQL-Databases as back end? For me it was the first time I used the Fork/Join framework, so I learned this :-). I was surprised, it’s quite easy. The load balancing and work stealing mechanisms seem to work good, the speedup compared to the sequential algorithm was as expected. This is definitively much easier comparing to create threads manually.

The second lesson is about better algorithms. As we have seen, this can make a world of difference, not just a factor of four gained by parallelism. This is far more important than eliminating some function calls or saving a few cycles by replacing `double`

with `float`

or some other tricky programming. This is especially true for large problems, where – for example – the time complexity `n log(n)`

of a good algorithm is much smaller than a time complexity `n^2`

of a bad algorithm (hint: Sorting).

The third lesson is simple: Don’t do the work at all. At least, don’t repeat it, use caching instead of repeated expensive operations. In this example, the expensive operation was the evaluation of identical branches in the tree. In enterprise applications, accessing the database usually takes most of the time. Given a good JPA provider or application server, you don’t have to implement the caching yourself, just plug in the cache recommended/supported by your provider/server and use the saved time to find a good set of configuration parameters.

In other cases, you have to do some work yourself. But don’t implement everything, there are helping classes available. The `HashMap`

or array used in this post are no real caches, they miss the feature of forgetting entries, so they will blow up your memory at some point. But the JDK has other classes which attack this problem: A `WeakHashMap`

forgets entries automatically when the garbage collector is running, but you have no control when entries are removed or which entries are removed. So it is not recommended to implement a cache. To regain some sort of control, extend `LinkedHashMap`

and override `removeEldestEntry()`

(see javadoc for details). This gives you an LRU cache with just a few lines of code.

When you want even more control, I recommend the Google Guava Cache. It allows eviction on time base or on weight base with a user defined comparison function for the weight.

Another important lesson *not* learned here is the proper use of a profiler. It can give you valuable information where your application spends all the time. For this simple example, it was clear without a profiler.

# Epilogue

It may come as a surprise that there is no solution for n = 7. In fact, you can prove there is no solution for every `n`

where `n modulo 3 = 1`

. I will give a short sketch of the parity based proof.

First let’s place numbers on the board according to the following two patterns:

1 1 1 0 0 1 0[1]1 1[1]0 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0

The field in brackets is the field with no peg at the start of a game. The parity is computed by adding all numbers of the fields with a peg and to apply modulo 2. For `n = 6`

there is an even number of ones on the board. Because the empty field has a one, too, the parity of the start position is odd. If you look at the pattern in a row or on one of the diagonals, you see a repeated sequence of `1 1 0`

. For every move in such a pattern, the parity stays the same.

Obviously, when the parity of the start position is odd (which is true for the left and right pattern), it must be odd for every position in the game, including the end position. An odd parity with one peg is possible only if this peg is on a field marked with a one.

If you record all end positions with one peg for `n = 5`

, you se it’s always on the same place, which is marked with a one in both patterns:

o o o o o o o o o o o o x o o

For n = 6 there are several fields where the last peg may end. Note all these fields are marked with a one on both boards shown above:

x o o o x o x o o x o o x o o o x o o x o

When `n modulo 3 = 1`

, the number of fields modulo three is one, too. If you extend the patterns shown above, you see there is always a one on the lower left and lower right corner. As a consequence, you have a number of `1 1 0`

groups and one additional one. Together with the empty field in the start position located on a one, this results in an even parity for the start position. Even parity with one peg left implies the last peg has to end on a field marked with zero. But whenever a field is marked zero in the left pattern, it is marked with a one in the right pattern (and vice versa). So there is no possible end position left for the last peg…

Wouldn’t it be evil to sell this game with size `n = 7`

?

## Kommentare

6. November 2012 von Ilias Tsagklis

Hey nice article!

Fork/join is quite a powerful framework for achieving parallelism in your code. You may also find a similar example (with benchmarks) here:

Java Fork/Join for Parallel Programming

6. November 2012 von roger.butenuth

Yes, the article contains more Fork/Join details. But the example Fibonacci numbers is not the best one. You can compute them without recursion and without cache much faster. This is simple enough to implement on a programmable calculatur (25 year old) and will be faster than the parallel/recursive implementation (on current hardware).

21. February 2013 von Java dev

One of the best example to learn fork join framework. Thanks a lot.

## Comment