Parallel Programming with Microsoft .Net : Parallel Aggregation - Design Notes

2/17/2011 9:25:40 AM

If you compare the sequential and parallel versions of the aggregation pattern, you see that the design of the parallel version includes an additional step in the algorithm that merges partial results. Figure 1, which illustrates the Parallel.ForEach and Parallel.For methods, shows this.

Figure 1. Aggregation using Parallel For and ForEach

Figure 1 shows that instead of placing the accumulation in a single, shared result, the parallel loop uses unshared local storage for partial results (these are named subtotals in Figure 1 ). Each task processes a single partition of the input values. The number of partitions depends on the degree of parallelism that’s needed to efficiently use the computer’s available cores. After a task finishes accumulating the values in its assigned partition, it merges its local result into the final, global result. The final result is shared by all tasks. Locks are required to ensure that updates are consistent.

The reason that this approach is fast is because there are very few locks. In normal cases, the number of elements to be processed is many times larger than the number of tasks and partitions. The cost of serializing locks can be amortized over many individual accumulation operations.

The algorithm used by PLINQ differs slightly from what’s shown in Figure 1 . Aggregation in PLINQ does not require the developer to use locks. Instead, the final merge step is expressed as a binary operator that combines any two partial result values (that is, two of the subtotals) into another partial result. Repeating this process on subsequent pairs of partial results eventually converges on a final result. One of the advantages of the PLINQ approach is that it requires less synchronization, so it’s more scalable. The binary combinations do not require the system to lock the final result; they can be performed in parallel. (This is possible because the binary combination operation is associative, so the order of these operations doesn’t affect the result.)

This discussion shows that the parallel aggregation pattern is a good example of why changes to your algorithm are often needed when moving from a sequential to parallel approach.

To make this point clear, here’s an example of what parallel aggregation would look like if you simply added locks to the existing sequential algorithm. You just need to convert sequential for to Parallel.For and add one lock statement.

Note:

You can’t just add locks and expect to get good performance. You also need to think about the algorithm.

// Do not copy this code. This version will run much slower
// than the sequential version. It's included here to
// illustrate what not to do.
double[] sequence = ...
object lockObject = new object();
double sum = 0.0d;
// BUG - don't do this
Parallel.For(0, sequence.Length, i =>
{
   // BUG - don't do this
   lock (lockObject)
   {
     sum += sequence[i];
   }
});
return sum;

If you forget to add the lock statement, this code fails to calculate the correct sum on a multicore computer. Adding the lock statement makes this code example correct with respect to serialization. If you run this code, it produces the expected sum. However, it fails completely as an optimization. This code is many times slower than the sequential version it attempted to optimize! The reason for the poor performance is the cost of synchronization.

In contrast, the examples of the parallel aggregation pattern that you have seen elsewhere in this article will run much faster on multicore computers than the sequential equivalent, and their performance also improves in approximate proportion to the number of cores.

It might at first seem counterintuitive that adding additional steps can make an algorithm go faster, but it’s true. If you introduce extra work and that work has the effect of preventing data dependencies between parallel tasks, you often benefit in terms of performance.