If you compare the sequential and parallel
versions of the aggregation pattern, you see that the design of the
parallel version includes an additional step in the algorithm that
merges partial results. Figure 1, which illustrates the Parallel.ForEach and Parallel.For methods, shows this.
Figure 1
shows that instead of placing the accumulation in a single, shared
result, the parallel loop uses unshared local storage for partial
results (these are named subtotals in Figure 1).
Each task processes a single partition of the input values. The number
of partitions depends on the degree of parallelism that’s needed to
efficiently use the computer’s available cores. After a task finishes
accumulating the values in its assigned partition, it merges its local
result into the final, global result. The final result is shared by all
tasks. Locks are required to ensure that updates are consistent.
The reason that this
approach is fast is because there are very few locks. In normal cases,
the number of elements to be processed is many times larger than the
number of tasks and partitions. The cost of serializing locks can be
amortized over many individual accumulation operations.
The algorithm used by PLINQ differs slightly from what’s shown in Figure 1.
Aggregation in PLINQ does not require the developer to use locks.
Instead, the final merge step is expressed as a binary operator that
combines any two partial result values (that is, two of the subtotals)
into another partial result. Repeating this process on subsequent pairs
of partial results eventually converges on a final result. One of the
advantages of the PLINQ approach is that it requires less
synchronization, so it’s more scalable. The binary combinations do not
require the system to lock the final result; they can be performed in
parallel. (This is possible because the binary combination operation is
associative, so the order of these operations doesn’t affect the
result.)
This discussion shows that the parallel
aggregation pattern is a good example of why changes to your algorithm
are often needed when moving from a sequential to parallel approach.
To make this point clear,
here’s an example of what parallel aggregation would look like if you
simply added locks to the existing sequential algorithm. You just need
to convert sequential for to Parallel.For and add one lock statement.
Note:
You can’t just add locks and expect to get good performance. You also need to think about the algorithm.
// Do not copy this code. This version will run much slower
// than the sequential version. It's included here to
// illustrate what not to do.
double[] sequence = ...
object lockObject = new object();
double sum = 0.0d;
// BUG - don't do this
Parallel.For(0, sequence.Length, i =>
{
// BUG - don't do this
lock (lockObject)
{
sum += sequence[i];
}
});
return sum;
If you forget to add the lock
statement, this code fails to calculate the correct sum on a multicore
computer. Adding the lock statement makes this code example correct with
respect to serialization. If you run this code, it produces the
expected sum. However, it fails completely as an optimization. This code
is many times slower than the sequential version it attempted to
optimize! The reason for the poor performance is the cost of
synchronization.
In contrast, the examples of the parallel
aggregation pattern that you have seen elsewhere in this article will
run much faster on multicore computers than the sequential equivalent,
and their performance also improves in approximate proportion to the
number of cores.
It might at first seem
counterintuitive that adding additional steps can make an algorithm go
faster, but it’s true. If you introduce extra work and that work has the
effect of preventing data dependencies between parallel tasks, you
often benefit in terms of performance.