MapReduce is a well-known pattern introduced in 2004 in a
paper titled “MapReduce: Simplified Data Processing on Large Clusters”
by Jeffrey Dean and Sanjay Ghemawat. The link for the document is http://labs.google.com/papers/mapreduce-osdi04.pdf.
The MapReduce pattern is designed to handle the reduction of vast
amounts of data separated across multiple computers. However, the
pattern is applicable even on a much smaller scale, such as a modern
multicore computer. The MapReduce pattern is a complex application of
data parallelism, dependencies, and reduction.
There are three collections in the MapReduce pattern. The first
collection is the input for the MapReduce pattern. It is a collection
of key and value pairs. You perform some transformation on the input
collection to create a second, intermediate collection, which is a
non-unique collection of key and value pairs. The third collection is a
reduction of the non-unique keys from the intermediate collection.
If the word apple appeared in three of the files, there would be identical entries for apple
in the intermediate list. The word “apple” would be the key, and the
value for each key would be the number of times that word (the key)
appears in each file. For the reduction, you want to reduce non-unique
keys to totals. The following diagram illustrates this example.
Note
The MapReduce class resides in the ParallelBook namespace. When you create a MapReduce object, you initialize it with a source collection. The MapReduceMapReduce.Map,
is responsible for transforming the source collection to an
intermediate collection. The first parameter is the mapping function,
which performs the transformation. The last parameter is an out parameter, which is the intermediate collection. The second method is the MapReduce.Reduce
method, which accepts and reduces the intermediate collection—its first
parameter. The next parameter is the reduction operation. The Map and Reduce methods are exposed separately in the interface to allow multiple reductions of an intermediate collection. The last parameter is the group
operation, which groups the keys of the intermediate collection. This
is important because the intermediate collection is reduced along group
boundaries. The default is to reduce by matching keys. class has only two methods. The first,
Here is the MapReduce.Map prototype.
public void Map<KEY2, VALUE2>(Func<Tuple<KEY, VALUE>,
IEnumerable<Tuple<KEY2, VALUE2>>> mapFunc,
out IEnumerable<Tuple<KEY2, VALUE2>> TupleCollection
)
And here’s the MapReduce.Reduce prototype.
public IEnumerable<Tuple<KEY2, VALUE2>> Reduce<KEY2, VALUE2>(
IEnumerable<Tuple<KEY2, VALUE2>> intermediate,
Func<KEY2, VALUE2[], VALUE2> reduceFunc,
Func<IEnumerable<Tuple<KEY2, VALUE2>>, Dictionary<KEY2, VALUE2[]>>
groupFunc= null
)
This next exercise involves using the MapReduce
class. You will create a collection of key and value pairs. The keys
are string values, and the values are integers. The intermediate
collection simply squares the values of the input collection. The
reduction will then reduce keys by summation.
Create a MapReduce class to square the values of a source collection, and then reduce the collection by summing the keys
-
Create a console project in Visual Studio 2010 for C#. Add a using statement for the System.Threading.Tasks namespace to the source code. Add a reference to the project for MapReduce.dll.
-
In the Main method, define and initialize an array of binary tuples for string and integer pairs.
Tuple<string, int>[] tuples = new Tuple<string, int>[] {
new Tuple<string, int>("a", 3),
new Tuple<string, int>("b", 2),
new Tuple<string, int>("b", 5)
};
-
Create an instance of the MapReduce class. In the constructor, initialize the object with the tuples array.
MapReduce<string, int> letters = new MapReduce<string, int>(tuples);
-
Now you will transform the source collection. First, define a
collection of tuples to hold the intermediate results. Your mapping
operation simply squares the value of each tuple and places the results
in an out variable.
IEnumerable<Tuple<string, int>> newmap;
letters.Map<string, int>((input) =>
{
return new Tuple<string, int>[] { new Tuple<string, int>(input.Item1,
input.Item2 * input.Item2) };
}, out newmap);
-
Reduce the collection with the MapReduce.Reduce method. Provide the intermediate collection as the input. In the reduction method, sum the totals of each group.
IEnumerable<Tuple<string, int>> reduction =
letters.Reduce<string, int>(newmap, (key, values) =>
{
int total = 0;
foreach (var item in values)
{
total += item;
}
return total;
});
-
Display the results, which are returned from the MapReduce.Reduce method. The answer should be a=9 and b=29.
foreach (var item in reduction)
{
Console.WriteLine("{0} = {1}", item.Item1, item.Item2);
}
Here is the entire program.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Letters
{
class Program
{
static void Main(string[] args)
{
Tuple<string, int>[] tuples = new Tuple<string, int>[] {
new Tuple<string, int>("a", 3),
new Tuple<string, int>("b", 2),
new Tuple<string, int>("b", 5) };
MapReduce<string, int> letters = new MapReduce<string, int>(tuples);
IEnumerable<Tuple<string, int>> newmap;
letters.Map<string, int>((input) =>
{
return new Tuple<string, int>[] { new Tuple<string,
int>(input.Item1, input.Item2 * input.Item2) };
}, out newmap);
IEnumerable<Tuple<string, int>> reduction = letters.Reduce<string,
int>(newmap, (key, values) =>
{
int total = 0;
foreach (var item in values)
{
total += item;
}
return total;
});
foreach (var item in reduction)
{
Console.WriteLine("{0} = {1}", item.Item1, item.Item2);
}
Console.WriteLine("Press enter to <end>.");
Console.ReadLine();
}
}
}