Parallel Programming with Microsoft Visual Studio 2010 : Using the MapReduce Pattern (part 2)

11/21/2013 7:43:25 PM

A Word Count Example

There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.

– William Shakespeare

Shakespeare is timeless. He also tends to use many of the same words in his various works. This makes Shakespeare ideal for a word count example. In addition, this section will provide a more complete demonstration of using the MapReduce class.

This example uses four common Shakespearean sonnets. Fortunately, you can find these sonnets in many places online. The goal is to count the instances of every word across the four sonnets. Small words, such as a, be, we, and so on, would clutter the results. For that reason, exclude small words from the list. Fortunately, there is a function for this purpose. An overload of the MapReduce.Map method has a Filter parameter, which is a function delegate. The Filter method accepts a key-value pair. If the method returns true, the entry is added to the intermediate collection. If it returns false, the item is omitted.

The source collection is comprised of the name and location of four sonnets, used to initialize an instance of a MapReduce class.

Tuple<string, string>[] sonnets = new Tuple<string, string>[] {
new Tuple<string, string>("Sonnet 1.txt",@"C:\shakespeare"),
new Tuple<string, string>("Sonnet 2.txt",@"C:\shakespeare"),
new Tuple<string, string>("Sonnet 3.txt",@"C:\shakespeare"),
new Tuple<string, string>("Sonnet 4.txt",@"C:\shakespeare") };
MapReduce<string, string> wordCount = new MapReduce<string, string>(sonnets);

The MapReduce.Map method will map the file names to a word count.

Read the text from the sonnets.
Define word delimiters.
Create a Dictionary object. For each word, check whether the word is in the dictionary. If not, add the word to the dictionary and set the count to 1. Otherwise, when the word already exists in the dictionary, increment the count of the existing word in the dictionary. When the process completes, return the values portion of the dictionary object as the intermediate collection. The intermediate collection will have the individual count per word for each file.

Here is the code for the word count example.

IEnumerable<Tuple<string, int>> wordCollection;
wordCount.Map<string, int>((input) =>
{
   StreamReader sw = new StreamReader(input.Item2 + @"\" + input.Item1);
   string data = sw.ReadToEnd();
   string[] words = data.Split(new[] {' ','.',',',';',':','=','+', '-', '*', ')',
       '(',
 '!', '#', '$', '\n', '\r'});
   Dictionary<string, Tuple<string, int>> rawCount =
       new Dictionary<string         Tuple<string, int>>();
   foreach (var word in words)
   {
      Tuple<string, int> value;
      if (rawCount.TryGetValue(word, out value))
      {
         int increment = rawCount[word].Item2 + 1;
         rawCount[word] = new Tuple<string, int>(word, increment);
      }
      else
      {
         rawCount.Add(word, new Tuple<string, int>(word, 1));
      }
   }
   return rawCount.Values;
},

After the mapping function, you have the Filter function. For brevity, words less than three characters in length are excluded from the final intermediate collection.

(key, value) =>
{
   if (key.Length < 3)
   {
      return false;
   }
   else
   {
      return true;
   }
},
out wordCollection);

The MapReduce.Reduce method is simple. The reduction method reduces the key groupings to totals that represent the aggregate total count of each word in the four files.

IEnumerable<Tuple<string, int>> reduction = wordCount.Reduce(
   wordCollection,
(key, values) =>
   {
      return values.Sum();
   }
);

Lastly, you can the show the results.

foreach (var item in reduction)
{
   Console.WriteLine("{0} {1}", item.Item1, item.Item2);
}

Here is the partial output from the Word Count example.

Parallel Programming with Microsoft Visual Studio 2010 : Using the MapReduce Pattern (part 1)

Other

Parallel Programming with Microsoft Visual Studio 2010 : Data Parallelism - Reduction

NET Debugging : Visual Studio (part 3) - Visual Studio 2010

NET Debugging : Visual Studio (part 2) - .NET Framework Source-Level Debugging

NET Debugging : Visual Studio (part 1) - SOS Integration

System Center Configuration Manager 2007 : Creating Packages (part 3) - About Packages, Programs, Collections, Distribution Points, and Advertisements

System Center Configuration Manager 2007 : Creating Packages (part 2) - Comparing GPO-based Software Distribution to ConfigMgr Software Distribution

System Center Configuration Manager 2007 : Creating Packages (part 1)

Microsoft Dynamic AX 2009 : Configuration and Security - Security Framework (part 3) - Security Coding

Microsoft Dynamic AX 2009 : Configuration and Security - Security Framework (part 2) - Applying Security

Microsoft Dynamic AX 2009 : Configuration and Security - Security Framework (part 1)