A Word Count Example
There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.
– William Shakespeare
Shakespeare is timeless. He also tends to use many of the same words
in his various works. This makes Shakespeare ideal for a word count
example. In addition, this section will provide a more complete
demonstration of using the MapReduce class.
This example uses four common Shakespearean sonnets. Fortunately,
you can find these sonnets in many places online. The goal is to count
the instances of every word across the four sonnets. Small words, such
as a, be, we,
and so on, would clutter the results. For that reason, exclude small
words from the list. Fortunately, there is a function for this purpose.
An overload of the MapReduce.Map method has a Filter parameter, which is a function delegate. The Filter method accepts a key-value pair. If the method returns true, the entry is added to the intermediate collection. If it returns false, the item is omitted.
The source collection is comprised of the name and location of four sonnets, used to initialize an instance of a MapReduce class.
Tuple<string, string>[] sonnets = new Tuple<string, string>[] {
new Tuple<string, string>("Sonnet 1.txt",@"C:\shakespeare"),
new Tuple<string, string>("Sonnet 2.txt",@"C:\shakespeare"),
new Tuple<string, string>("Sonnet 3.txt",@"C:\shakespeare"),
new Tuple<string, string>("Sonnet 4.txt",@"C:\shakespeare") };
MapReduce<string, string> wordCount = new MapReduce<string, string>(sonnets);
The MapReduce.Map method will map the file names to a word count.
-
Read the text from the sonnets.
-
Define word delimiters.
-
Create a Dictionary
object. For each word, check whether the word is in the dictionary. If
not, add the word to the dictionary and set the count to 1. Otherwise,
when the word already exists in the dictionary, increment the count of
the existing word in the dictionary. When the process completes, return
the values portion of the dictionary object as the intermediate
collection. The intermediate collection will have the individual count
per word for each file.
Here is the code for the word count example.
IEnumerable<Tuple<string, int>> wordCollection;
wordCount.Map<string, int>((input) =>
{
StreamReader sw = new StreamReader(input.Item2 + @"\" + input.Item1);
string data = sw.ReadToEnd();
string[] words = data.Split(new[] {' ','.',',',';',':','=','+', '-', '*', ')',
'(',
'!', '#', '$', '\n', '\r'});
Dictionary<string, Tuple<string, int>> rawCount =
new Dictionary<string Tuple<string, int>>();
foreach (var word in words)
{
Tuple<string, int> value;
if (rawCount.TryGetValue(word, out value))
{
int increment = rawCount[word].Item2 + 1;
rawCount[word] = new Tuple<string, int>(word, increment);
}
else
{
rawCount.Add(word, new Tuple<string, int>(word, 1));
}
}
return rawCount.Values;
},
After the mapping function, you have the Filter function. For brevity, words less than three characters in length are excluded from the final intermediate collection.
(key, value) =>
{
if (key.Length < 3)
{
return false;
}
else
{
return true;
}
},
out wordCollection);
The MapReduce.Reduce
method is simple. The reduction method reduces the key groupings to
totals that represent the aggregate total count of each word in the
four files.
IEnumerable<Tuple<string, int>> reduction = wordCount.Reduce(
wordCollection,
(key, values) =>
{
return values.Sum();
}
);
Lastly, you can the show the results.
foreach (var item in reduction)
{
Console.WriteLine("{0} {1}", item.Item1, item.Item2);
}
Here is the partial output from the Word Count example.