LINQ to Objects : How to Join with Data in Another Sequence (part 5) - One-to-One Join Performance Comparisons

1/14/2014 3:12:01 AM

2.6 One-to-One Join Performance Comparisons

Choosing the most appropriate join technique will yield large performance gains as collection size grows. To have some guidance on which technique makes sense, a simple experiment is necessary (results of such an analysis are summarized in Table 1). As mentioned earlier, a database management system (such as SQL Server, MySQL, and Oracle) should be used when joining large amounts of data, and although you can join in-memory using LINQ to Objects, you should exhaust other opportunities first if high performance on larger data sets is a requirement.

One-to-one join performance is primarily dependent on how many different elements will need to be looked up from the inner sequence and how often that is necessary. The Join operator builds a lookup list of all inner sequence keys ahead of time. If many of these keys are then used in the outer sequence, this step aids performance; if only a few of the inner elements will ever be eventually looked up (because there are only a few outer records), then this step actually harms performance. You will need to examine the likely number of records in both the inner and outer sequences. In the small set of test data used for this profiling example, it was evident that five records or more was the sweet-spot for building this lookup table in advance. Though if the number of records in the inner sequence grows, so will the sweet-spot—you must test given your own sequence sizes, and consider their likely growth over time.

The experimental approach used in this test was to simply run a large number of join iterations over some sample data and determine the total elapsed time. Each of the four different techniques were tested for all records in an outer sequence and a single record in the outer sequence (by adding a where clause for a single element).

To measure the elapsed time, a simple helper method was written that takes an action delegate and measures the elapsed time of calling that delegate a given number of times (iterations):

private long MeasureTime(Action action, int iterations)
{
    System.Diagnostics.Stopwatch watch =
        new System.Diagnostics.Stopwatch();

    watch.Start();

    for (int i = 0; i < iterations; i++)
        action();

    return watch.ElapsedMilliseconds;
}

This helper function is called by specifying an action inline, such as

long t = MeasureTime(delegate { q1.ToList(); }, 1000000)

In this example, q1 is our LINQ to Objects query under test. The method ToList is called on this query to ensure that the full sequence is yielded, meaning that every result element is processed avoiding deferred execution making all queries appear instant.

Each Join technique has a different performance profile for the two most basic scenarios:

The outer sequence has many records, and you want all of them—An example of this scenario is when you are binding the full sequence of results to a grid or other repeater control. For example, the following code demonstrates a one-to-one join where outer collection has many records:
```
var q1 = from outer in orders
         join inner in customers on
            outer.CustomerID equals inner.CustomerID
```
The outer sequence has only a few records, either by count or because of a strict where clause (targeting a single record is the most common)—An example of this scenario is when you are looking up a single record for editing purposes. For example, the following code demonstrates a one-to-one join where a single record is isolated in the outer collection:
```
var q1 = from outer in orders
         where outer.OrderNumber == "Order 3"
         join inner in customers on
            outer.CustomerID equals inner.CustomerID
```

The test data used for this experiment is small—up to 100 orders and 5 customers (for a full listing, see Table 2). This is an unrealistic size of data, but it does provide a stable platform for understanding relative performance characteristics. This experiment is looking at how the relationship between outer and inner sequence collection size impacts query performance. Table 3 shows the results of this experiment.

Table 3. One-to-One Join Techniques and Their Performance Comparisons Using 1M Iterations
Join Technique Iterations = 1M	Scenario 1 -Many Record Performance Outer n = 100 Inner n = 5	Many Record Performance Outer n = 50 Inner n = 5	Many Record Performance Outer n = 5 Inner n = 5	Scenario 2 - Single Record Performance Outer n = 1 Inner n = 5
Using the `join` (or `Join`) operator Listing 2	21793ms (1 - best)	11319ms	2432ms	1852ms (4 - worst)
Using a subquery in the select projection Listing 4	75897ms (3)	40464ms	4375ms	1493ms (2)
Using the `SingleOrDefault` operator in the select projection Listing 5	56335ms (2)	27412ms	3758ms	1320ms (1 - best)
Using a cross join with a `where` filter Listing 6	80829ms (4 - worst)	40424ms	5142ms	1636ms (3)

Note

The actual numbers in seconds are highly subject to hardware and available memory; they are shown here for relative purposes only. This experiment is to differentiate performance of the different join techniques.

Plotting the results for 100, 50, and 1 demonstrates how linear the performance characteristic is against outer loop count. The join syntax always wins until the outer sequence size drops to below five. However, when the outer-sequence count is less than five, all syntax options are instant unless you are running one million iterations.

From this simple experiment, some conclusions can be drawn:

Use the Join syntax for all one-to-one scenarios unless the outer sequence has less than five elements
Use the SingleOrDefault syntax when the outer-sequence size is less than five elements and the number of iterations is in the hundreds of thousands.

Figure 1. Different one-to-one Join techniques. Time taken for 1M iterations against an outer sequence of 100, 50, and 1 element count. Join syntax is faster until around an outer-sequence size of five elements.
Never use the cross join/where syntax for one-to-one joins. Enumerating every combination of element versus element only to filter out all but a few results impacts performance in a big way.
Use a relational database system for large volume join operations where possible.