2.6 One-to-One Join Performance Comparisons
Choosing
the most appropriate join technique will yield large performance gains
as collection size grows. To have some guidance on which technique
makes sense, a simple experiment is necessary (results of such an
analysis are summarized in Table 1).
As mentioned earlier, a database management system (such as SQL Server,
MySQL, and Oracle) should be used when joining large amounts of data,
and although you can join in-memory using LINQ to Objects, you should
exhaust other opportunities first if high performance on larger data
sets is a requirement.
One-to-one join performance is primarily dependent
on how many different elements will need to be looked up from the inner
sequence and how often that is necessary. The Join operator
builds a lookup list of all inner sequence keys ahead of time. If many
of these keys are then used in the outer sequence, this step aids
performance; if only a few of the inner elements will ever be
eventually looked up (because there are only a few outer records), then
this step actually harms performance. You will need to examine the
likely number of records in both the inner and outer sequences. In the
small set of test data used for this profiling example, it was evident
that five records or more was the sweet-spot for building this lookup
table in advance. Though if the number of records in the inner sequence
grows, so will the sweet-spot—you must test given your own sequence
sizes, and consider their likely growth over time.
The experimental approach used in this test was to
simply run a large number of join iterations over some sample data and
determine the total elapsed time. Each of the four different techniques
were tested for all records in an outer sequence and a single record in
the outer sequence (by adding a where clause for a single element).
To measure the elapsed time, a simple helper method
was written that takes an action delegate and measures the elapsed time
of calling that delegate a given number of times (iterations):
private long MeasureTime(Action action, int iterations)
{
System.Diagnostics.Stopwatch watch =
new System.Diagnostics.Stopwatch();
watch.Start();
for (int i = 0; i < iterations; i++)
action();
return watch.ElapsedMilliseconds;
}
This helper function is called by specifying an action inline, such as
long t = MeasureTime(delegate { q1.ToList(); }, 1000000)
In this example, q1 is our LINQ to Objects query under test. The method ToList
is called on this query to ensure that the full sequence is yielded,
meaning that every result element is processed avoiding deferred
execution making all queries appear instant.
Each Join technique has a different performance profile for the two most basic scenarios:
The
outer sequence has many records, and you want all of them—An example of
this scenario is when you are binding the full sequence of results to a
grid or other repeater control. For example, the following code
demonstrates a one-to-one join where outer collection has many records:
var q1 = from outer in orders
join inner in customers on
outer.CustomerID equals inner.CustomerID
The
outer sequence has only a few records, either by count or because of a
strict where clause (targeting a single record is the most common)—An
example of this scenario is when you are looking up a single record for
editing purposes. For example, the following code demonstrates a
one-to-one join where a single record is isolated in the outer
collection:
var q1 = from outer in orders
where outer.OrderNumber == "Order 3"
join inner in customers on
outer.CustomerID equals inner.CustomerID
The test data used for this experiment is small—up to 100 orders and 5 customers (for a full listing, see Table 2).
This is an unrealistic size of data, but it does provide a stable
platform for understanding relative performance characteristics. This
experiment is looking at how the relationship between outer and inner
sequence collection size impacts query performance. Table 3 shows the results of this experiment.
Table 3. One-to-One Join Techniques and Their Performance Comparisons Using 1M Iterations
Join Technique Iterations = 1M | Scenario 1 -Many Record Performance Outer n = 100 Inner n = 5 | Many Record Performance Outer n = 50 Inner n = 5 | Many Record Performance Outer n = 5 Inner n = 5 | Scenario 2 - Single Record Performance Outer n = 1 Inner n = 5 |
---|
Using the join (or Join) operator
Listing 2 | 21793ms (1 - best) | 11319ms | 2432ms | 1852ms (4 - worst) |
Using a subquery in the select projection
Listing 4 | 75897ms (3) | 40464ms | 4375ms | 1493ms (2) |
Using the SingleOrDefault operator in the select projection
Listing 5 | 56335ms (2) | 27412ms | 3758ms | 1320ms (1 - best) |
Using a cross join with a where filter
Listing 6 | 80829ms (4 - worst) | 40424ms | 5142ms | 1636ms (3) |
Note
The actual numbers in seconds are highly subject to
hardware and available memory; they are shown here for relative
purposes only. This experiment is to differentiate performance of the
different join techniques.
Plotting the results for 100, 50, and 1 demonstrates
how linear the performance characteristic is against outer loop count.
The join syntax always wins until the outer sequence size drops to
below five. However, when the outer-sequence count is less than five,
all syntax options are instant unless you are running one million
iterations.
From this simple experiment, some conclusions can be drawn:
Use the Join syntax for all one-to-one scenarios unless the outer sequence has less than five elements
Use the SingleOrDefault syntax when the outer-sequence size is less than five elements and the number of iterations is in the hundreds of thousands.
Never
use the cross join/where syntax for one-to-one joins. Enumerating every
combination of element versus element only to filter out all but a few
results impacts performance in a big way.
Use a relational database system for large volume join operations where possible.