How join/into Works
The join/into keyword combination in query expression syntax maps to the GroupJoin standard query operator. The GroupJoin operator returns a sequence of inner elements with a matching key value for every outer element in a sequence. Unlike the pure join
syntax, every outer-sequence element will have a represented group
sequence, but those without matching inner-sequence elements will have
an empty group. The GroupJoin extension method signature takes the following form:
public static IEnumerable<TResult>
GroupJoin<TOuter, TInner, TKey, TResult>(
this IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
IEqualityComparer<TKey> comparer // optional
)
The pseudo-code for this operator is similar to the
Join operator, except the groupings, rather than the individual
elements in each group, are returned. The pseudo-code for GroupJoin is
Create
a grouping on the inner sequence for all elements that share the same
inner-key selector function value (as determined by the innerKeySelector argument). Call this list lookup.
Iterate the outer sequence. Call each current element item.
Create
a grouping from the key of the lookup sequence for all elements that
share the same outer-key selector function result each item returns (as determined by the outerKeySelector argument, and the comparer equality comparer argument if specified).
Yield the result by passing each item and the grouping to the resultSelector argument.
The returned sequence will be in the order of the outer-sequence elements, although it can be changed by any combination of order by
clauses as required. The order of the elements in each group will be
the same as the inner sequence, and if you want to change the order of
the inner-sequence elements, consider using a subquery in the select
projection with specific order by clauses instead of the GroupJoin syntax.
The key selection expression for the outer and inner sequence can be any expression that evaluates to a comparable type.
Using the extension method syntax for joins is almost never required (almost never—if you need a custom equality comparer, you have no choice). To demonstrate how a join/into
query expression maps to the equivalent extension method syntax code,
the following code is functionally equivalent to that shown in Listing 7:
var q1a = customers
.GroupJoin(
orders,
c => c.CustomerID,
o => o.CustomerID,
(c, o) => new
{
LastName = c.LastName,
Orders = o
}
);
One-to-Many Join Using a Subquery
A subquery in a query’s select projection can
achieve a one-to-many join. If targeting a single record (or a small
number, say less than five outer records), this technique performs
adequately and is easy to read. Listing 8 demonstrates using a subquery for a one-to-many join, and this code produces the Console output shown in Output8.
Listing 8. One-to-Many Join using a subquery in the select projection—see Output 8
var customers = Customer.SampleData(); var orders = Order.SampleData();
var q2 = from c in customers select new { LastName = c.LastName, Orders = from o in orders where o.CustomerID == c.CustomerID select o };
foreach (var customer in q2) { Console.WriteLine("Last name: {0}", customer.LastName); foreach (var order in customer.Orders) Console.WriteLine(" - Order number: {0}", order.OrderNumber); }
|
Output 8.
Last name: Gottshall - Order number: Order 1 - Order number: Order 2 Last name: Valdes Last name: Gauwain Last name: Deane - Order number: Order 3 - Order number: Order 4 Last name: Zeeman - Order number: Order 5
|
One-to-Many Join Using the ToLookup Operator
The ToLookup standard query operator builds an efficient lookup list based on a key selection function. The result is similar to a Hashtable, however the ToLookup
operator returns a sequence for each key value rather than for a single
element. Many of Microsoft’s standard query operators use ToLookup
internally to perform their tasks, and it is a convenient helper method
in all manner of programming problems. It was surprising in testing to
find that by using this operator, a major performance gain was achieved
over the other one-to-many techniques. An added advantage of creating
your inner-sequence grouping first is that the result can be used
multiple times across multiple queries. Listing 9 demonstrates how to achieve a one-to-many join using the ToLookup operator; this code produces the identical output to that shown in Output 9.
Listing 9. One-to-Many Join using the ToLookup Operator—see Output 9
var customers = Customer.SampleData(); var orders = Order.SampleData();
// build a lookup list for the inner sequence var orders_lookup = orders.ToLookup(o => o.CustomerID);
var q3 = from c in customers select new { LastName = c.LastName, Orders = orders_lookup[c.CustomerID] };
foreach (var customer in q3) { Console.WriteLine("Last name: {0}", customer.LastName); foreach (var order in customer.Orders) Console.WriteLine(" - Order number: {0}", order.OrderNumber); }
|
One-to-Many Join Performance Comparisons
As
seen in one-to-one joins, choosing the correct join technique is
critical when working with large collection sizes, and the same holds
true for one-to-many joins. The guidance offered here is in not a
substitute for running tests on your own queries and data, however the
guidance offered in Tables 4 and 5
show some critical join types to avoid. A database system is the best
tool for undertaking joins over larger records sets; small data sizes
work fine, but if performance is a critical requirement, exhaust all
options of joining the data at the source database before using LINQ to
Objects.
To derive the performance guidance for one-to-many
joins, do a simple experiment of using the three different join
techniques over a large number of iterations and measuring the total
time. Each join technique has a different performance profile for the two most basic scenarios:
The
outer sequence has many records—An example of this scenario is when you
are binding the full sequence of results to a grid or other repeater
control and summarizing the results of a related set of records. For
example, the following code demonstrates a one-to-many join where outer
collection has many records:
var q1 = from c in customers
join o in orders on
c.CustomerID equals o.CustomerID
into cust_orders
The outer sequence has only a few records, either by count or because of a strict where
clause (targeting a single record is the most common)—An example of
this scenario is when you are looking up a single record for editing
purposes. For example, the following code demonstrates a one-to-many
join where a single record is isolated in the outer collection:
var q1 = from c in customers
where c.CustomerID == "DEA1"
join o in orders on
c.CustomerID equals o.CustomerID
into cust_orders
Table 5. One-to-Many Join Techniques and Their Performance Comparisons Using 1M Iterations
Join Technique Iterations = 1Million times | Many Record Performance Outer n = 100 Inner n = 5 | Many Record Performance Outer n = 50 Inner n = 5 | Many Record Performance Outer n = 5 Inner n = 5 | Single Record Performance Outer n = 1 Inner n = 5 |
---|
Using the join/into combination (or GroupJoin) operator
Listing 7 | 21562ms (3 - worst), although improves as inner sequence count grows larger | 12284ms | 2410ms | 1964ms (3 - worst) |
Using a subquery in the select projection
Listing 8 | 20179ms (2) | 9953ms | 1222ms | 712ms (2) |
Using the ToLookup operator
Listing 9 | 17928ms (1 - best) | 9514ms | 1218ms | 694ms (1 - best) |
Conclusions can be drawn from these simple experiments:
Use the ToLookup syntax for all one-to-many scenarios, unless the inner query is complex and easier to achieve with a subquery.
The join/into
syntax is slower than the other techniques and should only be used with
small inner sequence counts (it gets superior as inner sequence size
grows) or when aggregation operators are going to be used over the join
result.
Use a relational database system for large volume join operations where possible.