LINQ to Objects : How to Join with Data in Another Sequence (part 7)

1/14/2014 3:19:24 AM

How join/into Works

The join/into keyword combination in query expression syntax maps to the GroupJoin standard query operator. The GroupJoin operator returns a sequence of inner elements with a matching key value for every outer element in a sequence. Unlike the pure join syntax, every outer-sequence element will have a represented group sequence, but those without matching inner-sequence elements will have an empty group. The GroupJoin extension method signature takes the following form:

public static IEnumerable<TResult>
     GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer // optional
)

The pseudo-code for this operator is similar to the Join operator, except the groupings, rather than the individual elements in each group, are returned. The pseudo-code for GroupJoin is

Create a grouping on the inner sequence for all elements that share the same inner-key selector function value (as determined by the innerKeySelector argument). Call this list lookup.
Iterate the outer sequence. Call each current element item.
1. Create a grouping from the key of the lookup sequence for all elements that share the same outer-key selector function result each item returns (as determined by the outerKeySelector argument, and the comparer equality comparer argument if specified).
2. Yield the result by passing each item and the grouping to the resultSelector argument.

The returned sequence will be in the order of the outer-sequence elements, although it can be changed by any combination of order by clauses as required. The order of the elements in each group will be the same as the inner sequence, and if you want to change the order of the inner-sequence elements, consider using a subquery in the select projection with specific order by clauses instead of the GroupJoin syntax.

The key selection expression for the outer and inner sequence can be any expression that evaluates to a comparable type.

Using the extension method syntax for joins is almost never required (almost never—if you need a custom equality comparer, you have no choice). To demonstrate how a join/into query expression maps to the equivalent extension method syntax code, the following code is functionally equivalent to that shown in Listing 7:

var q1a = customers
          .GroupJoin(
               orders,
               c => c.CustomerID,
               o => o.CustomerID,
               (c, o) => new
               {
                   LastName = c.LastName,
                   Orders = o
               }
         );

One-to-Many Join Using a Subquery

A subquery in a query’s select projection can achieve a one-to-many join. If targeting a single record (or a small number, say less than five outer records), this technique performs adequately and is easy to read. Listing 8 demonstrates using a subquery for a one-to-many join, and this code produces the Console output shown in Output8.

Listing 8. One-to-Many Join using a subquery in the select projection—see Output 8

var customers = Customer.SampleData();
var orders = Order.SampleData();

var q2 = from c in customers
         select new
         {
             LastName = c.LastName,
             Orders = from o in orders
                      where o.CustomerID == c.CustomerID
                      select o
         };

foreach (var customer in q2)
{
    Console.WriteLine("Last name: {0}", customer.LastName);
    foreach (var order in customer.Orders)
        Console.WriteLine(" - Order number: {0}", order.OrderNumber);
}

Output 8.

Last name: Gottshall
 - Order number: Order 1
 - Order number: Order 2
Last name: Valdes
Last name: Gauwain
Last name: Deane
 - Order number: Order 3
 - Order number: Order 4
Last name: Zeeman
 - Order number: Order 5

One-to-Many Join Using the ToLookup Operator

The ToLookup standard query operator builds an efficient lookup list based on a key selection function. The result is similar to a Hashtable, however the ToLookup operator returns a sequence for each key value rather than for a single element. Many of Microsoft’s standard query operators use ToLookup internally to perform their tasks, and it is a convenient helper method in all manner of programming problems. It was surprising in testing to find that by using this operator, a major performance gain was achieved over the other one-to-many techniques. An added advantage of creating your inner-sequence grouping first is that the result can be used multiple times across multiple queries. Listing 9 demonstrates how to achieve a one-to-many join using the ToLookup operator; this code produces the identical output to that shown in Output 9.

Listing 9. One-to-Many Join using the `ToLookup` Operator—see Output 9

var customers = Customer.SampleData();
var orders = Order.SampleData();

// build a lookup list for the inner sequence
var orders_lookup = orders.ToLookup(o => o.CustomerID);

var q3 = from c in customers
         select new
         {
             LastName = c.LastName,
             Orders = orders_lookup[c.CustomerID]
         };

foreach (var customer in q3)
{
    Console.WriteLine("Last name: {0}", customer.LastName);
    foreach (var order in customer.Orders)
        Console.WriteLine(" - Order number: {0}", order.OrderNumber);
}

One-to-Many Join Performance Comparisons

As seen in one-to-one joins, choosing the correct join technique is critical when working with large collection sizes, and the same holds true for one-to-many joins. The guidance offered here is in not a substitute for running tests on your own queries and data, however the guidance offered in Tables 4 and 5 show some critical join types to avoid. A database system is the best tool for undertaking joins over larger records sets; small data sizes work fine, but if performance is a critical requirement, exhaust all options of joining the data at the source database before using LINQ to Objects.

To derive the performance guidance for one-to-many joins, do a simple experiment of using the three different join techniques over a large number of iterations and measuring the total time. Each join technique has a different performance profile for the two most basic scenarios:

The outer sequence has many records—An example of this scenario is when you are binding the full sequence of results to a grid or other repeater control and summarizing the results of a related set of records. For example, the following code demonstrates a one-to-many join where outer collection has many records:
```
var q1 = from c in customers
         join o in orders on
            c.CustomerID equals o.CustomerID
         into cust_orders
```
The outer sequence has only a few records, either by count or because of a strict where clause (targeting a single record is the most common)—An example of this scenario is when you are looking up a single record for editing purposes. For example, the following code demonstrates a one-to-many join where a single record is isolated in the outer collection:
```
var q1 = from c in customers
         where c.CustomerID == "DEA1"
         join o in orders on
            c.CustomerID equals o.CustomerID
         into cust_orders
```

Table 5. One-to-Many Join Techniques and Their Performance Comparisons Using 1M Iterations
Join Technique Iterations = 1Million times	Many Record Performance Outer n = 100 Inner n = 5	Many Record Performance Outer n = 50 Inner n = 5	Many Record Performance Outer n = 5 Inner n = 5	Single Record Performance Outer n = 1 Inner n = 5
Using the `join`/`into` combination (or `GroupJoin`) operator Listing 7	21562ms (3 - worst), although improves as inner sequence count grows larger	12284ms	2410ms	1964ms (3 - worst)
Using a subquery in the select projection Listing 8	20179ms (2)	9953ms	1222ms	712ms (2)
Using the `ToLookup` operator Listing 9	17928ms (1 - best)	9514ms	1218ms	694ms (1 - best)

Conclusions can be drawn from these simple experiments:

Use the ToLookup syntax for all one-to-many scenarios, unless the inner query is complex and easier to achieve with a subquery.
The join/into syntax is slower than the other techniques and should only be used with small inner sequence counts (it gets superior as inner sequence size grows) or when aggregation operators are going to be used over the join result.
Use a relational database system for large volume join operations where possible.