Programming Microsoft SQL Server 2005: Using the Data Mining Wizard and Data Mining Designer (part 7) - Nested Tables

1/14/2011 5:43:54 PM

Nested Tables

Nested tables are perhaps one of the most confusing features of SSAS data mining. Once you understand them, however, they will likely become an integral component of your data mining toolset. Nested tables allow you to include transactional data in your mining models without changing your data structure.

For example, suppose we want to include data about the types of products customers bought in our profitability analysis. Adventure Works sells four categories of products: bikes, clothing, accessories, and components. What kinds of products do the high-profit customers buy? Are bike buyers high-profit customers? What about clothing, accessory, or component buyers? Although we could add four new columns to the case table to flag the types of products a customer bought, what if Adventure Works later decided to sell 10 categories of products? Or what if we want to know whether particular models of bikes, of which there are 50, predict profitability? Adding distinct columns to our case table for each of these values would clearly be inefficient, both in terms of storage for the case table and the ETL work needed to build it. Nested tables eliminate these inefficiencies, allowing you to leave intact the transactional data’s existing structure.

To understand this a little better in the context of the mining structure and models we’ve built so far, let’s look at the create statement for the vCustomerPurchases view shown in Listing 1.

Listing 1. A view with transaction data

CREATE VIEW [dbo].[vCustomerPurchases]
AS
SELECT
  f.SalesOrderNumber
  ,f.SalesOrderLineNumber
  ,f.CustomerKey
  ,t.FullDateAlternateKey
  ,pc.EnglishProductCategoryName AS ProductCategory
  ,psc.EnglishProductSubcategoryName AS ProductSubcategory
  ,isnull(p.[ModelName], p.[EnglishProductName]) AS [Model]
  ,f.OrderQuantity
  ,Row_Number ( ) OVER (
    Partition BY f.CustomerKey
    ORDER BY t.FullDateAlternateKey, f.SalesOrderLineNumber)
  AS PurchaseSequence
FROM [dbo].[FactInternetSales] AS f
INNER JOIN dbo.dimProduct AS p
ON f.ProductKey = p.ProductKey
INNER JOIN dbo.dimProductSubCategory AS psc
ON p.ProductSubCategoryKey = psc.ProductSubcategoryKey
INNER JOIN dbo.dimProductCategory AS pc
ON psc.ProductCategoryKey = pc.ProductCategoryKey
INNER JOIN dbo.dimTime AS t
ON f.OrderDateKey = t.TimeKey

The view is constructed with a simple SELECT statement from the fact table with a join to a few dimension tables to pick up product names and product categories. (We’ll discuss the PurchaseSequence column, which uses the new T-SQL Row_Number() function, in the next section.)

Table 1 shows customer attributes along with transactional data about customer purchases. Suppose we want to identify Melvin as an Accessory Buyer and Jennifer as both a Bike Buyer and an Accessory Buyer in our mining models without explicitly adding the attributes “Bike Buyer” and “Accessory Buyer” to our customer profile data (which in our analysis comes from vCustomerProfitability). Adding the transactional data (in our analysis, vCustomerPurchases) as a nested table to the mining structure will do the trick.

Table 1. Nested Data
					Customer Purchases
CustKey	Name	Profit	Gender	Income Group	Product Category	Product Name	Seq
11580	Melvin	Low	M	Low	Accessories	Sport-100 Helmet, Blue	1
11036	Jennifer	Low	F	Moderate	Bikes	Mountain-200 Silver, 46	1
					Accessories	Sport-100 Helmet, Black	2
11376	Lance	Low	M	Moderate	Clothing	Half-Finger Gloves, M	1
					Clothing	Short-Sleeve Classic Jersey, M	2
11136	Brianna	High	F	Moderate	Bikes	Mountain-200 Black, 38	1
					Accessories	Fender Set, Mountain	2
					Clothing	Classic Vest, M	3

To use vCustomerPurchases as nested table in our mining structure, we must first add it to our DSV:

Double-click AdventureWorks DW.dsv in Solution Explorer to bring it up in the data source view designer.

Click the Add/Remove Objects toolbar button (first from the left), right-click the designer window and choose Add/Remove Tables, or choose Add/Remove Tables from the Data Source View menu. Any of these three actions opens the Add/Remove Tables dialog box.

Locate vCustomerPurchases in the Available Objects list, move it to the Included Objects list, and then click OK.

To complete our DSV changes, we must establish the relationship between vCustomerPurchases and vCustomerProfitability. Click the CustomerKey column in vCustomerPurchases and drag it onto the CustomerKey column in vCustomerProfitability.

Save your changes. The data source view designer should appear as shown in Figure 21.

Figure 21. Transactional data on customer purchases in the DSV will be used as the source table for a nested table in the mining model.

Back in the Mining Structure designer, click the Mining Structure tab.

Locate, but do not click, the Add A Mining Structure Nested Table Column toolbar button (fourth from the left).

Note the term “nested table column.” SSAS considers the nested table to be an additional column in the mining structure—that is, it considers it to be a columntable.

that has a “data type” of

Note

In keeping with the SSAS mapping of nested tables to a special “data type” of table, note that in Table 20-1 we portrayed vCustomerPurchases as just another customer attribute (albeit a hierarchical one), along with Profit, Gender, and Income Group. We did this because we knew we would use vCustomerPurchases as a nested table column.

Click the Add A Mining Structure Nested Table Column toolbar button. This brings up the Select A Nested Table Key Column dialog box.

Determining the key of a nested table can be counterintuitive initially; your first instinct might be to select CustomerKey, SalesOrderNumber, or a combination of SalesOrderNumber and SalesOrderLineItem as the nested table key. In fact, none of these is an appropriate key for the nested table.

When you specify a column as the nested table key column, the distinct values in that column become usable attributes in your mining model. As such, the correct key for the nested table in this scenario is ProductCategory, which has the possible values “Bike,” “Clothing,” and “Accessory.”

Select vCustomerPurchases from the Nested Table list, and then select ProductCategory from the Source Column list. Click OK.

As you will see after we modify and process the mining models, selecting ProductCategory as the nested table key column adds attributes such as vCustomerPurchases(Bike) and vCustomerPurchases(Clothing) to our mining models.

Notice that the nested table has been added to the mining structure tree view as a column of the CustomerProfitCategory structure (appearing with the name “v Customer Purchases”). Unlike other columns in the mining structure, it has a plus sign next to it.

10.

Click the plus sign to expand the table. Now expand the columns of the nested table. The mining structure tab will appear as shown in Figure 22.

Figure 22. A mining structure with a nested table column

When you add new columns to an existing mining structure, they will by default be ignored by existing mining models. Therefore, you must specify the usage of the new columns in the models where you want to include them:

1.	Click the Mining Models tab and then click the plus sign next to v Customer Purchases. You will see a new row for the Product Category child column, which is also ignored by all models.
2.	Click the cell corresponding to the intersection of the v Customer Purchases column (the “parent” row) and the CustomerProfitCategory_CL model.
3.	Click the drop-down arrow in the cell to view the entries in the drop-down list, and select Input. The setting for the Product Category child column is automatically changed from Ignore to Key.
4.	Repeat the process for each of the other mining models in the structure. Afterwards, the Mining Models tab should appear as shown in Figure 23. Figure 23. Mining models in a mining structure with a nested table column

Deploy and process your mining models. This might take several minutes. When processing is complete, click the Mining Model Viewer tab.

Click the Cluster Diagram tab of the CustomerProfitCategory_CL5 model (the model with five clusters). Using the Shading Variable drop-down list (note the three variables that correspond to product category), change the shading variable to v Customer Purchases(Bikes). In the State drop-down list, the two possible states of the variable are Existing and Missing. If the Bikes key exists in the nested table, the customer is a bike buyer; if the Bikes key is missing, the customer is not a bike buyer. Select Existing.

Take note of Cluster 1 and Cluster 5, which have the highest concentration of bike buyers. Change the shading variable to Is New Customer, and select the Y state. Cluster 5 has more new customers than Cluster 1. Change the shading variable to ProfitCategory, select the High state, and note that the two clusters with the highest concentration of bike buyers also have the highest concentration of high-profit customers.

Figure 24 compares the five clusters on the Cluster Profiles tab. When the key of a nested table column is included in a mining model, the distinct values found in the key column form the basis of existence attributes that are used to train the mining model. The value of the key column represented by the existence attribute is shown in parentheses after the nested table column name. Cluster 1 and Cluster 5 have a markedly higher proportion of bike buyers than do the other clusters. The proportion of accessory and clothing buyers is similar across all clusters.

Figure 24. Viewing the results of a mining model that includes a nested table column input

Use the mining model viewers to explore other mining models. Take note of the input attributes added to the models as a result of adding the nested table to the models. Close the project and exit Visual Studio.