Programming Microsoft SQL Server 2005: Using Data Mining Extensions (part 1) - Data Mining Modeling Using DMX

1/16/2011 9:13:57 AM

We will see how to use SQL Data Mining Extensions (DMX) to do those same tasks. DMX is an extension to the SQL language for creating and working with data mining models in SSAS.

As with conventional SQL, DMX has two kinds of statements: data definition language (DDL) statements and data manipulation language (DML) statements. DDL statements are used to create mining structures and models. DML statements are used to train, browse, and predict using mining models.

It is true that you can avoid writing DDL queries in DMX by using the SSAS designers. It is also true that DML prediction queries in DMX can be autogenerated using those tools. But as with relational databases, it is better not to completely rely on tools when a powerful, declarative language is available to perform these same tasks. We will now cover writing DMX code to add a mining model to an existing mining structure, create and train sequence clustering and association rules models, and perform prediction queries against trained mining models.

Data Mining Modeling Using DMX

DMX can be used to create, modify, and train the same mining structures and models as those created using the Data Mining Wizard and modified using the Data Mining Designer:

1.	In SQL Server Management Studio, show the Object Explorer window if is not visible. If your SSAS server does not already appear in Object Explorer, add a connection to it using the Connect drop-down list or the corresponding toolbar button.
2.	Drill down on the Databases node and then the node for Chapter20, our sample database. Notice that Object Explorer provides a dedicated node for mining structures. Drill down on the Mining Structures node and verify that you can see a node for CustomerProfitCategory, our mining structure. Drill down on this node and then drill down farther on its child Mining Models node and verify that nodes appear for each of the four models we created previously.
3.	Click the Analysis Services DMX Query button on Management Studio’s SQL Server Analysis Services Editors toolbar (the fourth button from the left). After you work through the Connect to Analysis Services dialog box, a special query window will open in which you can execute DMX statements. Much like the Multidimensional Expressions (MDX) query window, which in its Tools pane lists cubes, dimensions, measures, and MDX functions, the DMX query window has a Tools pane containing a drop-down list of all models in the database (for all mining structures), a Metadata tab that lists all columns in the currently selected model, and a Functions tab that lists DMX functions. In addition, Template Explorer offers a number of DMX query templates. (Template Explorer will be discussed in detail later.)
4.	For this article, we will review the code in several DMX scripts, contained in the Chapter20DMX project, belonging to Management Studio’s sample code solution Chapter20.smsssln. Double-click the Chapter20.smsssln file from Windows Explorer. Doing so will open the sample solution in SQL Server Management Studio. Next, open the first five script files in the project (the five top-most scripts in Solution Explorer under the Queries node within the Chapter20DMX project). Your Management Studio environment should appear similar to that shown in Figure 1. Figure 1. SQL Server Management Studio with a DMX query window open

Using DMX, you can either create a mining structure and then add a mining model to it, or you can create a mining structure and a model in a single step. The scripts CreateMiningStructure.dmx, AddModelWithDrillthrough.dmx, and MapTrain.dmx, respectively, show how to create a mining structure, add a mining model to the structure, and process the model. The scripts CreateStructureAndModel.dmx and MapTrainOneStep.dmx show how to create a mining structure and model in one step and then process the structure and model.

Why would you bother running three scripts when you could do the same work using two? To begin with, you might want to create a mining structure that supports many models, where each model will use only a subset of the columns in the structure. The columns defined in the mining structure form the domain of columns from which you can choose to build your individual mining models. But creating a mining structure and model in one step forces the structure and the model to share a set of columns.

Another reason for using separate DMX statements for creating mining structures and models is simply to get a deeper understanding of how SSAS thinks of mining structures and models.

The simplest form of the CREATE MINING STRUCTURE statement is as follows:

CREATE MINING STRUCTURE <structure>
(
    <column name> <data type> <content type>
)

You must specify a name for the mining structure and define the columns for the structure. For each column in the mining structure, you must specify a column name, the data type of the column (for example, Text or Long), and the content type of the column (Key, Continuous, or Discrete). An example of a simple CREATE MINING STRUCTURE statement is given in the CreateMiningStructure.dmx script. Open it and follow these steps:

1.	Click the Execute toolbar button or shortcut menu command, choose Execute from the Query menu, or press F5 or CTRL+E. You should see a message saying that the execution is complete, and the status bar in the lower-left corner of the query window should indicate that the query executed successfully.
2.	Go to the Chapter20 database node in Object Browser and refresh its Mining Structures node by using the Refresh command on its shortcut menu. You should see the CarOwner_DMX structure you just created. Expand the structure, and you will see a Mining Models node indented beneath it. If you expand that node, you will discover that it is empty; this is because the mining structure does not have any mining models associated with it yet. To add a mining model to an existing mining structure, you must use an ALTER MINING STRUCTURE statement with an ADD MINING MODEL clause. Let’s take a look at a simple form of the statement: ALTER MINING STRUCTURE <structure> ADD MINING MODEL <model>( <column name> [<prediction>] ) USING <algorithm> [(<parameter list>)] [WITH DRILLTHROUGH] The script AddModelWithDrillthrough.dmx gives a specific example of an ALTER MINING STRUCTURE statement: ALTER MINING STRUCTURE CarOwner_DMX ADD MINING MODEL CarOwner_DT( [Customer Key] ,[Income Group] ,[Gender] ,[Marital Status] ,[Is Car Owner] PREDICT ,[Region] ) USING Microsoft_Decision_Trees WITH DRILLTHROUGH This example adds a Decision Trees model for predicting car ownership to the mining structure. Note that the column list for the mining structure specifies the data type and content type for the columns. The data type and content type of columns are common among all the mining models that are part of the same mining structure. In contrast, column usage is model specific. Column usage—whether the column is an input column or predictable column—is specified in the mining model column list. By default, the columns are input columns; therefore, you need to identify only the columns that are predictable. In the example, the [Is Car Owner] column is marked as a predictable column because we want to use the Decision Trees algorithm to predict car ownership. Let’s continue the example.
3.	Execute the query.
4.	Refresh and expand the Mining Models node below the CarOwner_DMX mining structure node. You should see the CarOwner_DT model. Right-click the model and choose Browse. You will receive an error message saying that the model must be processed. Right-click the model in Object Browser and choose Process.
5.	In the Process Mining Model dialog box, click OK. You will see an error message similar to that shown in Figure 2. Figure 2. The error message that appears when you try to process a mining structure for which column bindings have not been specified This error message appears because SSAS does not know the source data for the mining structure. So far, all we have done is specify some column names in a mining structure and model—there has been no mention of the source of the data for the structure or model. Binding the columns of a mining structure to data is done using an INSERT INTO DMX statement. Here is a simple form of the statement: INSERT INTO MINING STRUCTURE <structure> ( <mining structure columns> ) OPENQUERY(<named datasource>, <query syntax>) To run your own INSERT INTO query, close the error message and the Process Mining Model dialog box. Close the empty viewer window and open the DMX script file MapTrain.dmx by double-clicking its node in Solution Explorer. An example of an INSERT INTO statement to bind data columns to the columns is contained in the script: INSERT INTO CarOwner_DMX( [Customer Key] ,[Income Group] ,[Gender] ,[Marital Status] ,[Is Car Owner] ,[Region] ) OPENQUERY([Adventure Works DW], 'SELECT CustomerKey ,IncomeGroup ,Gender ,MaritalStatus ,IsCarOwner ,Region FROM AdventureWorksDW.dbo.vCustomerProfitability' ) The column list consists of the columns that were specified in the CREATE MINING STRUCTURE statement. The named data source was defined in the SSAS database—in our case, [Adventure Works DW]. Our data source is a SQL Server database, so the code within the OPENQUERY expression is actually a Transact-SQL (T-SQL) statement. To continue with the example:
6.	Execute the script. The statement binds the mining structure columns and processes the mining model.
7.	Go back to the Object Browser window, right-click the CarOwner_DT model’s node, and choose Browse. This brings up the Tree Viewer in a new document window. The Tree Viewer is identical to the one you used in Visual Studio.
8.	Right-click a node of the decision tree and choose Drill Through. This brings back the rows from vCustomerProfitability that belong to the node that was right-clicked. (The drill-through feature was enabled by the WITH DRILLTHROUGH clause that we specified in the script AddModelWithDrillthrough.dmx.)
9.	Close the viewer window and any open DMX Query windows.

Moving on, let’s explore how to create a mining structure and model in one step. We do so by specifying the column name, data type, content type, and usage in the column list:

CREATE MINING MODEL <model>(
  <column name> <data type> <Content Type> [<prediction>]
)
USING <algorithm> [(<parameter list>)] [WITH DRILLTHROUGH]

The DMX code in CreateStructureAndModel.dmx is an example of this statement:

CREATE MINING MODEL CarOwner_DT_OneStep(
  [Customer Key] LONG KEY
  ,[Income Group] TEXT DISCRETE
  ,[Gender] TEXT DISCRETE
  ,[Marital Status] TEXT DISCRETE
  ,[Is Car Owner] TEXT DISCRETE PREDICT
  ,[Region] TEXT DISCRETE
) USING Microsoft_Decision_Trees WITH DRILLTHROUGH

Take these steps next:

1.	Execute the query and then refresh the Mining Structures node in the Object Explorer window. You should see a new node for a structure with the name CarOwner_DT_OneStep_Structure. SSAS appends the Structure suffix to the name specified in the CREATE MINING MODEL statement in order to distinguish the mining structure from the mining model.
2.	Expand the structure’s node and then its child mining models node. You should see the CarOwner_DT_OneStep model; this model and its parent structure were created by the single DMX statement.
3.	Open and execute the script MapTrainOneStep.dmx to process the CarOwner_DMX model. Browse it if you want, and then close the viewer and all open DMX Query windows.

Here’s the DMX code from the MapTrainOneStep.dmx script:

INSERT INTO CarOwner_DT_OneStep(
  [Customer Key]
  ,[Income Group]
  ,[Gender]
  ,[Marital Status]
  ,[Is Car Owner]
  ,[Region]
)
OPENQUERY([Adventure Works DW],
  'SELECT
    CustomerKey
    ,IncomeGroup
    ,Gender
    ,MaritalStatus
    ,IsCarOwner
    ,Region
   FROM AdventureWorksDW.dbo.vCustomerProfitability'
)

This query is similar to the query in MapTrain.dmx, which was used to process the CarOwner_DMX mining structure. In MapTrain.dmx, a mining structure was specified in the INSERT INTO statement; the DMX code in MapTrainOneStep.dmx specifies a mining model. Specifying a mining model in the INSERT INTO DMX statement processes that specific mining model. Specifying a mining structure in the statement processes the mining structure and all of the models in that structure. The mining structure we created has only one mining model, so the result is the same regardless of whether we specify the structure or the model.

DMX statements for creating and processing mining models with nested tables are slightly more complex than for models without nested tables. As mentioned in the previous section, SSAS views a nested table as a column; it is a column with a “data type” of table. The concept of a nested table being a “column” is confusing. The concept becomes more clear, however, when you consider DMX statements that create mining structures and models that, in turn, use nested tables. Here is the syntax for adding a mining model that uses a nested table to an existing mining structure:

ALTER MINING STRUCTURE <structure>
ADD MINING MODEL <model>(
  <column name>  [<prediction>],
  <table name> [<prediction>] (
    <column name>  [<prediction>]
  )
)
USING <algorithm> [(<parameter list>)] [WITH DRILLTHROUGH]

An example of adding a mining model with a nested table is shown in the CreateModelWithNestedTable.dmx script file:

ALTER MINING STRUCTURE CustomerProfitCategory
ADD MINING MODEL ProductBuyer_DT(
  [Customer Key]
  ,[Age]
  ,[Commute Distance]
  ,[English Education]
  ,[Gender]
  ,[Income Group]
  ,[Is Car Owner]
  ,[Region]
  ,[v Customer Purchases] PREDICT
    (
      [Product Category]
    )
) USING Microsoft_Decision_Trees

CreateModelWithNestedTable.dmx adds a decision tree model—for predicting the category of products that customers buy—to the CustomerProfitCategory mining structure .

1.	Execute CreateModelWithNestedTable.dmx and then check that the ProductBuyer_DT model was added to the mining structure in Object Browser.
2.	Right-click the model and choose Browse. You will receive an error message saying that you cannot browse the model until it has been processed.
3.	To fix the problem, execute the DMX code in the script file TrainPreMapped.dmx. This query provides an example of using the INSERT INTO statement to process a mining model when the column bindings for the structure have already been defined: INSERT INTO ProductBuyer_DT The CustomerProfitCategory structure was defined using the Data Mining designer. When you create a structure using this designer, you select columns from your data source view for the structure; the mining structure columns are bound to the source data at that time. When the mining structure columns are already bound to the source data, the DMX statement to process the model is very simple: INSERT INTO <model>
4.	After you execute TrainPreMapped.dmx, go back to Object Browser. Right-click the ProductBuyer_DT model in the CustomerProfitCategory structure, and choose Browse. This time the Tree Viewer appears with no error messages.
5.	Use the viewer to explore the model. The model predicts whether customers are bike buyers, accessory buyers, or clothing buyers. Figure 3 shows the dependency network tab for the model. Figure 3. Browsing a decision tree mining model using the Tree Viewer in Management Studio

The query in the file CreateStructureWithNestedTable.dmx creates a mining structure and model that uses the Association Rules algorithm to predict the bundles of products customers have purchased:

CREATE MINING MODEL PurchaseAssociations(
  [Customer Key] LONG KEY
  ,[Customer Purchases] TABLE PREDICT
    (
      [Product Subcategory] TEXT KEY
     )
) USING Microsoft_Association_Rules

The Association Rules algorithm is used to find combinations of items that frequently occur together. The most common application of the Association Rules algorithm is in market-basket analysis. By understanding the bundles of products that customers have purchased, you can make recommendations for future purchases.

The script file TrainMapStructureWithNestedTable.dmx contains code that shows how to bind the columns of a mining structure that has a nested table to source data. The source query that specifies the columns from the source data to map to the mining structure columns uses the SHAPE command. At its simplest, the SHAPE command takes the following form:

SHAPE {<parent table query>}
APPEND ({ <nested table query> }
     RELATE <master column> TO <child column>)
          AS <nested table column name>

The SHAPE command, originally added to SQL to support ActiveX Data Objects (ADO) hierarchical (or “shaped”) RecordSet objects, lets you combine data from multiple SELECT statements into a single hierarchical result set (a result set with an embedded table).

More Info

For detailed information on the SHAPE command and the MSDataShape OLE DB provider, see Microsoft Knowledge Base article Q189657 at http://support.microsoft.com/kb/q189657.

A result set with an embedded table is precisely what a mining structure with a nested table column requires. Use of the SHAPE command with DMX requires that the parent table query and nested table query each be ordered by the column that relates the parent table to the nested table. For the query in TrainMapStructureWithNestedTable.dmx, this is the CustomerKey column. The query combines the use of the DMX INSERT INTO command and the SQL SHAPE command to bind the columns in a mining structure with a nested table.

Here is the generalized form for such queries:

INSERT INTO MINING STRUCTURE <structure> (
  <mining structure column> | [SKIP],
  <nested table> (
    <nested table column> | [SKIP]
  )
)
SHAPE {<parent table query>}
APPEND ({ <child table query> }
     RELATE <parent column> TO <child column>)
          AS <nested table column name>

The actual code from TrainMapStructureWithNestedTable.dmx is as follows:

INSERT INTO PurchaseAssociations(
  [Customer Key]
  ,[Customer Purchases]
    (
      SKIP
      ,[Product Subcategory]
    )
)
SHAPE{
  OPENQUERY([Adventure Works DW],
    'SELECT CustomerKey
     FROM AdventureWorksDW.dbo.vCustomerProfitability
     ORDER BY CustomerKey')
  }
APPEND ({
  OPENQUERY([Adventure Works DW],
    'SELECT CustomerKey, ProductSubcategory
     FROM AdventureWorksDW.dbo.vCustomerPurchases
     ORDER BY CustomerKey'
)}
RELATE CustomerKey TO CustomerKey) AS CustomerPurchases

The mining structure columns specified in the INSERT INTO clause of TrainMapStructureWithNestedTable.dmx are the structure’s column names. The column list after the INSERT INTO clause uses the column names with the spaces—these are the names that were specified in the CREATE MINING MODEL statement. (Note that the ordinal position of the columns in the list corresponds to the columns returned by the source data query, which in this case is the result set returned by the SHAPE command and its APPEND and RELATE clauses.) If the source data query returns columns that are not required by the mining model, you must use the SKIP keyword to indicate that the column should be skipped.

The query in TrainMapStructureWithNestedTable.dmx skips the CustomerKey column from the child table query. You are required to include the column that relates the child (nested) table to the parent table in the parent and child table queries. Both the parent table query and the child table queries in TrainMapStructureWithNestedTable.dmx select the CustomerKey column; however, the [Customer Key] column appears only once in the mining structure. Therefore, you must use the SKIP keyword to skip the CustomerKey column for the nested table.

Execute CreateStructureWithNestedTable.dmx followed by TrainMapStructureWithNestedTable.dmx, and explore the PurchaseAssociation model by right-clicking its node in the Object Browser window and choosing Browse.

The queries in the script files CreateModelCustomClusterCount.dmx and MapTrainSkipColumns.dmx together provide another example of using DMX to create a mining model with a nested table and then binding its structure to the source data. CreateModelCustomClusterCount.dmx creates a mining structure with a clustering sequence mining model:

CREATE MINING MODEL PurchaseSequence(
  [Customer Key] LONG KEY
  ,[Profit] LONG CONTINUOUS
  ,[Income Group] TEXT DISCRETE
  ,[Customer Purchases] TABLE
    (
      [Purchase Sequence] LONG KEY SEQUENCE
      ,[Product Subcategory] TEXT DISCRETE PREDICT
     )
) USING Microsoft_Sequence_Clustering (Cluster_Count = 0)

In the Clustering Sequence algorithm, not only are the combinations of items frequently occurring together of interest, but the order in which they occur is of interest. In our example, the model clusters customers according to the sequence of the product subcategory of their purchases. The model sets the Cluster_Count parameter of the algorithm to 0. This has the effect of forcing SSAS to determine the optimal number of clusters to create. By default, the algorithm creates 10 clusters. Execute the queries in CreateModelCustomClusterCount.dmx and MapTrainSkipColumns.dmx, in order, and then browse the model using the Cluster Viewer.

Programming Microsoft SQL Server 2005: Using Data Mining Extensions (part 2) - Data Mining Predictions Using DMX

Other

SQL Server 2008 : Working with DML Queries - Using the UPDATE Statement (part 2)

SQL Server 2008 : Working with DML Queries - Using the UPDATE Statement (part 1)

Azure Programming Considerations

Programming with SQL Azure : Record Navigation in WCF Data Services

Defensive Database Programming with SQL Server : TRY...CATCH Gotchas (part 2) - TRY...CATCH blocks cannot catch all errors

Defensive Database Programming with SQL Server : TRY...CATCH Gotchas (part 1) - Re-throwing errors

SQL Server 2008 : Working with DML Queries - Using the INSERT Statement (part 2)

SQL Server 2008 : Working with DML Queries - Using the INSERT Statement (part 1) - Using the INSERT Statement with the VALUES Clause

Programming Microsoft SQL Server 2005: Using the Data Mining Wizard and Data Mining Designer (part 7) - Nested Tables

Programming Microsoft SQL Server 2005: Using the Data Mining Wizard and Data Mining Designer (part 6) - Validating and Comparing Mining Models

Programming Microsoft SQL Server 2005: Using Data Mining Extensions (part 1) - Data Mining Modeling Using DMX

Data Mining Modeling Using DMX

Figure 1. SQL Server Management Studio with a DMX query window open

Figure 2. The error message that appears when you try to process a mining structure for which column bindings have not been specified

Figure 3. Browsing a decision tree mining model using the Tree Viewer in Management Studio