We will see how to use SQL Data Mining Extensions
(DMX) to do those same tasks. DMX is an extension to the SQL language
for creating and working with data mining models in SSAS.
As with conventional SQL,
DMX has two kinds of statements: data definition language (DDL)
statements and data manipulation language (DML) statements. DDL
statements are used to create mining structures and models. DML
statements are used to train, browse, and predict using mining models.
It is true that you can
avoid writing DDL queries in DMX by using the SSAS designers. It is also
true that DML prediction queries in DMX can be autogenerated using those tools. But as with
relational databases, it is better not to completely rely on tools when a
powerful, declarative language is available to perform these same
tasks. We will now cover writing DMX code to add a mining model to an
existing mining structure, create and train sequence clustering and
association rules models, and perform prediction queries against trained
mining models.
Data Mining Modeling Using DMX
DMX can be used to
create, modify, and train the same mining structures and models as those
created using the Data Mining Wizard and modified using the Data Mining
Designer:
1. | In
SQL Server Management Studio, show the Object Explorer window if is not
visible. If your SSAS server does not already appear in Object
Explorer, add a connection to it using the Connect drop-down list or the
corresponding toolbar button.
|
2. | Drill
down on the Databases node and then the node for Chapter20, our sample
database. Notice that Object Explorer provides a dedicated node for
mining structures. Drill down on the Mining Structures node and verify
that you can see a node for CustomerProfitCategory, our mining
structure. Drill down on this node and then drill down farther on its
child Mining Models node and verify that nodes appear for each of the
four models we created previously.
|
3. | Click
the Analysis Services DMX Query button on Management Studio’s SQL
Server Analysis Services Editors toolbar (the fourth button from the
left).
After you work through the Connect to Analysis Services dialog
box, a special query window will open in which you can execute DMX
statements. Much like the Multidimensional Expressions (MDX) query
window, which in its Tools pane lists cubes, dimensions, measures, and
MDX functions, the DMX query window has a Tools pane containing a
drop-down list of all models in the database (for all mining
structures), a Metadata
tab that lists all columns in the currently selected model, and a
Functions tab that lists DMX functions. In addition, Template Explorer
offers a number of DMX query templates. (Template Explorer will be
discussed in detail later.)
|
4. | For
this article, we will review the code in several DMX scripts, contained
in the Chapter20DMX project, belonging to Management Studio’s sample
code solution Chapter20.smsssln. Double-click the Chapter20.smsssln file
from Windows Explorer. Doing so will open the sample solution in SQL
Server Management Studio. Next, open the first five script files in the
project (the five top-most scripts in Solution Explorer under the
Queries node within the Chapter20DMX project). Your Management Studio
environment should appear similar to that shown in Figure 1.
|
Using DMX, you can either
create a mining structure and then add a mining model to it, or you can
create a mining structure and a model in a single step. The scripts
CreateMiningStructure.dmx, AddModelWithDrillthrough.dmx, and
MapTrain.dmx, respectively, show how to create a mining structure, add a
mining model to the structure, and process the model. The scripts
CreateStructureAndModel.dmx and MapTrainOneStep.dmx show how to create a
mining structure and model in one step and then process the structure
and model.
Why would you bother
running three scripts when you could do the same work using two? To
begin with, you might want to create a mining structure that supports
many models, where each model will use only a subset of the columns in
the structure. The columns defined in the mining structure form the
domain of columns from which you can choose to build your individual
mining models. But creating a mining structure and model in one step
forces the structure and the model to share a set of columns.
Another reason for
using separate DMX statements for creating mining structures and models
is simply to get a deeper understanding of how SSAS thinks of mining
structures and models.
The simplest form of the CREATE MINING STRUCTURE statement is as follows:
CREATE MINING STRUCTURE <structure>
(
<column name> <data type> <content type>
)
You must specify a
name for the mining structure and define the columns for the structure.
For each column in the mining structure, you must specify a column
name, the data type of the column (for example, Text or Long), and the
content type of the column (Key, Continuous, or Discrete). An example of
a simple CREATE MINING STRUCTURE statement is given in the CreateMiningStructure.dmx script. Open it and follow these steps:
1. | Click
the Execute toolbar button or shortcut menu command, choose Execute
from the Query menu, or press F5 or CTRL+E. You should see a message
saying that the execution is complete, and the status bar in the
lower-left corner of the query window should indicate that the query
executed successfully.
|
2. | Go
to the Chapter20 database node in Object Browser and refresh its Mining
Structures node by using the Refresh command on its shortcut menu. You
should see the CarOwner_DMX structure you just created. Expand the
structure, and you will see a Mining Models node indented beneath it. If
you expand that node, you will discover that it is empty; this is
because the mining structure does not have any mining models associated
with it yet.
To add a mining model to an existing mining structure, you must use an ALTER MINING STRUCTURE statement with an ADD MINING MODEL clause. Let’s take a look at a simple form of the statement:
ALTER MINING STRUCTURE <structure> ADD MINING MODEL <model>( <column name> [<prediction>] ) USING <algorithm> [(<parameter list>)] [WITH DRILLTHROUGH]
The script AddModelWithDrillthrough.dmx gives a specific example of an ALTER MINING STRUCTURE statement:
ALTER MINING STRUCTURE CarOwner_DMX ADD MINING MODEL CarOwner_DT( [Customer Key] ,[Income Group] ,[Gender] ,[Marital Status] ,[Is Car Owner] PREDICT ,[Region] ) USING Microsoft_Decision_Trees WITH DRILLTHROUGH
This example adds
a Decision Trees model for predicting car ownership to the mining
structure. Note that the column list for the mining structure specifies
the data type and content type for the columns. The data type and
content type of columns are common among all the mining models that are
part of the same mining structure. In contrast, column usage is model
specific. Column usage—whether the column is an input column or
predictable column—is specified in the mining model column list. By
default, the columns are input columns; therefore, you need to identify
only the columns that are predictable. In the example, the [Is Car
Owner] column is marked as a predictable column because we want to use
the Decision Trees algorithm to predict car ownership. Let’s continue
the example.
|
3. | Execute the query.
|
4. | Refresh
and expand the Mining Models node below the CarOwner_DMX mining
structure node. You should see the CarOwner_DT model. Right-click the
model and choose Browse. You will receive an error message saying that
the model must be processed. Right-click the model in Object Browser and
choose Process.
|
5. | In the Process Mining Model dialog box, click OK. You will see an error message similar to that shown in Figure 2.
This error message appears because SSAS does not know the source
data for the mining structure. So far, all we have done is specify some
column names in a mining structure and model—there has been no mention
of the source of the data for the structure or model. Binding the
columns of a mining structure to data is done using an INSERT INTO DMX statement. Here is a simple form of the statement:
INSERT INTO MINING STRUCTURE <structure> ( <mining structure columns> ) OPENQUERY(<named datasource>, <query syntax>)
To run your own INSERT INTO
query, close the error message and the Process Mining Model dialog box.
Close the empty viewer window and open the DMX script file MapTrain.dmx
by double-clicking its node in Solution Explorer. An example of an INSERT INTO statement to bind data columns to the columns is contained in the script:
INSERT INTO CarOwner_DMX( [Customer Key] ,[Income Group] ,[Gender] ,[Marital Status] ,[Is Car Owner] ,[Region] ) OPENQUERY([Adventure Works DW], 'SELECT CustomerKey ,IncomeGroup ,Gender ,MaritalStatus ,IsCarOwner ,Region FROM AdventureWorksDW.dbo.vCustomerProfitability' )
The column list consists of the columns that were specified in the CREATE MINING STRUCTURE
statement. The named data source was defined in the SSAS database—in
our case, [Adventure Works DW]. Our data source is a SQL Server
database, so the code within the OPENQUERY expression is actually a Transact-SQL (T-SQL) statement.
To continue with the example:
|
6. | Execute the script. The statement binds the mining structure columns and processes the mining model.
|
7. | Go
back to the Object Browser window, right-click the CarOwner_DT model’s
node, and choose Browse. This brings up the Tree Viewer in a new
document window. The Tree Viewer is identical to the one you used in
Visual Studio.
|
8. | Right-click
a node of the decision tree and choose Drill Through. This brings back
the rows from vCustomerProfitability that belong to the node that was
right-clicked. (The drill-through feature was enabled by the WITH DRILLTHROUGH clause that we specified in the script AddModelWithDrillthrough.dmx.)
|
9. | Close the viewer window and any open DMX Query windows.
|
Moving on, let’s explore
how to create a mining structure and model in one step. We do so by
specifying the column name, data type, content type, and usage in the
column list:
CREATE MINING MODEL <model>(
<column name> <data type> <Content Type> [<prediction>]
)
USING <algorithm> [(<parameter list>)] [WITH DRILLTHROUGH]
The DMX code in CreateStructureAndModel.dmx is an example of this statement:
CREATE MINING MODEL CarOwner_DT_OneStep(
[Customer Key] LONG KEY
,[Income Group] TEXT DISCRETE
,[Gender] TEXT DISCRETE
,[Marital Status] TEXT DISCRETE
,[Is Car Owner] TEXT DISCRETE PREDICT
,[Region] TEXT DISCRETE
) USING Microsoft_Decision_Trees WITH DRILLTHROUGH
Take these steps next:
1. | Execute
the query and then refresh the Mining Structures node in the Object
Explorer window. You should see a new node for a structure with the name
CarOwner_DT_OneStep_Structure. SSAS appends the Structure suffix to the name specified in the CREATE MINING MODEL statement in order to distinguish the mining structure from the mining model.
|
2. | Expand
the structure’s node and then its child mining models node. You should
see the CarOwner_DT_OneStep model; this model and its parent structure
were created by the single DMX statement.
|
3. | Open
and execute the script MapTrainOneStep.dmx to process the CarOwner_DMX
model. Browse it if you want, and then close the viewer and all open DMX
Query windows.
|
Here’s the DMX code from the MapTrainOneStep.dmx script:
INSERT INTO CarOwner_DT_OneStep(
[Customer Key]
,[Income Group]
,[Gender]
,[Marital Status]
,[Is Car Owner]
,[Region]
)
OPENQUERY([Adventure Works DW],
'SELECT
CustomerKey
,IncomeGroup
,Gender
,MaritalStatus
,IsCarOwner
,Region
FROM AdventureWorksDW.dbo.vCustomerProfitability'
)
This query is similar to
the query in MapTrain.dmx, which was used to process the CarOwner_DMX
mining structure. In MapTrain.dmx, a mining structure was specified in the INSERT INTO statement; the DMX code in MapTrainOneStep.dmx specifies a mining model. Specifying a mining model in the INSERT INTO
DMX statement processes that specific mining model. Specifying a mining
structure in the statement processes the mining structure and all of
the models in that structure. The mining structure we created has only
one mining model, so the result is the same regardless of whether we
specify the structure or the model.
DMX statements for
creating and processing mining models with nested tables are slightly
more complex than for models without nested tables. As mentioned in the
previous section, SSAS views a nested table as a column; it is a column
with a “data type” of table. The concept of a nested table being a
“column” is confusing. The concept becomes more clear, however, when you
consider DMX statements that create mining structures and models that,
in turn, use nested tables. Here is the syntax for adding a mining model
that uses a nested table to an existing mining structure:
ALTER MINING STRUCTURE <structure>
ADD MINING MODEL <model>(
<column name> [<prediction>],
<table name> [<prediction>] (
<column name> [<prediction>]
)
)
USING <algorithm> [(<parameter list>)] [WITH DRILLTHROUGH]
An example of adding a mining model with a nested table is shown in the CreateModelWithNestedTable.dmx script file:
ALTER MINING STRUCTURE CustomerProfitCategory
ADD MINING MODEL ProductBuyer_DT(
[Customer Key]
,[Age]
,[Commute Distance]
,[English Education]
,[Gender]
,[Income Group]
,[Is Car Owner]
,[Region]
,[v Customer Purchases] PREDICT
(
[Product Category]
)
) USING Microsoft_Decision_Trees
CreateModelWithNestedTable.dmx
adds a decision tree model—for predicting the category of products that
customers buy—to the CustomerProfitCategory mining structure .
1. | Execute
CreateModelWithNestedTable.dmx and then check that the ProductBuyer_DT
model was added to the mining structure in Object Browser.
|
2. | Right-click
the model and choose Browse. You will receive an error message saying
that you cannot browse the model until it has been processed.
|
3. | To fix the problem, execute the DMX code in the script file TrainPreMapped.dmx. This query provides an example of using the INSERT INTO statement to process a mining model when the column bindings for the structure have already been defined:
INSERT INTO ProductBuyer_DT
The CustomerProfitCategory structure was defined using the Data
Mining designer. When you create a structure using this designer, you
select columns from your data source view for the structure; the mining
structure columns are bound to the source data at that time. When the
mining structure columns are already bound to the source data, the DMX
statement to process the model is very simple:
|
4. | After
you execute TrainPreMapped.dmx, go back to Object Browser. Right-click
the ProductBuyer_DT model in the CustomerProfitCategory structure, and
choose Browse. This time the Tree Viewer appears with no error messages.
|
5. | Use
the viewer to explore the model. The model predicts whether customers
are bike buyers, accessory buyers, or clothing buyers. Figure 3 shows the dependency network tab for the model.
|
The
query in the file CreateStructureWithNestedTable.dmx creates a mining
structure and model that uses the Association Rules algorithm to predict
the bundles of products customers have purchased:
CREATE MINING MODEL PurchaseAssociations(
[Customer Key] LONG KEY
,[Customer Purchases] TABLE PREDICT
(
[Product Subcategory] TEXT KEY
)
) USING Microsoft_Association_Rules
The Association Rules
algorithm is used to find combinations of items that frequently occur
together. The most common application of the Association Rules algorithm
is in market-basket analysis. By understanding the bundles of products
that customers have purchased, you can make recommendations for future
purchases.
The script file
TrainMapStructureWithNestedTable.dmx contains code that shows how to
bind the columns of a mining structure that has a nested table to source
data. The source query that specifies the columns from the source data
to map to the mining structure columns uses the SHAPE command. At its simplest, the SHAPE command takes the following form:
SHAPE {<parent table query>}
APPEND ({ <nested table query> }
RELATE <master column> TO <child column>)
AS <nested table column name>
The SHAPE command, originally added to SQL to support ActiveX Data Objects (ADO) hierarchical (or “shaped”) RecordSet objects, lets you combine data from multiple SELECT statements into a single hierarchical result set (a result set with an embedded table).
A result set with an embedded table is precisely what a mining structure with a nested table column requires. Use of the SHAPE
command with DMX requires that the parent table query and nested table
query each be ordered by the column that relates the parent table to the
nested table. For the query in TrainMapStructureWithNestedTable.dmx,
this is the CustomerKey column. The query combines the use of the DMX INSERT INTO command and the SQL SHAPE command to bind the columns in a mining structure with a nested table.
Here is the generalized form for such queries:
INSERT INTO MINING STRUCTURE <structure> (
<mining structure column> | [SKIP],
<nested table> (
<nested table column> | [SKIP]
)
)
SHAPE {<parent table query>}
APPEND ({ <child table query> }
RELATE <parent column> TO <child column>)
AS <nested table column name>
The actual code from TrainMapStructureWithNestedTable.dmx is as follows:
INSERT INTO PurchaseAssociations(
[Customer Key]
,[Customer Purchases]
(
SKIP
,[Product Subcategory]
)
)
SHAPE{
OPENQUERY([Adventure Works DW],
'SELECT CustomerKey
FROM AdventureWorksDW.dbo.vCustomerProfitability
ORDER BY CustomerKey')
}
APPEND ({
OPENQUERY([Adventure Works DW],
'SELECT CustomerKey, ProductSubcategory
FROM AdventureWorksDW.dbo.vCustomerPurchases
ORDER BY CustomerKey'
)}
RELATE CustomerKey TO CustomerKey) AS CustomerPurchases
The mining structure columns specified in the INSERT INTO clause of TrainMapStructureWithNestedTable.dmx are the structure’s column names. The column list after the INSERT INTO clause uses the column names with the spaces—these are the names that were specified in the CREATE MINING MODEL
statement. (Note that the ordinal position of the columns in the list
corresponds to the columns returned by the source data query, which in
this case is the result set returned by the SHAPE command and its APPEND and RELATE clauses.) If the source data query returns columns that are not required by the mining model, you must use the SKIP keyword to indicate that the column should be skipped.
The query in
TrainMapStructureWithNestedTable.dmx skips the CustomerKey column from
the child table query. You are required to include the column that
relates the child (nested) table to the parent table in the parent and
child table queries. Both the parent table query and the
child table queries in TrainMapStructureWithNestedTable.dmx select the
CustomerKey column; however, the [Customer Key] column appears only once
in the mining structure. Therefore, you must use the SKIP keyword to skip the CustomerKey column for the nested table.
Execute
CreateStructureWithNestedTable.dmx followed by
TrainMapStructureWithNestedTable.dmx, and explore the
PurchaseAssociation model by right-clicking its node in the Object
Browser window and choosing Browse.
The queries in the
script files CreateModelCustomClusterCount.dmx and
MapTrainSkipColumns.dmx together provide another example of using DMX to
create a mining model with a nested table and then binding its
structure to the source data. CreateModelCustomClusterCount.dmx creates a
mining structure with a clustering sequence mining model:
CREATE MINING MODEL PurchaseSequence(
[Customer Key] LONG KEY
,[Profit] LONG CONTINUOUS
,[Income Group] TEXT DISCRETE
,[Customer Purchases] TABLE
(
[Purchase Sequence] LONG KEY SEQUENCE
,[Product Subcategory] TEXT DISCRETE PREDICT
)
) USING Microsoft_Sequence_Clustering (Cluster_Count = 0)
In the Clustering
Sequence algorithm, not only are the combinations of items frequently
occurring together of interest, but the order in which they occur is of
interest. In our example, the model clusters customers according to the
sequence of the product subcategory of their purchases. The model sets
the Cluster_Count
parameter of the algorithm to 0. This has the effect of forcing SSAS to
determine the optimal number of clusters to create. By default, the
algorithm creates 10 clusters. Execute the queries in
CreateModelCustomClusterCount.dmx and MapTrainSkipColumns.dmx, in order,
and then browse the model using the Cluster Viewer.