Editing and Adding Mining Models
With your mining
structure and first mining model created, you are ready to create
additional models within the structure. On the Mining Models tab of the
data mining structure designer, you can view the definition of your
mining model, as shown in Figure 5.
You will see the columns in your mining structure configured according
to the choices you previously made for how those columns should be used
in your clustering model.
Editing a Mining Model
You can also edit the
definition of your mining model from the Mining Model tab. Profit
Category is specified as a PredictOnly column in the mining model. We
want to change the usage type from PredictOnly to Predict. To edit
column usage, take these steps:
1. | On the Mining Models tab of the data mining structure designer, click the cell next to Profit Category.
|
2. | In the drop-down list that appears, change the column usage to Predict, as shown in Figure 6.
|
Changing
the column usage from PredictOnly to Predict allows customer profit
category to be a factor in how the clusters are determined, and not just
a descriptive characteristic analyzed after the fact.
Adding a Mining Model
By default, the
Microsoft Clustering algorithm creates 10 clusters from your training
data. As a thoughtful data analyst, you might be concerned that users
will have trouble grasping the nuances of the differences in so many
clusters. You might therefore want to create a second clustering model
that segments your customers into only five clusters. To do this, follow
these steps:
1. | Right-click anywhere on the Mining Models tab and then choose New Mining Model.
|
2. | In the New Mining Model dialog box, type CustomerProfitCategory_CL5
in the Model Name text box and then select Microsoft Clustering from
the Algorithm Name drop-down list. Click OK. A new mining model is added
to the right of the existing model.
|
3. | To
change the number of clusters created, from 10 to 5, right-click the
CustomerProfit Category_CL5 column header or any of its column cells and
then choose Set Algorithm Parameters.
|
4. | In the Algorithm Parameters dialog box that appears (Figure 7), you can set algorithm parameters that are specific to the selected mining model.
This can be a bit perplexing at first because the designers
abstract much of the complexity of data mining through their UI and
wizards. Meanwhile, the dialog box cracks open the black box and allows
you to fine-tune your model. Click the various parameters in the
parameter list and review the descriptions in the Description area
below.
|
5. | When
you’re done exploring the parameters, click the cell at the
intersection of the CLUSTER_COUNT row and the Value column, type 5, and then click OK.
|
Adding a Model That Uses a Different Algorithm
We now have two
clustering models in our mining structure. To validate the accuracy of
one model in a mining structure, it is often useful to create an
additional model using a different algorithm. We’ll add a new model
using the Decision Trees algorithm. This can be done with surprisingly
little effort:
1. | Right-click anywhere on the Mining Models tab and choose New Mining Model.
|
2. | In the New Mining Model dialog box, name your model CustomerProfitCategory_DT, select the Decision Trees algorithm (it should be selected by default), and then click OK.
The CustomerProfitCategory_DT model appears to the right of the
existing clustering models with the same columns and usage.
|
Changing Column Usage
Adding the model is
helpful, but it would be even better if we could use it to predict the
number of products people buy (in addition to predicting profit
category). To change the column usage for NumProductGroup in the
CustomerProfitCategory_DT mining model, click the cell corresponding to
the NumProductGroup under CustomerProfitCategory_DT and select Predict
from the drop-down list. By setting the content type of both
the ProfitCategory and the NumProdGroup variables to Predict, we are
allowing each to be an input in the decision tree of the other. If we
had instead set the content type of the NumProdGroup to PredictOnly, the
number of products purchased would not be a factor in creating the
ProfitCategory decision tree—that is, the number of products purchased
would not be considered in the splits of the ProfitCategory tree.
More Info
Splits
are the branches at each node of a decision tree. Each node of a
decision tree represents a subset of the population. This subset is
characterized by the parentage of the node. Splits at each node are
determined by identifying the input characteristic by which the
distribution of the predicted variable differs most for the subset
defined at that node. When the distribution of the predicted variable at
a node does not vary significantly by any input characteristic, there
are no further splits and the branch ends. A node from which there are
no splits is referred to as a leaf. |
Mining Models and Data Types
Let’s continue building our mining structure by adding another model, this time using the Naïve Bayes algorithm:
1. | Right-click the CustomerProfitCategory_CL model, and choose New Mining Model.
|
2. | Name your model CustomerProfitCategory_NB, select the Naïve Bayes algorithm, and click OK.
|
A message box
will appear explaining that the Age column will be ignored because the
Naïve Bayes algorithm does not support working with continuous columns.
Click Yes. Notice that the content type for the Age column is set to
Ignore, the content type for the NumProd column is set to Input, and the
content type for the ProfitCategory column is set to Predict. For this
model, we want to predict only ProfitCategory, so no content type
modifications are necessary.
Important
Because
we created the Naïve Bayes model by right-clicking the
CustomerProfitCategory_CL model and choosing New Mining Model, the
CustomerProfitCategory_CL content type settings were used in the new
model, making ProfitCategory the only predicted column. Had we instead
right-clicked the CustomerProfitCategory_DT model and chosen New Mining
Model, that model’s content type settings would have been used, and
NumProd, in addition to Profit Category, would have been a predicted
column. |
We mentioned earlier that
age is ignored in the Naïve Bayes model because it is a continuous
variable. If age is a significant determinant of whether a customer is a
high-profit or low-profit customer, the Naïve Bayes model will appear
to perform worse than other models where age is included. Therefore, we
want to include at least some indication of age in the Naïve Bayes
model. To do that, we must add a “discretized” version of the age column
to our mining structure and include it in our Naïve Bayes model.
The Mining Models tab supports the deletion of columns from a mining structure, but to add a column to the structure, we need to go back to the Mining Structure tab.
On the Mining Structure tab, right-click the tree view and choose Add A Column.
In the Select a Column dialog box (Figure 8), select the AgeGroup column in the Source Column list and then click OK to add AgeGroup to the mining structure.
Note
The
AgeGroup column in vCustomerProfitability categorizes customers into
groups such as “Under 30,” “Age 30 through 35,” and “Age 36 through 45”
by using a CASE
statement. SSAS also has a column content type called “discretized” that
can be used to categorize a continuous attribute. |
Return
to the Mining Models tab. You will see that AgeGroup appears in the
mining structure, although its usage is set to Ignore in all of the
defined models. To include it in the Naïve Bayes model, click in the
cell corresponding to the AgeGroup column under the Naïve Bayes model
and change the usage to Input. Your Mining Models tab should appear as
shown in Figure 9.