Validating and Comparing Mining Models
We
have processed the mining models and analyzed their predictive
contents, which, though similar, are not identical. Although it is easy
to conclude that being a long-time customer is an important determinant
of customer profitability, the models differ somewhat in their
conclusions about which other customer characteristics or buying
behaviors are important indicators of profitability. Given these
differences, how can you determine which of these secondary predictions
are correct? You can do so by using the data mining structure designer’s
built-in “lift charts” and classification matrix facilities, which make
it easy to validate the relative accuracy of various mining models in
your mining structure and compare their predictive ability.
Click the Mining
Accuracy Chart tab. It has three subtabs—Column Mapping, Lift Chart, and
Classification Matrix—which you can use to compare the accuracy of your
models based on supplied test data.
We need to tell SSAS how to
map the columns from the test set (vCustomerProfitability in our
example) to the mining models. The correct mappings are shown in Figure 18.
The Column Mapping tab
has three sections. In the top section, you specify the table that has
the test data set and how the columns of the test data set will map to
the columns of your mining models. In the middle section, you can
specify filters, if any, for your test data. This is useful if you want
to test the accuracy of your mining models for a particular subset of
data. Lastly, you specify the models, predictable columns, and
predictable column values from which lift charts and classification
matrices will be created.
Starting at the top
section of the tab, verify that the CustomerProfitCategory mining
structure is selected in the Mining Structure mini-window on the left
side. If it is not, click the Select Structure button in the Mining
Structure mini-window, select the CustomerProfitCategory mining
structure, and click OK. In the Select Input Table(s) mini-window
farther to the right, click the Select Case Table button to specify the
test set. In the resulting Select Table dialog box, select
vCustomerProfitability and click OK.
Columns in the input
table will be mapped to columns in the mining structure that have the
same name. To fine-tune these mappings, right-click anywhere on the
Column Mapping tab and choose Modify Connections (or choose Modify
Connections from the Mining Model menu, which is available only when the
Column Mapping tab is active and when you have clicked it). This brings
up the Modify Connections dialog box, which allows editing of the
mappings between the mining structure and the input table. Click a cell
below the Table Column header in the dialog box to reveal a drop-down
list of all the columns in the input table. You can change a default
mapping, if necessary, by clicking a cell and selecting the desired
column from the drop-down list. The proper settings for our mining
structure and test data table are (partially) shown in Figure 19. After you verify that the mappings are correct, click OK.
More Info
The Column Mapping tab’s input filtering grid (in the center pane, below the column mappings) allows you to create a WHERE
clause for your input table. This is useful if you are interested in
testing the mining model for only a particular subset of the test data.
For example, to test the model for customers in North America only, you
can select vCustomerProfitability as the Source, select Region as the
Field, and type = ‘North America’ as the Criteria/Argument. |
You can create two types of
lift charts on the Mining Accuracy Chart tab, one of which allows you
to specify the state of the predictable column. (The other type of lift
chart does not allow you to do this.) If you specify the state of the
predictable column, the lift chart shows how well each model predicts
that particular state. If you do not specify the state of the
predictable column, the chart shows the relative accuracy of the models
in predicting the column value overall.
By default, all
mining models in the mining structure are selected, and no state is
specified for the predictable column. Click any cell in the Predict
Value column of the grid in the bottom pane of the Column Mapping tab,
select the High option from the drop-down list that appears, and then
click the Lift Chart tab. After the chart has finished loading, make
sure the Lift Chart option is selected from the Chart Type drop-down
list. The result, which compares how well each mining model predicts
high-profit customers, is shown in Figure 20.
The
x-axis of the lift chart shows the percentage of the overall
population, and the y-axis shows the percentage of the population that
is high profit. Recall that we defined high-profit customers as those in
the top quartile of customers. In an ideal model, in which you can
predict perfectly who the high-profit customers are, you need to look at
only 25 percent of the overall population to identify 100 percent of
the high-profit customers. The ideal model is represented by the
salmon-colored line in the figure. Note how it hits the top of the chart
at the 25-percent mark along the x-axis. The cornflower blue line in
the figure represents random guessing. By randomly guessing, you would
need to look at 100 percent of the overall population to identify 100
percent of the high-profit customers. The mining models are an
improvement over random guessing; the decision tree model (represented
by the mustard-colored line) is the most accurate, followed by the Naïve
Bayes model (represented by the light blue line). The colors of lines
may differ in your version of the graph, but the results will be the
same.