Programming Microsoft SQL Server 2005: Using the Data Mining Wizard and Data Mining Designer (part 1) - Creating a Mining Structure

1/14/2011 5:29:27 PM

Start by adding to your project a data source view (DSV) called CustomerProfitability that contains the vCustomerProfitability view.

Important

Normally, instead of using the vCustomerProfitability view in the DSV, you would use the TrainingCustomers and TestCustomers tables that were created by the SSIS package. Then you would use TrainingCustomers to train your mining models. If we were to use the TrainingCustomers table now, however, the results of the mining models would be not quite the same as those shown in the figures. The Percentage Sampling Task generates random samples, making the results slightly different from those presented in this chapter. Using vCustomerProfitability, which includes all customers, makes the examples in this chapter completely reproducible.

For the customer profitability analysis, we want to find out the common characteristics of high-profit and low-profit customers. To do this, we must add a calculated column to our DSV that defines “high profit.” We don’t want to store this in our database because it is a definition specific to the analysis we are doing.

To create a calculated column in the DSV, take the following steps:

1.	Right-click the vCustomerProfitability view on the DSV design tab and choose New Named Calculation.
2.	In the Named Calculation dialog box, give your named calculation the name ProfitCategory, enter the description High-profit customers in the top quartile of profit, and type the following in the Expression box: CASE WHEN Profit>1077 THEN 'High' ELSE 'Low' END This creates two categories of customers: the “High” category for customers in the top quartile of profit amounts and the “Low” category for all other customers.
3.	Click OK. You will see your newly created column in the DSV with a calculator icon to the left of it.

Now we’ll browse the data within the DSV:

1.	Right-click vCustomerProfitability and choose Explore Data. The Data Explorer window opens. You can choose to view either the top x rows in your table or a random sample of x rows from your table. (The value of x is configurable.) By default, the Data Explorer window shows data for the top 5000 rows in your data source (the exact meaning of “top” will vary, depending on how the data is sorted in your table).
2.	To change this default, click the Sampling Options button (the one with the properties icon) in the upper-right corner of the window. This opens the Data Exploration Options dialog box.
3.	Click the Random Sample option button, leaving the Sample Count at 5000, and click OK.
4.	To implement this configuration change and see the data for your random sample of records from vCustomerProfitability, you must refresh the data. To do so, click the Resample Data button (the one with the refresh icon) in the upper-right corner of the Data Explorer window. Now scroll to the rightmost column of the table. You will see the named calculation you just created. It will look no different from any other column of your table.

Using the other tabs in the Data Explorer window, you can create PivotTables and charts or simple data distribution charts. To use the distribution chart feature, follow these steps:

1.	Click the Chart tab. By default, column charts are displayed for a subset of the columns in your DSV. You can click anywhere on the drop-down control at the top of the window to modify the view to display charts for any combination of columns from your table.
2.	Select the ProfitCategory column from the drop-down list, and either press ENTER or click the Chart tab—or click somewhere in the chart area—to dismiss the column list. A column chart for Profit Category should appear alongside the other charts.
3.	You can change the chart type to a (horizontal) bar chart or pie chart. To change your column charts to pie charts, click the pie chart button in the upper-left corner of the Data Explorer window, or right-click a blank area of the chart and choose Pie Chart. Figure 1 shows pie charts of several customer characteristics. Figure 1. The Chart tab of the Data Explorer window allows you to see how your data is distributed.

You can see details about the data represented by a particular item on the Chart tab by hovering the mouse pointer over the item. For example, hovering over a pie slice shows you the count and percentage of records that are represented by that slice (as shown in Figure 20-9).

Data Explorer allows you to see the distribution of your data quickly and easily without your needing to write a single line of code. This can be useful during the startup phase of your data mining project, before any models have been created.

Creating a Mining Structure

Mining models in SSAS are contained within mining structures. A mining structure consists of a subset of columns from your DSV and one or more mining models that use the columns. The content type of each column is set separately by the mining model—for each mining model in your mining structure, you specify whether the column is a key, an input, or a predictable column. (We’ll discuss the content types in more detail later.)

Because you can also indicate that a column in the structure should be ignored for a particular mining model, you can conceivably have two mining models within the same structure that do not share any input or predictable columns. But it makes more sense to group, under the same structure, mining models that answer the same business question and that have some commonality in columns from your DSV. This is because mining models created within the same structure can be compared using the Mining Accuracy Chart functionality of SSAS, which we will examine shortly. Furthermore, mining models that share a structure are processed in parallel, with a single pass through the data.

Getting back to our data mining goals, we want to know whether particular characteristics or buying behaviors can help us predict who the high-profit customers are. We also want to cluster customers according to their demographic characteristics and buying behaviors. To that end, we will use the Data Mining Editor in Visual Studio to create a mining structure with models that use the Clustering, Decision Tree, and Naïve Bayes algorithms. (Later we will discuss how each of these algorithms works, how they differ from each other, and their appropriate use for specific prediction scenarios.)