DMX in 2008: Model Filtering (Part 1 of 2)
This is the first post in a small series describing my favorite data mining feature in SQL Server 2008, model filtering. Filtering allows you to take full advantage of the mining structure / mining model dichotomy, separating the data staging (in the structure) from the modeling (in the mining model).
SQL Server 2008 allows specifying, for each mining model, a filter to be applied on the training data. The filter acts as a partitioning mechanism inside the mining structure and it is applied on top of any training/testing partitioning already existing at the structure level. Furthermore, filters can be applied to data in nested tables, but more about this in the next post. Filters allow you to build different models for various partitions of the data with minimal effort, either to obtain better overall accuracy (think of it as manually specifying the first split in a decision tree), or to compare the patterns between different partitions.
So, let’s start with a simple scenario, using the same old Adventure Works data - building a mining model that predicts how likely a customer is to purchase a bike based on various demographic information. Let’s also assume that you know, based on your experience with the data that the Geography factor has in important role and you don’t want your North American patterns to hide (because of volume) the peculiarities of the Pacific market.
Suppose you start with a mining structure defined like below:
CREATE MINING STRUCTURE TestStruct
(
CustomerKey LONG KEY,
Region TEXT DISCRETE,
YearlyIncome DOUBLE DISCRETIZED,
EnglishOccupation TEXT DISCRETE,
CommuteDistance TEXT DISCRETE,
BikeBuyer BOOLEAN DISCRETE
) WITH HOLDOUT( 30 PERCENT )
Then, train the mining structure with an INSERT INTO statement:
INSERT INTO MINING STRUCTURE TestStruct
(
CustomerKey,
Region,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer
)
OPENQUERY([Adventure Works DW], ‘SELECT
CustomerKey,
Region,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer FROM dbo.vTargetMail‘)
Now, the modeling part - add a mining model to the mining structure, using only the data for the Pacific area:
ALTER MINING STRUCTURE TestStruct
ADD MINING MODEL DT_Pacific
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER( Region=’Pacific’ ), DRILLTHROUGH
The new syntax element is the FILTER construct, which instructs the model to use in training only those cases where the Region column contains the ‘Pacific’ value.
Similarly, add a new model for the North America region:
ALTER MINING STRUCTURE TestStruct
ADD MINING MODEL DT_NorthAmerica
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER( Region=’North America’ ), DRILLTHROUGH
A note on the syntax: the WITH keyword was available in SQL 2005 (and used to specify the DRILLTHROUGH option when creating a mining model). WITH can be used to introduce various model options, currently drill-through and filters. Consequently, DRILLTHROUGH and FILTER may appear in any order and are not conditioned by each other.
As you notice, FILTER is syntactically a function, taking a Boolean expression. The Boolean expression may be a simple predicate (the kind of Column=Value), an EXISTS predicate or a Boolean combination (using AND, OR and NOT operators) of multiple predicates.
The Column part of a simple predicate is always the name of a mining structure column.
The respective mining structure column does not have to be included in the mining model. In the examples above, Region is no longer an interesting attribute when all the training cases seen by a model have the same value (e.g. Pacific), therefore it is not part of the model.
An EXISTS predicate is a Boolean expression that is applied on a nested table, and I will discuss it in detail in my next post.
A simple predicate is defined as <Column> <Operator> <Value>, where the accepted set of operators depends on the content type of the column, and Value must always be a constant. Value generally has to have the same type as the column but can also be the NULL constant.
If the column is:
- - A case level KEY - then no operator can be applied (no key filters are supported)
- - DISCRETE or DISCRETIZED - supports the = and <> operators
- - CONTINUOUS - supports the =, <>, <, <=, >, >= operators
XML serialization of the floating point double numbers as well as rounding errors may result in losses of less significant digits, therefore using an = or <> operator with a double value may result in unexpected results, so it is safer to use a combination of other predicates.
In case of discretized (numeric or date/time) values, the = and <> operators have a semantic similar with the PREDICTION JOIN operator: they evaluate the bucket that contains the value and then apply the predicate to the bucket index. Therefore, for a discretized YearlyIncome column with buckets (0, 10000], (10000, 20000] etc. the filters below mean exactly the same thing:
- - (YearlyIncome=50000)
- - (YearlyIncome=9000)
- - (YearlyIncome=999.999)
Applying a filter on a mining model means, effectively, training the mining model with the data returned by a query like below:
SELECT … FROM MINING STRUCTURE TestStruct.CASES WHERE IsTrainingCase() AND (<FilterHere>)
For those of you that used OLAP mining models, filters are somehow similar to defining a cube slice on a mining structure - but at the model level.
If your filtered model has drillthrough enabled, then all the structure cases that match the filter will be linked with a content node, not only the training cases from the structure. Of course, only the structure training cases matching the filter will be used in training.
Therefore, the following queries are equivalent
SELECT … FROM MINING MODEL DT_NorthAmerica.CASES WHERE IsTrainingCase()
SELECT … FROM MINING STRUCTURE TestStruct.CASES WHERE IsTrainingCase()AND (Region=’North America’)
In my previous post on Test and Training data, I talked about accuracy stored procedures and how they can be applied on various data partitions (test or training). Filters add a new dimension to the data partitions, which can now be filtered test data or filtered training data.
Specifically, I was mentioning that the accuracy stored procedures have a syntax like below
CALL SystemGetLiftTable ( BikeBuyerClass, 2, ‘BikeBuyer’, true) // - to get the lift chart for the TRUE state of the BikeBuyer attribute
CALL SystemGetClassificationMatrix( BikeBuyerClass, 2, ‘BikeBuyer’) // - to get the full classification matrix for the BikeBuyer attribute
And the second parameter (2) is a bitmask identifying the data partition(s) to be used in the accuracy stored procedure. Here are some bit values:
- 1 identifies the training partition
- 2 identifies the test partition
- 3 ( bitwise OR combination of 1 and 2) identifies ALL the data in the mining structure (training + test)
Filters introduce another bit in the bitmask, with a value of 4. Therefore, here are some new values for the second parameter of the accuracy stored procedures:
- 5 (4 bitwise OR 1) identifies the filtered training data seen by the model in training
- 6 (4 bitwise OR 2) identifies the filtered test data - those test cases from the mining structure where the model filter applies
- 7 (4 bitwise OR 1 bitwise OR 2) identifies all the cases (training or test) in the structure where the model filter applies
In my next post I will talk about nested table filters and some new scenarios enabled by those.
Discussion Area - Leave a Comment