DMX in 2008: Model Filtering (Part 1 of 2)
This is the first post in a small series describing my favorite data mining feature in SQL Server 2008, model filtering. Filtering allows you to take full advantage of the mining structure / mining model dichotomy, separating the data staging (in the structure) from the modeling (in the mining model).
SQL Server 2008 allows specifying, for each mining model, a filter to be applied on the training data. The filter acts as a partitioning mechanism inside the mining structure and it is applied on top of any training/testing partitioning already existing at the structure level. Furthermore, filters can be applied to data in nested tables, but more about this in the next post. Filters allow you to build different models for various partitions of the data with minimal effort, either to obtain better overall accuracy (think of it as manually specifying the first split in a decision tree), or to compare the patterns between different partitions.
So, let’s start with a simple scenario, using the same old Adventure Works data - building a mining model that predicts how likely a customer is to purchase a bike based on various demographic information. Let’s also assume that you know, based on your experience with the data that the Geography factor has in important role and you don’t want your North American patterns to hide (because of volume) the peculiarities of the Pacific market.
Suppose you start with a mining structure defined like below: