Querying Rules and Itemsets (like the Data Mining Viewers do)
I will try to continue the series started by Jamie, presenting the other set of queries issued by the Microsoft Association Rules viewer. Recently, a question on these queries appeared on the MSDN Data Mining Forums and the poster raised a very good point: while the stored procedures were intended as internal calls for the built-in viewers, external applications and viewers may want to employ them.
So, here is how the rest of the Association Rules viewer works.
Once the viewer is loaded, the first call is something like:
CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetStatistics(’Customers’)
The single parameter of this stored procedure is the mining model name. The result is a one-row table containing the following columns:
Column |
Sample Value |
Comments |
MAX_PAGE_SIZE | 2000 | The maximum server supported page size for fetching rule and itemsets. This parameter ensures the the viewer will not make requests which will make the server go out of memory, details later. |
MIN_SUPPORT | 89 | Minimum actual support for rules detected by the model |
MAX_SUPPORT | 2439 | Maximum actual support for rules detected by the model |
MIN_ITEMSET_SIZE | 0 | Minimum itemset size |
MAX_ITEMSET_SIZE | 3 | Maximum itemset size |
MIN_RULE_PROBABILITY | 0.401529636711281 | minimum actual rule probability |
MAX_RULE_PROBABILITY | 0.993975903614458 | maximum actual rule probability |
MIN_RULE_LIFT | 0.514182044237125 | minimum actual rule importance |
MAX_RULE_LIFT | 2.13833283242171 | maximum actual rule importance |
Now, this information is cached by the algorithm during training and saved as a text field in the root content node. It can be retrieved with a simple DMX statement like:
SELECT TOP 1 NODE_DESCRIPTION FROM Customers.CONTENT
The query will return a big string like below:
“Association Rules Model; ITEMSET_COUNT=378; RULE_COUNT=119; MIN_SUPPORT=89; MAX_SUPPORT=2439; MIN_ITEMSET_SIZE=0; MAX_ITEMSET_SIZE=3; MIN_PROBABILITY=0.401529636711281; MAX_PROBABILITY=0.993975903614458; MIN_LIFT=0.514182044237125; MAX_LIFT=2.13833283242171”
The stored procedure guarantees that the string information is parsed according to the whatever method the server uses to cache the information. So, for instance, if Microsoft decides to add some new statistics in the description string, you can safely rely on the stored procedure to fill the right columns with the appropriate information.
Once these statistics are retrieved, some pieces of the Rules viewer are already populated:
For instance, the Up-Down controls for Minimum probability an minimum importance have a minimum and a maximum value.
Extracting the rules
The next viewer call fetches some rules for the rules viewer.
The call looks like below:
CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetRules(’Customers’, 0, 1999, 1, 0.4, 0.51, ”, True)
Here is a list of parameters:
Parameter ordinal |
Comment |
1 | string — The name of the mining model |
2 | integer — The index of the first rule to be retrieved (0 means start with the first rule) |
3 | integer — The index of the last rule to be retrieved (1999, together with 0 for the previous parameter, makes a total of 2000 rules, i.e. the rules page size from statistics) |
4 | integer — The sort order for the rules, more details below |
5 | double –the minimum probability for returned rules |
6 | double — the minimum importance for returned rules |
7 | a string filter to be applied on returned rules (may be empty) |
8 | a boolean flag (true/false) flag indicating whether the long name or short name of a rule should be returned – the state of the “Show Long Name” checkbox in the viewer |
The parameter order is somehow confusing. The stored procedure works like this:
- scans all the rules detected during the model training. These rules can be obtained with a query like
SELECT * FROM Customers.CONTENT WHERE NODE_TYPE = 8
Such a query will typically take long time to execute and even longer time to interpret the results. More details on the content information for rule will be presented in a future post. If you don’t want to wait, look at page 355 in the “Data Mining with Microsoft SQL Server 2008“ book. Browsing the rules using the server side Adomd.net object model is far more efficient than executing the query, and this is one of the major reasons for the viewers to use the stored procedures.
- for each rule, apply the filters described by the other parameters (minimum probability, minimum importance and, if not empty, the string filter).
- insert the qualified rules (which pass all filters) in a heap data structure for sorting, based on the sort order argument
- once the rules browsing is complete, traverse the heap and return the rules starting with the start index, up to the end index
The sort order parameter is an integer representation of the enumeration below:
public enum AssociationSortOrder
{
AscendingProbability = 0,
DescendingProbability = 1,
AscendingLift = 2,
DescendingLift = 3,
AscendingSupport = 4,
DescendingSupport = 5,
AscendingItemsetLength = 6,
DescendingItemsetLength = 7,
AscendingLexicographical = 8,
DescendingLexicographical = 9,
}
The default sort order is 1 (AscendingProbability). You can change the sort order by clicking one of the rules table headers. The only enumeration options that are valid for rules are 0, 1, 2, 3, 8 and 9 (respectively, ascending and descending order by probability, importance or lexicographically). The other values in the enumeration are reserved for retrieving the itemsets.
The string filter parameter can be any .Net regular expression to be applied to the rule description. In particular, it can also be an item name (which is a valid .Net regular expression). If this filter is not empty, then only rules matching the regular expression will be returned. With a bit of experimentation, you can build rather sophisticated rule filters using the RegEx mechanism.
Now, the result of the stored procedure is a non-normalized representation of lots of information describing the rules. It has the following columns, which have different semantics depending on the table row:
Column |
What it may hold |
NODE_UNIQUE_NAME | If this is not empty, then the current row is a rule. Otherwise, the current row contains additional information |
NODE_CAPTION | For rule rows, this is the rule caption |
NODE_SUPPORT | For the first row in the result, the total number of returned rules For rule rows, the rule support. Empty for rows containing additional information |
NODE_PROBABILITY | For rule rows, the rule probability. Empty otherwise |
NODE_LIFT | For rule rows, the rule importance. Empty otherwise |
NODE_SIZE | For rule rows, the size of the itemset defining the rule. An example: for a rule like (A,B)->C, this column will contain 3 |
ATTRIBUTE_NAME | For rule rows, as well as for the first row, this is empty. For the other rows, this contains the name of an attribute that participates in a rule |
ATTRIBUTE_VALUE | For rule rows, as well as for the first row, this is empty. For the other rows, this contains the value of an attribute that participates in a rule |
Here is a result sample:
NODE_UNIQUE_NAME |
NODE_CAPTION |
NODE_SUPPORT |
NODE_PROBABILITY |
NODE_LIFT |
NODE_SIZE |
ATTRIBUTE_NAME |
ATTRIBUTE_VALUE |
119 | |||||||
1378 |
Attack of the Clones = Existing, Return of the Jedi = Existing -> Empire Strikes Back = Existing |
165 |
0.993975903614458 |
1.35796… |
3 | ||
Attack of the Clones | Existing | ||||||
Return of the Jedi | Existing | ||||||
Empire Strikes Back | Existing | ||||||
1485 |
Alien = Existing –> Aliens = Existing |
103 |
0.515 |
1.1578… |
2 | ||
Alien | Existing | ||||||
Aliens | Existing |
The first row tells the viewer that 119 rows have been returned in this result set. This row always populates only the NODE_SUPPORT column with the number of returned rules.
The second row is the first actual rule being returned:
“Attack of the Clones = Existing, Return of the Jedi = Existing -> Empire Strikes Back = Existing”
This rule corresponds to the content node with the NODE_UNIQUE_NAME of 1378, has a support of 165, a probability of 0.993… and an importance of 1.35796…. It consists of 3 attributes.
The next 3 rows will contain the attribute names and their values. By convention, the last row in this batch will be the rule result. Therefore, the next 3 rows contain, respectively:
- the two predicates forming the left hand side of the rule : “Attack of the Clones” = “Existing” and “Return of the Jedi” = “Existing”
- the predicate on the right hand side of the rule: “Empire Strikes Back” = “Existing”
Row 5 of the result table contains the next rule (from the content node with the ID of 1485) and the next two rows contain the components of this rule. And so on, and so forth.
The decomposition of rule attributes and values allows changing on the fly the rendering:”Show attribute and value”, “Show attribute only” or “Show value only”.
You are probably already familiar with the resulting view:
Extracting the itemsets
The itemset extraction is very similar to the rule extraction.
The stored procedure call looks like:
CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.GetItemsets(’Customers’, 0, 1999, 5, 0, 89, ”, True)
Here is a list of parameters:
Parameter ordinal |
Comment |
1 | string — The name of the mining model |
2 | integer — The index of the first itemset to be retrieved (0 means start with the first itemset) |
3 | integer — The index of the last itemset to be retrieved (1999, together with 0 for the previous parameter, makes a total of 2000 itemsets, i.e. the rules page size from statistics) |
4 | integer — The sort order for the itemsets, more details below |
5 | integer –the minimum itemset size |
6 | integer — the minimum itemset support |
7 | a string filter to be applied on returned itemsets (may be empty) |
8 | a boolean flag (true/false) flag indicating whether the long name or short name of an itemset should be returned – the state of the “Show Long Name” checkbox in the viewer |
The sort order parameter is an integer representation of the enumeration described in the rules section,AssociationSortOrder.
The default sort order is 5 (DecendingSupport). You can change the sort order by clicking one of the itemset table headers. The only enumeration options that are valid for rules are 4,5,6,7, 8 and 9 (respectively, ascending and descending order by support, itemset size or lexicographically).
The result of the GetItemset stored procedure is very similar to the result of the GetRules stored procedure.
That’s all! If you write a nice viewer using the stored procedures, send us a demo (or at least a snapshot!)
Bogdan.
Excellent. This is exactly the kind of stuff of which I would like to see more.
Can i ask? What is query to get this:
Attack of the Clones Existing
Return of the Jedi Existing
Empire Strikes Back Existing
I always get’s only first line in NODE_DISTRIBUTION:
Movies (Star Wars Episode V: Empire Strikes Back) Existing
Thank you
Got it