Excel data mining cell functions
In the previous post I presented a not-so-documented cell function installed in Excel by the data mining add-ins. Jamie also gave some cool demos featuring these functions. To complete this topic, here is a full description of the cell functions and how they can be used.
The Excel Data Mining Client add-in includes 3 cell functions: DMPREDICT, DMPREDICTTABLEROW and DMCONTENTQUERY. These functions are included in the add-ins for SQL Server 2005, but not documented or supported. However, they will be both documented and supported in the add-ins for SQL Server 2008.
To use these functions, you will need a mining model available on the server. For the examples below, I created a Microsoft Decision Trees mining model, built by running the Classify task in the DM Client add-in over the sample Table Analysis Tools Sample spreadsheet in the sample workbook installed by the add-ins.
The model is named PurchasedBike and it predicts (based on demographic information) if a customer is likely (or not) to purchase a bike. The single predictable column of the model is named [Purchased Bike] and it take Yes and No as values.
For simplicity, all the cell functions are invoked in the same spreadsheet with the training data (the Table Analysis Tools Sample sheet).
DMPREDICT (presented in the previous post)
DMPREDICT can take a variable number of arguments, the minimum being 3.
- the first parameter is the Analysis Services connection to be used. An empty string refers the current (active) connection.
- the second parameter is the name of the mining model that will execute the prediction
- the third parameter, is the requested predicted entity (predictable column, in general, but could also be any prediction function)
The function may also take up to 32 pairs of arguments. Each such pair contains the value and the name of an input (in this order, i.e. value followed by name). The name of each input must match the name of an input column of the model, because DMPREDICT is executed as a NATURAL PREDICTION JOIN statement. The execution of a DMPREDICT statement effectively translates to the execution of a DMX statement like below:
SELECT
<Entity — parameter 2>
FROM
<Model — parameter 3>
NATURAL PREDICTION JOIN
(
SELECT <Parameter4> AS [<Parameter5>],
SELECT <Parameter6> AS [<Parameter7>],
…
) AS T
Both the values and the names can be hardcoded (e.g. …0, “Cars”…) or references in the Excel spreadsheet.
Example:
DMPREDICT( “”, “PurchasedBike”, “[Purchased Bike]”, A2, “Cars”)
translates to:
SELECT
[Purchased Bike]
FROM
[PurchasedBike]
NATURAL PREDICTION JOIN
(
SELECT <Value of Cell A2> AS [Cars]
…
) AS T
Well, actually, the values are passed as a rowset parameter and not as a singleton query (more details here), but
this does not make much of a difference.
DMPREDICTTABLEROW
This is a flavor of DMPREDICT intended to be used on Excel List objects. It can take 4 or 5 parameters.
- the first parameter is the Analysis Services connection to be used. An empty string refers the current (active) connection.
- the second parameter is the name of the mining model that will execute the prediction
- the third parameter, is the requested predicted entity (predictable column, in general, but could also be any prediction function)
- the fourth parameter is a range of cells to be passed as inputs
- the fifth parameter (optional) is a comma-separated list of column names to be used as names for the inputs.
If the range of input cells (the 3rd parameter) belongs to an Excel List Object, then the fifth parameter is not required. The function will use the Excel List’s column names to map the cell values to model columns in a NATURAL PREDICTION JOIN statement. An example: in cell O4 of the Table Analysis Tools Sample spreadsheet, type the formula:
=DMPREDICTTABLEROW(”",”PurchasedBike”, “[Purchased Bike]”, A4:M4,”")
The evaluation triggers the execution of a DMX statement which predicts the value of [Purchased Bike] by applying the PurchasedBike model patterns against the data in the A4:M4 range. As the model is built on the same Excel List object, the names associated with the range cells will match the model column names.
Note: The NATURAL PREDICTION JOIN syntax does not raise an error if one column (or all columns) is not matched in the mining model. Therefore, such a statement may yield unexpected results if the column mapping is not done correctly.
The prediction entity does not have to be a predictable column. Here is another example:
In cell P4 of the spreadsheet, type the formula:
=DMPREDICTTABLEROW(”",”PurchasedBike”,”PredictProbability([Purchased Bike], ‘Yes’)”, A4:M4,”")
The cell now contains the predicted probability for buying a bike
DMCONTENTQUERY
This function allows querying the content of a mining model. It takes 4 parameters. The first three are just like in the other cell functions:
- the first parameter is the Analysis Services connection to be used. An empty string refers the current (active) connection.
- the second parameter is the name of the mining model that will execute the prediction
- the third parameter, is the requested content column
- the fourth parameter is a WHERE clause to be appended to the content query
The function translates to a DMX statement like
SELECT
<Content Column — parameter 3>
FROM
<Model — parameter 2>.CONTENT
WHERE
<Condition — parameter 4>
This cell function requires good understanding of the algorithm-specific model content and can generate some spectacular results. For example, it can explain the reason for a certain prediction in the PurchasedBike model!
Decision Trees is a very intuitive algorithm - after various splits, the prediction is based on the distribution of a certain tree node. The Microsoft Decision Trees algorithm allows identifying that particular node by using the PredictNodeId prediction function. In the content, each node is represented as a row. A content column, NODE_DESCRIPTION, contains a textual description of all the splits from the root to the current node.
Retrieving the Node ID is easy, using the previous functions:
DMPREDICTTABLEROW(”",”PurchasedBike”,”PredictNodeId([Purchased Bike])”,A4:M4,”")
Using DMCONTENTQUERY and DMPREDICTTABLEROW, you can create a single expression which explains the prediction
In cell Q4 of the spreadsheet, type the expression below:
=DMCONTENTQUERY(
”",
”PurchasedBike”,
”NODE_DESCRIPTION”,
CONCATENATE(
”[NODE_UNIQUE_NAME]=’”,
DMPREDICTTABLEROW(”",”PurchasedBike”,”PredictNodeId([Purchased Bike])”,A4:M4,”"),
”‘”) )
A bit more complex than the first DMPREDICT, but not really hard to read. The cell will immediately display the reason for prediction and, for my model, the result looks like: “Age >= 39 and Cars not = 1″
Now, this is pretty much it!
Enjoying playing with DM cell functions, and drop a comment here if you find some other interesting ways of using the cell functions in your spreadsheets
Hello!
Very Interesting post! Thank you for such interesting resource!
PS: Sorry for my bad english, I’v just started to learn this language
See you!
Your, Raiul Baztepo