Regression accuracy: Excel’s regression vs. the SSDM algorithm
A recent post on the MSDN Forums raised an interesting issue: Excel Data Analysis’s Linear Regression and SSDM were returning different results. Specifically, the SSDM results were much worse.
The issue turned out to be a data modeling issue (columns were not mapped properly). However, during the investigation I had to compare Excel’s linear regression with the SSDM regression algorithm(s). Thought this might be interesting, so here is one way to compare the results from the two implementations.
I started with some simple (X,Y) data (available for download as a CSV file). First step - run Excel’s Data Analysis regression tool. The results are displayed typically in a separate spreadsheet, and the interesting part is the Coefficients sections:
Therefore, the Excel regression formula is
Y = 2.37486095661847*X -0.310344827586214
Next thing — apply Excel’s regression coefficients to the existing data. I did this by adding a column in my data spreadsheet and populating it with a formula:
Next thing, I created a Microsoft Linear Regression mining model on the same data. There is a variety of ways to do this, such as exporting data to a table, connecting directly to the Excel spreadsheet or, the simplest way, by using the Excel add-ins.
To get the model’s predictions in Excel, I used one of the functions exposed by the Excel add-ins, DMPREDICT. If you do not have the add-ins you can always execute a prediction query in SQL Server Management Studio or BI Dev Studio.
However, with the add-ins’s function, getting the prediction results is really easy: