Hanalei, Hawaii 2010-03-18

LINQ Gymnastics: Creating A Predictive Query With LINQ

As you may have guessed, I've been using LINQ a whole lot with Adventure Works lately. It's been.

As you may have guessed, I've been using LINQ a whole lot with Adventure Works lately. It's been... interesting (the Adventure Works part) but it's great for working out those weak LINQ muscles that tend to atrophy after a while. On the heals of last night's nested query discovery, I decided to see what other kinds of things I could get LINQ to do...

 

Predictive Modeling and You
The selling doesn't stop when a user pops an item into their basket - this is something I learned when I was steward of the Commerce Starter Kit. Once a user commits to putting something into their cart, you're 80% there in terms of selling them something else. Don't show the basket after someone adds an item- show more products!

But what do you show?

In English, what you want to show are items that other people bought, most often, when they bought the product the user just added to their basket. This is a Predictive Cross-sell, and it's not the easiest query to write.

When you install the Adventure Works examples, one of the things you can do is install the SQL Analytics bits - an OLAP cube that's capable of running some high-end analytics that pretty much nail this stuff. There is an example in the old Adventure Works store sample site, but suffice to say that this is soooooooo out of the realm of possibility for most people (having runtime access to a cube, the memory and personnel needed to run said cube, the stones to run a Data Mining model real time, etc). Let's just say this isn't the best idea.

No worries - you can run this query yourself, with a little LINQ Gymnastics.

 

The Code
I wrote about this before, but the best way to approach a difficult query is to cut it up into pieces. It really helps in this case, since what we're doing is statistical and therefore may seem simple, but isn't really. So let's break it out:

  • We need to know all the Orders that our Product was part of
  • We need all of the Products that were part of those Orders, omitting the one we just bought
  • We need to roll these up by the SUM of the quantity of the products bought, and order them in descending order
  • We need to take the top 5 of this SUM, and query the Products table for the related Products

Breaking that down into LINQ is actually not that hard, if you write a LINQ query per requirement:

First, pull all the Orders for a given Product (let's assume we just added product 719 to our basket):

        //load our context
        AdventureWorks.DB db=new DB();

        //first, we need to know all of the SalesOrderIDs that had this product
        var salesWithProduct = from s in db.SalesOrderHeaders
                               join sd in db.SalesOrderDetails on s.SalesOrderID equals sd.SalesOrderID
                               where sd.ProductID == 719
                               select s.SalesOrderID;

Two things are very important here:

  1. This query doesn't fire until we iterate over it (call ToList() or foreach for example). What it does do is build an IQueryable Expression that we can use in other queries.
  2. You MUST use "select s.SalesOrderID" here - you can't say "select new{s.SalesOrderID}" because that would return an Anonymous type and make our next query barf as it uses "Contains()".

Next, all of the products that were part of those Orders:

        //now we need to get all the productIDs from those sales,
        //where the productID != our original
        var productsInSales = from sd in db.SalesOrderDetails
                              where salesWithProduct.Contains(sd.SalesOrderID)
                              && sd.ProductID!=719
                              select new
                              {
                                  sd.ProductID,
                                  sd.OrderQty
                              };

 

Notice the "Contains" here - it's a little inverted, but basically we're treating IQueryable as a Collection (which it is) by using the Contains() statement, we're creating an "IN"-style query.

Next, Rollup and order by SUM of the OrderQty:

        //this should give us a massive list of products
        //what we want here is to gorup these by the SUM of the order quantity
        //as we want only those products that others have bought a lot of
        var topOrders = from p in productsInSales
                          group p by p.ProductID into g
                          orderby g.Sum(x => x.OrderQty) descending
                          select new
                          {
                              ProductID = g.Key,
                              OrderSum = g.Sum(x => x.OrderQty)

                          };

Here, "g.Key" is our ProductID, and that's all we need for our final query...

        //create a query with a JOIN
        //faster than an IN
        var topProducts = from p in db.Products
                          join t in topOrders on p.ProductID equals t.ProductID
                         select p;

        //our result holder
        //take on the top 5
        List<Product> result= topProducts.Take(5).ToList<Product>();

This executes very, very quickly and produces this SQL:

SELECT [t0].[ProductID], [t0].[Name], [t0].[ProductNumber], [t0].[MakeFlag], [t0].[FinishedGoodsFlag], 
[t0].[Color], [t0].[SafetyStockLevel], [t0].[ReorderPoint], [t0].[StandardCost], [t0].[ListPrice], [t0].[Size], 
[t0].[SizeUnitMeasureCode], [t0].[WeightUnitMeasureCode], [t0].[Weight], [t0].[DaysToManufacture], [t0].[ProductLine], 
[t0].[Class], [t0].[Style], [t0].[ProductSubcategoryID], [t0].[ProductModelID], [t0].[SellStartDate], 
[t0].[SellEndDate], [t0].[DiscontinuedDate], [t0].[rowguid], [t0].[ModifiedDate]
FROM [Production].[Product] AS [t0]
INNER JOIN (
    SELECT SUM(CONVERT(Int,[t1].[OrderQty])) AS [value], [t1].[ProductID]
    FROM [Sales].[SalesOrderDetail] AS [t1]
    WHERE (EXISTS(
        SELECT NULL AS [EMPTY]
        FROM [Sales].[SalesOrderHeader] AS [t2]
        INNER JOIN [Sales].[SalesOrderDetail] AS [t3] ON [t2].[SalesOrderID] = [t3].[SalesOrderID]
        WHERE ([t2].[SalesOrderID] = [t1].[SalesOrderID]) AND ([t3].[ProductID] = @p0)
        )) AND ([t1].[ProductID] <> @p1)
    GROUP BY [t1].[ProductID]
    ) AS [t4] ON [t0].[ProductID] = [t4].[ProductID]
ORDER BY [t4].[value] DESC

 

Which is sort of Fugly, but it works and is very speedy.

 

Reality Check
In "Keepin it Real Yo" I think it's worth thinking this approach through. I can rewrite that query into a Sproc, lose 60% of the SQL, and be done with this whole thing in very short order. In addition, all of the LINQ statements amount to a lot of extra code just to build a SQL statement. However...

If you break our your queries into steps and add some comments, like I did here, you understand a whole heck of a lot more what I, the stats wonk, am trying to do.  If you were to stumble accross a predictive Sproc in the DB, you would no doubt spend a whole lot of time trying to figure just WTF I was thinking.

In addition, you can throw a debugger on each of the LINQ statements above, run the resulting SQL, and see if those results make sense.

I like SQL a lot, but I know that I'm very alone in that. LINQ is a nice alternative for complicated SQL, as long as you take the time to make readable.