Hanalei, Hawaii 9/2/2010
438 Posts and Counting

LINQ Gymnastics: Creating A Predictive Query With LINQ

Wednesday, February 27, 2008 -

As you may have guessed, I've been using LINQ a whole lot with Adventure Works lately. It's been... interesting (the Adventure Works part) but it's great for working out those weak LINQ muscles that tend to atrophy after a while. On the heals of last night's nested query discovery, I decided to see what other kinds of things I could get LINQ to do...

 

Predictive Modeling and You
The selling doesn't stop when a user pops an item into their basket - this is something I learned when I was steward of the Commerce Starter Kit. Once a user commits to putting something into their cart, you're 80% there in terms of selling them something else. Don't show the basket after someone adds an item- show more products!

But what do you show?

In English, what you want to show are items that other people bought, most often, when they bought the product the user just added to their basket. This is a Predictive Cross-sell, and it's not the easiest query to write.

When you install the Adventure Works examples, one of the things you can do is install the SQL Analytics bits - an OLAP cube that's capable of running some high-end analytics that pretty much nail this stuff. There is an example in the old Adventure Works store sample site, but suffice to say that this is soooooooo out of the realm of possibility for most people (having runtime access to a cube, the memory and personnel needed to run said cube, the stones to run a Data Mining model real time, etc). Let's just say this isn't the best idea.

No worries - you can run this query yourself, with a little LINQ Gymnastics.

 

The Code
I wrote about this before, but the best way to approach a difficult query is to cut it up into pieces. It really helps in this case, since what we're doing is statistical and therefore may seem simple, but isn't really. So let's break it out:

  • We need to know all the Orders that our Product was part of
  • We need all of the Products that were part of those Orders, omitting the one we just bought
  • We need to roll these up by the SUM of the quantity of the products bought, and order them in descending order
  • We need to take the top 5 of this SUM, and query the Products table for the related Products

Breaking that down into LINQ is actually not that hard, if you write a LINQ query per requirement:

First, pull all the Orders for a given Product (let's assume we just added product 719 to our basket):

        //load our context
        AdventureWorks.DB db=new DB();

        //first, we need to know all of the SalesOrderIDs that had this product
        var salesWithProduct = from s in db.SalesOrderHeaders
                               join sd in db.SalesOrderDetails on s.SalesOrderID equals sd.SalesOrderID
                               where sd.ProductID == 719
                               select s.SalesOrderID;

Two things are very important here:

  1. This query doesn't fire until we iterate over it (call ToList() or foreach for example). What it does do is build an IQueryable Expression that we can use in other queries.
  2. You MUST use "select s.SalesOrderID" here - you can't say "select new{s.SalesOrderID}" because that would return an Anonymous type and make our next query barf as it uses "Contains()".

Next, all of the products that were part of those Orders:

        //now we need to get all the productIDs from those sales,
        //where the productID != our original
        var productsInSales = from sd in db.SalesOrderDetails
                              where salesWithProduct.Contains(sd.SalesOrderID)
                              && sd.ProductID!=719
                              select new
                              {
                                  sd.ProductID,
                                  sd.OrderQty
                              };

 

Notice the "Contains" here - it's a little inverted, but basically we're treating IQueryable as a Collection (which it is) by using the Contains() statement, we're creating an "IN"-style query.

Next, Rollup and order by SUM of the OrderQty:

        //this should give us a massive list of products
        //what we want here is to gorup these by the SUM of the order quantity
        //as we want only those products that others have bought a lot of
        var topOrders = from p in productsInSales
                          group p by p.ProductID into g
                          orderby g.Sum(x => x.OrderQty) descending
                          select new
                          {
                              ProductID = g.Key,
                              OrderSum = g.Sum(x => x.OrderQty)

                          };

Here, "g.Key" is our ProductID, and that's all we need for our final query...

        //create a query with a JOIN
        //faster than an IN
        var topProducts = from p in db.Products
                          join t in topOrders on p.ProductID equals t.ProductID
                         select p;

        //our result holder
        //take on the top 5
        List<Product> result= topProducts.Take(5).ToList<Product>();

This executes very, very quickly and produces this SQL:

SELECT [t0].[ProductID], [t0].[Name], [t0].[ProductNumber], [t0].[MakeFlag], [t0].[FinishedGoodsFlag], 
[t0].[Color], [t0].[SafetyStockLevel], [t0].[ReorderPoint], [t0].[StandardCost], [t0].[ListPrice], [t0].[Size], 
[t0].[SizeUnitMeasureCode], [t0].[WeightUnitMeasureCode], [t0].[Weight], [t0].[DaysToManufacture], [t0].[ProductLine], 
[t0].[Class], [t0].[Style], [t0].[ProductSubcategoryID], [t0].[ProductModelID], [t0].[SellStartDate], 
[t0].[SellEndDate], [t0].[DiscontinuedDate], [t0].[rowguid], [t0].[ModifiedDate]
FROM [Production].[Product] AS [t0]
INNER JOIN (
    SELECT SUM(CONVERT(Int,[t1].[OrderQty])) AS [value], [t1].[ProductID]
    FROM [Sales].[SalesOrderDetail] AS [t1]
    WHERE (EXISTS(
        SELECT NULL AS [EMPTY]
        FROM [Sales].[SalesOrderHeader] AS [t2]
        INNER JOIN [Sales].[SalesOrderDetail] AS [t3] ON [t2].[SalesOrderID] = [t3].[SalesOrderID]
        WHERE ([t2].[SalesOrderID] = [t1].[SalesOrderID]) AND ([t3].[ProductID] = @p0)
        )) AND ([t1].[ProductID] <> @p1)
    GROUP BY [t1].[ProductID]
    ) AS [t4] ON [t0].[ProductID] = [t4].[ProductID]
ORDER BY [t4].[value] DESC

 

Which is sort of Fugly, but it works and is very speedy.

 

Reality Check
In "Keepin it Real Yo" I think it's worth thinking this approach through. I can rewrite that query into a Sproc, lose 60% of the SQL, and be done with this whole thing in very short order. In addition, all of the LINQ statements amount to a lot of extra code just to build a SQL statement. However...

If you break our your queries into steps and add some comments, like I did here, you understand a whole heck of a lot more what I, the stats wonk, am trying to do.  If you were to stumble accross a predictive Sproc in the DB, you would no doubt spend a whole lot of time trying to figure just WTF I was thinking.

In addition, you can throw a debugger on each of the LINQ statements above, run the resulting SQL, and see if those results make sense.

I like SQL a lot, but I know that I'm very alone in that. LINQ is a nice alternative for complicated SQL, as long as you take the time to make readable.

Related


Gravatar
Roger Jennings - Wednesday, February 27, 2008 - Rob:

!Viva LINQ!

Great job _and_ post!

--rj

See http://oakleafblog.blogspot.com/2008/02/linq-and-entity-framework-posts-for_26.html
Gravatar
Steven Rogers - Wednesday, February 27, 2008 - LINQ is cool. Like, you know, "wow, neat". I, personally, can't see when I'd find a use for it. I like SQL too much. :)
Gravatar
adminjew - Wednesday, February 27, 2008 - Rob Nice! But when are we getting some of these cool stuff in SubSonic (You must be up to something:)
Gravatar
Jamie MacLennan - Thursday, February 28, 2008 - Using Analysis Services is hardly out of the range of possiblility for most people. You don't need a cube, it doesn't take any particular amount of memory (I run models with 750,000 products on my laptop), and you can even build the models in Excel.

Once the model is built, the query to retrieve the predictions is as simple as

SELECT Predict(Products, 5) FROM MyCrossSellModel
NATURAL PREDICTION JOIN
( SELECT (SELECT 'Apples' AS Product UNION
SELECT 'Bananas AS Product) AS Products ) as t

Soooooooo out of reach? I don't think so. Oh, and if you want to create the model in code rather than Excel, you can do so with a simple CREATE MINING MODEL statement and populate with an INSERT INTO.

These queries are executable through OLE/DB using the Analysis Services provider (which happens to ship with Excel) or in .Net using the ADO.Net provider or (better) the ADOMD.Net provider - the code looks the same either way. Also - say you happened to be developing on a platform that didn't have a provider (e.g. Windows Mobile), you can use XML for Analysis over http to query from any device - e.g. cross sales on your phone!

Now if you can show me LINQ on top of Analysis Services DMX queries - then I'll be impressed...

-Jamie
Gravatar
Rob Conery - Thursday, February 28, 2008 - @Jamie: OLAP isn't built to be hit on par with transactional systems. The shear volume of data just does not scale, at all. It's a bad model, no matter what you do. In terms of soooo out of reach - let me repeat back to you some snips of your comment: >>>you can even build the models in Excel. What's a model? >>>Once the model is built, the query to retrieve the predictions is as What do models have to do with queries? >>>simple CREATE MINING MODEL statement and populate with an INSERT INTO CREATE MINING MODEL is simple? What does it do? Where is it? What's the perf and scaling? API? Constraints? >>>executable through OLE/DB using the Analysis Services provider That simple eh? >>>XML for Analysis over http to query from any device This one I know - and what I know is that I never want to use it again :). Homey this is obviously your expertise. Others don't know what you know :).
Gravatar
Jamie MacLennan - Thursday, February 28, 2008 - I'll quip back

* I'm not talking about OLAP - I'm talking about data mining. Analysis Services does both and they are not the same thing. AS can respond to 1,000,000,000 (yes 1 Billion) cross sell queries per day from a catalog of 750,000 products across 15,000,000 transactions. How many times can you fire that LINQ query on a problem that size?

* A (data mining) model is like a table

* You query a model like you query a table

* CREATE MINING MODEL is like CREATE TABLE - see BOL

* If ADO.Net is too complicated, what can I say?

It doesn't take a lot of expertise, only familiarity.
Gravatar
Rob Conery - Thursday, February 28, 2008 - Didn't mean to be quippy :)... apologies. ADO's not complicated; I was referring to OLE DB and the DB's thereof - most people don't use it is all.

I'll sidestep the AS scalability discussion and agree with you - I think you have more exprience then me on that front. I'd never do it as I have brought applications to their knees while waiting on AS to figure out what was going on. Moreover I needed to run impersonation to get it to work right - not something I'm too fond of.

In terms of a LINQ query - it just creates SQL and does a TOP 5. On an indexed table you can beat on it all day.

>>>It doesn't take a lot of expertise, only familiarity.

Which is precisely my point. As they say here: "If Can, Can. If No Can, No Can". 95% of the shops out there don't have expertise in Data Mining or OLAP.

I don't need a DBA or mining wonk to run this query. It's SQL Server and some C# - all done.
Gravatar
Augusto - Friday, February 29, 2008 - I'm not a C# developer but I'm curious, the LINQ SQL like statements are they reserved keyword? Is "from" for example a reserve keyword? What "select" "join". etc?
Gravatar
Jamie MacLennan - Friday, February 29, 2008 - No biggie - if you want to see the SQL Server Data Mining recommendations in action launch the sample at http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=81. You can add movies and get recommendations and see the queries sent to Analysis Services to get the results.

This site runs a CTP of SQL Server 2008 Analysis Services, but the application was written using 2005 (actually the application was written on a beta around 2003, but that's not so relevant). The site's been up almost continuously since 10/2004 with server restarts only for maintenance (like when we changed the machine type of maintenance).

If you want to gain familiarity you can poke around the site for other demos or download the data mining addins for Office from the site as well - these are "no brainer" data mining tools for any user.

Again - I have to disagree with you regarding the "No Can Can" statement - how many developers had "expertise" in LINQ before they tried it. Go ahead - try out the data mining addins - it's hella simpler than you imagine. (and if configuration is an issue, there's an article on the SQLServerDataMining.com site explaining what to do as well as a video tutorial - no excuses...)

-Jamie
Gravatar
Jon Kruger - Friday, February 29, 2008 - OK, I'm curious. What if you took that big long SQL query that LINQ spit out and rewrote it as a stored procedure, and then checked the execution plans of the two at the same time. Which one would be faster, and by how much? Like you said, the LINQ code is much more readable than SQL. Not only that, it's compiled, type safe code.
Gravatar
Joe Chung - Monday, March 03, 2008 - Jamie and Rob, what I would love to see is LINQ to SSAS! It doesn't have to be "either/or"!
Gravatar
SimonTeW - Monday, March 03, 2008 - Rob,

You mention that perhaps a stored proc would be the way to go but that breaking the query up as LINQ made it clearer. Why not do both? Sub-queries in a stored proc, with a comment at the top of each sub-query and suitable formatting (ie indenting), are quite easy to understand.

On the other hand, I've found that many developers have only a very basic knowledge of SQL (for example, the other day I found myself having to explain outer joins to a bloke I consider a C guru). So maybe LINQ will help them.

It's like algrebra at high school. I remember reading Disney comics and Archie and Jughead as a young kid, and "algebra" kept on being mentioned as this terrible torture at school. Obviously a really difficult subject. It wasn't until I had almost finished high school that I discovered the easy stuff we had been doing for years was actually algebra. SQL seems to be the same for many developers I've met. "Ooh SQL. A declarative language. Too difficult to grasp the concept." Give them this really cool new query language called LINQ, though, and I bet they'll be playing around and getting to grips with it in no time.

Cheers
Simon
Gravatar
Bebel - Thursday, March 06, 2008 - Rob,
I am one of the people who usually read every article you post here. I have been using SubSonic for a while now. I have realised you have been writing a little bit more about Linq.
My concern/question is on the furture of SubSonic/Linq. Do you still recommend some of us to still be using SubSonic for new projects? would you rather have us get going using Linq? I know people have asked this before. But allow me to ask again, What role SubSonic will play now that there is Linq? I once read your article when you got hired by Microsoft and I believed it was SubSonic was going to be used by the ASP.NET MVC. However, so far there is no sign of SubSonic in the MVC framework. Do you mind writing about the current role of SubSonic within the ASP.NET team? or any article that will help some of us understand where things are going.
I just needed an expert advice from you.

Thank you!!!

--B.
Gravatar
Pete Hurst - Thursday, March 06, 2008 - @Steven Rogers...

Linq's power is not just in the way it can generate SQL.

You can use Linq to perform operations on *any* collection, not just DB tables.

I've taken 6 or 7 line functions that manipulated collections, and reduced them to a *single line* lambda expression query, using Linq.

The power of this stuff, is that you can write SQL-like code *inline* for ANY purpose, AND have it syntax and type-checked by the compiler.

The thing is, once you're querying any and all collections using this same expressive syntax, why also use a *second* syntax (disconnected from the main flow of your program) by writing SQL code? (Except for more heavyweight processing that is more efficiently run as an SP.)

The more I use it, the more I'm finding out it can do...
Gravatar
Pete Hurst - Thursday, March 06, 2008 - @Bebel (and Rob)

I've been kind of wondering this as well. I migrated my app from SubSonic to LinqToSql, which was a massive overhaul, because they both use fundamentally different conventions. (The worst one is SubSonic's use of constructors for performing database retrieves. That ended up being pretty nasty to refactor...)

Then, when Rob's talking about all the new MVC stuff, SubSonic seems to be strongly integrated.

However, in terms of DAL functionality, Linq has completely replaced SubSonic (IMO). I don't really see where SubSonic fits into the .NET 3.5 / MVC / Linq world.
Gravatar
pommatt - Sunday, March 09, 2008 - can i use subsonic and linq to sql on anther dbms for example Mysql sysbase right!!
Gravatar
Deepak - Tuesday, March 25, 2008 - Its really helpful to new bebbie in LINQ.. gud stuff and nice explanation..

Thanks..