Monday, May 19, 2008 -
As with all projects, you hit roadblocks. And I've hit one head-on!
Linq Happens
I got a great comment on my blog the other day from David Jade (no link...):
It seems to me (from watching SQL Profiler) that LazyList doesn't really work. There seems to be two issues here:
First, from looking at stack traces it seems that somewhere in the depths of Linq to SQL, the LazyList members are being touched which causes the inner IQueryable to be executed while constructing the model objects.
Second, it seems that sometimes a query in a repository that generates a model object that contains LazyList members seems to get wildly complex and pre-load a lot of data which should be lazily loaded. I say sometimes since it seems to happen for the first LazyList model member but not subsequent LazyList model members (in fact if you rearrange the model members, it will switch which one gets pre-loaded). This is not necessary related to using SingleOrDefault() either as it happen when returning IQueryable as well.
I had seen this happening already, but figured that I would tune this out with some perf optimizations in the coming weeks. But my spidey senses started tingling...
Is Perf A Feature?
A lot has been written about Premature Optimization being the Root of All Evil but I think the point that is missed is that if you have something (like I do here) that isn't working at all the way you want, you may want to take a look at it :) in case you start smelling smoke, as I am right now. This is echoed by Robert O'Callahan:
There's a folklore quote "premature optimization is the root of all evil", attributed to Tony Hoare and Donald Knuth...
"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.". Unfortunately --- and I'm not the first to note this --- this advice, taken out of context and followed slavishly, often leads people into deep trouble. The problem is that it collides with another well-known feature of software development: it gets more expensive to make changes to the system later in its development...
If you leave optimization to late in development, then profile it and find that fixing the performance bottleneck requires a major change in your design, or forces major module interface changes, then you have a serious problem
Completely agree with this. I've been there :).
In my case, I'm using a pattern that's not been tested before, and kind of running with scissors. To some people, 100 queries in one call may not be a big deal. To me, well it's chocolate-coated negligence.
I've elected to stop what I'm doing and dive into this mess. And what a mess it is! This is why you haven't seen another screencast, I'm neck deep in code right now, trying to figure out why I can't get my LazyList to work like it looks like it should!
Please Read This Before Providing Your Comment
I probably didn't put this high enough in the post - I know many are on their way down the page here writing an "I told you SO!" and while you might have a point - I'm not giving up yet. There has to be a way - and I don't want to give up yet as I think there's a lot of value in making this work.
This post is my attempt at being perfectly transparent with trying to solve a very aggravating problem (as you'll see). I'd love to hear some thoughts other than "give up" :).
Is This List Lazy Or Not!
In the SQLCatalogRepository, here is the code for grabbing Category and Products:
public IQueryable<Category> GetCategories() { var culturedName = from ct in db.CategoryCultureDetails where ct.Culture.LanguageCode == System. Globalization. CultureInfo. CurrentUICulture.TwoLetterISOLanguageName select new { ct.CategoryName, ct.CategoryID }; return from c in db.Categories join cn in culturedName on c.CategoryID equals cn.CategoryID select new Category { ID = c.CategoryID, Name = cn.CategoryName, ParentID = c.ParentID ?? 0, Products = new LazyList<Product>( from p in GetProducts() join cp in db.Categories_Products on p.ID equals cp.ProductID where cp.CategoryID == c.CategoryID select p ) }; }
You can see the call to LazyList<Product> above, where the IQueryable definition for Product is passed in to the constructor. This is supposed to stay "as a definition" until the List is enumerated - calling the query then.
Here's the code for GetProducts():
public IQueryable<Product> GetProducts() { var cultureDetail = from cd in db.ProductCultureDetails where cd.Culture.LanguageCode == System. Globalization. CultureInfo. CurrentUICulture. TwoLetterISOLanguageName select cd; var result = from p in db.Products join detail in cultureDetail on p.ProductID equals detail.ProductID select new Product { ID = p.ProductID, Name = p.ProductName, Description = detail.Description, ShortDescription = detail.ShortDescription, Price = detail.UnitPrice ?? p.BaseUnitPrice, Manufacturer = p.Manufacturer, ProductCode=p.ProductCode, //reviews Reviews=new LazyList<ProductReview>(from r in GetReviews() where r.ProductID==p.ProductID select r), //images Images=new LazyList<ProductImage>(from i in GetProductImages() where i.ProductID == p.ProductID select i), }; return result; }
You can see I'm doing the same here for Images and Reviews.
What Works, What Doesn't.
If I comment out the call to Reviews and Images (which are just very basic IQeryable definitions), LazyList works perfectly and all is good - one call to the DB:
Uncomment the constructors in GetProducts() for Images and Reviews, and BOOM:
100+ queries. Now it's clear to me that something is tripping IQueryable - but what? I know it's something inside the Product definition. Time to Dive Deeper.
Starting At The Beginning
If LazyList() works, then I should be able to do this without hitting the DB:
ICatalogRepository rep = new SqlCatalogRepository(); LazyList<Product> prods = new LazyList<Product>(rep.GetProducts());
... and indeed that's the case. My Profiler confirms this with a nice blank trace. So now I know LazyList works! I'm just using it wrong somewhere.
If I ask for the count, it should hit the DB:
ICatalogRepository rep = new SqlCatalogRepository(); LazyList<Product> prods = new LazyList<Product>(rep.GetProducts()); int productCount = prods.Count(); Assert.AreEqual(productCount, 51);
... and the Profiler confirms this. Unfortunately it hits it 100 times :). It doesn't do this for Category when it gets enumerated (without Images/Reviews).
Calling Count()
I've traced the problem to Count. Something is calling Count on the property which is tripping the enumerator. I've changed my test to this, so I can be sure that I'm not inadvertantly calling every product:
ICatalogRepository rep = new SqlCatalogRepository(); LazyList<Product> prods = new LazyList<Product>(rep.GetProducts()); Product p = prods[0];
This kicks the enumerator off for my Product list, which in turn populates the list with Products, which then (evidently) need to load every Image/Review. This is wrong.
If I look at the Call Stack for LazyList when this happens - check this out. Count is being called for the ProductReview IList internally, which is firing it's enumerator:
After Count.get() is called, the next step is to the CopyTo() method. If I'm understanding things here, the list items of the Inner list are being copied to the outer (derived) LazyList, which is causing the iteration craziness.
Why Does Category Work? Or Does It...
I found the answer, and it's weird. Add this to your "101 Strange Things About LINQ To Sql" notebook :). Let's take a look at the IQueryable definition for Category again:
select new Category { ID = c.CategoryID, Name = cn.CategoryName, ParentID = c.ParentID ?? 0, Products = new LazyList<Product>( from p in GetProducts() join cp in db.Categories_Products on p.ID equals cp.ProductID where cp.CategoryID == c.CategoryID select p ) };
Notice here that for LazyList<Product>, I'm using a join (because I have to) on a Many to Many relationship. Linq To Sql translates this as a set of nested INNER and LEFT joins (I won't hurt your eyes with the SQL). Just know that this whole thing is filled in ONE SQL CALL. Which is awesome for Linq To Sql, and kudos to the team for that.
The upshot here is that it only APPEARED to work (the Product List not getting filled until I called it)- Linq To SQL just outsmarted me :) and indeed filled the Product list - from one call. If I change that IQueryable join pattern for Products, it wouldn't work at all no doubt.
Perhaps I can do the same for Images/Reviews. If I rework the definition for Product/Images to include a join (which I don't need):
Images=new LazyList<ProductImage> (from i in GetProductImages() join p1 in db.Products on i.ProductID equals p1.ProductID where p1.ProductID == p.ProductID select i),
It works the same way! Only ONE CALL to the DB to load up the Images along with the Product :). Now that's some progress!
BOOM! Not So Fast
If I reset both Images and Reviews to use joins, it stops working again - I guess the join structure was a tad too much for it :). Oh well - it was sort of haackish anyway :).
Summary
The problem appears to be the Count property. More specifically it's the CopyTo. I don't know if there's a way around this as the Count property of the IList property needs to have a number to it.
For the Linq To Sql guys out there - Linq To Sql objects don't have a way around this either. Even if you specify LoadOptions for the Context, they will still be called ONE AT A TIME.
I have ways out of this, to be sure. But I really want LazyList to work and I've asked Ayende to check out what I'm doing. Hopefully he will give me his usual "Oh this is really simple stuff, it surprises me that you don't know this...".
I would really love some ideas here; hit me!
Technorati Tags: aspnetmvc
This is a great chance to refactor with NHibernate 1.2 and take advantage of a more mature ORM with a great community of support and fellow developers.
It's not too late - show us those refactoring skills now of how it is possible to swap out your data layer.
I recommend Spring.NET and NHibernate - mixed in with some NHibernate Query Generator. If your really stuck on LINQ you can get Ayende's Linq over NHibernate.
:)
How will NHibernate solve this problem?
NHibernate has better and more transparent lazy loading (at least for one-to-one relations) and good eager loading options. The Linq provider is not yet at the same level as Linq To SQL though.
I am pretty sure NHibernate can handle what you are trying to do, but it could be kind of tricky if you are new the ICritiera API.
It would be a big refactoring, you probably have to abandon your query repostories and instead use the real/normal repository pattern.
ECO (CapableObjects.com) solves this problem by loading so called object IDs ONLY but not all the objects.
For example, if you have a customer object with number of OrderItems; calling
customer.OrderItems.Count
will execute SQl like this:
SELECT ID FROM OrderItems as i WHERE i.CustomerID = {customerId}
Then Count can easily know the Count :)
Advantages:
1. Lazzy load doesn't load unneeded objects one by one just to "count".
2. Count is available.
3. IDs of related items are already known.
Disadvantage:
1. All IDs are in memory. Sometimes might be a problem with memory (probably with billions of records - ID=8 bytes+internal stuff).
Regards,
Dmitriy.
SteveO - have you actually *USED* LINQ to NHibernate?
@Dmitry: I don't see how any problem is solved here - I want POCOs, not an artifact of some system (like Linq To Sql or SubSonic's ActiveRecord). That is "bleeding" the data access into my model - again not what I want.
Also - in your example, you still have to call that SELECT statement for every order item - that's not lazy loading.
Hi Rob,
=POCOs
If you use any O/R mapping framework you rarely have really POCO objects.
Let's have a look at NHibernate. You should use ISet in your object if you want to have a bag.
EF or Linq2Sql - you have lots of internal stuff there: IPropertyChange(ing) (it's IPOCO), events to handle changed properties and code to set appropriate IDs for association attributes (not event IPOCO already).
And it's another question WHY you really need POCO. It is very unlikely you are going to change data access layer in the middle of the project. So let's not worry about POCOs. Why should we care about it if doesn't make life easier.
=Problem IS solved
If I have followed you correctly you have 100 queries when you call prods.Count(). And that's because of every one product is loaded (one by one) just to calculate Count. This is wrong approach. We didn't ask to load products, but rather get the count.
If you do the same with ECO (prods.Count) it will NOT load products one-by-one, but rather load its IDs.
When you are going to enumerate all of them they will be loaded one by one. And here's lazzy loading.
In this case it is just needed to PRELOAD all of the products.
So we'll end up with 2 queries:
1) prods.Count - SELECT ID FROM ...
2) Preload and enumerate products SELECT Name, Description... FROM ....
= Lazy Loading
You say "you still have to call that SELECT statement for every order item - that's not lazy loading.".
That's exactly lazy loading which is "a design pattern to defer initialization of an object until the point at which it is needed.". Or what is your definition of lazy loading to be on the same page?
The problem with lazy loading is that it should be used with care to avoid 100 SQLs. Preloading should be used in such cases.
My point is that it is just ridiculous that EF loads list elements ONE-BY-ONE to return the COUNT.
It should be smart enough to perform one of these actions:
1) Load list with non-initialized objects to have Count.
2) Load the whole list with ONE SQL.
It's not a rocket science.
I hope I have just missed something.
Cheers.
>>>If you use any O/R mapping framework you rarely have really POCO objects.<<<
True enough :).
>>>And it's another question WHY you really need POCO. It is very unlikely you are going to change data access layer in the middle of the project. So let's not worry about POCOs. Why should we care about it if doesn't make life easier.<<<
Ahh but you miss the point :). My model is not my data access - this is why some people are up in arms over EF; the data access works it's way deeply into your application. Now in many cases I think you're right - no need to make life hard.
But to answer your question - YOU may not want to change your DA, but in 3 years? In my case, I absolutely MUST be agnostic on this. Not only in terms of Data Access, but also storage in general.
>>>In this case it is just needed to PRELOAD all of the products.<<<
An in-memory cache does not solve any problem. Inventory? Concurrency? Discounts/Price changes?
>>>My point is that it is just ridiculous that EF loads list elements ONE-BY-ONE to return the COUNT.<<<
Agreed. But as I'm going to show in my next webcast L2S is actually pretty smart about this stuff :).
RE your Lazy Loading - your definition is correct, your examples are not. I'll shorten this discussion by saying that with SubSonic I've ventured deeply into this. I know something about it :).
Hi Rob,
== Model & Data Access
"My model is not my data access".
I have never said model Model=DAL.
I follow Model Driven Architecture and my interpretation of Model closely equals to Conceptual Model that "captures ideas in a problem domain". That's the whole point of model.
Data Access layer is responsible for (really?) data access :)
That's completely different area. But when you use a model you will probably implicitly access data via DAL.
You understand it like this:
Model (poor objects with no logic) --> Services (logic) --> Repositories (just a proxy) --> Data Access (EF, NH, ECO, SubSonic...) --> Storage (DB).
I prefer to interpret it like this:
Model (with all the logic) --> Data Access (ECO) --> Storage (DB, Memory).
It eliminates unneeded complexity and all your business logic is located in the model. Model is the center of the world. Modeled objects are not POCOs and they interact with DAL through interfaces but have no idea what storage they are working (if they are) against.
== In-Memory Cache
"Inventory? Concurrency? Discounts/Price changes?"
ECO greatly solves these problems with so called subscriptions and other techniques (like optimistic locking).
It notifies all dependent properties, associations, derived attributes if data they rely on has been changed.
You never have to worry about the internals. en.wikipedia.org/.../CodeGear_ECO
== YOU may not want to change your DA, but in 3 years?
I can say that you will rewrite the system using new technologies in 3 year :) With high probability.
== Lazy Loading definition & sample
I believe you are highly experienced in this area.
I was referring to the example:
1. products.Count (SELECT ID FROM ...)
2. products[0] (SELECT Name, Image.. FROM... WHERE ID={0}) - lazy load.
3. products[1] (SELECT Name, Image.. FROM... WHERE ID={1}) - lazy load.
...
k. products[k] (SELECT Name, Image.. FROM... WHERE ID={k}) - lazy load.
This is what I meant.
To avoid k queries it is needed to preload all k Products (single line of code in ECO).