NoSQL In The Wild: How We Do It at Tekpub
- Wednesday, May 19, 2010
I've received a flood of emails since we launched the revamped Tekpub site (using Rails and MongoDb) a few months back. Many people have been surpised to find out that we use both MongoDb *and* MySQL. Here's the story of how we did it and why.
Let's Get This Out Of The Way First
Some people might read this as "Rob's defecting" or "Rob's a 'Rails Guy' now". If you're looking for drama and sassy posts on The One True Framework - there's a site for that. Our choice to move to Rails/NoSQL was largely financial - you can read more about that here. What follows can easily be translated to whatever framework you work with - assuming a Mongo driver exists for it.
We split out the duties of our persistence story cleanly into two camps: reports of things we needed to know to make decisions for the business and data users needed to use our site. Ironically these two different ways of storing data have guided us to do what's natural: put the application data into a high-read, high-availability environment (MongoDb) - put the historical, reporting data into a system that is built to answer questions: a relational data store.
It's remarkable at how well this all fits together into a natural scheme. The high-read stuff (account info, productions and episode info) is perfect for a "right now" kind of thing like MongoDb. The "what happened yesterday" stuff is perfect for a relational system. There are other benefits as well and one of them is a very big thing for us is
We don't want to run reports on our live server. You don't know how long they will take - nor what indexing they will work over (limiting the site's perf). Bad. Bad-bad.
The Verdict It works perfectly. I could *not* be happier with our setup. It's incredibly low maintenance, we can back it up like any other solution, and we have the data we need when we need it.
The actors in this play are making their way to the stage now... as the curtain rises...
- Ruby on Rails 2.3.5
- Ruby 1.8.7
- MongoMapper gem from John Nunemaker
- MySQL 5.1
- DataMapper gem
- ... and of course MongoDb
The server that we're running in Ubuntu 9.1 up on Amazon EC2 - large instance with prepay.
The Application Theater
If you look over Tekpub you'll notice it's not terribly deep, in terms of "programmable concepts". We have all of 6 classes that we store as documents in MongoDb:
- Account - the thing with the subscription info and who you are
- Coupon - we hand these out from time to time
- Production - one of our core classes: "Mastering ASP.NET MVC" for instance
- Episode - a Production can have one or more of these: "Episode 6: The View", for instance
- Order - this is a "shopping cart" if you will, not a historical record
- Payment - a temporary holder class for pings from PayPal (yes, I know we need to get away from them)
Classes using ActiveRecord are typically spartan - letting the database define their structure. For Java/C# devs this might be weird - there are no properties:
class Production < ActiveRecord::Base #validators, etc end
With MongoMapper, this is done differently - you specify the keys explicitly:
class Production include MongoMapper::Document key :title, String, :required => true key :slug, String, :unique => true, :required => true, :index => true key :description, String key :notes, String key :price, BigDecimal, :numeric => true key :released_at, Date, :default => Date.today key :default_height, String, :default => '600' key :default_width, String, :default => '1000' key :quotes, String #royalty info key :producers, String timestamps! end
For a lot of .NET/Java devs this will look "messy" - you shouldn't elevate "data concerns" into your model. This argument makes good sense for a large, complex site - that you're building in C# or Java. Typically Ruby focuses on the straight, narrow path and with that comes a dramatic turn towards "doing what you need to do... and no more". This resonates with me - and moreover if I had to change later on it's just a function of changing a few bits of text (you'll see that in a second)
If you wanted to do the same thing with C#/.NET you can - but you don't need to use specific wording to get it to work. If you use NoRM (the MongoDb driver that I lended a hand to) - it translates simple objects into documents, so you can rest easy :).
The Reporting Theater
We use DataMapper for our access to our relational system: MySQL 5.1. We thought about going with PostGres but decided we (James and myself) knew MySQL pretty well and since it wasn't doing any serious heavy lifting - we could just leave it at that.
DataMapper is appealing to me as I hate migrations. Rails has such a fluid flow to it - from testing with Cucumber, tweaking code, altering your views a bit - all of it is so fluid and simple that when it comes time to alter your database... well it comes to a screetching halt.
DataMapper solves this by a handy feature called "auto_migrate". You can fire this into your console, or if you're using Vim:
The classes look a lot like MongoMapper (indeed I think they were the inspiration - given the "Mapper" suffix):
class OrderLog include DataMapper::Resource property :id, Serial property :order_number, String, :required => true property :total, BigDecimal, :precision => 10, :scale => 2 property :paypal_user_name, String property :transaction_id, String, :required => true property :slug, String property :month, Integer property :year, Integer property :created_at, Date, :required => true, :default => Date.today property :user_name, String, :length => 500 property :gift_code, String property :description, String, :length => 1200, :required => true end
"Serial" is a way to tell DataMapper to make this an Identity column (with auto-increment) - also notice the specific instructions given for how the data should be stored. I tried to do this with SubSonic (SimpleRepository) and I got pretty close to this functionality.
Something to note here is the denormalization. People familiar with reporting and data warehouses are used to seeing columns spread out and denormalized - the feeling here is that the relational system already dealt with the ACID stuff, and the ETL process (extraction, transformation, and loading) process squeezed out bad or invalid data. There's no reason to keep things in a relational structure when they're living inside an analytical one - which we are now in.
Specifically: note the "month" and "year" breakouts. This is very common in an OLAP scenario - some engines are smart enough to parse this for you, but on larger warehouses if you take the time to split these during ETL, you'll save your self a couple of hours during the processing of the cube.
In fact right there is the whole reason for denormalizing analytical systems: speed. People hate waiting for business reports and that waiting can cost you money. This doesn't apply to us right now - but you never know. It's a solid plan, so I went with it:
Each reporting table should answer one or more questions on its own. If you need a join, you're doing it wrong - default to denormalization.
I know that will raise some eyebrows. To fully grok this you'll need to understand that you're writing a historical record here - a "history book" of your application that focuses on a single "aspect" of the story. In the same way that the books in your school overlapped and discussed the same subjects - your data will overlap (in a denormalized way). It's OK - this stuff doesn't (and shouldn't) change.
More specifically: Orders are orders. If you had a relational system and changed the category of a product - it would change the category of the order. That's not accurate, historically speaking. The user bought that product when it was categoryX - not categoryY and even if you *think* that it should have been categoryY - it wasn't. The good news is that if you have to change this, you can - it's just some update queries.
Deciding What Goes Where
This part might seem difficult - but it can be solved with some simple questions. Questions you should be asking yourself and your client right up front:
- How will the client know the site's working (from a business standpoint)?
- How will the client know if I addressed the business goal?
- How will the client know if users like the site?
- How can I help the client justify their changes in the future?
These questions should be asked right up front - they will pull you to the goal line. For instance - let's take the first question and apply it to Tekpub:
How do we know if Tekpub is providing users with the value we pitched to them?
Putting our business hats on we could answer that in a number of ways:
- Subscribers are signing up
- People are buying our productions
- Subscribers aren't cancelling
So how do we answer these questions in a Relational DB way? Good question! Pulling these notions apart, they resolve down to some specific technical needs:
- A table that tracks subscription "movement" - signups, cancellations, what subscription, coupons/discount, and how much was paid with the dates of each.
- A table that tracks what was sold, when, for how much, discount, coupon used, and dates
This is enough to get rolling. I have 2 tables I need to create: subscription_log and production_sale - so I open up DataMapper:
class SubscriptionLog include DataMapper::Resource property :id, Serial property :user_name, String, :length => 600 property :created_at, Date property :counter, Integer property :description, String property :month, Integer property :year, Integer property :sub_type, String end class ProductionSale include DataMapper::Resource property :id, Serial property :slug, String property :user_name, String, :length => 500 property :title, String property :amount, BigDecimal, :precision => 10, :scale => 2 property :discount, BigDecimal, :precision => 10, :scale => 2 property :created_at, Date property :month, Integer property :year, Integer end
And there you have it. Running auto_migrate will push these tables to the database - and I'm ready to start pushing data in.
Hooking Up The Reporting Bits
There are a couple of approaches to this - and it depends how much traffic and activity your site gets. Given the example above, I have 2 choices:
- Add an entry to the subscription_log on each signup/cancellation
- Add an entry to the production_sale table on each sale (or refund)
- Run a "scraper" to pull the data on a periodic basis
If you sell a lot of stuff then it might not make sense to add a couple extra calls to the process, slowing things down. On the other hand - the reports are constantly up to date.
If you're patient (unlike me) - you could run these jobs periodically - nightly if you like - and load up your tables in a detached process. This is completely up to you.
One thing you can do with EF/LinqToSQL if you're using .NET is to override the partial methods OnSave and write out to various logs that way. This is a handy way of doing it as it keeps the hooks on your model - not in your Controller or service code.
There are a ton of ways to pull this information out and show it to clients. Here are my top choices:
- Use Excel. A lot of people don't know that they can hook Excel right up to SQL Server, or to a web page! Web Reports are a hidden gem, and you can output your data to a web page (in a table) and Excel will read that right in. It will also update it when you ask it to. Your clients can then do all kinds of fun analytics with it - without asking you for specific reports. They can even run pivots!
- Dedicated Reporting App If you use SQL Server you can hook up, for free, SQL Reporting Services (which comes with SQL Express). You'll have to write the reports yourself - but it's pretty simple. There are others out there - Telerik also has a set of tools.
- Homespun. There are ton of free charting apps out there - including Google's chart control and Open Chart - a flash-based chart builder. This might be the simplest thing to do if you need to present your data to your client in specific ways. Be forewarned, however, that you need to be able to add some interactivity here - working with data will spawn questions, and keep you on the hook for constant report writing.
- Bootstrap. If you're in a startup (like Tekpub) and you need some reports run - it's often easier to just kick up a SQL tool and write it by hand - saving the query and offloading it to CSV or just printing out the results. James and I have done screenshares over Skype to scan over the data - both asking questions, both firing up the SQL (seeing who can run the query fastest - James always uses the designer surface).
If you're wanting to see examples of this in ASP.NET MVC/C# - the Tekpub ASP.NET MVC starter site is divided out exactly this way:
- an ISession specifically for high-read queries
- an ISession specifically for reporting
- an ISession specifically for Unit of Work
I'll be updating it in a week or so to add some EF love to it and also an InMemorySession you can use for testing.
Understanding that your application data story is not a consistent, homogenous thing will save you some serious pain as your app grows. You will inevitably need to add perf tweaks here and there - do that with a dedicated agent. You might also (well... actually you *will*) need to send data off to a separate reporting system at some point. You could do that on a systems level - using replication and painful vbscripting transformations - or you could ask your app to "drop the data off on the way home from work".
It's a bit of a shift from what we're used to - but it's working for me, in the wild, on a site that my livelihood depends on.