Thursday, September 20, 2007 -
In my last post I talked about some simple approaches to mining text using Full Text Indexing in SQL Server 2005 and LinqToSql . Admittedly, the LinqToSql bits we pretty light and some even asked "why is Linq even in the title?". I'll fix all that with this post as I dive a lot deeper into the analytics behind Text Mining, and also (hopefully) show you some neat tricks you can use when reporting on data like this. Blowin The Lid The techniques I'm about to show you are very much like those that the big players in the Text Mining industry use, without the fluff. Just 4 years ago, you could expect to pay a Business Intelligence company some healthy 6-digit money to come to your work and tell you absolutely nothing. Yep - bold statement, but it's true. Things may have changed since I left the industry, but I can tell you that there was a lot of Snake Oil (on par with SEO) back then. If you've ever been involved with any sort of Business Intelligence software, you know that it's a massive investment. The licenses can be extreme (Cognos costs more than most houses) and moreover, you usually end up contracting with their Professional Services team to "help you get set up". In my last post a commenter asked:
If the market is growing that fast, then why did InXight and ClearForest need to sell themselves, why is Temis for sale? (or so it’s rumored) , why did Nstein change direction?
I don't know the particulars about these companies, but in general companies in a massively rising market "pop and drop" because most of the time they're all promises, and very little substance. In summary form, I can bill myself as a mind reader, and charge you a lot of money to read your mind. In order to do it properly, however, I'll need to consult with you for six months and get to know you. All of these companies have a problem in the "prove it" department. In other words, their clients want these companies to
tell me something I don't know
These BI companies spend a massive amount of time getting their software setup in your office, importing your data, training your employees to use the software, then making up excuses as to why they couldn't give you any answers anyway. It's not even the 80/20 rule - it's more like 90/10! Software can't tell you anything - it only reports back based on things that it's been told. What I hope to show you in this post is how to write an application that will do this, with minimal effort. I'll also show you how you can wow your boss or client by using a tool they know and love - Excel.
What We're Not Going To Cover A lot of text mining software (including SQL Analysis Services new Text Miner) works from the "discovery" point of view - in other words you load up some text and want to "grok" it without prior knowledge (using auto-categorization, etc). There are all kinds of statistical and analytical ways to do this (K-Means, Clustering, etc), but that's a job for Data Warehousing and analytics, and is far beyond what I want to go into here. What I do want to go into is how you can probe text data to find out more about issues you know exist. Things like "why does the engine blow up", "where is the oil leak starting", and "why can't people install SubSonic".
Get On With It! Ok Ok! Sometimes a little explanation can really help. But to keep it concise, I want to focus this post really simplistically: A Problem Statement and A Solution Statment:
Problem Statement: I want to know why people using SubSonic are having problems setting it up
Solution Statement: I am going to use Full Text Indexing and SQL Server 2005 to categorize the thread-starter posts in my forums database. I'm then going to use OpenNLP (covered in Part 1) to parse out sentences and words from these posts and tag them structurally. Using this data, I am going to analyze word-occurrences and relationships to help me find the clues I need to solve this problem.
Step 1: Database Design The design is pretty simple and I think the diagram is pretty self-explanatory. You'll notice that I've renamed and moved things around since the last post - this is just to simplify reading and design, and I also had some better ideas in the intervening days:
Importing the data is much the same as in Part 1, if you'd like to read more about that, you can view it here. The main difference is that I'm not using a Phrases table anymore.
Step 2: Run The Investigation To run the investigation, we need to loop over the investigations in the Investigation table, using the keywords to run our analysis. I currently only have one, which is my problem statement above (why people are having setup issues). I've set the DB up, however, to handle any number of investigations: 
Step 3: The Analysis This is the biggy, this is where we're going to deliver value for our software! The analysis is going to do a number of things for us:
Knowing just these two things will lead us (hopefully) on our path to discovery - or else to redefining our query so we can "ask a better question". Normally it might be tempting to put this all in a database, but since BigWonderfulIdea Inc. (me) has
we're going to put this into code, and not the database: 
Note the "let" keyword in the LINQ query - this is a nifty feature that allows you to explicitly declare a type for the return. If you have issues with var abuse, this is a nice way to be a little clearer with your code. The primary thing to take away from this (very long) method is that we're using in-memory lists a lot. Text Mining can be a very, very memory-intense thing, and there are some things you can do to increase performance:
To make things a little more compact, I've created a helper class to hold the collection of InvestigationResults called "ResultRollup" and added a method to it to help with rolling up counts: 
Step 4: The Result This really didn't take all that much code to do - so let's see some results! Running from the console: 
This took 30ms to run (nice and fast!). But showing this to my Unreasonable Client (me), he asks:
The output is pretty definitive: people are having issues setting up the project against their database. That's wonderful, but why! I want to know why! I want an Oompa Loompa!
Well hrrmph. The good news is that we now have a little more to go on here. The trick is to look at the words where the co-occurring word count matches as closely as possible to the main word count. These words are
In addition, the word "project" and "connection" repeat in this list. It's pretty safe to say that the "why" here is the connection string: people can't get SubSonic to see their database. OK we're getting closer! There's still something else we can do however... When OpenNLP scanned our data, it tagged all the words with their structural types - Nouns, verbs, Adjectives, etc. Initially when I ran the analysis, all I wanted was the "What" - what was causing the setup problem. Now I know - it's connecting SubSonic to the database.
Now I want to know "how", and I can know this by changing the analysis to look at VERBS
This is simple to do, by passing in a different argument to our RunInvestigation() method - setting wordType to "VB" - we now have an output of all the verbs: 
This return isn't as direct as the previous, but again what I want to know here is what the user was doing - "how" did they experienced the connection failing. It might seem obvious in this context (and since we're all geeks), but when chasing other issues - knowing the "how" is critical. Let's see what this list tells me:
So it seems pretty straightforward: Users are having problems connecting to their databases either initially or when they change their connection strings (this could happen when going live for instance). They are finding this out during the build, or when they go to generate their code.
This is great information and is actionable - I can do a number of things now (which I will for the next release of SubSonic actually), including:
Step 5: Results and The Unreasonable Client So now that the demo is done, and my client (me) is happy with the action statement, I find him staring at the report saying "so when do you think you'll have this done?" Don't ever fool yourself - it's all about the UI isn't it :).
When you do any kind of data work like this, you need to be ready to hand over the data, NOT the application.
This doesn't mean and MDF file or Access Database - it means CSV or some other nice way to shove the data into Excel. This presents a "Vegas" problem since data that's in Excel "stays in Excel" for the most part. There is a really nice way to set this application that will make our Unreasonable Client VERY happy, and allow them to work with the data on an ongoing basis, not just read the results on the console or in a web page.
A Quick Tangent: Excel is very under-used by developers. If you think about it, we're used to pumping out some kind of web form or Win App, and we do our best to put in some kind of UI that allows a user to work with the data. Sometimes this makes sense, but if you're building an application that is data-intensive (like this one is), then do yourself a favor and return the results to Excel using the technique I'm about to show you. Your client will shit bricks when they see that not only can they see the data, they can PLAY with the data. Most corporate types absolutely love Excel - and they'll love you too if you give them this option.
The Web Query Most people don't know that you can hook up Excel to a web page (since Office 2000), have it parse out a table, and pull the bits back in so you can work on it like it's any other range of data. We're going to do just this for our new Text Miner. The first thing we need to do is create a web site, and reference our Analytics project with all of our lovely Analytics code in it. Once that's done, we need to work up a simple UI that will allow a user to specify a search phrase, an exception list, and the type of word they want to query on. We then "hook this up" to our code, so that when a user supplies the right information they can see the results: 
Notice the URL - I've set it up so that when you click on the "go" button, you are redirected back to the same page, and all of the data is kept in the query string. This is important! We want to make it very easy for our Excel users to tap into the data and also save a reference to their query. Next, open up Excel (I'm using 2007), and click on "Data" (for pre-2007 it's in the Data menu, under "Web Query") - select "From Web":
This will open up a dialog that looks just like a browser. We then enter the URL of our TextMiner web site. Initially you see the blank page with the prompts we saw above:
Notice the yellow arrows? Excel is parsing out all the "parsable" content that is sees on the page - in other words what it can shove into a cell for you. We don't have any data just yet, so let's enter our search terms like we did before, and click "go":
Our results have come back (in a nice Gridview with no formatting since it messes up Excel's formatting) and I've clicked the yellow arrow next to the table that holds all my data. I click import, and BOOM! I have some data to play with (including charts and graphs, etc): 
The best part about using a Web Query, is that Excel remembers that it's from the web and does NOT do a bulk import of data - rather it's a stored "Query" range, and you can refresh it any time you like! In fact you have Excel refresh the query every time the page loads, or on a set interval - this means it will go back to the web page and make the exact same request and parse the data in the exact same way. To see this - right click any where in the range of sells and selec "Data Range Properties" - there are all sorts of good things in there to play with.
My Client Is Stoked At this point, my Unreasonable Client is very, very stoked since he can do his own data mining, right from inside Excel. He can wonk-out on the results, and work up reports for his boss that will make him look like a champion. I just wish I did this stuff 4 years ago. I might be rich by now :). Have you ever worked with Text Mining? What's your experience been like?