Prolog and Teaser This is going to be a long post. More than that, this post is going to show you something (if you don't already know it) that could have very easily netted you $20 - $50 Million in funding just 4 years ago (who knows - maybe this stuff might even be worth this today). Text Analysis is a HUGE deal these days: the ability to "divine" knowledge from a large body of text entries is of extreme value to the government (think Homeland Security) as well as private corporations wanting to know what's going on inside and outside their company. From KDNuggets (emphasis mine):
The unstructured data space in Europe is exploding, it is set to be worth $15 billion by 2010 according to industry expert Alessandro Zanasi... Text mining technologies have already kick started in the US and now the trend is starting to gain traction across the globe. But it's the European market where all the eyes are turning right now, with a flurry of acquisitions taking place.
Disclosure I worked for a text mining company for 5 years (3 of them on contract) and my non-compete clause expired 6 months ago.
I want to state clearly, up-front, that NONE of the things I'm going to go into in this post, in any way, discusses "company secrets" or comprimises any confidentiality (enforceable or not) - my integrity wouldn't let me do that :). My former company kept their stuff under tight lock and key - and I never ever saw what they did. Moreover they were all about Open Source and their engine was written in Java. One thing I can say, however, is that their work is largely based on an Open Source project called OpenNLP, which I'll talk about and get into later in this post.
Isn't It All Just Full Text Indexing? It might be tempting to think that some well-written FT queries or Regular Expression wizardry can tell you a lot about your unstructured data (aka "text"). But Text Mining is a lot more than that. It's detective work, slogging through millions and millions of text records to find relatable patterns and derivative words. There is no automation of this stuff - it's investigation with a goal in mind. Used to be this stuff took months - but armed with some cool technology, it can now take weeks or even days. Here's a typical Text Mining scenario:
- Boss asks Joe Miner to find some information on why a certain model of car engine blows up
- Joe Miner gathers all the warranty claim information he can pertaining to the model of car
- Joe Miner performs some intelligent searches to find keywords that describe this issue
- Joe Miner uses these keywords to then search the information, pulling out "tokens" (or nouns) that are associated with these keywords
- Joe then rolls these tokens up in a pareto list, and runs some more analytics (which I'll describe in detail) which associates these words together into a phrase
- Joe pores over these token phrases and speaks to a company engineer who says "why are those words there" and hopefully says "aha! I wonder if [thing x] is causing [thing y] to fail!"
In most cases, a good text mining tool will simply drop clues for engineers or "Knowledge Experts" to geek on - hopefully sparking some insight.
The goal: effectively clean, parse and load your data so that engineers can geek out on it.
Chunks of OpenNLP Love OpenNLP (NLP is "Natural Language Processing") is an Open Source project framework built on Java. It's absolutely massive, and very very cool. In summary, OpenNLP allows you to pass text to it, and it will parse out sentences, phrases, and word "parts" and then tag them with the appropriate structural tag (noun, noun phrase, verb phrase, etc). If I pass OpenNLP the above sentence, it will return this to me:
[NP OpenNLP/NNP (/-LRB- NL/NNP P/NNP ] [VP is/VBZ ] "/`` [NP Natural/JJ Language/JJ Processing/NNP ] "/'' )/-RRB- [VP is/VBZ ] [NP an/DT Open/NNP Source/NN project/NN framework/NN ] [VP built/VBN ] [PP on/IN ] [NP Java/NNP ] ./. [NP It/PRP ] [VP 's/VBZ ] [ADJP absolutely/RB massive/JJ ] ,/, and/CC [VP very/RB very/RB cool/VB ] ./. [PP In/IN ] [NP summary/NN ] ,/, [NP OpenNLP/NNP ] [VP allows/VBZ ] [NP you/PRP ] [VP to/TO pass/VB ] [NP text/NN ] [PP to/TO ] [NP it/PRP ] ,/, and/CC [NP it/PRP ] [VP will/MD parse/VB ] [PRT out/RP ] [NP sentences/NNS ] ,/, [NP phrases/NNS ] ,/, and/CC [NP word/NN "/`` parts/NNS ] "/'' and/CC then/RB [VP tag/VB ] [NP them/PRP ] [PP with/IN ] [NP the/DT appropriate/JJ structural/JJ tag/NN ] [PP (/-LRB- ] [NP noun/NN ] ,/, [NP noun/JJ phrase/NN ] ,/, [NP verb/JJ phrase/NN ] ,/, [NP etc/FW )/-RRB- ] ./.
OpenNLP was ported (sort of) to the .NET framework by Richard Northedge a few years back, and you can see some of the very cool stuff he did with it up at CodeProject.com. The code he has there is released under LGPL.
Let's Mine Some Text One of the hardest parts of any text mining dev job is getting your hands on some relevant data, and you need LOTS of it. Statistics only work if you have a large and relevant source data set so for this example, I'm going to mine the SubSonic Forums to see if I can find some neat stuff to help me solve some problems. The first thing we need to do is go down the road that's well trodden in the analytics world (but surprisingly hard to find for some people!): ETL.
ETL stands for "Extraction, Transformation and Loading" and is the CORE of data analysis.
Extraction refers to pulling the data you want - this is no easy task as many times you don't want it all. A lot of research needs to be done to make sure you know what data you have, and how relevant it is. In my case I don't want every forum post - that won't help me. I only want the ones that describe an issue - the thread starters. So my first step is to extract all the thread-starting posts from the forums into a new database where I can move on to step 2 which is...
Transformation refers to cleaning your data and changing it's structure for reporting (among other things). You might have to add a field that shows weight in pounds, time in GMT, or price in UK Pounds. You will most likely need to validate and transform the text to make sure it contains more than a threshold amount of words, and that all the HTML is stripped and replaced. I'll show you how I did that for the forums below.
Loading refers to adding the data to your analytical system - the Data Warehouse. I won't go into the details here about Warehouses and Data Marts, nor about their structures. Suffice to say that your data will most likely be in a completely different structure when you load it, and you'll have to add some data about the loading itself so you know what you did to it and when. In my case I'm going to load the forums into a bland table named "TextEntry" (more on that below).
Extracting the Forum Data Companies like Cognos, DataObjects, and Microsoft sell ETL tools that can automate this stuff for you. The good old COM workhorse Microsoft DTS (Data Transformation Services) usually was able to do 90% of what's needed, but if you're not working with SQL Server, you were sort of stuck. Since my needs are pretty simple, I'm going to write my own ETL tool. I should caution you that this is NEVER a good idea since there are things about ETL that you probably don't know, and by the time you find out it will be too late. Transactions, scheduled loading, timestamping, legacy fields, scripted transforms - all of these things are handled nicely by DTS and SSIS (which some people rightfully hate) so before you follow me off the cliff - make sure you look at the Microsoft Business Intelligence Stuff. That said, all I really want is a Pepsi so I'm going to use LinqToSql to push one set of data into my DB.
- The first thing I do is setup a view on my forums that only shows me the Thread primary key, the post data, the subject, and the text. That's all i want.
- Next, I need to create the structure of my TextMining DB, which I'm calling TextMiner:

- OpenNLP will do a few things of interest for me: it will parse my forum posts into sentences, so I want to keep those intact as sentences. Next, it will parse out phrases (noun phrases, verb phrases, etc) and tag them with the type - I'll want these to. Finally, it will parse out each word and tag it with it's structural type (noun, verb, adjective, adverb, etc). I want all the words for sure!
- Next, I work up some Full Text indexing (which is available on SQL Express - this entire example was built on Express) on TextEntrySentence and base in on SentenceText (the un-OpenNLP'd sentence).
Note that I've added a table called "NoiseWords" - this table has 500 "standard" noise words such as "me, you, we, like, the" etc. I'm going to make sure that no words are saved that fall in that list. Finally, I add in two LinqToSql Classes - one for the forums, the other for the TextMiner DB and I'm ready to write some code!
Transforming the Forum Data
I'll outline what I'm going to do here, then I'll show you some code:
- I need to get all the thread starters from the forum
- then I need to "Chunk" them using OpenNLP - "Chunking" is the parsing process that tags the data for me
- Once each post is Chunked, I'm going to run some more parsing to lay out the sentences, phrases and words
- I'll validate the words using some custom code and then save it down.
Import Forums:
//get the data from the forums and load it up!
static void ImportText()
{
//load up all the threads from the Forums DB
Forums.ForumsDB db = new Forums.ForumsDB();
TextMiner.TextMinerDB textDB = new TextMiner.TextMinerDB();
//remove the existing data first
DeleteTextEntries();
Console.WriteLine("Getting thread list...");
var threadResult = from threads in db.ThreadViews
select threads;
foreach (var thread in threadResult)
{
//load up a new text blob
TextMiner.Data.TextEntry entry = new TextMiner.Data.TextEntry();
Console.WriteLine("Loading Thread " + thread.ThreadID.ToString()+": "+thread.Subject);
entry.DateEntered = DateTime.Now;
entry.Descriptor = thread.Resolution;
entry.ReferenceKey = thread.ThreadID.ToString();
entry.Subject = thread.Subject;
entry.TextBlob = thread.PostText;
entry.TextDate = thread.CreatedOn;
textDB.TextEntries.Add(entry);
}
//save it!
Console.WriteLine("Saving changes to DB... This could take a minute or twelve...");
textDB.SubmitChanges();
Console.WriteLine("Done!");
Console.ReadLine();
}
//remove entries is prep for load
static void DeleteTextEntries()
{
TextMiner.TextMinerDB textDB = new TextMiner.TextMinerDB();
Console.WriteLine("Deleting all text entries");
var allTextEntries = textDB.TextEntries;
textDB.TextEntries.RemoveAll(allTextEntries);
}
Next, parse and load the Sentences/Phrases and Words. This is a lot of code but LinqToSql saves us some nice time:
static void ParsePhrases()
{
//run the chunker on each Text Entry
TextMiner.TextMinerDB textDB = new TextMiner.TextMinerDB();
Console.WriteLine("Cleaning out all phrases, sentences and words");
//clean up the DB
textDB.CleanWordsAndPhrases();
Console.WriteLine("Pulling all text entries");
var result = from entries in textDB.TextEntries
select entries;
int textCounter = 1;
int recordCount=0;
//loop all forum threads...
foreach (var item in result)
{
if(recordCount==0)
recordCount=result.Count();
//let the user know what's going on...
if (item.TextBlob.Length > 9)
{
Console.WriteLine("Chunking "+textCounter.ToString()+" of "+
recordCount.ToString()+" - " + item.TextBlob.Substring(0, 10) + "...");
}
else
{
Console.WriteLine("Chunking " + textCounter.ToString() + " of " +
recordCount.ToString() + " - " + item.TextBlob);
}
//chunk the data using OpenNLP from Richard Northedge
string chunkedBits = TextMiner.Parser.Sentence.ChunkIt(item.TextBlob);
string[] cleanSentences = item.TextBlob.Split(new char[] { '.' },
StringSplitOptions.RemoveEmptyEntries);
//the first thing the chunker does is divide out sentences
//this will be helpful for reporting later, to identify what
//the FullText engine thought was important
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(@"(?<=^s).*?(?=.)", options);
MatchCollection sentences = regex.Matches(chunkedBits);
int thisIndex = 0;
foreach(Match m in sentences){
TextMiner.Data.TextEntrySentence sentence = new TextMiner.Data.TextEntrySentence();
sentence.TextEntryID = item.TextEntryID;
//if it fails, swalow it - we're more interested in the words anyway
try
{
sentence.SentenceText = cleanSentences[thisIndex];
}
catch
{
}
sentence.RawSentenceText = m.Value;
//add the new Sentence in
textDB.TextEntrySentences.Add(sentence);
//save it so we can get the ID
textDB.SubmitChanges();
//now pull out the parsed stuff, and store each bit as a phrase
Regex regPhrases = new Regex(@"(?<=[).*?(?=])");
MatchCollection phrases = regPhrases.Matches(m.Value);
//the parsed bits look like this:
//[VP will/MD be/VB scrambling/VBG ]
//so all the notation in between is what we need
//split on spaces, the first telling us what type of phrase it is
foreach (Match phraseMatch in phrases)
{
//split out the phrases bases on spacing
string[] phraseParts = phraseMatch.Value.Split(new char[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
if (phraseParts.Length > 1)
{
TextMiner.Data.TextEntryPhrase phrase = new TextMiner.Data.TextEntryPhrase();
//parse back the phrase elements so we have a clean, unmarked phrase
//this is for reference
StringBuilder sbPhrase = new StringBuilder();
foreach (string word in phraseParts)
{
if (word.Contains("/"))
{
sbPhrase.Append(word.Substring(0, word.IndexOf("/")).Trim() + " ");
}
}
phrase.PhraseType = phraseParts[0];
phrase.PhraseText = sbPhrase.ToString().Trim();
phrase.RawPhraseText = phraseMatch.Value.Trim();
phrase.SentenceID = sentence.SentenceID;
textDB.TextEntryPhrases.Add(phrase);
//save it now so we have a new ID to work with
textDB.SubmitChanges();
//finally, save the words
foreach (string word in phraseParts)
{
TextMiner.Data.TextEntryWord w = new TextMiner.Data.TextEntryWord();
//split this on "/"
string[] wordParts = word.Split('/');
//make sure this is a word/TYPE split
if (wordParts.Length ==2)
{
//validate it - we don't want garbage...
string theWord = wordParts[0];
//strip the HTML
theWord = StripHTML(theWord);
//make sure it's valid and not NOISE
if (IsValidWord(theWord))
{
w.PhraseID = phrase.PhraseID;
w.WordText = wordParts[0];
w.WordType = wordParts[1];
textDB.TextEntryWords.Add(w);
}
}
}
//save down the words
textDB.SubmitChanges();
}
}
thisIndex++;
}
textCounter++;
}
Console.WriteLine("Done!");
Console.ReadLine();
}
Here are the validators:
#region Validators and Cleaners
//static holder for all noise words
static List<TextMiner.Data.NoiseWord> NoiseWords;
//Singleton NoiseWord loader
static List<TextMiner.Data.NoiseWord> GetNoiseWords(){
if (NoiseWords == null)
{
TextMiner.TextMinerDB textDB = new TextMiner.TextMinerDB();
var result = from noiseWords in textDB.NoiseWords
select noiseWords;
NoiseWords = result.ToList<TextMiner.Data.NoiseWord>();
}
return NoiseWords;
}
//found in SubSonic if you need it
static string StripHTML(string sWord)
{
string pattern = @"<(.|n)*?>";
string sOut = Regex.Replace(sWord, pattern, "");
sOut = sOut.Replace(" ", "");
sOut = sOut.Replace("&", "&");
sOut = sOut.Replace(">", ">");
sOut = sOut.Replace("<", "<");
return sOut;
}
//tests the word to make sure we load
static bool IsValidWord(string sWord){
bool result = true;
//make sure the length is more that 2 chars
result = sWord.Length > 2;
if (result)
{
//make sure it's not a noise word
foreach (var word in GetNoiseWords())
{
if (word.Word.ToLower().Trim().Equals(sWord.ToLower().Trim()))
{
result = false;
break;
}
}
}
return result;
}
#endregion
Running this code takes some time since the Chunker isn't very fast. For 2000 records it took about 6 hours on my PC. This will vary with the text you're using.
Mining the Data
To mine this data, I'm going to use SQL Server's Full Text Index to help me wrap some context here. This is where analytics goes from the Geek's office to the Board Room - defining what you're looking for is absolutely critical. If you don't ask the right question, you won't get a relevant answer. I can't emphasize this enough! In my case, let's pretend I've handed out a survey to people who've decided to ditch SubSonic and go with another ORM tool like NHibernate. I want to know why - so I ask a few questions and some nice people take the time to tell me and (let's pretend) the answer is pretty blunt: installing it sucks.
So here's our problem statement that we're going to work from: "Installing SubSonic Sucks - find out why" (it's actually quite easy - i'm making this up).
Normally, since I read the forums daily, I know what's going on with SubSonic. But for fun let's pretend that SubSonic has turned into a multi-billion dollar enterprise and has 20 million users and a forum with 100s of millions of posts (Eeggggggzzelleeeent). The first thing I need to do is to define some keywords that will help me nail posts regarding SubSonic installation problems. The order of the words in the query is important, as is the choice of words. Since I'm using Full Text Indexing, I need to think in terms of the words offered by users. In this case most users won't say "install" since that's not something you do necessarily - instead I'll focus on the word "setup". But I'm not going to read all the posts (there could be thousands!) that this query returns - instead what I want to read are the words and word associations for each post as well as the relative score those words produced. This is why I parsed out all the words to begin with. If a word keeps coming up in my word list, then maybe it will help me to solve my installation problems. To make this all happen in one step, I'm going to create a Table-valued Function in SQL that handles the Full Text query. I'm then going to use that table value in a roll-up query on the TextEntryWord table. Here's the function:
ALTER FUNCTION dbo.GetSentenceFullText ( @query nvarchar(255) ) RETURNS @table TABLE (pk int, score int) AS BEGIN INSERT INTO @table (pk,score) SELECT TextEntrySentence.SentenceID,RANK ---my indexed table FROM TextEntrySentence, ---I can use CONTAINS here but this is fine for now FREETEXTTABLE(TextEntrySentence,*,@query) FT WHERE FT.[KEY]=TextEntrySentence.SentenceID ---only want rank 50 or above AND RANK>50 ORDER BY FT.RANK DESC RETURN END
And here's the query that I'll use the function in:
SELECT TOP (20).TextEntryWord.WordText, TextEntryWord.WordType, COUNT(dbo.TextEntryWord.WordID) AS WordCount, SUM(GetSentenceFullText_1.score) AS Relevance FROM TextEntryWord INNER JOIN TextEntryPhrase ON dbo.TextEntryWord.PhraseID = dbo.TextEntryPhrase.PhraseID INNER JOIN TextEntrySentence ON dbo.TextEntryPhrase.SentenceID = dbo.TextEntrySentence.SentenceID INNER JOIN GetSentenceFullText('setup problem') AS GetSentenceFullText_1 ON dbo.TextEntrySentence.SentenceID = GetSentenceFullText_1.pk WHERE (NOT (dbo.TextEntryWord.WordText LIKE N'subsonic%')) AND (NOT (TextEntryWord.WordText LIKE N'problem%'))AND (NOT (TextEntryWord.WordText LIKE N'setup%')) GROUP BY TextEntryWord.WordText, dbo.TextEntryWord.WordType HAVING (TextEntryWord.WordType = N'NN') ORDER BY WordCount DESC, [Relevance] DESC
Notice that in this query I need to make sure the words returned aren't part of the original query - in other words I did a Full Text rollup on the words "setup" and "problem" - I don't want these two words to be tops in my list, so I make sure to filter them out. I also remove "SubSonic" since it shows up in here too. Also - I want to see only nouns, which are encoded NN - since that will tell me more specific issues 
Looking at this result set, on it's own, doesn't really suggest too much. However if you put on your Sherlock hat and "muse" over these words, you can begin to see a larger picture. The words "database", "web", "file", "client", and "error" all suggest something pretty specific: Connection Strings!
And indeed - connection string issues are the NUMBER ONE installation support issue we have, so it looks like my report is taking me somewhere.
Let's see if we can get some additional context here. If I adjust the query a bit to return the top Phrases instead of the Word Counts (just remove the COUNT roll-up and add in TextEntryPhrase.PhraseText to the result set) - here's what we get
: 
You'll notice that my cleaning routines are failing me a little - but that's OK (normally it's not - you'd have to do a full rerun). It's difficult in this situation because users post a lot of code and my simple scrubber routines aren't catching it all... Anyway- this is beginning to confirm the intial finding about connection string issues. But there's more here - the entry for "new Windows App" appeared pretty high in the list there and indeed -
setting up non Website Project (i.e. Web Application Projects and Windows Apps) is NUMBER TWO on our list of support issues for setup :). Hey this stuff's working! Looks like I better get to writing up those docs a bit!
Moving Forward and Automation
Now that I have a handle on "why installation sucks", I can do something about it and setup automatic tracking of this in my reporting application. I'll want to know how this trends from this point on (as I write more docs and try to make the install procedure easier). Instead of writing this query every time I want to investigate something, I can define a table in the database that will hold my "Definitions" for me. I don't want to call this table "Definition" because, well, that doesn't mean anything :) so instead I'm going to put my reporting hat on and think of how I'd like to read this in a report. The context of these definitions is an "Issue", and what I'm going to define are Issue Categories: "Setup Issues" e.g. I don't want to use the word "Issue" though, since I might want to bucket something more positive - so I'll just call the table "Category" - which makes perfect sense. This table is pretty simple - just a name and a set of keywords that we'll use in reporting later. In addition, I'm going to add a mapping table (many to many) to the Sentences table so I can keep track of the results of the Full Text query. I'm adding a column to this mapping table called "Score" that will track the RANK score returned by the Full Text Query as well: 
Finally, whenever I run a Extraction/Transform/Load, I can call a Stored Procedure that will "bucket" this stuff for me:
CREATE PROCEDURE [dbo].[TagSentenceCategories] ( @categoryID int ) AS DELETE FROM Sentence_Category_Map WHERE categoryID=@categoryID Declare @query nvarchar(1500) SELECT @query=keywords FROM Category WHERE CategoryID=@CategoryID INSERT INTO Sentence_Category_Map (CategoryID,SentenceID,score) SELECT @categoryID, TextEntrySentence.SentenceID, RANK FROM TextEntrySentence, FREETEXTTABLE(TextEntrySentence,*,@query) FT WHERE FT.[KEY]=TextEntrySentence.SentenceID AND RANK>50 ORDER BY FT.RANK DESC RETURN RETURN
By doing it this way, we've saved our problem definitions into the database so we can now track them over time. You'll notice, I'm sure, that I'm wholesale deleting out all the many to many data in the MAP table every time this query is run - that's not scalable nor is it smart - but for now this works as I need it to. I'm going to need to change this when I create my Data Warehouse, but my boss hasn't told me to do that yet - when he does I'll be sure to come back to this blog and read Part 2!
Part 1 Summary
We're just getting off the ground with Text Mining with this blog post, but hopefully you're starting to see the potential using tools like OpenNLP and Full Text Indexing. In the next part of this series, I'll dive into a Warehouse structure and how you can start to track these things over time using some Data Warehousing tricks and Text Mining investigative techniques. In addition, I'll show you some OLAP tricks to do "unguided" exploration with pivot tables using the best, most powerful, and most overlooked data analysis tool out there: Excel.
Corporations pay consultants thousands and thousands of dollars to try and figure this stuff out :) and it's not really that difficult. It just takes the right tools, some time, and a willingness for a geek like me to blog about it :). If you have anything in particular you'd like to see in the next post - let me know. I'll also be sure to include a lot more LinqToSql stuff :).
