Octopodial Chrome

Stuff that Made Sense at the Time

The Personal Weblog of Bob Uhl


Friday, 04 May 2007

Hyperspatial Text Classification

While reading the docs for CRM114 (a text classification engine; text classification can be used to determine if email is spam; if a log entry is important; or if a newspaper article is worth reading) I discovered that it supports a hyperspatial classifier. It’s a pretty neat idea: a document is broken into its component features (e.g. phrases and individual words; this step is pretty standard for classifiers); each feature is then hashed to a 32-bit integer value; the document is then considered to be a point in a 2^32-dimensional space—if a feature is present once, then the value of that dimension is one; if twice, then two and so forth.

So documents are points in this 4,294,967,296-dimensional space; what’s this buy? Well, imagine that every already-classified document is a star emitting light, and that an unknown-class document is a planet receiving light from all stars. One simply adds up the light each class sheds on the planet (nearer stars are brighter; those further away are dimmer); whichever class sheds the most light is the class of the document in question.

This sounds very complex, but it turns out to be very easy to represent and calculate. A document is represented by a sorted list of integers; each integer is the hash of a particular feature; only those features which are present are listed (this saves space since the vast majority of the 4,294,967,296 possible features are absent in any one document). To calculate the difference between two documents, just walk two indices along them, keeping track of features found in one, the other or both.

I’ve already got some working code which I’m training to recognise plain text versus HTML. We’ll see how good I can get it…


May
Sun Mon Tue Wed Thu Fri Sat
    4
   
2007
Months
May

Powered by Blosxom | Subscribe with Bloglines | Listed on
BlogShares | Blogarama - The Blog Directory | Technorati Profile

This is my blogchalk:
United States, Colorado, Englewood, Centennial, English, , Robert, Male, 21–25, Free Software, Society for Creative Anachronism.