What’s it all about, Alfie

Posted on 04/12/09 | in ideas, news, play

I’ve just launched a new tool at Hatmandu.net, a text content and keyword analyser – in theory useful for search engine optimisation, but also to get the general gist of a text.From the notes:

This text content and keyword analyser is intended to give a more precise indication of a text’s most important words than other tools available. Most keyword analysers use simple word frequency (which is also shown here anyway), but that doesn’t relate the specific text to the language in general – common terms such as ‘people’ and ‘time’, for example, appear in many documents, but do not necessarily indicate the essence of the particular text being analysed. This analyser uses the TF-IDF statistical method to relate the frequencies of words in the specific text to their general frequencies in the British National Corpus. I am indebted to Adam Kilgarriff‘s version of the BNC, which I have adapted considerably for this tool. This analyser mainly uses the nouns in the BNC, on the basis that these are the parts of speech that best indicate the subject matter of a text. (At some point I hope to produce a version using an American English corpus, though I’d be surprised if the results were very different.)

It works with Twitter accounts (though it only reads the last 200 tweets, which may not form a usefully large body of text), and URLs where my humble scraping tool is able to extract the text successfully – most useful is the ‘paste text’ field, which will accept up to 1Mb of text (about 200,000 words) – so will analyse entire books if desired. Livejournal users can enter their URL (http://username.livejournal.com) assuming their account is public.

It’s a bit experimental at the moment, but hopefully might migrate from ‘possibly fun’ to ‘possibly useful’ in due course!

2 Comments on “What’s it all about, Alfie”

  1. Pete G Says:

    Interesting. It’s almost surprising nobody’s done this before.

    Immediate observation on the twitter version is that it ought to filter out usernames, or at the very least the target’s own username (in one I just tried, there were 20 occurrences of the target username, with just 3 of the next most common word).

    How do you weight new coinages?

    I guess the most obvious extension is to include bigrams. Does the form you have the BNC in allow you to obtain reference frequencies for those, too? Even at its simplest, this would allow you to distinguish between “bike ride” and “fairground ride”, which are obviously different keywords. More generally, you can get useful information by looking at clusters of words. As something to aspire to, what would it take to correctly identify the keyword ‘can’ for http://en.wikipedia.org/wiki/Can_%28band%29 ? (Incidentally, I’m curious about how your current lexer decides that ;· is a word…)

  2. hatmandu Says:

    Hi, and thanks for the comments.

    Twitter usernames: yes, others have raised this. I probably will do that (though in a sense the commonness of other people’s usernames shows how important they are to the writer, I guess). To be honest, the Twitter option of this is not ideal anyway as the text samples aren’t huge (up to 200 tweets) – but it helps spread the word!

    New coinage: um, good point. The BNC isn’t massively current. There are internet-based corpuses, so maybe for longevity one of those might ultimately be better.

    I’d like to implement something with bigrams in the future, as and when I get time. Definitely agree they provide more sophisticated analysis. This current tool is inevitably rough and ready.

    Oh, and ‘;•’ – yeah, somehow that’s slipped through the net, and I’m going to zap it!

Leave a Reply