More on Technorati's Desi Blog Ranking

Today is/was Singapore’s 40th Independence Day. Other than fireworks and flags, it meant that it was a holiday. Sitting at home and reading all your comments, I decided to revisit the blog rankings.

My main aim was to clean up the code so it could potentially be useful elsewhere. So I rewrote it to be extensible (geeks: OOPSified it), a bit faster and 99% automatic. I still have to upload the results to the server. :-)

I thought it wouldn’t take more than an hour but it took most of the day. You see, I went and harvested all the links from indianbloggers and sambharmafia and then it turns out, there are some bloggers writing in Hindi and Tamil! Well, I knew it happens, but just didn’t anticipate it in this context. You see, my program couldn’t handle non-English data!

This being the first time I’ve had to deal with language encoding issues, most of the day was spent in researching Unicode and coaxing my parsers to respect encodings. At the end of this learning exercise, I present to you the better, bigger and badder Desi Blog Rankings!

These are built from this list of blog urls. As of now, its got nearly four hundred urls. Adding new blogs is just a matter of appending to this file and running an update.

The URLs in the file have to be exactly the blog’s URL and not be pointing to some entry in the blog or the domain root. Only when we query Technorati with the correct url does it sends back the correct information viz. the blog’s name, rank, links, etc. With an incorrect query url, Technorati returns just the inbound links. So it is vital that the blog url be accurate.

Even with an accurate url, Technorati likes to act naughty and we sometimes get incorrect results. For example, the query for India Uncut always fails although the web based query works just fine. I can’t figure out why; I’ve filed a bug with Technorati, let’s see what comes out of it.

Since not all queries yield a cosmos rank, we can’t take that as an accurate quantitative indication of popularity. In general, inbound blogs and inbound links are reported correctly, so they are the best measures at the moment.

Also, I am not scraping sites to look for links to RSS/Atom files. If Technorati returns a link, I include it. So don’t ask me to manually add RSS links.

As far as I am concerned, this ranking stuff is feature complete. I’ll probably add detection and marking of inactive blogs but other than that, I can’t think of what else could be done.

Now go blog about it and make it worth my while. ;-)