Web publishing, online research, stats, webmining and search engines.

Monday, August 28, 2006

FactBites getting fatter


I've just updated a new index for our site FactBites. It now fields 2 million keywords. More features to come. 

Thursday, August 24, 2006

Can you handle the truth?

Or "If I say something weird, will you leave me and never come back?"

A big issue with running stat sites is providing relevant, useful info that users understand while not cutting down what the users see until they can't question you. 

On NationMaster and StateMaster, there are thousands of stats, so you're always confronted with contradictions. Figures seemingly don't make sense when you put them altogether. Our competitors like the CIA World Factbook have so few stats that inconsistencies can't arise. It's the easy option.

Our correlations feature wasn't as much of a hit as I'd hoped, because people looked at the data and said "wa? Murder rate correlates to gun ownership that makes sense. But look it correlates even more strongly to orange juice consumption!"

In a world without sites like NationMaster, we just leave it to experts to select which stats are most relevant. Of course an expert is by definition someone more knowledgable of the domain, so they will be able to digest the info more readily. But when more statistically significant figures are lying around and are not used, everyone should have access to them so they can ask why.

Now with SEO Sleuth, I chose to show every search going to every site. Now, any webmaster can tell you that people come to your site looking for pretty different things to what the site offers. And looking at the terms as a whole may give you a distorted view of what the site's about (but perhaps a good view of what parts of the site are of interest to searchers). But yeah, we're left with the same problem; people give a quick "that doesn't look right" and leave the site. One thing I considered was linking to the actual searches to prove it, but I didn't want to be republishing such sensitive data. So once again, the quandry. 

Wednesday, August 23, 2006

G'day World

Back in February when I was in San Fran, an American friend of mine told me about this show. And it's only last weekend round to giving it a listen.

I've been listening to podcasts for a while; a fair few IT Conversations some of which can be good, but I often feel a little hyped out. I sometimes got this queasy feeling that it's all not real.

So it's really refreshing to discover Cameron Reilly and Ben Barren from G'day World. Industry news from that certain intangible Australian sensibility; the cynicism, the good ole Aussie accent. Keep it up guys.

Speaking of saying hello to the world, my Technorati Profile.

Monday, August 21, 2006

What turns me on

Blogging as a medium is supposed to allow people to be a little more personal. While I don't feel much need to talk much about my private life, I would really like to talk about why I do what I do.

I've been programming since I was 5 and doing generating stats for communities since I was 13 (back in the days of BBS's). Since then I've worked for a publisher, a web developer and an internet marketing agency. Now almost 30, and 4 years into running my own business, the same thing has always driven me.

For me, I have an introverted motivation and an extraverted one.

On the internal side of things, it's all about that eureka moment when something new arises out of nothing. When the machine gives you back something more intelligent and sophisticated than the code you wrote to discover it.

For the outer world, it's about knowledge; helping more people see the facts for themselves, without relying on any elite to digest them first.

NationMaster and StateMaster are referenced thousands of times on the web. For whenever you're in a discussion or debate, you know there's one site you can go to compare countries/states on just about anything. Of course you can generally use stats to support either side of an argument, but it does increase the quality of the debate and encourages people to find common ground in some area approaching reality.

Politics is very polarised these days, particularly in the US and the mass media landscape are more bombarded by sophisticated PR and marketing than ever. Our society in general is growing exponentially more complex. The reality of looking at a stat site like NationMaster is that you're going to be confronted with figures that don't fit nicely with your belief system, or even rival belief systems. Stats aren't perfect, but they bring people closer to the complexity of reality. That's something I really hope we can get on top of in this coming information age.

Climbing over the Great Firewall


Got this email recently about our Qwika product which includes Wikipedia mirrors. This is the sort of mail that makes it all worthwhile: 
I am in china .maybe you know ,the wikipedia was blocked in china . I visited your web site and i find that i can read the articles from wikipedia by clicking the "cached "in the searching results .it is wonderful !now i can read the full text but no images . i very appreciate your work !

luck and happy !
It's also a good argument for free content being available through many sources, which is the spirit of the GFDL anyway. 

Sunday, August 20, 2006

Announcing SEO Sleuth

Tuesday 8th August was an adrenaline pumping day for me. Sitting on the ferry I was going through newly downloaded files on my laptop's desktop and came across these large gzips a mate had sent me that morning. I was thinking it was going to be that Google N-gram data. But browsing through it, it was clear these were searches, millions of them! I let out an involuntary audible expletive and then day I wondered round the city thinking up things to do with the data.

The most obvious site, it seemed, would be one that allows ordinary people to read this goldmine of personal information, instead of just people who knew how to read large files. (I was surprised by a number of news stories on the subject where the journalist disclaimed that they hadn't seen the data themselves). But I thought that was a bit unethical. As it turns out, there's no shortage of such sites out there now with dontdelete.com and aolsearchdatabase.com among the more popular.

I also started to see more research-oriented sites crop up that looked at it completely in aggregate - what proportion of people click which ranked result for all terms, how long is the tail and such.

With my background in search, what was really interesting for me was the possibility of generating search engine reports for any site. Except for the rare occasion when a log reports are made publically available, we really don't know what people search for to find our competitors' sites.

And the results are certainly interesting. For example, The Open Directory Project's incoming keywords are predominently adult. The top term is actually "adult" itself.

Next thing I did was a transpose: for given keyword, it will tell you which sites get the traffic and where they rank. Obviously simply performing a search and seeing what ranks best is the traditional way of doing this, but very often lower ranking sites get more traffic. Knowing both where the results rank and how much traffic they got, and being able to look at the original search (just click on the keyword title) you can then use the tool to work out which sites had the best titles and descriptions (or perhaps brand power).

Anyway check it out. The site as had some success with mentions on Threadwatch and the New Scientist blog