robots.txt is a file that ninty-nine percent of all search engines download from the root domain of a webserver and use it as instructions for what – and what not – to index.
This is my robots.txt file:
User-Agent: * Disallow: /fun/mp3s.html Disallow: /comment Disallow: /trackback Disallow: /logging Disallow: /attachment Disallow: /search Disallow: /archive
See that last one? That’s the odd one out. It’s going to take a while (the top one has been there for a couple of months now, and was only removed for two weeks, and searches for MP3s account for most of my search traffic), but I’ve blocked Google from my date-based archives.
Why? Have I gone insane? Not quite. I’m currently plagued by incorrect search results. Until earlier this week, This Page was the top match on Google for the phrase “I Hate Dominos”. When I mentioned this a couple of days ago, that page became the top match within hours. This is stupid. Not only is Aquarionics defiantly not about my hatred of Dominos, I didn’t even say I did, some random anonymous commenter did.
Part of the problem with this is that every article gets indexed by Google twice (multiplied by the number of sites I get spidered as, now down to just one from six last week) and the top 200 words get indexed once more (The first two are part of the daily and single-item archives, the third is as the monthly archives which only show extracts or descriptions). This means that not only do people search for random things and get my website, when they search for things I do talk about they get the monthly page, where the phrase might be fifteen folds down.
So I’ve blocked search engines from searching archives, and instead made sure that there is a big list of links to every single entry in each section, so the engines can still find them but now will only index the page-per-article sections instead of having four copies of every item.
Which is neat.