Google Inc today lost a copyright fight launched by Belgian French-language newspapers which demanded the web search service remove their stories, claiming it infringed copyright laws. … They complained that the search engine’s “cached” links offered free access to archived articles that the papers usually sell on a subscription basis. It was unclear if Google would have to pay a fine.
— Wire story: Google loses case against Belgian papers
That’s just stupid. You don’t need to go around suing search engines to stop your stuff getting into their databases. Every web developer who knows anything about this knows you just need to drop a robots.txt file onto your web site and it stops all search engines and archivers stone dead.
To ignore that, and send the lawyers in instead just looks like you’re not looking for a solution, you’re looking for money.
I believe their problem isn’t the fact that Google analyses, determines, and provides search matches – the problem is purely that it also takes a copy of the pages in question and provides a hosted free copy of it (and thus is violating copyright). Furhtermore, THE big issue is that Google quite often somehow manages to cache copies of pages that are subscription based. Really, how many of us for years now have used the Google cache to help themselves access to articles that otherwise deny access without purchasing a subscription? I do it all the time, proving on each and every occasion the very crime Google is committing (and I guess myself for taking advantage of it?).
robots.txt is certainly more effective these days than it has been in the past, in years past it was only partially effective. However to this day there are still various search engines, although mostly spiders, that don’t give a rats about robots.txt – I’ve still got quite a number of IP bans on my server to prevent such idiodic engines that have absolutely no regard to the load they put on servers by pulling in hundreds upon hundreds of php generated pages in such short amounts of time. Without it, and assuming you end up indexed by such a site, there are times of day when your site becomes virtually unresponsive due to insane amounts of load (not to mention stuffed if I’m paying for all that data).
ArsTechnica has more on this story, and judging from that it’s still not clear on why they didn’t use robots.txt. I’ve certainly seen other web sites use it effectively to stop Google.
Yeah Chris, I’ve noted some sites/spiders burning enormous amounts of bandwidth. The big names are usually better behaved.