29

I am new to web crawling and I am testing my crawlers. I have been doings tests on various sites for testing. I forgot about robots.txt file during my tests.

I just want to know what will happen if I don't follow the robots.txt file and what is the safe way of doing the crawling?

yannis
  • 39,647

3 Answers3

45

The Robot Exclusion Standard is purely advisory, it's completely up to you if you follow it or not, and if you aren't doing something nasty chances are that nothing will happen if you choose to ignore it.

That said, when I catch crawlers not respecting robot.txt in the various websites I support, I go out of my way to block them, regardless of whether they are troublesome or not. Even legit crawlers may bring a site to a halt with too many requests to resources that aren't designed to handle crawling, I'd strongly advise you to reconsider and adjust your crawler to fully respect robots.txt.

yannis
  • 39,647
16

most sites don't have any repercussions

however there are some sites that have crawler traps, links hidden for the normal user but plainly visible for crawlers

these traps can IP block your crawler or do anything really to try and thwart the crawler

ratchet freak
  • 25,986
10

There are no legal repercussions that I'm aware of. If a web master notices you crawling pages that they told you not to crawl, they might contact you and tell you to stop, or even block your IP address from visiting, but that's a rare occurrence. It's possible that one day new laws will be created that add legal sanctions, but I don't think this will become a very big factor. So far, the internet culture used to prefer the technical way of solving things with "rough consensus and running code" rather than asking lawmakers to step in. It would also questionable whether any law could work very well given the international nature of IP connections.

(In fact, my own country is in the process of creating new legislation specifically targeted at Google for re-publishing snippets of online news! The newspapers could easily bar google from spidering them via robots.txt, but that's not what they want - they want to be crawled, because that brings page hits and ad money, they just want Goggle to pay them royalties on top! So you see, sometimes even serious, money-grubbing businesses are more upset for not crawling them than for crawling them.)

Kilian Foth
  • 110,899