2

We are noticing that a significant amount of web traffic is from content scrapers (determined due to their crawling pattern). They are useless visitors to us but consume a lot of our resources (bandwidth, cpu). Is there any application/firewall to detect content scrapers and block them?

Excluding Search engine crawlers, they are not useless.

Note: I prefer to use existing solutions. It believe this is a common problem and there should be an existing solution.

newbie
  • 93

1 Answers1

0

The best way to do this is to block the traffic using netfilter/iptables as this is essentially more performant than blocking via apache2 / php. Problem here is that you're required to know the ip / hostname of the content scrapers.

A possible extension could be you try to detect content scrapers based on their behaiviour (-> statistical methods! - e.g. requests per minute) or e.g. search for missing useragent or other stuff a normal user browser would have and then deny access for them. Of course you could also add the IP / Hostname via php (or whatever environment you use) to iptables so it's blocked. But normally this requires root permission and it's NOT a good idea to give root permission to your apache2.