5

I am trying to build a system for my company which wants to check for unusual/abusive pattern of users (mainly web scrapers).

Currently the logic I have implemented parses the http access logs and takes into account the following parameters to calculate the potential of a user being a scraper or bot:

  1. It checks v/s HTTP 'POST/GET' requests ratio for each IP

  2. It calculates the ratio of unique URLs and total number of hits (sparsity) by each IP

Based on the above two parameters, we try to block any IP showing unusual behaviour, but these two parameters alone have not been sufficient for bot detection. Thus I would like to know:

  1. Are there any other parameters which can be included to improve the detection?

  2. I found a paper published in ACM library which follows the Bayesian approach to detect a crawler. Has anyone used this? How effective is this?

  3. Stack Overflow and other high traffic sites have such kind of systems deployed, what logic do they follow to keep unwanted spammers/crawlers away in real time?

dimo414
  • 403

5 Answers5

6

What are you trying to protect against? Is the concern that the bot will use excessive bandwidth or that they will get a copy of all your website content?

In either case an analysis of the log file after the fact will do nothing to prevent either. If you are concerned with someone stealing your content, what good does it do know that someone just did it last night? A little like locking the door after you have been robbed.

Much better to simply implement bandwidth throttling, simply limit the number of pages per unit time (minute/hour whatever) that you website will deliver to a specific IP address, or better still a block of IP addresses.

Remember that someone trying to steal your content may be very clever. The most likely will use multiple IP addresses.

Also be aware that there are hardware appliances that can be installed in a data center to do this in real-time.

JonnyBoats
  • 1,793
1

Just embed some invisible links in your html. Anyone that follows one is a robot or scraper.

ddyer
  • 4,078
1

Examine the frequency at which requests are coming in, and if it is exceedingly high, throttle the requests. This way, you are not blocking anyone, and yet nobody can consume too much bandwidth.

Mike Nakis
  • 32,803
1

I like requests-per-session-per-second, sessions-per-IP, and request pace over time.

The first - requests-per-session-per-second - will almost invaribly be different between humans and bots.

The second - sessions-per-IP - might appear to be easy to do, but you probably won't be able to tell the difference between a large number of users behind a NAT/firewall -and- a multithreaded bot. It's probably a good "additonal indicator", however.

The third - request pace over time - requires a little explaining. Bots tend to have their own analysis pace, processing "lag", and turnaround time between page requests. Depending upon what they're doing, a bot can retrieve and parse tens-of-kilobytes of webpage content without flinching, and turn around and make yet another request. However, this doesn't differ from what a human might do when, say, they immediately see a link they want, and click on it before the rest of the page loads.

However, a human - even one that frequently visits your site - will likely only remember how to quickly navigate the first few levels of your site using this method. After a few levels, the human will likely "slow down", and read more content/take more time to process what they've requested. A bot, on the other hand, will continue at their original pace throughout its entire interaction with your site.

Based upon this, I'd say any session that quickly (more-than-humanly-possible?) processes the retrieved content should be initially categorized a bot, but not cut off. If, after two, perhaps three levels of navigation into your site the session still continues to make "faster-than-human" requests, definitively call it a bot, and cut it off.

If a human can actually achieve such a high and sustained interaction with your site, you probably have to redesign your site anyway (lol), and either give the user shortcuts to deep portions of your site, or "flatten" your site altogether.

ka9cql
  • 306
0

This isn't an answer to the question since a feasible way to distinguish between ethical and voluntary non-ethical access of web pages doesn't exist.

What can be done about avoiding flooding is to split the content of web site's pages in static and non-static and deliver the static content form a network of CDNs and non-static content from a cluster of web servers that dynamically scale according to the load on the nodes.