8

Trying to get the following behavior working in nginx

A default rate limit of 1r/s for each ip when using a browser. A rate limit of 10r/s for bing and google spiders. Reject bad bots.

Unfortunately google doesn't publish ip addresses for googlebot so I'm limited to useragent.

So far this gets close:

http { 
  # Rate limits
  map $http_user_agent $uatype {
    default 'user';
    ~*(google|bing|msnbot) 'okbot';
    ~*(slurp|nastybot) 'badbot';
  }

  limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
  limit_req_zone $binary_remote_addr zone=two:10m rate=10r/s;

  ...

  server {
    ...

    location / {
      if ($uatype == 'badbot) {
        return 403;
      }

      limit_req zone=one burst=5 nodelay;
      if ($uatype != 'user') {
        limit_req zone=two burst=10 nodelay;
      }

      ...
    }

  ...
  }
}

BUT - 'if' isn't allowed to do this.

$ nginx -t

nginx: [emerg] "limit_req" directive is not allowed here in /etc/nginx/nginx.conf nginx: configuration file /etc/nginx/nginx.conf test failed

There are so many untested suggestions on nginx forums, most do not even pass configtest.

One that looks promising is Nginx Rate Limiting by Referrer? -- Downside of that version is that all of the configuration is repeated for each different limit (I have many rewrite rules)

Anyone got something good?

Ali W
  • 317
  • 2
  • 4
  • 7

4 Answers4

3

Unfortunately you can't dynamize this way, limit request module doesn't support this.

The link you found is probably the only way to achieve this. Use include directive to "avoid" repeating your configuration.

But what if a thirdparty crawler suddenly impersonate a goodbot user agent ?

Xavier Lucas
  • 13,505
3

Today I was able to implement rate limiting on a user agent base; try this:

map $http_user_agent $bad_bot {
    default 0;
    (foo|bar) 1;
}

map $http_user_agent $nice_bot {
    default "";
    (baz|qux) 1;
}

limit_req_zone $nice_bot zone=one:10m rate=1r/s;
limit_req_status 429;

server {
    ...
    location / {
        limit_req zone=one nodelay;
        if ($badbot) {
            return 403;
        }
        ...
    }
}
hvelarde
  • 133
1

*** ANSWER 1 of 2 ***

Very belated answer. But since I spent most of today studying this problem I thought I'd show what worked for me. The short of it is using map and restarting the service instead of simply signaling a reload.

Here's what I have near the top of my sites-available/default file:

# User agent strings get pretty long
map_hash_bucket_size 256;

Detect the basic agent type (bot or not)

map $http_user_agent $agent_type {

# Not a bot
default  0;

# All of these below are bots that should be rate-limited
"~*(Amazonbot|ClaudeBot|DataForSeoBot|GPTBot|SemrushBot)"  1;
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0 Safari/537.36"  1;
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"  1;
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43"  1;

}

Define our rate-limiting zones for use later

map $agent_type $default_zone_key { default ''; 0 $binary_remote_addr; } limit_req_zone $default_zone_key zone=default_zone:20m rate=5r/s;

map $agent_type $bot_zone_key { default ''; 1 $binary_remote_addr; } limit_req_zone $bot_zone_key zone=bot_zone:10m rate=6r/m;

Note that in addition to some bots that announce themselves (eg ClaudeBot and GPTBot) I added a few "exactly this string" cases for some botnets that don't announce themselves. I may be accidentally punishing some real human users too. But I'm just happy to have something that works for now.

Then in the server section I've got this:

server {
    server_name  somedomain.com;
    ...
limit_req zone=default_zone burst=50 nodelay;
limit_req zone=bot_zone burst=10 nodelay;
limit_req_status 429;  # Too many requests

...

# Want to see whether you're seen as a bot? Uncomment the following:
#add_header  X-Routing-Agent-Type "$agent_type";

}

If you're not understanding what's going on then I'll explain. As a relative novice with NGINX I can relate. limit_rate_zone allows you to give a key. If it's blank then that rule is effectively ignored. In our case we're feeding it an IP address. More specifically we have two rules. One rule matches all the IP addresses associated with bots (by user agent). The other rule matches all the IP addresses that aren't. Those map $agent_type <key> blocks populate those two IP-address-or-blank keys.

It's worth noting that you could add more $agent_type numbers if you wanted a finer gradation than simply "bot" and "everything else". You'd just need to add more of the various definition blocks and lines.

I was so frustrated today because so much of what I was trying out was not working. Only near the end did it dawn on me that NGINX was not loading my configuration as expected. I was using nginx -s reload. Everything started working when I instead used service nginx restart.

1

*** ANSWER 2 of 2 ***

After a few more hours of tinkering I discovered what seems to be a more elegant approach. One problem I noticed with my earlier solution is that it did not account well for bot nets. Amazonbot, for example, was employing dozens of separate bots. They all report the same user agent. But my earlier solution was granting a separate quota for each of them rather than all of them collectively.

I realized then that this is how the key you give to limit_req_zone is a grouping key. When you give it an IP address you are saying each IP address gets its own quota. I'm fairly certain that's also why you declare a shared memory size (eg 10m). Each unique key fits into that shared memory and stale ones fall off when more memory is needed for fresh requests.

So what if we did two zones? One zone is the default zone and applies to everything that's not a known bot. In that one the keys we plug in are the remote IP addresses. So each address has its own quota. But then the other zone is for bots. Where each bot has a unique name and that name is the key. Not an IP address. Here's how I did it. First part goes up top:

# User agent strings get pretty long
map_hash_bucket_size 256;

Define our rate-limiting zones for use later

Each one of these bots (including networks) gets one shared zone with its own limited rate

map $http_user_agent $bot_name { default ''; '~Amazonbot' 'Amazonbot'; '~YandexBot' 'YandexBot'; '~ClaudeBot' 'ClaudeBot'; '~DataForSeoBot' 'DataForSeoBot'; '~DotBot' 'DotBot'; '~GPTBot' 'GPTBot'; '~*MJ12bot' 'MJ12bot'; 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43' 'Unknown1'; 'Mozilla/5.0 (Windows NT 9.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43' 'Unknown2'; } limit_req_zone $bot_name zone=named_bot_zone:30m rate=6r/m;

Everything not a bot is in the default zone

map $bot_name $default_zone_key { default ''; '' $binary_remote_addr; } limit_req_zone $default_zone_key zone=default_zone:30m rate=5r/s;

And then down in the specific server block I have:

server {
    server_name  some_domain_name.com;
    ...
limit_req zone=default_zone burst=50 nodelay;
limit_req zone=named_bot_zone burst=10 nodelay;
limit_req_status 429;  # Too many requests

...

}

This does generally appear to work as I'm expecting it to. I'm watching as a few bots are nibbling away. But I do notice that the bigger bot-nets do seem to get in a few more requests per minute than others do. For example Amazonbot is presently getting up to 9 per minute. It should only be getting 6 per minute. There may be other factors I'm overlooking though. But this is definitely better than the much higher number I was seeing earlier when it was per IP address. Perhaps the burst rate accounts for it.

I'm keeping my original answer too. Mainly because it's also valid. And it's a different approach. I think some people might still prefer to allow each bot machine to have its request quota/limit while others might prefer this whole-bot-network quota approach.