We have nginx as a reverse proxy, load balancing across 2 application servers. These application servers were defined in upstream blocks like so:
upstream app_backends {
server 1.1.1.1:8080 max_fails=1 fail_timeout=120s;
server 1.1.1.2:8080 max_fails=1 fail_timeout=120s;
}
We had a significant outage where a client was sending a request with a large cookie header, which the uwsgi application choked on and closed the connection early. This resulted in nginx marking a failure on that backend, and then immediately sending the request to the second backend which would choke in exactly the same way. Then nginx would mark both backends down and only respond to requests from all clients with 502s for the next two minutes.
Once we understood the problem we easily fixed it by setting max_fails=0. This resulted in the client in question, with the large cookie header, getting 502s, but all other clients could continue to use the application without issue. But of course this means nginx isn't offering any protection against failures in our backends.
We actually have this same configuration across a number of different applications, and I'm trying to understand what is the safest general configuration for our setup.
The default values in nginx for these two settings are max_fails=1 and fail_timeout=10. Our problem was obviously exacerbated by the fact that we had fail_timeout=120s, but even if it had been 10s, this still would have resulted in our application being taken down completely for 10 seconds at a time whenever this particular client with a large cookie header made a request.
It seems like a bad pattern in general that a single fault in response to a request, which may be a special case request like ours was, leads to a whole backend being taken offline? Especially where we have no idea if the same error will apply to all backends equally, as it did in this case?
What I'm asking is: Would it be generally a safer configuration for our setup to use max_fails=0 for all our apps rather than to use the actual nginx default of max_fails=1 fail_timeout=10s? And if so, is this potentially an argument for nginx to change its default?