How to track down company wide connection issues to a specific web location?

Question

I've tried asking on StackOverflow without success so I hope I this community can help me track down this issue. We have a web app that many people in the company need to access. Occasionally the web app seems to stop responding to requests.

For example if a resource index page (e.g. orders table) tries to refresh the resource list during an outage it will request the data via an API but the request ends up silently failing after a while. The app becomes unreachable with long lasting requests to just about everyone at the company at the same time for several minutes at a time but accessing the app during this outage/sluggish period from another network (e.g. mobile data) works. Other websites also don't seem to be affected during this period.

The browser network tab shows the requests as failing after 20-40s but there's no status code. The status text when the request is selected is failed net::ERR_CONNECTION_TIMED_OUT. It seems like when you don't click the request while it's processing and open up the detail later the timing tab will say it got stuck on the Stalled phase. But if you open the request detail while it's being processed it will instead say it got stuck on the Initial connection phase. This makes the timing tab of the request detail seem unreliable as what it shows seems to depend on whether I was inspecting the request at the time it was being processed or not.

Server setup:

The server doesn't seem to show major overload during this time - max 30% CPU/memory usage. The server is running on a Digital Ocean droplet and using nginx to host the Laravel app.

What I considered / tried: Company connections come from the same IP. But while the app itself does have throttling enabled, it's bound to the user ID, returns a "Too many attempts" error message and the 429 status code. If this is a case of throttling it should not be at the app level because throttling there is recognizable by the error message and status code.

I tried inspecting nginx configs to find any throttling enabled but it doesn't seem to be explicitly enabled unless nginx enforces some sort of a default. But even if enabled nginx should also return 429/503 as far as I understand from what I read. But in our case it seems like no errors or codes are returned.

I've tried contacting both DigitalOcean and the company ISP and they both claim not to be using any sort of throttling/rate-limiting mechanism. The company network admin also says that there isn't such a mechanism running.

What can I do to debug / investigate where the issue is coming from? From what I understand the issue can be anywhere from nginx configuration to ISP provider throttling. I am thinking this is some kind of throttling at the moment but could I be missing something.

suchislife · Answer 1 · 2023-11-02T22:46:53.040

Use diagnostic tools to identify bottlenecks or errors in various parts of your infrastructure (nginx, Digital Ocean, internal network). Record data during the outage to analyze later.

# nginx logs
tail -f /var/log/nginx/access.log
tail -f /var/log/nginx/error.log
Network diagnostics (replace x.x.x.x with server IP)
traceroute x.x.x.x
mtr --report --report-cycles=10 x.x.x.x
Laravel logs
tail -f /path/to/laravel/storage/logs/laravel.log
Digital Ocean droplet metrics
Check droplet metrics via Digital Ocean dashboard

This will help you pinpoint whether the issue is within your nginx setup, Digital Ocean droplet, internal network, or elsewhere. Logs and network diagnostics can provide clues.

Reply to Comment

• To inspect if there are any traffic shaping or throttling rules applied using the tc command, which could be influencing the flow of network traffic:

# Display all the traffic control (qdisc) settings on all interfaces:
tc qdisc show dev [interface-name]
Example for eth0 interface:
tc qdisc show dev eth0

If there are specific traffic control rules applied, they'll be listed here. They can be further analyzed to determine if they're contributing to the reported timeouts.

How to track down company wide connection issues to a specific web location?

1 Answers1

Network diagnostics (replace x.x.x.x with server IP)

Laravel logs

Digital Ocean droplet metrics

Check droplet metrics via Digital Ocean dashboard

Example for eth0 interface: