We recently moved some of our production infrastructure to Kubernetes. Many pods are exposed through a LoadBalancer service on AWS. This creates an ELB, registers each node in the cluster with the ELB, and configures a node port to map ELB ports to pods. Our applications are able to connect via the load balancer, but amount of BackendConnectionErrors (as reported by cloudwatch) is 5-7x higher than request count. I'm not sure how to debug this.
The number of reported backend connection errors does not correlate with any application layer error metrics. This leaves me to conclude that it some sort of infrastructure problem perhaps being amplified by retries. However I do not know how to debug this issue.
My hypothesis is one or both of these:
- Some weird AWS setting that is missing on the ELB for connection management
- Nodes in the cluster have some sysctl setting or other networking config that's blocking the amount of connections coming over the ELB
- Some intermediate piece of networking infrastructure messing with the connections.
My question is: how I can debug/trace some TCP/networking related metrics on the instances in the cluster?
More info about the CloudWatch metrics in question.