Debug BackendConnectionErrors from Kubernetes Service LoadBalancer sesrvice

Question

We recently moved some of our production infrastructure to Kubernetes. Many pods are exposed through a LoadBalancer service on AWS. This creates an ELB, registers each node in the cluster with the ELB, and configures a node port to map ELB ports to pods. Our applications are able to connect via the load balancer, but amount of BackendConnectionErrors (as reported by cloudwatch) is 5-7x higher than request count. I'm not sure how to debug this.

The number of reported backend connection errors does not correlate with any application layer error metrics. This leaves me to conclude that it some sort of infrastructure problem perhaps being amplified by retries. However I do not know how to debug this issue.

My hypothesis is one or both of these:

Some weird AWS setting that is missing on the ELB for connection management
Nodes in the cluster have some sysctl setting or other networking config that's blocking the amount of connections coming over the ELB
Some intermediate piece of networking infrastructure messing with the connections.

My question is: how I can debug/trace some TCP/networking related metrics on the instances in the cluster?

More info about the CloudWatch metrics in question.

Are you sure all your nodes are UP and running ? if one is failing at k8s level the ELB can't tell this and will still send it requests ... — Tensibai, Jun 08 '17 at 13:01

score 5 · Answer 1 · answered Jul 01 '17 at 04:01

My solution to this problem was to rework my Services. The setup in my question had one K8s Service with ~10 ports. I reworked the setup to use one port per Service. The problem went away. I don't know why though. This makes me suspect something on the nodes themselves or some complexity in routing connections to the correct node port. I'm cautious of exposing too many ports again because of this.

Debug BackendConnectionErrors from Kubernetes Service LoadBalancer sesrvice

1 Answers1