1

I'll start off by saying I'm not really an infrastructure person, but am being required to maintain it until we hire someone more suited to it, so apologies if I have any terminology wrong.

We have a kubernetes cluster (AKS)in azure. Currently we are running only 1 server in the cluster, with 4 namespaces configured within it, representing each environment. To properly support our production environment, we ideally want at least 3 servers, with the ability to scale appropriately. However, whenever we try to scale it up, services deployed to different server nodes cannot communicate with one another.

The unfortunate thing is I have no idea even how to start begin debugging this. I know we have some custom nginx rules that microsoft gave us, but even setting allow everything as the first rule does not seem to resolve this issue. How do I begin debugging this? What information would even be helpful for this issue?

Richard Slater
  • 11,747
  • 7
  • 43
  • 82

3 Answers3

1

With help from a Microsoft Engineer, we were able to resolve our issue.

The trick to it was that the node that could communicate with the outside world was always the one that the kubernetes tunnel pod was running on. This indicated a problem with communication between nodes. The traceroute indicated that this, with no network security group traffic restrictions, was bouncing between the node1 ip, to the load balancer ip, back to the node2 ip.

we had a security rule that restricted all traffic that was not on our specifically approved list.

One of the debug steps I had taken was to open all traffic between the load balancer ip to the load balancer ip. However, this failed to consider the fact that the traffic was not from the load balancer to the load balancer, but from nodex to the load balancer then from the load balancer to nodey.

We added an inbound and outbound traffic rule that allowed traffic from the subnet ips that nodes can possibly be on to the load balancer, as well as within the same subnet for good meansure. This resolved the issue.

If that's confusing, I can provide images that show the new rules and explain it a little clearer, but I'm hoping that is clear enough for anyone who experiences a similar issue.

0

We had the same issue; pods on one node couldn't communicate with pods on another.

Testing Connectivity

We first confirmed our theory that pods couldn't communicate by using kubectl exec -it <podname> -- /bin/bash to connect to a pod on one node (related docs). From that pod we then tried to connect to open ports on pods on the same node and on different nodes; thus showing that we could communicate with those on the same node, but not with those on a different node / confirming our theory.

Since most pods don't have utility tooling installed, we used the (timeout 1 bash -c '</dev/tcp/10.244.1.45/8089' && echo PORT OPEN || echo PORT CLOSED) 2>/dev/null approach to test connectivity (where 10.244.1.45 is the IP of the target pod, and 8089 the listening port). Related notes

Resolving the issue

In our case the issue was with the Route Table associated with our Kubernetes cluster's subnet.

  • Open the Azure Portal
  • Navigate to the kubernetes service
  • Navigate: Settings > Networking > Virtual Network Integration
  • Click on the subnet
  • Click on the route table associated with our subnet
  • Under routes you'll see each route is named after the node to which it relates. The Next hop IP address is the IP of that node. The Address prefix is the CIDR under which all pods hosted on that node should sit.
  • This was the issue: we found that the address prefix associated with one node was different to the pods addresses on that node; i.e. the address prefix was 10.244.0.0/24, but pods under that node were in the range 10.244.1.0/24; e.g. 10.244.1.45.
  • We resolved the issue by adding a new route with the correct information (we gave it the same name as the invalid route, just suffixing _fix on the end - we could probably have just corrected the invalid entry, but adding a new route felt lower risk).

Hope that helps others who may hit this issue.

JohnLBevan
  • 315
  • 1
  • 3
  • 11
0

What about creating a new managed k8s cluster and start from scratch? This is what I do if the architecture is a maze. Managed k8s on azure and gcp is a couple of clicks away.

030
  • 13,383
  • 17
  • 76
  • 178