service-to-pod communication is broken in kubernetes

Question

I have an in house 5 node cluster running on bare-metal and I am using Calico. The cluster was working for 22 days but suddenly it stopped working. After investigating the problem I found out that the service to pod communication is broken while all the components are up and kubectl is working without a problem.

From within the cluster (component A) if I try to curl another component (bridge) with its IP it works:

$ curl -vvv http://10.4.130.184:9998
* Rebuilt URL to: http://10.4.130.184:9998/
*   Trying 10.4.130.184...
* TCP_NODELAY set
* Connected to 10.4.130.184 (10.4.130.184) port 9998 (#0)
> GET / HTTP/1.1
> Host: 10.4.130.184:9998
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Accept-Ranges: bytes
< Cache-Control: public, max-age=0
< Last-Modified: Mon, 08 Apr 2019 14:06:42 GMT
< ETag: W/"179-169fd45c550"
< Content-Type: text/html; charset=UTF-8
< Content-Length: 377
< Date: Wed, 23 Oct 2019 09:56:35 GMT
< Connection: keep-alive
< 
<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

    <title>Bridge</title>

    <meta content='width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0' name='viewport' />
    <meta name="viewport" content="width=device-width" />
</head>
<body>
    <h1>Bridge</h1>
</body>

</html>
* Connection #0 to host 10.4.130.184 left intact

ns lookup for the service is also working (it resolve to the service IP):

$ nslookup bridge
Server:    10.5.0.10
Address 1: 10.5.0.10 kube-dns.kube-system.svc.k8s.local

Name:      bridge
Address 1: 10.5.160.50 bridge.170.svc.k8s.local

But the service to pod communication is broken and when I curl to the service name most of the times (60-70%) it stuck:

$ curl -vvv http://bridge:9998
* Rebuilt URL to: http://bridge:9998/
* Could not resolve host: bridge
* Closing connection 0
curl: (6) Could not resolve host: bridge

When I check the endpoints of the that service I can see that the IP of that pod is there:

$ kubectl get ep -n 170 bridge
NAME     ENDPOINTS                                               AGE
bridge   10.4.130.184:9226,10.4.130.184:9998,10.4.130.184:9226   11d

But as I said the curl (and any other method) that uses the service name is not working. And this is the service description:

$ kubectl describe svc -n 170 bridge
Name:              bridge
Namespace:         170
Labels:            io.kompose.service=bridge
Annotations:       Process: bridge
Selector:          io.kompose.service=bridge
Type:              ClusterIP
IP:                10.5.160.50
Port:              9998  9998/TCP
TargetPort:        9998/TCP
Endpoints:         10.4.130.184:9998
Port:              9226  9226/TCP
TargetPort:        9226/TCP
Endpoints:         10.4.130.184:9226
Port:              9226-udp  9226/UDP
TargetPort:        9226/UDP
Endpoints:         10.4.130.184:9226
Session Affinity:  None
Events:            <none>

This problem is not limited to just this component and it is like this for all the components.

I restarted the CoreDNS (deleted its pods) but it is still the same. I was facing this problem before and previous time I thought it is related to Weavenet that I was using and I needed the cluster so I tore down the cluster and rebuilt it with Calico but now I am sure that this is not related to CNI and it is something else.

Environment: - Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:27:17Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: This is a bare-metal cluster of 5 nodes, 1 master and 4 workers. All nodes are running Ubuntu 18.04 and they are connecting to the same subnet.
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

Linux serflex-argus-1 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Install tools: Kubeadm
Network plugin and version (if this is a network-related bug): Calico "cniVersion": "0.3.1"

Update

After deleting all the kube-proxy pods problem seemed to be solved but I still like to know what caused this problem. Btw I didn't see any error in the kube-proxy logs.

Update 2

The journalctl also doesn't show any strange activity, it just shows some logs when I deleted the pod of kube-proxy:

giant-7:~$ sudo journalctl --since "4 days ago" | grep kube-proxy

Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: I1023 12:24:16.751108   31145 reconciler.go:181] operationExecutor.UnmountVolume started for volume "kube-proxy" (UniqueName: "kubernetes.io/configmap/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea-kube-proxy") pod "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea" (UID: "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea")
Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: I1023 12:24:16.751237   31145 reconciler.go:181] operationExecutor.UnmountVolume started for volume "kube-proxy-token-4t5tq" (UniqueName: "kubernetes.io/secret/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea-kube-proxy-token-4t5tq") pod "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea" (UID: "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea")
Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: W1023 12:24:16.765057   31145 empty_dir.go:421] Warning: Failed to clear quota on /var/lib/kubelet/pods/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea/volumes/kubernetes.io~configmap/kube-proxy: ClearQuota called, but quotas disabled
Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: I1023 12:24:16.782557   31145 operation_generator.go:831] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea-kube-proxy" (OuterVolumeSpecName: "kube-proxy") pod "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea" (UID: "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea"). InnerVolumeSpecName "kube-proxy". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: I1023 12:24:16.840793   31145 operation_generator.go:831] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea-kube-proxy-token-4t5tq" (OuterVolumeSpecName: "kube-proxy-token-4t5tq") pod "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea" (UID: "02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea"). InnerVolumeSpecName "kube-proxy-token-4t5tq". PluginName "kubernetes.io/secret", VolumeGidValue ""
Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: I1023 12:24:16.851656   31145 reconciler.go:301] Volume detached for volume "kube-proxy-token-4t5tq" (UniqueName: "kubernetes.io/secret/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea-kube-proxy-token-4t5tq") on node "serflex-giant-7" DevicePath ""
Oct 23 12:24:16 serflex-giant-7 kubelet[31145]: I1023 12:24:16.851679   31145 reconciler.go:301] Volume detached for volume "kube-proxy" (UniqueName: "kubernetes.io/configmap/02a1a1fb-2411-4f0e-98e1-ef2dbd6149ea-kube-proxy") on node "serflex-giant-7" DevicePath ""
Oct 23 12:24:25 serflex-giant-7 kubelet[31145]: I1023 12:24:25.973757   31145 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-proxy" (UniqueName: "kubernetes.io/configmap/4e7f5d97-fd49-461b-ae38-6bc5e3f0462b-kube-proxy") pod "kube-proxy-qpj4h" (UID: "4e7f5d97-fd49-461b-ae38-6bc5e3f0462b")
Oct 23 12:24:25 serflex-giant-7 kubelet[31145]: I1023 12:24:25.973826   31145 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "lib-modules" (UniqueName: "kubernetes.io/host-path/4e7f5d97-fd49-461b-ae38-6bc5e3f0462b-lib-modules") pod "kube-proxy-qpj4h" (UID: "4e7f5d97-fd49-461b-ae38-6bc5e3f0462b")
Oct 23 12:24:25 serflex-giant-7 kubelet[31145]: I1023 12:24:25.973958   31145 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-proxy-token-4t5tq" (UniqueName: "kubernetes.io/secret/4e7f5d97-fd49-461b-ae38-6bc5e3f0462b-kube-proxy-token-4t5tq") pod "kube-proxy-qpj4h" (UID: "4e7f5d97-fd49-461b-ae38-6bc5e3f0462b")
Oct 23 12:24:25 serflex-giant-7 kubelet[31145]: I1023 12:24:25.974027   31145 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "xtables-lock" (UniqueName: "kubernetes.io/host-path/4e7f5d97-fd49-461b-ae38-6bc5e3f0462b-xtables-lock") pod "kube-proxy-qpj4h" (UID: "4e7f5d97-fd49-461b-ae38-6bc5e3f0462b")
Oct 23 12:24:26 serflex-giant-7 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/pods/4e7f5d97-fd49-461b-ae38-6bc5e3f0462b/volumes/kubernetes.io~secret/kube-proxy-token-4t5tq.
Oct 23 12:24:26 serflex-giant-7 kubelet[31145]: E1023 12:24:26.645571   31145 kuberuntime_manager.go:920] PodSandboxStatus of sandbox "ff4c1ba15b8c11a4fe86974286d81ebe4870d9e670226234fae6c7c21ce36c1d" for pod "kube-proxy-qpj4h_kube-system(4e7f5d97-fd49-461b-ae38-6bc5e3f0462b)" error: rpc error: code = Unknown desc = Error: No such container: ff4c1ba15b8c11a4fe86974286d81ebe4870d9e670226234fae6c7c21ce36c1d

score 1 · Answer 1 · answered Oct 23 '19 at 15:00

Sounds like the problem has been resolved by deleting all of the kube-proxy pods, which is an aggressive solution but if it works then good work!

Best I can offer you is a collection of links about diagnosing Kubernetes problems:

Troubleshoot Applications
Troubleshooting Kubernetes Networking Issues
Troubleshooting GKE (GKE Specific but still useful)
Deploy Kubespy

Not a great answer I'm afraid, but we may have missed the opportunity to diagnose the issue.

service-to-pod communication is broken in kubernetes

Update

Update 2

1 Answers1