Confluent Kafka helm chart - broker pods not being able to communicate to Zookeper overtime

Question

We are running confluent helm chart on a lot of clusters, we occasionally get brokers pods failing after some time and no self-heal after is following, the error in the broker usually is the following:

[main-SendThread(confluent-{env}-cp-zookeeper-headless:{port})] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server confluent-{env}-cp-zookeeper-headless/{ip:port}. Will not attempt to authenticate using SASL (unknown error)
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [confluent-{env}-cp-zookeeper-headless:{port}.

Is anybody experiencing this and is there any guidance how to automate handling or fix this behavior? The only thing I understand is that Zookeper is not responsive for some reason, but Zookeper pods are up and broker ones are down. In one of Zookeper pods I see the following:

[2020-06-15 13:02:18,902] WARN Connection broken for id 2, my id = 2, error =  (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2020-06-15 13:03:07,010] WARN Connection broken for id 2, my id = 2, error =  (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2020-06-15 13:03:07,010] WARN Connection broken for id 1, my id = 2, error =  (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2020-06-15 13:03:08,244] ERROR Unhandled scenario for peer sid: 3 fall back to use snapshot (org.apache.zookeeper.server.quorum.LearnerHandler)

We are on k8s 1.14.3 and running the following version of the confluent chart: cp-helm-charts-0.4.1

My assumption is after any distruption the connection between both sides is lost and not handled to reconnect?

Also there is a github issue on vendor site: https://github.com/confluentinc/cp-helm-charts/issues/446

Confluent Kafka helm chart - broker pods not being able to communicate to Zookeper overtime

0 Answers0