We are running confluent helm chart on a lot of clusters, we occasionally get brokers pods failing after some time and no self-heal after is following, the error in the broker usually is the following:
[main-SendThread(confluent-{env}-cp-zookeeper-headless:{port})] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server confluent-{env}-cp-zookeeper-headless/{ip:port}. Will not attempt to authenticate using SASL (unknown error)
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [confluent-{env}-cp-zookeeper-headless:{port}.
Is anybody experiencing this and is there any guidance how to automate handling or fix this behavior? The only thing I understand is that Zookeper is not responsive for some reason, but Zookeper pods are up and broker ones are down. In one of Zookeper pods I see the following:
[2020-06-15 13:02:18,902] WARN Connection broken for id 2, my id = 2, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager) [2020-06-15 13:03:07,010] WARN Connection broken for id 2, my id = 2, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager) [2020-06-15 13:03:07,010] WARN Connection broken for id 1, my id = 2, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager) [2020-06-15 13:03:08,244] ERROR Unhandled scenario for peer sid: 3 fall back to use snapshot (org.apache.zookeeper.server.quorum.LearnerHandler)
We are on k8s 1.14.3 and running the following version of the confluent chart: cp-helm-charts-0.4.1
My assumption is after any distruption the connection between both sides is lost and not handled to reconnect?
Also there is a github issue on vendor site: https://github.com/confluentinc/cp-helm-charts/issues/446