5

Background: When running Consul client-mode in AWS, EC2s are constantly getting replaced, which results in a lot of dead client nodes when looking at the Consul UI or via consul members.

Already tried to enabled reconnect_timeout, leave_on_terminate (default to true on Client but tried to turn on anyway), and Autopilot cleanup_dead_servers to no avail.

The delay seems to be inconsistent as well, some got reaped after 4~7 days, some are never reaped, consul force-leave had to be used.

Casper
  • 151
  • 1
  • 4

1 Answers1

4

As a best practice, you should gracefully deregister the node gracefully. In this case, consul will know the the node has left and will be removed from the cluster. Otherwise, the consul cannot distinguish between a temporary failure, agent crash, network partition, etc.

There is a github issue as well related to this.

Quoting few important points from the issue.

The nodes should automatically reap out after 72hours (not yet configurable, but soon). Otherwise, the best route is to issue a graceful leave before destroying the nodes (consul leave), so that they can be reaped immediately. They are kept around for that long since without a graceful leave, Consul cannot distinguish between a temporary failure, agent crash, network partition, etc.

.

force-leave should push them into the "left" state. Nodes are not reaped until they are in failed for 72h or in the left state. "force-leave" just moves a node from the "failed" -> "left" state. They are not removed from the members list for 24 or 72h.

Samit
  • 1,021
  • 6
  • 11