Scaling up is not too hard.
You will need to elect some group leaders.
Also, while it is {more powerful, higher overhead}
than your current setup, you might find the notion of
reliable multicast
or virtual synchrony to be of interest.
Ken Birman did great work on this in the 1980's.
A host becomes a member of sequentially numbered "views",
which may shrink in membership due to power fail or network partition,
and may grow when partitions are later merged.
Goal: You want both traffic through a given WAN link,
and messages seen by a given node, to be bounded even as number of nodes grows.
- Keep sending mdns messages as you currently do, at random intervals of roughly 24 hours, plus once soon after bootup.
- Every node has a unique (mostly persistent) ID, perhaps derived from MAC address or a GUID rolled at software install time.
- Define some "regions" of network topology. If you think about ARP broadcasts, a region corresponds to an IPv4 subnet prefix, e.g. 10.0.2.0/23. You might instead use "nearest leader node" based on elapsed time ping, or administrative groupings, or even SHA3 hash of MAC address modulo K.
- Region leaders send frequent mdns advertisements, perhaps once per minute. Upon bootup a node pauses for random delay, sends advertisement, considers itself an auxiliary leader, and waits for messages. It should soon hear one from the region's leader, containing the node's ID, at which point node demotes itself to being an ordinary participant.
- Leaders send "current state of the world" messages that list all of a region's node IDs together with timestamp it last heard from that node. It also includes administrivia: (A.) region's subleader(s), and (B.) other regions along with their leaders.
- Nodes periodically send "I'm alive!" unicast messages to leaders and aux leaders.
- Auxiliary leaders send messages similar to leader, but less frequently. If leader becomes unreachable, the lowest numbered aux leader promotes itself to become leader.
This is an asymmetric gossip protocol.
The idea is that everyone can see what the protocol believes
the state of the world is, and typically they silently agree with that.
Seeing yourself listed in an advertisement from leader squelches
your own advertisement. In some ways this resembles
bridge spanning tree protocol.
Leader + aux leader will have overhead proportional to region size.
Everyone else wins.
For less overhead, use a greater number of smaller regions.
Very important: jitter all timers.
Add some random amount of delay.
Else nodes will synchronize with one another.
I have seen this in production with MUA clients hitting a POP3 mail server.
The BGP community was impacted by this badly enough that mandatory jitter is now written into the protocol spec.