Which method can I use to automate DNS-failover with monitoring?

Question

We're operating multiple redundant servers across the world for latency reasons. Currently if one site goes down, our only way to let another site take over that region is through DNS.

We would like to automate this process, for example by replacing/modifying the zone files if a site is detected as having failed through a monitoring tool.

My Google skills only turned up companies offering this as a service, but we'd prefer our own solution. For monitoring we currently use Nagios, our nameserver is Bind.

Is there any tool/method out there to accomplish this?

score 5 · Accepted Answer · answered Feb 15 '15 at 16:43

Of cause there is, that's what those services are doing as well. :-)

It depends a bit how you're currently redirecting/distributing your users globally. Assuming that the results is effectively that some users are redirected from www.example.com to www.eu.example.com and others to www.oc.example.com respectively www.am.example.com.

You could use your monitoring solution so that when www.am.example.com becomes unresponsive not only a normal alert is triggered, but an update such that www.am.example.com points to www.eu.example.com instead.

A clean way is with dynamic Update which is a method for adding, replacing or deleting records in a master server by sending it a special form of DNS messages. The format and meaning of those messages is specified in RFC 2136.

Dynamic update is enabled by including an allow-update or an update-policy clause in the zone statement. For more info check the Bind Administrator Reference Manual.

The cleanest is to probably use both IP based access controls and DNS public keys.

Create the key-pair:

dnssec-keygen -a HMAC-MD5 -b 512 -n USER nagios.example.com.

Which should result in two files, one for the private key Knagios.exmaple.com.NNNN.private and a second with the public key Knagios.exmaple.com.NNNN.key.

Update your Bind config:

key nagios.example.com. {
    algorithm HMAC-MD5;
    secret "<string with contents from Knagios.exmaple.com.NNNN.key>"; };

zone "am.example.com"
{
    type master;
    file "/etc/bind/zone/am.example.com";
    allow-update { key nagios.example.com.; };
    ...
};

Then a script that does the following when an alert is raised using the Bind nsupdate utility:

cat<<EOF | /usr/bin/nsupdate -k Knagios.exmaple.com.NNNN.private -v
server ns1.example.com
zone am.example.com
update delete www.am.example.com. A
update add www.am.example.com. 60 A <ip-address-of-www.eu.example.com>
send
EOF

I'm not sure if you were allowed to use dynamic update for anything besides A records.

score 1 · Answer 2 · answered Feb 15 '15 at 16:43

None. Your approach is broken.

You seem to be under the delusion that you can change the DNS like that. It does not work like this. Even if you set the TTL low, some providers will ignore it - and your old value will still be used. You effectively have no control over DNS expiration outside "within a day or two".

Any high availability based on DNS changes is thus fundamentally flawed.

Which method can I use to automate DNS-failover with monitoring?

2 Answers2

Linked