Availability at risk due to one offline Domain Name Server?

Question

A domain can have plenty of name servers registered at there domain registrar. The name servers are picked randomly and not like expected primary first, secondary second and so on.

Knowing that, does this mean that when one name server is down, there is a 50% chance that the visitors that get to question the offline name server will never reach your site? While the other 50% is able to browse to your site just fine, and so affecting the availability of the server?

Lastly, why would clients not by default question the next name server in the list when one is down?

The same applies for IPv4 and IPv6. If one of the name servers only supports IPv6 and no IPv4 and a user without IPv6 connectivity gets to question that specific name server, the site will be unreachable I suppose.

Additionally, I'm talking explicitly about the way the authoritative server is picked and the handling of a failure in case the picked authoritative server is not available due to downtime or ipv4-ipv6 incompatibilities between client and server.

score 6 · Accepted Answer · edited Oct 07 '21 at 07:34

Lastly, why would clients not by default question the next name server in the list when one is down?

That is exactly what recursive servers do when talking to authoritative servers. RFC 1035 §7.2 describes the overall process if you're interested, but the following excerpts are the most immediately relevant:

The key algorithm uses the state information of the request to select the next name server address to query, and also computes a timeout which will cause the next action should a response not arrive. The next action will usually be a transmission to some other server, but may be a temporary error to the client.

[snip]

If a resolver gets a server error or other bizarre response from a name server, it should remove it from SLIST, and may wish to schedule an immediate transmission to the next candidate server address.

There are a few other factors considered in the selection of the authoritative server, such as the observed response time based on prior communication history. It's there in the RFC if you're interested.

The key to ensuring that you are not impacted by nameserver unreachability is covered by BCP 16. In particular, Section 3.1 states:

Secondary servers must be placed at both topologically and geographically dispersed locations on the Internet, to minimise the likelihood of a single failure disabling all of them.

That is, secondary servers should be at geographically distant locations, so it is unlikely that events like power loss, etc, will disrupt all of them simultaneously. They should also be connected to the net via quite diverse paths. This means that the failure of any one link, or of routing within some segment of the network (such as a service provider) will not make all of the servers unreachable.

This is to account for the fact that the resiliency of your domain is severely impacted by single points of failure on the network, or on the physical site. The ideal state is to have multiple authoritative nameservers that are not impacted by any change in network or physical state experienced by the others.

score 4 · Answer 2 · answered Jun 21 '17 at 19:07

I would say that the answer to the overall sentiment of the question is "no".

First off, the client machine traditionally only has a stub resolver, blindly sending all queries (with "recursion desired" set) to some configured nameserver address (resolv.conf).

It's really what happens in the next step, when that nameserver processes the recursion request, making iterative queries until it reaches the authority, that your question is applicable.

And while there is some degree of implementation specific behavior, it's absolutely the case that it is expected to try to work itself through the authoritative nameservers until it finds one which is responsive.
The caveat here is rather that there will be some overall timeout, so there is a risk that it cannot finish in time.
That said it's also common to keep tabs of which servers are working and which aren't, increasing the chances that successive queries will succeed in a timely fashion, and of course queries for already cached data will not even require communication with the authoritative servers.

All in all, no, you should not expect 50% chance of user-visible error if there are two nameservers and one is down. More likely the first lookup in a completely cold-cache scenario will just be slightly slow.

score 0 · Answer 3 · answered Jun 21 '17 at 13:28

Saying that there is 50% chance that the visitors that get to question the offline name server will never reach your site is not accurate. From manual of Linux resolver man resolv.conf, under the section that describes nameserver option you can read:

If there are multiple servers, the resolver library queries them in the order listed. If no nameserver entries are present, the default is to use the name server on the local machine. The algorithm used is to try a name server, and if the query times out, try the next, until out of name servers, then repeat trying all the name servers until a maximum number of retries are made.

So, they will be tried according to the order specified in config file. Saying that does not mean necessarily mean that all resolvers should behave in the same way.

Availability at risk due to one offline Domain Name Server?

3 Answers3