Is it possible to completely avoid a single-point-of-failure in a web back-end?

Question

It seems like you're always dependent on some hosting provider being available. Even if you servers are geo-redundant across data centers, you still have a DNS record that points to some IP address and it will be resolved by some DNS server that can disappear any second. Is there a solution for this? I've seen people suggest DNS load-balancing with some mechanism for detecting downtime and doing failover. Which DNS provider offers this? And does it still rely on one of its data-centers not being down?

Assuming everything behind our first line of contact (LB proxy) is already geo-redundant - is there really a feasible way to take care of that last step?

score 3 · Answer 1 · answered May 22 '11 at 05:36

Actually, there can be several DNS servers serving a certain domain, take a look at the domain stackoverflow.com:

$ nslookup -type=ns stackoverflow.com
Server:     192.168.0.1
Address:    192.168.0.1#53

Non-authoritative answer:
stackoverflow.com   nameserver = ns3.serverfault.com.
stackoverflow.com   nameserver = ns1.serverfault.com.
stackoverflow.com   nameserver = ns2.serverfault.com.

Authoritative answers can be found from:

$

The domain names under stackoverflow.com can be resolved by three name servers, so even if one or two of them went down, the domain names can still be resolved.

score 2 · Answer 2 · answered May 22 '11 at 05:55

The RFCs which make recommendations for DNS servers suggest using at least three name servers placed in logically and geographically diverse locations to avoid exactly that problem. The IP addresses published for those servers can also be set up with IP anycast so servers at a variety of locations can share the same IP address. Routing around failures is pretty much automatic when the proper routing is used (i.e. one location that is tied to that IP goes down and traffic is simply directed to another automatically). The root DNS servers and many of the major TLDs are set up this way to resist failure and be resilient against DDoS attacks. It is how services such as OpenDNS have close to 100% uptime even when serving billions of queries.

Companies have spent millions of dollars on redundant infrastructure to reduce downtime, but failures can still happen, often in unexpected ways related to the human factors involved rather than the technological factors.

Is it possible to completely avoid a single-point-of-failure in a web back-end?

2 Answers2