319

We received an interesting "requirement" from a client today.

They want 100% uptime with off-site failover on a web application. From our web application's viewpoint, this isn't an issue. It was designed to be able to scale out across multiple database servers, etc.

However, from a networking issue I just can't seem to figure out how to make it work.

In a nutshell, the application will live on servers within the client's network. It is accessed by both internal and external people. They want us to maintain an off-site copy of the system that in the event of a serious failure at their premises would immediately pick up and take over.

Now we know there is absolutely no way to resolve it for internal people (carrier pigeon?), but they want the external users to not even notice.

Quite frankly, I haven't the foggiest idea of how this might be possible. It seems that if they lose Internet connectivity then we would have to do a DNS change to forward traffic to the external machines... Which, of course, takes time.

Ideas?

UPDATE

I had a discussion with the client today and they clarified on the issue.

They stuck by the 100% number, saying the application should stay active even in the event of a flood. However, that requirement only kicks in if we host it for them. They said they would handle the uptime requirement if the application lives entirely on their servers. You can guess my response.

gWaldo
  • 12,027
ChrisLively
  • 3,782

27 Answers27

370

Here is Wikipedia's handy chart of the pursuit of nines:

enter image description here

Interestingly, only 3 of the top 20 websites were able to achieve the mythical 5 nines or 99.999% uptime in 2007. They were Yahoo, AOL, and Comcast. In the first 4 months of 2008, some of the most popular social networks, didn't even come close to that.

From the chart, it should be evident how ridiculous the pursuit of 100% uptime is...

Skyhawk
  • 14,230
GregD
  • 8,753
192

Ask them to define 100% and how it will be measured Over what time period. They probably mean as close to 100% as they can afford. Give them the costings.

To elaborate. I've been in discussions with clients over the years with supposedly ludicrous requirements. In all cases the they were actually just using non precise enough language.

Quite often they frame things in ways that appear absolute - like 100% but in actual fact on deeper investigation they are reasonable enough to do the cost/benefit analyses that are required when presented with costings to risk mitigation data. Asking them how they will measure the availability is a crucial question. If they don't know this then you are in a position having to suggest to them that this needs to defined first.

I would ask the client to define what would happen in terms of business impact/costs if the site went down in the following circumstances:

  • At their busiest hours for x hours
  • At their least busy hours for x hours

And also how they will measure this.

In this way you can work with them to determine the right level of '100%'. I suspect by asking these kinds of of questions they will be able to better determine their other requirements' priorities. For example they may want to pay certain levels of SLA and compromise other functionality in order to achieve this.

Preet Sangha
  • 2,787
137

Your clients are crazy. 100% uptime is impossible no matter how much money you spend on it. Plain and simple - impossible. Look at Google, Amazon, etc. They have nearly endless amounts of money to throw at their infrastructure and yet they still manage to have downtime. You need to deliver that message to them, and if they continue to insist that they offer reasonable demands. If they don't recognize that some amount of downtime is inevitable, then ditch 'em.

That said, you seem to have the mechanics of scaling/distributing the application itself. The networking portion will need to involve redundant uplinks to different ISPs, getting an ASN and IP allocation, and getting neck-deep in BGP and real routing gear so that IP address space can move between ISPs if need be.

This is, quite obviously, a very terse answer. You haven't had experience with applications requiring this degree of uptime, so you really need to get a professional involved if you want to get anywhere close to the mythical 100% uptime.

EEAA
  • 110,608
54

Well, that's definitely an interesting one. I'm not sure I would want to get myself contractually obligated to 100% uptime, but if I had to I think it would look something like this:

Start with the public IP on a load balancer completely out of the network and build at least two of them so that one can fail over to the other. A program like Heatbeart can help with the automatic failover of those.

Varnish is primarily known as a caching solution but it does some very decent load balancing as well. Perhaps that would be a good choice to handle the load balancing. It can be set up to have 1 to n backends optionally grouped in directors which will load balance either randomly or round-robin. Varnish can be made smart enough to check the health of every back end and drop unhealthy back ends out of the loop until it comes back online. The backends do not have to be on the same network.

I'm kind of in love with the Elastic IPs in Amazon EC2 these days so I would probably build my load balancers in EC2 in different regions or at least in different availability zones in the same region. That would give you the option of manually (god forbid) spinning up a new load balancer if you had to and moving the existing A record IP to the new box.

Varnish cannot terminate SSL, though, so if that is a concern you may want to look at something like Nginx instead.

You could have most of your backends in your clients network and one or more outside their network. I believe, but am not 100% sure, that you can prioritize the backends so that your clients machines would receive priority until such time as all of them became unhealthy.

That's where I would start if I had this task and undoubtedly refine it as I go along.

However, as @ErikA states, it's the Internet and there are always going to be parts of the network that are outside your control. You'll want to make sure your legal only ties you up with things that are under your control.

jdw
  • 3,955
29

No problem - slightly revised contract wording though:

... guarantee an uptime of 100% (rounded to zero decimal places).

26

If Facebook and Amazon can't do it, then you can't. It's as simple as that.

Paperjam
  • 139
Mike
  • 22,748
25

To add oconnore's answer from Hacker News

I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not. When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down. Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent(1) three 9 reliability, with good failover modes, your expected downtime is under a second per year(2). Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist. The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.

(1) A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.

(2) DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.

17

You are being asked for something impossible.

Review the other answers here, sit down with your client, and explain WHY it's impossible, and gauge their response.

If they still insist on 100% uptime, politely inform them that it cannot be done and decline the contract. You will never meet their demand, and if the contract doesn't totally suck you'll get skewered with penalties.

voretaq7
  • 80,749
13

Price accordingly, and then stipulate in the contract that any downtime past the SLA will be refunded at the rate they are paying.

The ISP at my last job did that. We had the choice of a "regular" DSL line at 99.9% uptime for $40/mo, or a bonded trio of T1s at 99.99% uptime for $1100/mo. There were frequent outages of 10+ hours per month, which brought their uptime well below the $40/mo DSL, yet we were only refunded around $15 or so, because that's what the rate per hour * hours ended up at. They made out like bandits from the deal.

If you bill $450,000 a month for 100% uptime, and you only hit 99.999%, you'll need to refund them $324. I'm willing to bet the infrastructure costs to hit 99.999% are in the neighborhood of $45,000 a month assuming fully distributed colos, multiple tier 1 uplinks, fancypants hardware, etc.

Bryan B
  • 379
10

If professionals question if 99.999 percent availability [is] ever a practical or financially viable possibility, then 99.9999% availability is even less possible or practical. Let alone 100%.

You will not meet 100% availability goal for an extended period of time. You may get away with it for a week or a year, but then something will happen and you will be held responsible. The outfall can range from damaged reputation (you promised, you didn't deliver) to bankruptcy from contractual fines.

Paweł Brodacki
  • 6,591
  • 1
  • 22
  • 23
10

There are two types of people who ask for 100% uptime:

  1. People with absolutely no knowledge about computers, computer systems, or the Internet.*
  2. Ones who are intentionally making an ass of themselves, either to test your ability to say No (Google "the Orange Juice Test"), or trying to gain some kind of contract SLA leverage in order to get out of paying you later.

My advice, having suffered both of these types of clients on many occasions, is to not take this client. Let them drive someone else insane.

*This same person might have no embarrassment inquiring about Faster-than-Light travel, Perpetual Motion, Cold Fusion, etc.

Irving
  • 146
8

I would communicate with the client to establish with them what exactly 100% uptime means. It is possible they don't really see a distinction between 99% uptime and 100% uptime. To most people (ie. not server admins) those two numbers are the same.

jhocking
  • 181
6

100% uptime?

Here's what you need:

Multiple, (& redundant) DNS servers, pointing to multiple sites all over the world, with proper SLAs with each ISP.

Make sure the DNS servers are setup properly, with TTL recognised effectively.

A T
  • 397
6

This is easy. The Amazon EC2 SLA clearly states:

“Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.”

http://aws.amazon.com/ec2-sla/

Just define 'uptime' to be relative to the entire bundle of service you can actually keep operational 100% of the time, and you should have no problems.

Also, it's worth pointing out that the entire point of an SLA is to define what your obligations are and what happens if you can't meet them. It doesn't matter if the client asks for 3 nines or 5 nines or a million nines - the question is what they get when/if you can't deliver. The obvious answer is to provide a line item for 100% uptime at 5x the price you want to charge, and then they get a 4x refund if you miss that target. You might score!

fields
  • 720
5

DNS changes only take time if they are configured to take time. You can set the TTL on a record to one second - your only issue would be to ensure that you provide a timely response to DNS queries, and that the DNS servers can cope with that level of queries.

This is exactly how GTM works in F5 Big IP - the DNS TTL by default is set to 30 seconds and if one member of the cluster needs to take over, the DNS is updated and new IP is taken up almost immediately. Maximum of 30 seconds of outage, but that is the edge case, the average would be 15 seconds.

Paul
  • 1,349
5

You know this is impossible.

No doubt the client is focused on seeing "100%", so the best you can do is promise 100%, except for [all reasonable causes that aren't your fault].

Marcin
  • 154
4

While I doubt 100% is possible you may want to consider Azure (or something with a similar SLA) as a possibility. What goes on:

Your servers are virtual machines. If there's ever a hardware issue on one server your virtual machine is moved to a new machine. The load balancer takes care of the redirection so the customer should not see any downtime (though I'm not sure how your sessions state would be affected).

That said, even with this fail-over, the difference between 99.999 and 100 borders on insanity.

You'll have to have full control over the following factors.
- Human factors, both internal and external, both malice and impotence. An example of this is somebody pushing something to production code that brings down a server. Even worse, what about sabotage?
- Business issues. What if your provider goes out of buisness or forgets to pay their electric bills, or simply decides to stop supporting your infrastructure without sufficient warning?
- Nature. What if unrelated tornadoes simultaneously hit enough data centers to overwhelm backup capacity?
- A completely bug free environment. Are you sure there isn't an edge case with some third party or core system control that hasn't manifested itself but still could do so in the future?
- Even if you have full control over the above factors, are you sure the software/person monitoring this won't present you with false negatives when checking if your system is up?

JSWork
  • 151
4

Honestly 100% is completely insane without at least a waver in the terms of a hacking attack. Your best bet is to do what Google and Amazon do in that you have a geo-distributed hosting solution where you have your site and DB replicated across multiple servers in multiple geographic locations. This will guarantee it in anything but a major disaster such as the internet backbone being cut to a region (which does happen from time to time) or something nearly apocalyptic.

I would put in a clause for just such cases (DDOS, internet backbone cutting, apocalyptic terrorist attack or a big war, etc).

Other than that look into Amazon S3 or Rackspace cloud services. Essentially the cloud setup will not just offer the redundancy in each location but also the scalability and the geo-distribution of traffic along with the ability to redirect around failed geo-areas. Though my understanding is that the geo-distribution costs more money.

Patrick
  • 190
3

I just wanted to add another voice to the "it can (theoretically) be done" party.

I wouldn't take on a contract that had this specified no matter how much they paid me, but as a research problem, it has some rather interesting solutions. I'm not familiar enough with networking to outline the steps, but I imagine a combination of network-related configurations + electrical/hardware wiring failovers + software failovers would, possibly, in some configuration or the other work to actually pull it off.

There's almost always a single point of failure somewhere in any configuration, but if you work hard enough, you can push that point of failure to be something that can be repaired "live" (i.e. root dns goes down, but the values are still cached everywhere else so you have time to fix it).

Again, not saying it's feasible.. I just didn't like how not a single answer addressed the fact that it isn't "way out there" - it's just not something they actually want if they think it through.

3

Re-think your methodology of measuring availability then work with your customer to set meaningful targets.

If you are running a large website, uptime is not useful at all. If you drop queries for 10 minutes when your customers need them most (traffic peak), it could be more damaging to the business than an hour-long outage at 3 AM on a Sunday.

Sometimes large web companies measure availability, or reliability, using the following metrics:

  1. percentage of queries that are answered successfully, without a server-side error (HTTP 500s).
  2. percentage of queries that are answered below a certain target latency.
  3. dropped queries should count against your stats (see below).

Availability should not be measured using sample probes, which is what an external entity such as pingdom and pingability are able to report. Don't rely solely on that. If you want to do it right, every single query should count. Measure your availability by looking at your actual, perceived success.

The most efficient way is to collect logs or stats from your load-balancer and calculate the availability based on the metrics above.

The percentage of dropped queries should also count against your stats. It can be accounted in the same bucket as server-side errors. If there are problems with the network or with another infrastructure such as DNS or the load balancers, you can use simple math to estimate how many queries you lost. If you expected X queries for that day of the week but you got X-1000, you probably dropped 1000 queries. Plot your traffic into queries per minute (or second) graphs. If gaps appear, you dropped queries. Use basic geometry to measure the area of those gaps, which gives you the total number of dropped queries.

Discuss this methodology with your customer and explain its benefits. Set a base-line by measuring their current availability. It will become clear to them that 100% is an impossible target.

Then you can sign a contract based on improvements on the baseline. Say, if they are currently experiencing 95% of availability, you could promise to improve the situation ten fold by getting to 98.5%.

Note: there are disadvantages to this way of measuring availability. First, collecting logs, processing and generating the reports yourself may not be trivial, unless you use existing tools to do it. Second, application bugs may hurt your availability. If the application is low quality, it will serve more errors. The solution to this is to only consider the 500s created by the load-balancer instead of those coming from the application.

Things may get a bit complicated this way, but it's one step beyond measuring just your server uptime.

3

While some people noted here, that 100% is insane or impossible, they somehow missed the real point. They argued, that the reason for this is the fact that even the best companies/services cannot achieve it.

Well, it's lot simpler than that. It's mathematically impossible.

Everything has a probability. There could be a simultaneous earthquake at all locations of where you store your servers, destroying all of them. Agreeably it's a ridiculously small probability, but it's not 0. All you internet providers could face a simultaneous terrorist/cyber attack. Again, not very probable, but not zero either. Whatever you provide, you can get a non-zero probability scenario which brings the whole service down. Because this, your uptime cannot be 100% either.

2

Go grab a book on manufacturing quality control using statistical sampling. A general discussion in this book, the concepts of which any manager would have been exposed to in a general statistics course in college, dictate the the costs to go from 1 excption in a thousand, to 1 in ten thousand to 1 in a million to 1 in a billion rise exponentially. Essentially the ability to hit 100% uptime would cost an almost unlimited amount of funds, kind of like the amount of fuel required to push an object to the speed of light.

From a performance engineering perspective I would reject the requirement as both untestable and unreasonable, that this expression is more of a desire than a true requirement. With the application dependencies which exist outside of any application for networking, name resolution, routing, defects propogated from underlying architectural components or development tools, it becomes a practical impossibility to have anyone gurantee 100% uptime.

1

I don't think the customer is actually asking for 100% uptime, or even 99.999% uptime. If you look at what they're describing, they're talking about picking up where they left off if a meteor takes out their on-site datacenter.

If the requirement is external people not even notice, how drastic does that have to be? Would making an Ajax request retry and show a spinner for 30 seconds to the end user be acceptable?

Those are the kinds of things the customer cares about. If the customer was actually thinking of precise SLAs, then they would know enough to express it as 99.99 or 99.999.

1

my 2 cents. I was responsible for a very popular web site for a fortune-5 company who would take out ads for the super bowl. I had to deal with huge spikes in traffic and the way I solved it was to use a service like Akamai. I do not work for Akamai but I found their service extremely good. They have their own, smarter DNS system that knows with a particular node/host is either under heavy load or is down and can route traffic accordingly.

The neat thing about their service was that I didn't really have to do anything very complicated in order to replicate content on servers in my own data center to their data center. Additionally, I know from working with them, they made heavy use of Apache HTTP servers.

While not 100% uptime, you may consider such options for dispersing content around the world. As I understood things, Akamai also had the ability to localize traffic meaning if I was in Michigan, I got content from a Michigan/Chicago server and if I was in California, I supposedly got the content from a server based in California.

Kilo
  • 1,574
0

Instead of off-site failover, just run the application from two locations simultaneously, internal and external. And synchronise the two databases... Then if the internal goes down, the internal people will still be able to work and external people will still be able to use the application. When internal comes back online, synchronise the changes. You can have two DNS entries for one domain name or even a network router with round robin.

Christian
  • 836
0

For externally hosted sites, the closest you'll get to 100% uptime is hosting your site on Google's App Engine and using its high replication datastore (HRD), which automatically replicates your data across at least three data centers in real time. Likewise, the App Engine front-end servers are auto scaled/replicated for you.

However, even with all of Google's resources and the most sophisticated platform in the world, the App Engine SLA uptime guarantee is only "99.95% of the time in any calendar month."

espeed
  • 159
0

Simple and direct: Anycast

http://en.wikipedia.org/wiki/Anycast

This is what cloudflare, google and any other big company uses to do redundant, low latency, cross continental fail-over/balancing.

But also keep in mind that it's impossible to have 100% uptime, and that the costs to go from 99.999% to 99.9999% is MUCH bigger.