Robert Važan

How to setup high availability configuration

High availability means that the service never goes down. But what should the application guard against? What threats is the application facing? Is high availability even possible?

I am going to describe how to do it in Linux. Microsoft stack virtually guarantees startup failure, so why bother supporting it? There are two parts to availability: the easy part - hardware, and the hard part - software.

DNS

There's simple DNS-based HA technique where you just publish two IP addresses for one DNS record and run identical service on both IPs. Browsers will try one of the two IP addresses randomly, then fail over to the other one if they cannot establish connection for 15-20 seconds. If the failure is permanent, you can switch DNS records and see them propagated to clients in 24 hours. Some people set low TTL on DNS, but this is malpractice, because it kills DNS performance and it takes ages for admin staff to react to failure anyway. DNS HA is simple and cheap, but those 15-20 seconds will translate to maybe 30-90% traffic loss, so it's very limited in effectiveness. The other downside is that your servers might be reachable, but they might experience internal issues, e.g. they will be just returning 500 pages. This is failure from user's point of view, but browser is happy to accept it. There is no health check with DNS-based HA.

Cloud

Ideally we shouldn't need to mess around with DNS at all and just somehow failover to another server without changing our public IP address. That's why cloud exists. I mean the real cloud. There are cheapo VPS services that will rent you $5/mo VPS, but they have no HA offering. They might be willing to SAN-protect your storage and migrate your VPS within the datacenter while letting you keep your IP address, but (1) this is rare and expensive and (2) it won't protect you from datacenter going down.

So how does cloud handle it? Cloud is big enough to have multiple datacenters within every region (region being something big like western Europe or eastern US). They have their own "autonomous system", which allows them to publish IP routing information globally. When datacenter goes down, they just notify routers in the region to redirect all traffic to the remaining datacenters. Routing information can be updated in a couple of seconds and once that's done, your users cannot see any difference. At API/UI level, clouds present this feature as availability zones. Essentially, you get 2-3 availability zones and every one of your availability zones gets thrown on different datacenter within the region. You get one public IP address that normally load-balances traffic over your availaibility zones. When datacenter (and one of your availability zones) goes down, your public IP address is rerouted to live datacenters that contain your remaining 1-2 availability zones. That's something you can get only from cloud and that's why cloud is more expensive than simple VPS renting. Note the expense doesn't come from higher cost of operating such a service, but rather from limited competition since cloud is big business thing and there are only a handful of companies providing the service. The only downside I see is that the other 1-2 availability zones will be momentarily overloaded and allocating more servers within the cloud might be halted, because the failover quickly exhausts cloud's reserve capacity if it happens during peak hours.

Keepalived

Whether you go with the cloud or DNS HA, you might be running multiple servers within every availability zone. You don't want the whole availability zone to go down just because of single server failure. Clouds often provide you with load balancers, but you can do it on your own with keepalived. It will float your public IP address within single LAN to whichever server is up at the moment. It reacts in milliseconds. You can also switch it manually when you know some server is going down for maintenance.

HAProxy

Every one of those servers then runs something like HAProxy and load-balances the traffic over your other servers. HAProxy runs health checks against your servers, so it will only load-balance over healthy servers. It can also provide affinity, i.e. one client always accesses one server unless some server goes up or down and HAProxy has to shuffle traffic to rebalance the cluster. That's important for performance and it also increases cache hit rate, which helps when your DB goes down as you maintain at least partial availability.

Stateful servers

DBs and other stateful servers are a tricky thing to keep running, because you have to ensure consistency. This is usually done with quorum, i.e. every DB server knows how many servers are in the group and it will only take action if it has majority consensus, also called quorum. That means you need at least 3 servers since no majority can be reached among 2 servers. Unless you want to go down the "eventually consistent" route, which you probably don't want to.

Client-side failover

Routing traffic to DB and other internal servers can be also done with keepalived/HAProxy combo, but oftentimes these services have client libraries that perform much simpler and faster client-side failover. Client-side failover is partially possible with HTTP too. You can have client-side javascript that will monitor download errors for static resources (images and stuff) and AJAX and attempt download from other servers in case of failure. However the main HTML page download cannot be controlled by javascript, so this is very limited failover option. There is HTML5 offline website functionality though. Unavailable HTML page will be redirected to offline failure page that is prefetched along with other offline content by the browser. You can put javascript retry loop there and restore the failed page as soon as the server becomes available or redirect the user to backup server (with perhaps different subdomain). IMO this is seriously crappy failover. It complicates applications, burdens the client, and it will only work if the user has previously visited the site and enabled offline content in his browser, which might require explicit per-site consent by some browsers (read: I didn't test it).

Software bugs

Okay, that was the easy part. Now how about software failure, i.e. bugs? The worst case nightmare is a software bug that triggers on Friday afternoon on all your servers at once and it is not discovered until Monday morning. Subsequently, debugging and fixing it takes several days and then deploy fails, resulting in more days of debugging. Now THAT is a serious problem. Imagine one week of downtime.

Software vulnerabilities

Sounds like it cannot be any worse? Now imagine security flaw in your software or servers. Some Russian or Chinese hacker (who gets $300 for the job or just does it out of hatred for "the West") uses some semi-automated tool to compromise your whole IT infrastructure including your dev machines, your back office, your servers, your switches, BIOS on your servers, your account with your cloud provider, everything... That mess is going to take several weeks to clean up. Couldn't be any worse? Several weeks is enough time to get everything up again, isn't it? Perhaps the hacker does it for profit and he places links to online gambling sites, porn, and some malware on your site. Search engines detect it and blacklist your site. It might take as much as 6 months to get off that blacklist. Funny, eh? And when it's all over, you are going to be fired, so the problem doesn't end for you after 6 months anyway.

Software failure is actually MUCH more common than hardware failure. There are ways to safeguard against software failure and human error and to minimize damage and speed up recovery when shit happens, but there's no silver bullet and definitely no perfect solution. It's a big topic and I am not going to discuss it now. I will have to write another blog post about this one.

Humans and (the lack of) organization

Software failure is still pretty lightweight problem compared to human failure. You make all these fancy safety mechanism at application level to ensure everything is super reliable. Then you come to sysadmin to ask what data, exactly, goes into offline backup and how often and you are told that, well, there's no offline backup... Oops. One hacker attack and years of work are gone. Everything gets deleted, including your online backups. While often disregarded as extremely unlikely disaster, this kind of total data loss actually happens, especially in the cloud-happy startup world. You can never be sure that people are doing what they should be doing.

Practical advice

Most businesses can actually run everything they have on one server, which might be split into several virtual servers for various reasons, but the total size of the operation is no bigger than one physical server. Single server failure rate is low and cloud providers (even cheap VPS providers) have proactive hardware health monitoring and they will migrate your virtual server to another physical server before you even notice the problem. This is why smaller operations should be hosted entirely in the cloud. In this scenario, there is absolutely no need for hardware-level redundancy (other than whatever is transparently provided by the cloud), because the added complexity would actually increase overall failure rate.

In smaller operations, your focus should be entirely on downtime caused by software and human error. You need scripted server setup, which can fully automatically redeploy any server within 15-30 minutes. You have to test it sometimes. You might wish to implement HAProxy, but not for its intended purpose of load balancing and high availability, but rather to have a lever you can use to quickly move traffic to newly deployed servers. This will allow you to bring fixed software online without having to wait for DNS timeouts.

Versioning is essential, because it lets you return back to known good configuration in case something goes wrong. Immutable servers provide natural versioning, because you can redirect traffic back to the old server if the new one is broken. Server setup scripts should be versioned, so that you have natural log of configuration changes and you can go back to previous server version. If your deployment scripts support fast rsync-based deployment of new application versions in addition to full server redeployment, you need to make sure the server keeps around older versions of the application, so that you can switch back to them in case the new version does not work.

In order to know that you need to fix something, you must have monitoring with alerts, including external pings, centralized logging, and metrics. In order to know the application is not ready for deployment, you need automated tests.

When everything else fails, you have frequent online backups and less frequent offline backups. Both backups and server setup scripts need to be tested regularly and the best way to do that is to use them to periodically recreate the whole test environment from scratch.