April 10, 2015

How to setup high availability configuration?

High availability means that the service never goes down. But what should the application guard against? What threats is the application facing? Is high availability even possible?

I am going to describe how to do it in linux. Microsoft stack virtually guarantees startup failure, so why bother supporting it? There are two parts to availability: the easy part - hardware, and the hard part - software.

There's simple DNS-based HA technique where you just publish two IP addresses for one DNS record and run identical service on both IPs. Browsers will try one of the two IP addresses randomly, then fail over to the other one if they cannot establish connection for 15-20 seconds. If the failure is permanent, you can switch DNS records and see them propagated to clients in 24 hours. Some people set low TTL on DNS, but this is malpractice, because it kills DNS performance and it takes ages for admin staff to react to failure anyway. DNS HA is simple and cheap, but those 15-20 seconds will translate to maybe 30-90% traffic loss, so it's very limited in effectiveness. The other downside is that your servers might be reachable, but they might experience internal issues, e.g. they will be just returning 500 pages. This is failure from user's point of view, but browser is happy to accept it. There is no health check with DNS-based HA.

Ideally we shouldn't need to mess around with DNS at all and just somehow failover to another server without changing our public IP address. That's why cloud exists. I mean the real cloud. There are cheapo VPS services that will rent you $5/mo VPS, but they have no HA offering. They might be willing to SAN-protect your storage and migrate your VPS within the datacenter while letting you keep your IP address, but (1) this is rare and expensive and (2) it won't protect you from datacenter going down.

So how does cloud handle it? Cloud is big enough to have multiple datacenters within every region (region being something big like western Europe or eastern US). They have their own "autonomous system", which allows them to publish IP routing information globally. When datacenter goes down, they just notify routers in the region to redirect all traffic to the remaining datacenters. Routing information can be updated in a couple of seconds and once that's done, your users cannot see any difference. At API/UI level, clouds present this feature as availability zones. Essentially, you get 2-3 availability zones and every one of your availability zones gets thrown on different datacenter within the region. You get one public IP address that normally load-balances traffic over your availaibility zones. When datacenter (and one of your availability zones) goes down, your public IP address is rerouted to live datacenters that contain your remaining 1-2 availability zones. That's something you can get only from cloud and that's why cloud is more expensive than simple VPS renting. Note the expense doesn't come from higher cost of operating such a service, but rather from limited competition since cloud is big business thing and there are only a handful of companies providing the service. The only downside I see is that the other 1-2 availability zones will be momentarily overloaded and allocating more servers within the cloud might be halted, because the failover quickly exhausts cloud's reserve capacity if it happens during peak hours.

Whether you go with the cloud or DNS HA, you might be running multiple servers within every availability zone. You don't want the whole availability zone to go down just because of single server failure. Clouds often provide you with load balancers, but you can do it on your own with keepalived. It will float your public IP address within single LAN to whichever server is up at the moment. It reacts in milliseconds. You can also switch it manually when you know some server is going down for maintenance. Every one of those servers then runs something like HAProxy and load-balances the traffic over your other servers. HAProxy runs health checks against your servers, so it will only load-balance over healthy servers. It can also provide affinity, i.e. one client always accesses one server unless some server goes up or down and HAProxy has to shuffle traffic to rebalance the cluster. That's important for performance and it also increases cache hit rate, which helps when your DB goes down as you maintain at least partial availability.

DBs and other stateful servers are a tricky thing to keep running, because you have to ensure consistency. This is usually done with quorum, i.e. every DB server knows how many servers are in the group and it will only take action if it has majority consensus, also called quorum. That means you need at least 3 servers since no majority can be reached among 2 servers. Unless you want to go down the "eventually consistent" route, which you probably don't want to.

Routing traffic to DB and other internal servers can be also done with keepalived/HAProxy combo, but oftentimes these services have client libraries that perform much simpler and faster client-side failover. Client-side failover is partially possible with HTTP too. You can have client-side javascript that will monitor download errors for static resources (images and stuff) and AJAX and attempt download from other servers in case of failure. However the main HTML page download cannot be controlled by javascript, so this is very limited failover option. There is HTML5 offline website functionality though. Unavailable HTML page will be redirected to offline failure page that is prefetched along with other offline content by the browser. You can put javascript retry loop there and restore the failed page as soon as the server becomes available or redirect the user to backup server (with perhaps different subdomain). IMO this is seriously crappy failover. It complicates applications, burdens the client, and it will only work if the user has previously visited the site and enabled offline content in his browser, which might require explicit per-site consent by some browsers (read: I didn't test it).

Okay, that was the easy part. Now how about software failure? The worst case nightmare is a software bug that triggers on Friday afternoon on all your servers at once and it is not discovered until Monday morning. Subsequently, debugging and fixing it takes several days and then deploy fails, resulting in more days of debugging. Now THAT is a serious problem. Imagine one week of downtime. Sounds like it cannot be any worse? Now imagine security flaw in your software or servers. Some Chinese hacker (who gets $100 for the job or just does it out of hatred for "the West") will compromise your whole IT infrastructure including your dev machines, your back office, your servers, your switches, BIOS on your servers, your account with your cloud provider, everything... That mess is going to take several weeks to clean up. Couldn't be any worse? Several weeks is enough time to get everything up again, isn't it? Perhaps the hacker is paid to do the job and he places links to online gambling sites, porn, and some malware on your site. Search engines detect it and blacklist your site. It might take as much as 6 months to get off that blacklist. Funny, eh? And when it's all over, you are going to be fired, so the problem doesn't end for you after 6 months anyway.

Software failure is actually MUCH more common than hardware failure. There are ways to safeguard against software failure and human error and to minimize damage and speed up recovery when shit happens, but there's no silver bullet and definitely no perfect solution. It's a big topic and I am not going to discuss it now. I will have to open another blog post for this one.

Software failure is still pretty lightweight problem compared to human failure. You make all these fancy safety mechanism at application level to ensure everything is super reliable. Then you come to sysadmin to ask what data, exactly, goes to offline backup and how often and you are told that, well, there's no offline backup... Oops. One hacker attack and years of work are gone. Everything gets deleted, including your online backups. While often disregarded as extremely unlikely disaster, this kind of total data loss actually happens, especially in the cloud-happy startup world. You can never be sure that people are doing what they should be doing.

No comments:

Post a Comment