May 5, 2015

Who needs partition tolerance anyway?

Nobody takes the CAP theorem seriously anymore. Except a couple dozen NoSQL database vendors who justify their core architecture with it. The theory goes that in choosing two out of three, RDBMS would traditionally sacrifice partition tolerance while NoSQL databases would sacrifice consistency in the name of availability and partition tolerance. I however suspect that partition tolerance in NoSQL is motivated by ease of implementation more than by consideration of application requirements.

Actually the story about NoSQL isn't really about CAP nor SQL. It's about complexity. RDBMS are complexity monsters that consequently evolve very slowly. Their architecture is decades old by now and it's showing its age. RDBMS are growing increasingly unsuitable to meet needs of modern applications.

The idea of NoSQL was to shed some of the burden, mostly the query language and schema, but more often than not also ACID, in order to create breathing space for innovation. Thus NoSQL conquered high-performance and distributed system territory in a couple of years while RDBMS still look the same they looked in the 90's.

I have no problem with lack of query language or schema. All those nifty features can be built on top of suitable core data processing capability. ACID is an entirely different question. You really need it. If database isn't ACID, then ACID guarantees have to be implemented on application layer, a much more expensive undertaking given the oceans of app code out there. If application layer doesn't provide ACID either, you have unreliable system and you need to deal with failures on organizational level, i.e. keep sending apology emails to customers.

Surprisingly enough, the latter is the approach adopted by most businesses. Welcome to the future where nothing works. Logical types like myself however like to work on top of well-defined primitives so that the software can be made 100% perfect so that I don't have to return back to it again and again with yet another patch to handle yet another corner case. I find ACID handy.

Of course, RDBMS didn't really provide any ACID either. Caching, sharding, replication, branch-local or device-local databases, cross-database commits, not to mention sloppy app code, all these layers on top of RDBMS completely disregarded ACID, rendering RDBMS-level ACID guarantees useless. So NoSQL rightfully dumped this unnecessary burden.

But then, ACID wasn't entirely useless. Most of the layers above RDBMS that killed ACID were either performance-related, assumed very unreliable Internet, or they were designed for fringe cases. Most NoSQL databases are intrinsically fast and scalable. Internet is quickly becoming dependable utility like electricity. And there's no point killing all ACID semantics because of a few fringe cases, mostly dealing with interaction with external systems.

So why no ACID in distributed NoSQL databases? Some people argue, still relying on the CAP theorem, that you must keep the app online (i.e. you need availability) and you must handle server failure automatically (i.e. you need partition tolerance), so you have to drop consistency. While availability is easy to understand, I find the partition tolerance argument cheesy.

Firstly, tolerance of server failure doesn't really require partition tolerance. Servers generally don't fail in droves. They fail one at a time. Quorum systems, where majority vote is required to proceed, are able to provide consistency and availability in the face of individual server failure.

Servers sometimes do fail in droves. The idea of partition tolerance is that a small group of servers, even single server, can keep working when disconnected from the larger community. But how useful is this behavior?

Let's think of modern application architecture. Application runs in one or more regions (essentially datacenters) and every region has a flock of virtual machines running the application. These virtual machines are themselves of the high availability variety. They have redundant SAN storage that is immortal as far as the datacenter as a whole keeps running. VMs themselves can be rapidly migrated to another node and reattached to their SAN storage with minimum downtime. Hardware failure looks like a single reboot to the affected VM. VMs are immortal as far as the datacenter as a whole keeps running.

So what if the datacenter goes down completely? This reportedly happens sometimes and it's usually software-related. Most datacenters have backup power supply and redundant connectivity. Nothing short of earthquake can stop them at hardware level. So what if someone accidentally flips the red switch that kills the datacenter? If all database nodes are in the same datacenter, then no amount of partition tolerance will help you. The whole app is down in any case.

What about globally distributed database? In this case the above mentioned quorum will save admin's ass, assuming the app is in at least three datacenters. Still no need for full partition tolerance. While it is possible for more than half of planet's datacenters to go down at the same time, it is very unlikely and nobody expects average application to function in that case.

You might argue that datacenter going down is not the only kind of partition. The datacenter might be just isolated. It might be disconnected from peers, but it might keep serving clients. I find this scenario unlikely. While the database might remain connected to app servers within the same datacenter, I very much doubt the app would be reachable by end users if connection to peers in other datacenters is lost. This might happen if the datacenter becomes isolated during war, but then again, nobody expects your application to work under such circumstances.

Then there's latency. Latency is effectively partial network partition. You don't see what happens at the other end of the world until about 100ms later. Yet you are expected to execute transactions in mere milliseconds. You are effectively forced to execute the transaction locally, in temporary isolation, without any coordination with other nodes.

This is however a well-studied problem that has some solid solutions. You essentially use multi-version concurrency control to provide illusion of halted time for reads. Then you permit writes to execute speculatively, giving them tentative transaction confirmation. Definitive transaction confirmation arrives perhaps 200-500ms later.

While this tentative commit seems like a complexity nightmare for applications, it is in fact pretty easy to design app frameworks that translate the delayed confirmation to UI-level behavior, perhaps displaying confirmation immediately, then reverting the UI to error report 500ms later in those 0.01% cases where global commit fails.

Things get more interesting when the local node operates in caching mode, i.e. it doesn't have all the data. Will we wait for full fetch from remote servers or should we return whatever we have now? RDBMS choose the former while NoSQL databases choose the latter. I think this all boils down to the attempt of many databases to answer all the questions in one big query.

It's much better idea to let the app query data in smaller chunks. Locally available chunks will be served immediately while remote chunks will be delayed. App logic can be formulated in parallel constructs, so the total latency of the individual queries is proportional to their "depth", i.e. longest chain of query dependencies, not the total number of queries. This is generally 2-3 roundtrips in the worst case and typically 0 roundtrips with effective caching.

Now that the app knows which chunks are available and which ones are not, it can make intelligent high-level decisions on how to handle the situation. It would typically display available data along with visual indicator that retrieval is still in progress.

Above I said that latency to settle global transactions is likely under 500ms. This holds as far as the network is lightly loaded. Networks do get overloaded, for example under DDoS attack, when VM allocation is backlogged due to datacenter capacity constraints, due to billing/safety limits taking effect, or simply when some user or admin submits huge transaction. Even though clouds permit huge scaling of bandwidth, there's still some limit somewhere. Exceeding available bandwidth or processing capacity directly translates into longer latencies.

There are however more intelligent ways to deal with this problem than just dropping all consistency guarantees. When the problem is caused by single huge transaction, it's better to let this transaction proceed in the background while smaller transactions get processed in realtime. When there are too many small transactions, it is better to drop some of these transactions rather than half-completing every transaction. Preferably, we should drop transactions of service abusers followed by low-priority transactions. Applications can easily implement high-level controls that disable non-essential features when overload conditions are detected.

So who really needs true partition tolerance? Perhaps people who run truly distributed systems, with databases scattered in remote locations or on end-user devices. That's not the case of the average cloud app where NoSQL databases are most likely to be used. In this light, disregarding consistency in the name of partition tolerance sounds like a cheap excuse to avoid the hard work of providing consistency guarantees.

No comments:

Post a Comment