Vaccinating applications. Resilience in distributed systems.

L. Peter Deutsch wrote about the seven fallacies of distributed computing already in 1994. One can wonder how it is possible that over 20 years later they are so often forgotten. Given the current popularity of public clouds and microservices architecture, such ignorance seems like asking for problems. So what should we do to avoid such problems? How do we ensure system resilience?

Communication across divides

Communication is certainly the basic mechanism used in distributed systems. Peter Deutsch enumerates two false assumptions in this respect:

the network is reliable,
topology doesn’t change.

Of course we know that the network is not reliable. Everyone who used a database cluster or a distributed cache even once experienced problems with stability of network connection. This does not mean, however, that errors occur all the time and we lose hundreds of data packets every second. There are even people who take temporary instability of the network layer into account when defining the level of availability of the services they provide. This is where the second false assumption interferes with the decision-making process, i.e. the unchangeability of topology. In the case of classical distributed systems, which are released in cycles and are not dynamically scalable, topology does not change too often. However, with cloud solutions or microservices, which are by definition autonomous in terms of changes, it’s a completely different game. Here, topology changes can be caused by a number of factors:

dynamic scalability (increasing and decreasing the number of instances),
migration of applications between virtual machines, resulting in a change of IP addresses,
temporary unavailability of the application due to e.g. deployment of a new version.

As you can see, this is no longer as unlikely as the usual instability of the transport layer.

Service discovery

Load balancer seems to be the first line of defence in dealing with topology changes. Instead of communicating directly with one another, services communicate via the load balancer. The application knows only the address of the component which divides traffic between particular machines. If the number of instances changes, you only need to update the configuration of the balancer and the entire environment works properly. This solution, however, is not perfect. The balancer is the single point of failure. If it’s unavailable, integration of components is impossible even if both the client and the producer operate correctly. Performance is also lower given the necessity to initiate additional network communication (application->LB->application instead of application->application). Service discovery may be another way to solve this problem. Simply put, service discovery (Consul being one of the examples) is a component which registers all applications operating in the system. When launched, applications report to the service discovery to get registered and then contact it again to log out before being shut down. In the meantime the registry controls availability of the application, making sure that the list is up-to-date. Now, if system A needs to communicate with system B, it asks the registry for a list of current instances of B. Then it sends direct traffic to selected instance. But it’s still a single point of failure, isn’t it? Well, it’s not. There are two reasons for that. First, such registry may be distributed across several instances, and failure of one of them does not mean others cannot be used. What’s more, even if all instances are unavailable, particular applications are not prevented from communicating with one another (because we still have the list of our collaborators). Or at least as long as their topology does not change significantly.

Try again

We managed to address the issue of migrating instances. But we still did nothing about temporary unavailability or instability of the network. Fortunately, both these problems can be solved simultaneously. The easiest way to tackle temporary problems is to repeat the attempt. How many of you repeated a test, hoping that the second time the result would be positive? Let’s treat communication the same way. If the response to the first request was time-out or any other exception, all you need to do is try again. And it often turns out to be enough. Of course you can keep on trying again and again in various intervals, sending messages to various instances. Unfortunately, as it is often the case, there is a trap. Namely, idempotence of communication. Idempotence is simply the possibility to repeatedly perform an operation without a change in its result. In the world of synchronous REST, GET, PUT and DELETE methods can be repeated without a problem. If you ask about user data (GET) repeatedly, the status of the system does not change. By the same token, if you DELETE user data from the system, doing it five times will not make the data “more deleted”. The PUT method is used to change the status of an object, e.g. user e-mail address. Again, changing e-mail address to x@y.com a couple of times does not mean it will be “more changed”. The POST method, however, does constitute a problem. By definition, it creates a new resource. If you use this method to top up your phone, repeating communication will theoretically result in multiple top-ups. To ensure idempotence in the above case, unique request identifiers are usually used. This, of course, entails the necessity to remember the list of messages handled. This, in turn, means that services are not fully stateless any more. What is more, in some cases checking whether a particular message was handled could take more time than its actual processing. But still, the possibility to repeat communication to solve two serious problems is worth the price. Especially if we adopt a pragmatic approach to the topic, we only need a narrow, say 5-minute, idempotence window, because the odds that someone will repeat a demand after a couple of months are low.

Always have a plan C

Plans A and B often fail. If basic communication fails and then you repeat communication to no avail, it’s time for plan C. Emergency procedure. This is where the idea of “design for failure”, widely used when designing distributed systems, comes to the rescue. When implementing any communication in such system, we must immediately think what we can do if it fails. By integrating a sales module with a fraud detection system, we immediately plan what we will do if the anti-fraud module is not available. Should we reject all customers? Or maybe we should accept all of them? Or maybe we should accept returning customers provided that the value of their order does not exceed PLN 500? Whatever decision we take, it comes with an additional business risk. And taking such risks is not the responsibility of IT. In other words: programmers indicate the necessity of developing such fallback, but it is the business that decides how it is supposed to work.

Please don’t wait. We’ll call you back

As some readers may have noticed, most of the problems described above concern synchronous communication. Isn’t it better to use asynchronous communication? Of course, it is! What is more, it is the default form of communication we should consider. We always use it until our model of operation is: “I need this information now, and if I don’t get it, I can manage differently (fallback)”. The main advantage of integration using queues is that the two sides do not have to live at the same time. If system A sends a message when system B is unavailable, it will process it immediately after restarting, even if system A is turned off at the time of the process. It’s only the queue that has to work. A queue is somewhat similar to databases, i.e. it is treated like an infrastructure element. As a result, it is more accessible and less frequently updated than applications.

Distributed monolith

A distributed system, even if we call it “microservices” but if it doesn’t have the mechanisms described above, is simply a monolith distributed over several machines. Monolithic architecture assumes that components, since they work within one application, are available simultaneously. Following the same assumptions in microservices means that the real availability of the platform will be closer to a direction indicator (it works - it doesn’t work - it works - it doesn’t work) than enterprise class software. Considering the above practices allows for migration from a distributed monolith to a properly functioning distributed system.