Safe to fail vs fail safe - DevSkiller Tech Blog

There are two types of drivers in the world: those who try to avoid accidents by driving very carefully in extra safe cars and those who prepare to deal with them when they happen by taking out a robust insurance policy. And it isn’t all only drivers. These contrasting approaches are found in many different fields and IT is no exception.

An IT project is like a satellite

For a moment let’s leave Earth and enter outer space. A satellite floating up here is bombarded by a variety of energy particles. Sped up to a few thousand kilometers per hour, these particles wreak havoc on any electronic system we decide to put into orbit.

We all know computers don’t use text to process and store data, they use only ones and zeros. Any piece of information, the number 23 (in the decimal system) for example, can be represented via a specific combination of ones and zeroes (in the binary system).

Take for instance the binary number 10111. As we can see, we need five memory cells to write this information. The memory cell, although it is not visible to the naked eye, is a physical object and has its own size. Not only that, it’s large enough to be hit by a wandering alpha particle. If that happens, a bit-flip occurs, which means a 0 turns into a 1 or the reverse.

So a circuit with the number 10111, after colliding with a particle suddenly turns into 10011. Using the decimal system, we could say a 19 becomes a 23. If this number specifies the duration during which an engine fires, for example, we have a serious problem. A problem that we have to resolve.

Fail-Safe vs. Safe-to-Fail: which is better?

We can solve our engine firing problem in two ways. The first method is to prevent errors from occurring. In our example, this could be done by employing sophisticated, expensive, and hefty shielding to protect our spacecraft’s sensitive equipment. We call this the Fail-Safe approach.

The second method, called Safe-to-Fail requires us to design our solution in such a way that any potential errors will result in no ill effects. If we know that during the calculation of trajectory an error can be introduced, the easiest way to continue on the correct course is the use of statistics. If we perform the same maneuver multiple times, we can surmise that the most common result is the desired one.

This is the same sort of approach taken by SpaceX. Instead of putting all of their critical systems into one processor, they use three dual-core processors. As a direct result, each calculation is performed 6 times, which allows the spacecraft to resolve any problems caused by lost particles.

Is the Safe-to-Fail approach better?

Is it possible to determine, which way is the best? Of course not, as we have to consider the context of the event. Let’s instead analyze the strengths and weaknesses of both approaches. The first approach appears ingenious in its simplicity; we add a shield and the problem is solved. But what happens if the shield is less effective than we assumed or some space debris damages the shield? Check-mate and we are left with nothing.

When we build our shield, we did everything possible in order to avoid an error, yet if one occurs, the results can be fatal. But if we assume that an error may occur, we can minimize the effects of any unforeseen occurrence. Why does it work? Even if one processor fails, we are left with 5 redundant processors, ensuring the continued functioning of our satellite.

Fail-Safe in the real world

Now let’s come back down to Earth and look at a few real-world examples. A common concern we all have is ensuring the proper functioning of our code. If we use the Fail-Safe approach, we will build our deployment pipeline in such a way that when our newest version enters production, it will be free of errors. This is possible thanks to a robust approach to testing, both manual and automated, which is unfortunately costly and time-consuming.

Not only that, you often find in the test phase a new version that is deployed to production which resolves an error. At the same time, the version can introduce bugs to a feature which has already passed the test.

Taking this approach means that we are very reluctant to take serious design decisions, such as the unmerging of a problematic project or the introduction of another major change. This is because we would then need to repeat all procedures we had already gone through for the last error.

A similar issue caused the Challenger disaster. The project managers didn’t believe the engineer when he described a problem that could lead to a catastrophe, as they were convinced that such a major issue would’ve been discovered in an earlier phase. Making such a verification mandatory would’ve pushed the project back even further.

What would a Fail-Safe approach have looked like in the previous example? Above all, we have to differentiate between two functions of our system: critical paths and all the rest. For critical issues, we need to create a standard testing path which we can verify end-to-end with every deployment. But it turns out that in practice, these critical features make up only 20% of a given project. Focusing on this 20% of features is a much more realistic goal for test automation. Despite this, It doesn’t change the fact that we’ve left out the remaining 80%. But have we actually done that?

Test Driven Development covers your entire project

In 1999 Kent Beck suggested the idea of Test-First as one of the pillars of extreme programming. Several years later it was fleshed-out as an essential software development technique known as Test-Driven Development (TDD). When implemented correctly, TDD ensures that right from the start, your codebase includes robust unit-test coverage. This guarantees that the system will work as the developer intended.

Despite its advantages, TDD doesn’t necessarily address the experience of end-users when using the software. As we already know, our goal is to reduce the negative effects of errors. The first step, ensuring that critical systems are properly protected, has already been addressed. The next thing we’ll do is limit the lifetime of errors in production.

Even a minor error that exists for several days or weeks, may cause our end-users irritation, resulting in poor reviews of the system. However, the same error if addressed within 30 minutes of occurring, will be quickly forgotten by the users.

What can we do in order to be ready to prepare and deploy a fix in such a short time?

Above all, such a strong response will not always be warranted. The best approach would be to break up our effort into several phases depending on our capabilities, as follows:

Time to detect an error - the ease with which our users can create a ticket, and our system’s ability to respond quickly are directly proportional to this value
Time to diagnose the problem - the fewer the changes are bundled together with each deployment, the less time it takes to diagnose the problem, making it easier for us to identify which particular commit contains the erroneous code
Time to fix - the better our tests and code quality, the more easily we can identify the problem, reproduce it, and fix it.
Time to deploy - every increase in automation shortens the time required to implement our deployment pipeline

In summary:

Understanding the Fail-Safe and Safe-to-Fail approaches, their consequences, and associated risks, along with our time and resource constraints, allows us to deliver a high-quality software product.

However, while the use of Safe to Fail seems to be a better idea, you should first ask yourself whether the current quality of the implemented system allows you to implement it simply. What’s interesting is that even a negative answer to the above question does not mean you can’t use it. With the help of the right techniques like Canary Deployments or rollback procedures, you can always improve reality. However, this is a topic for another article.