Testing REST and Messaging with Spring Cloud Contract at Devskiller

Testing REST and Messaging with Spring Cloud Contract at Devskiller

Here at Devskiller, we have been working with Spring Cloud Contract for a long time now in order to simplify designing the APIs and testing our system while, at the same time, keeping the communication between services working correctly. In fact, the members of our team have been among the creators, first users, and first contributors of the Accurest project that has later been adopted in Spring Cloud as Spring Cloud Contract. Over time, we have both adjusted the way we work with the contracts to the specifics of our team and our system, and have leveraged the contracts to make substantial changes in our architecture. This article is dedicated to some of the things we have learnt while working with this tool and the improvements we were able to introduce to our system thanks to having adopted the Consumer-Driven Contracts approach.

Working with internal APIs

A Local TDD Workflow

It is a common approach for companies to have teams centered around microservices or to have each team assume the ownership of one or more applications that the company develops. This approach, aside from many benefits, results in focusing on a single context instead of on the implemented functionality. So we chose the “feature approach”, where while working on a new functionality or a change in an existing one, an engineer will introduce changes in all the microservices affected by the modification. On rare occasions when this work is shared, usually one person will still work on both sides of the communication, i.e. the consumer-side client and the producer-side API and leave stubbed service methods to be implemented by someone else.

Whoever is working on the changes of the API and its consumer will usually check out both the repositories locally, start working on the client first, adding changes to the producer repo, generating stubs with skipped tests, testing and improving the client Once a change has been implemented on this side, they can them move to running tests, adding implementation and then refactoring on the producer side. Then they would proceed with more similar iterations for other changes.

This allows us, in practice, to adopt the TDD red-green-refactor approach on the API and architecture level. Even though we always start to change the APIs on the consumer side, we can allow for flexibility and adjust the changes on the go; for example, when we see that in the phase of changes on the producer endpoints’ something seems to be overly entangled or unnecessarily complex. The only thing to keep in mind here is that while skipping the producer tests when working locally can be quite useful, it should never, ever be done in deployment pipelines.

Local Flow

The Contracts’ Scope

Similar to E2E tests, which we only create for our most critical business scenarios, we don’t add contracts for every possible use-case or corner case. Granted, we cover many more scenarios with the contract tests than with the E2E tests. However, if we have an endpoint for a functionality that is used once in a few months, does not bring us considerable revenue, and our clients would not immediately experience difficulties we might skip adding contracts for it.

Likewise, we don’t define contracts for all the possible values. For instance, if we are adding a new value for an enum that is used in communication between services, rather than adding a new contract for the new value, we simply switch it to an existing one. That’s because the enum names usually don’t change and adding a separate contract for every value would simply be an overkill.

Backwards Compatibility Checks and Deferred Updates

An important advantage of working with Spring Cloud Contract is being able to leverage the contracts as an easy mechanism for API backwards compatibility checks. It’s fairly simple - before any deployment we test the Producer applications not only against their current contracts but also against the contracts that are currently deployed to production. This allows us to be sure that we are not breaking the API with backwards incompatible changes.

API Backwards Compatibility Checks

By the way, if you are interested in introducing this approach to your deployment pipelines, I suggest you have a look at the Spring Cloud Pipelines project that provides this functionality out of the box.

Spring Cloud Pipelines Flow

Source: https://spring.io/blog/2016/10/18/spring-cloud-pipelines

In Spring Cloud Pipelines after the project is built and the API compatibility is confirmed, it is deployed (including database schema migrations) and tested. After that, the rollback test is executed by deploying and running tests against the production version of the application. After deployment to stage, end to end tests can be triggered, followed by a zero-downtime rolling deployment to production. As the backwards compatibility of both API and database setup has been verified, the pipeline also allows for an easy rollback of the application’s previous versions in cases where any serious issues have been found in the production code.

No Shared DTOs

In the beginning, while developing the system, in order to stick to the DRY principle, we widely used DTOs and small libraries with canonical objects shared across various services. It seemed clean and coherent at first, but soon enough, we have learnt that in a distributed system, the avoidance of coupling and the ability to ensure autonomous implementation and deployment for each microservice plays a much bigger role than the avoidance of repetitions.

Thanks to contract testing, we don’t need to use the same data structures to pass around and store data in any of the collaborating services - in fact, as the services operate within different bounded contexts, it would probably be an anomaly if we needed them to be identical everywhere - we only had to know that the data was passed correctly during communications. So after adopting the contracts, we have stopped using shared DTOs altogether.

No Versioning

As we stopped sharing DTOs as dependencies and as we ensured that all our API changes were always backwards compatible, we have realised we do not really need to version our services either. We can just work on the most recent version and, as we can always be sure it’s backwards compatible with the way our system works currently in production, we can deploy new versions any time and at will - other services will be able to work with the newest version of our service out of the box. For that reason we have removed service versioning as well.

Unlike in the public APIs, where we usually would not make any major changes in the API without a version change, for internal APIs we might want to switch things up from time to time. So when testing backwards compatibility, we don’t want to ensure that modifications are never being made - we just want to be positive that they won’t break the production. So how do we, for instance, switch the buildTime field that holds the times of our builds measured in seconds for a buildTimeMillis field, holding the values expressed in milliseconds?

We use a technique usually referred to as deferred update. First, we just add the buildTimeMillis field along with the -already existing- buildTime both to the contracts and to the API Producer application. We can now work with consumer instances that use either the old or the new field value simultaneously. Secondly, we remove the old field from the contracts. Now, when the engineers work on the API clients and execute their tests against the producer stubs, the tests will fail and they will be able to see that they need to switch to using the buildTimeMillis field in their applications in order to make the test pass. They will not have to rely on someone remembering to inform them that they need to adjust their services to a change in the API. Finally, when all the services have been adjusted and their fresh versions deployed to production, we can remove the buildTime field from the producer application as well. The update is complete.

Deployment Contracts’ fields Application fields
0 {
"buildTime" : 5
}
{
"buildTime" : 5
}
1 {
"buildTime" : 5,
"buildTimeMillis" : 5000
}
{
"buildTime" : 5,
"buildTimeMillis" : 5000
}
2 {
"buildTimeMillis" : 5000
}
{
"buildTime" : 5,
"buildTimeMillis" : 5000
}
3 {
"buildTimeMillis" : 5000
}
{
"buildTimeMillis" : 5000
}

Working with public APIs

A good public API should return the data that was agreed upon and should have a documentation that is easy to use and tightly linked with the code. We should have tests to ensure that the API is not being modified by mistake and we should be certain that our documentation reflects exactly what our code does. We have all faced the situations when external API providers have handed us human-written documentation and after we have implemented API clients according to it, it has appeared that there are discrepancies between the document and how the API really works. It’s so frustrating.

We definitely did not want the consumers or our APIs to ever have to face these kinds of problems with us, so we have used Spring Rest Docs to ensure that our API works correctly and is well documented. This neat solution allows you to first write the tests that define how your API should work and verify if the implementation is, indeed, correct and then to generate the API documentation automatically based on those tests. MockMvc, WebTestClient, and RestAssured are supported for the test generation, and there are additional methods available to pass descriptions and other context information that will be included in the documentation. The documentation is generated directly from the tests so we can be sure that our API works as expected and the documentation reflects the available functionalities.

Why am I writing about it in an article dedicated to contracts? One of the cool features here is that you can actually generate stubs and/or contracts from your Spring Rest Docs setup. And why would you need stubs for your public API? Many of your public API consumers might not implement Spring Boot-based clients or wish to write integration tests using stubs, maybe they are just working on a frontend and are using a completely different stack… The good news is they can still use the stubs to, for instance, prototype their UI against them, with only a minimal setup required on their side. They will only need to have a command line terminal, Maven installed and add just a few entries in their Maven settings.xml file, as described in the plugin documentation. It takes no more than a few minutes and then, they can just run mvn spring-cloud-contract:run -Dstubs="com.devskiller:sample-service" and stubrunner will be downloaded for them and launched with a wiremock instance running underneath, populated with the stubs that you have previously published to an artifact repository.

If you want to use Spring Cloud Contract for technologies other than JVM, you can also consider using the Spring Cloud Contract docker images that allow you to work with consumer-driven contracts in the stack of your choice.

read more

Safe to fail vs fail safe

Safe to fail vs fail safe

There are two types of drivers in the world: those who try to avoid accidents by driving very carefully in extra safe cars and those who prepare to deal with them when they happen by taking out a robust insurance policy. And it isn’t all only drivers. These contrasting approaches are found in many different fields and IT is no exception.

An IT project is like a satellite

For a moment let’s leave Earth and enter outer space. A satellite floating up here is bombarded by a variety of energy particles. Sped up to a few thousand kilometers per hour, these particles wreak havoc on any electronic system we decide to put into orbit.

We all know computers don’t use text to process and store data, they use only ones and zeros. Any piece of information, the number 23 (in the decimal system) for example, can be represented via a specific combination of ones and zeroes (in the binary system).

Take for instance the binary number 10111. As we can see, we need five memory cells to write this information. The memory cell, although it is not visible to the naked eye, is a physical object and has its own size. Not only that, it’s large enough to be hit by a wandering alpha particle. If that happens, a bit-flip occurs, which means a 0 turns into a 1 or the reverse.

So a circuit with the number 10111, after colliding with a particle suddenly turns into 10011. Using the decimal system, we could say a 19 becomes a 23. If this number specifies the duration during which an engine fires, for example, we have a serious problem. A problem that we have to resolve.

Fail-Safe vs. Safe-to-Fail: which is better?

We can solve our engine firing problem in two ways. The first method is to prevent errors from occurring. In our example, this could be done by employing sophisticated, expensive, and hefty shielding to protect our spacecraft’s sensitive equipment. We call this the Fail-Safe approach.

The second method, called Safe-to-Fail requires us to design our solution in such a way that any potential errors will result in no ill effects. If we know that during the calculation of trajectory an error can be introduced, the easiest way to continue on the correct course is the use of statistics. If we perform the same maneuver multiple times, we can surmise that the most common result is the desired one.

This is the same sort of approach taken by SpaceX. Instead of putting all of their critical systems into one processor, they use three dual-core processors. As a direct result, each calculation is performed 6 times, which allows the spacecraft to resolve any problems caused by lost particles.

Is the Safe-to-Fail approach better?

Is it possible to determine, which way is the best? Of course not, as we have to consider the context of the event. Let’s instead analyze the strengths and weaknesses of both approaches. The first approach appears ingenious in its simplicity; we add a shield and the problem is solved. But what happens if the shield is less effective than we assumed or some space debris damages the shield? Check-mate and we are left with nothing.

When we build our shield, we did everything possible in order to avoid an error, yet if one occurs, the results can be fatal. But if we assume that an error may occur, we can minimize the effects of any unforeseen occurrence. Why does it work? Even if one processor fails, we are left with 5 redundant processors, ensuring the continued functioning of our satellite.

Fail-Safe in the real world

Now let’s come back down to Earth and look at a few real-world examples. A common concern we all have is ensuring the proper functioning of our code. If we use the Fail-Safe approach, we will build our deployment pipeline in such a way that when our newest version enters production, it will be free of errors. This is possible thanks to a robust approach to testing, both manual and automated, which is unfortunately costly and time-consuming.

Not only that, you often find in the test phase a new version that is deployed to production which resolves an error. At the same time, the version can introduce bugs to a feature which has already passed the test.

Taking this approach means that we are very reluctant to take serious design decisions, such as the unmerging of a problematic project or the introduction of another major change. This is because we would then need to repeat all procedures we had already gone through for the last error.

A similar issue caused the Challenger disaster. The project managers didn’t believe the engineer when he described a problem that could lead to a catastrophe, as they were convinced that such a major issue would’ve been discovered in an earlier phase. Making such a verification mandatory would’ve pushed the project back even further.

What would a Fail-Safe approach have looked like in the previous example? Above all, we have to differentiate between two functions of our system: critical paths and all the rest. For critical issues, we need to create a standard testing path which we can verify end-to-end with every deployment. But it turns out that in practice, these critical features make up only 20% of a given project. Focusing on this 20% of features is a much more realistic goal for test automation. Despite this, It doesn’t change the fact that we’ve left out the remaining 80%. But have we actually done that?

Test Driven Development covers your entire project

In 1999 Kent Beck suggested the idea of Test-First as one of the pillars of extreme programming. Several years later it was fleshed-out as an essential software development technique known as Test-Driven Development (TDD). When implemented correctly, TDD ensures that right from the start, your codebase includes robust unit-test coverage. This guarantees that the system will work as the developer intended.

Despite its advantages, TDD doesn’t necessarily address the experience of end-users when using the software. As we already know, our goal is to reduce the negative effects of errors. The first step, ensuring that critical systems are properly protected, has already been addressed. The next thing we’ll do is limit the lifetime of errors in production.

Even a minor error that exists for several days or weeks, may cause our end-users irritation, resulting in poor reviews of the system. However, the same error if addressed within 30 minutes of occurring, will be quickly forgotten by the users.

What can we do in order to be ready to prepare and deploy a fix in such a short time?

Above all, such a strong response will not always be warranted. The best approach would be to break up our effort into several phases depending on our capabilities, as follows:

  • Time to detect an error - the ease with which our users can create a ticket, and our system’s ability to respond quickly are directly proportional to this value
  • Time to diagnose the problem - the fewer the changes are bundled together with each deployment, the less time it takes to diagnose the problem, making it easier for us to identify which particular commit contains the erroneous code
  • Time to fix - the better our tests and code quality, the more easily we can identify the problem, reproduce it, and fix it.
  • Time to deploy - every increase in automation shortens the time required to implement our deployment pipeline

In summary:

Understanding the Fail-Safe and Safe-to-Fail approaches, their consequences, and associated risks, along with our time and resource constraints, allows us to deliver a high-quality software product.

However, while the use of Safe to Fail seems to be a better idea, you should first ask yourself whether the current quality of the implemented system allows you to implement it simply. What’s interesting is that even a negative answer to the above question does not mean you can’t use it. With the help of the right techniques like Canary Deployments or rollback procedures, you can always improve reality. However, this is a topic for another article.

read more