Patterns for Distributed Transactions

Patterns for Distributed Transactions

Distributed transactions consist of multiple coordinated and orchestrated independent units. It's expected them to reflect ACID properties (atomicity, consistency, isolation, durability). Within a microservices system context, we can define a distributed transaction as a set of organized state changes spanning a number of different services. We can also express the organized state changes as a sequence of business logic steps.

Let's assume that we have a social media application. When a user creates a post, that post first goes into a content analysis service for abuse screening. Then, if that post contains an advertisement, the user is redirected to the payment process, and the payment service is triggered and then the advertisement content appears in users' timelines after the payment is completed. If the user is posting a noncommercial post, the content gets listed in the timeline, and all subscribers of that user are notified via the notification service. In this case, we have two business scenarios following different steps; one is for a commercial user with a payment process and without a notification step, and the other one is for a regular user without a payment step and with a notification service call.

In the regular posting scenario, first, we can create a message bus. When a user submits a post, a new message goes to the message queue, and the content screening service as a consumer gets the message and executes the screening. It then sends a new message to the message bus with the screening result. The timeline service and the notification service as subscribers get the message and do the required work. This transaction with a message bus looks applicable at first sight but it contains only the happy path. For instance, when the timeline service receives and operates the work but due to a failure in the notification service it can not operate the required task, it means the executed transaction is not complete and it needs to be canceled or rolled back to a previous state.

There are multiple aspects to consider when making transaction plans. The primary concerns of distributed transactions are message routing, transaction state, and failure compensation. The messages should be flexible and able to follow conditional routing paths based on the alternative business logic needs. Transactions must be easy to cancel or roll back. All services of the transaction path should also have failure compensation plans to handle error situations. There are two commonly known patterns used to address the concerns of distributed transactions that we'll cover in this article; the saga pattern and the routing slip pattern.

Saga Pattern

A saga pattern is a design pattern used for managing distributed transactions, especially popular in microservices systems with event-driven or message-oriented elements. It was originally designed to be used for database transactions and data store management scenarios, but today it's also used as a tool for state management, message routing, and failure compensation.

Saga patterns start with breaking down the distributed transaction into a sequence of smaller states or units of work and the transaction states are controlled in a centralized location. For our social media application example, states of a post submission process might be; post submitted, content screened, payment completed, listed in the timeline, and users notified. When the saga process is initiated, a record with a unique id (usually named a Correlation Id) is created in the data store of a central controller service, which is responsible for tracking the changes in the transaction states (in our example, we can name it as the Post State Service). When one service completes its responsibility (do the regular work or perform the failure compensation task), the state of the corresponding transaction is changed based on the transaction correlation id. This way, we can transition forward or back across the distributed services based on the defined business logic through asynchronous communication (in our example it is a message bus).

Saga is usually modeled as a state machine, which is a well-known behavioral design pattern. It consists of states, and the transition between states is driven by receiving and publishing new messages. It's easy to change and arrange the order of states without making changes to services.

While handling happy paths, we should also consider planning failure and cancellation paths. We can implement failure compensation by rolling back the distributed transaction. For example, when an error occurs, we can go backward one step at a time. Saga state machine design is also capable of executing parallel processes, but this might add exponential complexity to the system.

Routing Slip Pattern

The routing slip pattern is a design pattern developed to manage complex business processes in distributed systems, such as microservices architectures. Separate services interact with each other by carrying a set of instructions, which is called a routing slip, from one to another. When a service receives a routing slip that contains a work item, it completes that work as indicated in the routing slip, then it passes the slip to the next service.

Rather than having a centralized state control pattern like the saga, the sequence of processing is recorded onto a routing slip, which is attached to a message. Each service can add additional data to the routing slip. The routing slip enables us to determine the sequence of services and which steps to run dynamically at runtime. We can easily include or exclude services from transactions by modifying the routing slip, which also enables conditional routing. Failure compensation and cancellation are also not complex with the routing slip pattern. If a failure occurs, the message can be delivered backward along the sequence in the routing slip, performing compensating operations at each step to restore the system back to its initial state.

Comparison

While the saga pattern and the routing slip pattern both provide solutions for distributed workflows, each has some strengths and weaknesses in different areas.

The routing slip can be a solution when centralized transaction state management becomes a performance issue, especially in complex scenarios. However, based on the business concept, if centralized state management is necessary, the saga does the job. From a routing perspective, if conditional routing is a requirement, choosing the routing slip pattern can be a better option. Failure compensation is another important consideration when making a design decision. For simple cases, using a saga state machine that handles both happy and failure compensation paths can achieve the desired results, but routing slip in failure handling is more straightforward. Another important point to discuss is the need for parallelization. If our business case requires a parallel process, it's not possible with a routing slip; it's doable with the saga pattern.

Use Case

Saga

Routing Slip

State Management

Good if centralized management is required

State changes are stored in the routing slip

Routing

Good for simple scenarios

Adaptive to complex scenarios

Failure Compensation

Good for simple scenarios

Straightforward

Parallelization

Possible

Not applicable

When making design decisions for a large microservices system, we don't have to choose one or the other. Saga and the routing slip can coexist in our system, and we can manage our distributed transactions with both. We can also have some system parts without managed communications and transactions.