(Fire-and-Forget Considered Harmful)

A lot of enterprises have a strong love affair with messaging software, such as JMS (e.g., IBM® WebSphere® MQ, and TIBCO Enterprise Message Service™) and admittedly, there is a lot to love, like temporal decoupling and guaranteed delivery. However, often overlooked is the chain-of-custody problem that is introduced when the fire-and-forget (A.K.A In-Only or Out-Only) message exchange pattern is used.

Maintaining the chain-of-custody in a messaging environment is vital when one system delegates message delivery accountability to another system and involves two parts:

  1. Ensuring that a message is successfully processed by the intended recipient or an appropriate error is returned to the sending application;
  2. Ensuring that chronological documentation is recorded of the receipt, transfer, transformation, and processing of the message.

The main focus of this article is part 1, ensuring message delivery, rather than part 2, its chronological documentation.

Fire-and-forget allows the sending application to simply send the messages and defer delivery accountability to the messaging software; it doesn’t care when the target application receives the messages, so long as it is received (actually, this is always a lie; the sending application expects it to arrive and be processed by the target application in a timely fashion, just not immediately), with the goal being to achieve eventual consistency between the applications (rather than immediate consistency). This works fine most of the time, but starts to fall down when errors occur, and this is where the chain-of-custody problem rears its ugly head.

When the message cannot be delivered successfully to the target application (e.g., the message is malformed, the message data is invalid, or the message expires), the messaging software (and the layers of orchestration in-between the applications) often does not have sufficient context to appropriately handle the error and it’s limited to three options:

  1. Discard the message and ignore the error;
  2. Send the message (and/or the error) to an error log or queue, which is by monitored by people;
  3. Send the message (and/or the error) to an error queue, which is processed by the sending application.

Option 1 obviously breaks the chain-of-custody. Option 2, besides being slow and expensive, breaks the chain-of-custody if you cannot guarantee that every error log or error queue entry is appropriately resolved. Since people are involved, mistakes will happen, breaking the chain-of-custody. Also, in our experience these logs/queues are rarely monitored effectively, making option 2 little better than option 1. In option 3, we are no longer discussing fire-and-forget, but instead a form of request-reply, specifically Robust-In-Only or Robust-Out-Only. If the response is synchronous, then there is no longer a chain-of-custody problem, as the sending application does not delegate delivery accountability. If the response is asynchronous, then accountability is still delegated and the chain-of-custody can be broken if the response (for whatever reason) cannot be successfully processed by the original sending application.

The use of fire-and-forget results in message loss and an overall system that is not robust. In turn, we have seen this result in ICC teams becoming the first point of contact when anything goes wrong, before the other teams even check their own outbound message queues. We have even seen one team earn the unfortunate moniker of ’North Korea’ as (like people going into North Korea), messages would enter the integration layer and never come out.

To deal with this issue, we often see teams try to implement more effective monitoring. When errors occur, they are recorded in an incident management system or a workflow management system for manual resolution, using a  four-eyes processes to reduce human error. However, as the integration environment is always evolving, there are always new fire-and-forget message types that aren’t well monitored. Chain-of-custody is yet again broken when the error cannot be processed into the incident or workflow management systems; and while four-eyes processes do reduce human error, they do not eliminate it. We’ve seen incidents where more than a thousand messages were incorrectly resolved, despite the use of a four-eyes processes.

The easier way to deal with this issue is to stop using fire-and-forget when delivery must be guaranteed. More specifically, the sending application should not defer delivery accountability to other systems.

Some REST– and SOAP-based systems do not have a chain-of-custody problem, as delivery accountability is not delegated. Instead, success or failure is returned to the sender via the HTTP status codes or a transport failure occurs. The sender is then able to appropriately handle the errors, given the context under which the request was sent. If it is not clear whether the target application received the message (e.g., request was sent, but a the transport failed before a response was received), well-designed REST-based systems can easily recover by issuing an appropriate GET. Similarly, well-designed SOAP-based systems can easily recover by invoking an appropriate query/retrieve operation.

There are situations where fire-and-forget is appropriate; however, they are limited to where the business value of the message is relatively short (such as the sending of stock ticks, where the next tick renders the previous one obsolete).

Do you need help designing your services or integration layer to avoid common pitfalls like this one? Contact our services team and we can start discussing how you can improve and increase the robustness of your integration layer today.

Does this also mean we should stop using JMS (and similar, such as TIBCO Rendezvous) altogether? Definitely not.  These types of messaging products provide an excellent way to defer delivery responsibility. However, in order to maintain delivery accountability, the sending application must either use synchronous request-reply or asynchronous request-reply, and in the latter case must additionally be able to detect and handle situations where a response is not successfully processed within a specified timeframe. In either case, the sending system must also be able to issue appropriate queries against the target application(s) to appropriately handle errors where it does not know what state the target system is in.

Fire-and-forget is initially attractive, as it allows the developers to ignore all the various error scenarios that may occur. But, in most cases, this approach is fundamentally flawed, as shown above. Well-designed REST- and SOAP-based systems force the developer to consider and handle the errors, as there is never a situation where they can claim ”we sent you the message, it’s your fault it didn’t get through,” as delivery accountability always resides with them.