In a recent blog post, Sharding Kafka for Increased Scale and Reliability, the CrowdStrike Engineering Site and Reliability Team shared how it overcame scaling limitations within Apache Kafka so that they could quickly and effectively process trillions of events daily. In this post, we focus on the other side of this equation: What happens when one of those messages inevitably fails?
When a message cannot be processed, it becomes what is known as a “dead letter.” The service attempts to process the message by normal means several times to eliminate intermittent failures. However, when all of those attempts fail, the message is ultimately “dead lettered.” In highly scalable systems, these failed messages must be dealt with so that processing can continue on subsequent messages. To retain the dead letter’s information and continue processing messages, the message is stored so that it can be later addressed manually or by an automated tool.
In Best Practices: Improving Fault-Tolerance in Apache Kafka Consumer, we go into great detail about the different failure types and techniques for recovery, which include redriving and dead letters. Here our aim is to solidify those terms and expound upon the processes surrounding these mechanisms.
Processing dead letters can be a fairly time-consuming and error-prone process. So what can be done to expedite this task and improve its outcome? Here we explore three steps organizations can take to develop the code and infrastructure needed to more effectively and efficiently capture, investigate and redrive dead letter messages.
|Dead Letter Basics|
|What is a message? A message is the record of any communication between two or more services.|
|Why does a message fail? Messages can fail for a variety of reasons, some of the most common being incompatible message format, unavailable dependent services, or a bug in the service processing the message.|
|Why does it matter if a message fails? In most cases, a message is being sent because it is sharing important information with another service. Without that knowledge, the service that should be receiving the message can have outdated or inaccurate information and make bad decisions or be completely unable to act.|
Three Best Practices for Resolving Dead Letter Messages
1. Define the infrastructure and code to capture and redrive dead letters
As explained above, a dead letter occurs when a service cannot process a message. Most systems have some mechanism in place, such as a log or object storage, to capture the message, review it, identify the issue, resolve the issue and then retry the message once it’s more likely to succeed. This act of replaying the message is known as “redriving.”
To enable the redrive process, organizations need two basic things: 1) the necessary infrastructure to capture and store the dead letter messages, and 2) the right code to redrive that message.
Since there could potentially be hundreds of millions of dead letters that need to be stored, we recommend using a storage option that meets these four criteria: low cost (especially critical as your data scales), abundant space (no concerns around running out of storage space), durability (no data loss or corruption) and availability (the data is available to restore during disaster recovery). We use Amazon S3.
For short-term storage and alerting, we recommend using a message queue technology that allows the user to send messages to be processed at a later point. Then your service can be configured to read from the message queue to begin processing the redrive messages. We use Amazon SQS and Kafka as our message queues.
2. Put tooling in place to make remediation foolproof
The process outlined above can be very error-prone when done manually, as it involves many steps: finding the message, copying its contents, pasting it into a new message and submitting that message to the queue. If the user misses even one character when copying the message, then it will fail again — and the process will need to be repeated. This process must be done for every failed message, making it potentially time-consuming as well.
Since the process is the same for processing dead letters, it is possible to automate. To that end, organizations should develop a command-line tool to automate common actions with dead letters such as viewing the dead letter, putting the message in the redrive queue and having the service consume messages from the queue for reprocessing. Engineers will use this command-line tool to diagnose and resolve dead letters the same way — this, in turn, will help reduce the risk of human error.
3. Standardize and document the process to ensure ease-of-use
Our third best practice is around standardization. Because not all engineers will be familiar with the process the organization has for dealing with dead letter messages, it is important to document all aspects of the procedure. Some basic questions your documentation should address include:
- How does the organization know when a dead letter message occurs? Is an alert set up? Will an email be sent?
- How does the team investigate the root cause of the error? Is there a specific phrase they can search for in the logs to find the errors associated with a dead letter?
- Once it has been investigated and a fix has been deployed, how is the message reprocessed or redrived?
Documenting and standardizing the process in this way ensures that anyone on the team can pick up, solve and redrive dead letters. Ideally, the documentation will be relatively short and intuitive, outlining the following steps:
- How to read the content of the message and review the logs to help figure out what happened
- How to run the commands for your dead letter tool
- How to put the message in the redrive queue to be reprocessed
- What to do if the message is rejected again
It’s important to have this “cradle-to-grave” mentality when dealing with dead letter messages — pun intended — since a disconnect anywhere within the process could prevent the organization from successfully reprocessing the message.
While many organizations focus on processing massive amounts of messages and scaling those capabilities, it is equally important to ensure errors are captured and solved efficiently and effectively.
In this blog, we shared our three best practices for organizations to develop the infrastructure and tooling to ensure that any engineer can properly manage a dead letter. But we certainly have more to share! We would be happy to address any specific questions or explore related topics of interest to the community in future blog posts.
Got a question, comment or idea? Feel free to share your thoughts for future posts on social media via @CrowdStrike.