Our client operates in the financial industry, where rapid and secure processing of a massive volume of transactions is of importance. To meet the demands of their business scale, they have adopted a microservices architecture, and for efficient data management and communication between these services, they have integrated Apache Kafka into their ecosystem.
- Dead Letter Topic (DLT) Handling: When a Kafka payload encounters processing failure at the consumer's end, it is redirected to the Dead Letter Topic (DLT) topic. Managing this process seamlessly posed a considerable challenge.
- Diverse Failure Causes: Payload processing failures stemmed from a number of reasons including network disruptions, code-level errors in the consumer end, temporary glitches, and more. Addressing these diverse causes demanded a comprehensive approach.
- Zero-Data Loss Objective: Once these issues get resolved, the data in the DLT topic must be processed so that we can achieve zero-data loss, i.e., no data is lost without being successfully processed on the consumer end.
To address the client's challenges, we designed a Microservice-based Java Spring Boot Application integrated with Apache Kafka, Docker, Kubernetes, Grafana, and Prometheus.
Categorized Failures: Whenever a consumer fails to process the data, it gets categorized into 2 categories - retriable and non-retriable.
- Example of Non-retriable data - missing required value in the payload
- Example of Retriable data - temporary network failure, programming issue in consumer, etc.
- Effective DLT Handling: Non-retriable data found its place in a designated Dead Letter Topic (DLT), while separate DLTs were tailored for each retriable data-generating topic.
- Notification System: From DLTs, notification alerts are generated, which send the data to Grafana for monitoring, email administrators, and trigger other notification systems. These notifications are shipped once in 5-minute intervals (configurable) so that there won’t be an overflow of alerts considering payloads are coming at a much faster rate.
- Retriable Data Recovery: Along with notifications, retriable payloads are re-processed once the issue is fixed on the consumer’s end. There are separate Java-based Spring Boot APIs written for this purpose. These APIs are triggered manually with a topic name and other information as parameters.
- Dynamic Consumer Engagement: Based on the topic name, these APIs dynamically start the consumers, which consume the messages from its associated DLT and send them over to Kafka, and thus the consumers process the data again. If it fails again, the same process is followed.
- Controlled DLT Message Flow: Once all the DLT messages are sent to Kafka, an API can be called to stop consuming the messages from a particular DLT topic.
This case study underscores the achievement of implementing Kafka-driven notifications and reprocessing for failed messages. Through adept categorization, efficient issue resolution, and dynamic reprocessing, the project effectively reduced data loss. Leveraging Java Spring Boot, Apache Kafka, and other tools, the client gained the ability to manage substantial transaction volumes with confidence. This success reflects our commitment to building robust systems that ensure data integrity and operational stability in the ever-changing financial landscape.