Dead Letter Queue (DLQ)

Work in Progress

This documentation section is still a draft that needs to be reviewed.

Overview

This document describes the Dead Letter Queue (DLQ) mechanism in Hexkit, which provides robust error handling for event processing failures in Kafka-based event systems. The DLQ support allows services to gracefully handle failed event processing by redirecting problematic events to a dedicated queue for later analysis or reprocessing, rather than crashing the service.

For information about general event subscription, see Event Driven Architecture.

DLQ Lifecycle

Apache Kafka doesn't provide out-of-the-box DLQ functionality, so Hexkit bridges that gap. The Kafka event subscriber provider in Hexkit can be configured to automatically handle a DLQ event flow with only minor changes in the standard usage. When an event consumer encounters an exception while processing an event, the system can:

  1. Retry the event processing (if configured)
  2. When retries are exhausted, publish the failed event to a dedicated DLQ topic
  3. Continue processing other events instead of crashing

DLQ Lifecycle Diagram

The flow diagram above demonstrates the general DLQ lifecycle:

  1. Failed events are retried a configurable number of times.
  2. Upon final failure, they are published to the configured DLQ topic.
  3. Events in the DLQ are manually reviewed through an external solution.
  4. DLQ events are eventually retried by publishing them to a service-specific retry-* topic.
  5. Upon consuming an event from the retry topic, the consumer restores the original topic name and proceeds with the normal request flow.
  6. The event is consumed again, this time successfully.
  7. If the event fails again for some reason, the DLQ process restarts.

DLQ Event Structure

When an event is published to the DLQ, it maintains the original payload, type, and key, but includes additional headers with information about the failure:

Header Description
event_id Service, topic, partition, and offset of original event
original_topic The topic where the event was originally published
exc_class The class name of the exception that caused the failure
exc_msg The error message from the exception

Example DLQ Event

Assumes the following event is consumed from a users topic and results in an error:

{
  "payload": {"user_id": "abc123"},
  "key": "abc123",
  "headers": {
    "type_": "user_registered",
    "correlation_id": "a648c68c-f14b-4d0a-8fc8-31824987613c",
  }
}

That event then becomes the following when published to the DLQ topic:

{
  "payload": <original event payload>,
  "key": <original event key>,
  "headers": {
    "type_": <original event type>,
    "correlation_id": <original correlation ID>,
    "event_id": "my-service,users,0,101",
    "original_topic": "users",
    "exc_class": "ValueError",
    "exc_msg": "Invalid data format"
  }
}

You can see that the DLQ event contains extra header information that can be used to deal with the event later.

Consuming From the DLQ

While Hexkit makes it easy to both divert problematic events to a DLQ topic and reintroduce them once resolved, it does not provide a comprehensive toolbox for DLQ event resolution. The user is entirely responsible for monitoring and maintaining the DLQ topic and its events, although Hexkit does provide some facilitating classes. GHGA, for example, uses a dedicated DLQ Service. For consuming events from the DLQ, Hexkit provides the DLQSubscriberProtocol. This protocol extends the standard EventSubscriberProtocol with additional parameters for accessing the DLQ metadata.

Configuration Parameters

The DLQ functionality is controlled by several configuration parameters in the KafkaConfig class:

Parameter Description Default Example
kafka_max_retries Maximum number of times to retry failed events 0 3
kafka_enable_dlq Toggle to enable/disable DLQ functionality False True
kafka_dlq_topic Topic name for the Dead Letter Queue "dlq" "my-dlq"
kafka_retry_backoff Base seconds to wait before retrying (doubles with each retry) 0 2

Minimal Example Configuration:

config = KafkaConfig(
    service_name="my-service",
    service_instance_id="instance-1",
    kafka_servers=["kafka:9092"],
    kafka_max_retries=3,
    kafka_enable_dlq=True,
    kafka_dlq_topic="dlq",
    kafka_retry_backoff=2
)

Retry Mechanism

The retry mechanism operates alongside and independently of the DLQ feature, but the two are intended to be used in concert. The DLQ filters events that would otherwise cause a service to crash, but oftentimes those failures are due to transient issues like database connection interruptions. The retry mechanism can help prevent clogging the DLQ with transient failures that don't represent genuine errors. The basic retry logic is straightforward:

  1. If retries are enabled (kafka_max_retries > 0), the event is retried immediately
  2. Each retry attempt uses exponential backoff based on the kafka_retry_backoff setting
  3. The backoff time doubles with each retry attempt: backoff_time = retry_backoff * 2^(retry_number - 1)