Event-Driven Architecture

This document provides an overview of the asynchronous communication methods between microservices. It offers both a high-level explanation of how information is sent and received and an examination of the layers of abstraction implemented within hexkit. The document also outlines how the outbox pattern is employed to solve certain challenges.

In the following, we assume that Apache Kafka is used as the event streaming infrastructure, since this is the primary and currently only infrastructure that hexkit supports. However, the idea behind hexkit is that it could be extended to also support other event streaming infrastructure such as Apache Pulsar or, in principle, message brokers or event bus services.

Event-driven Architecture with Apache Kafka

See the Kafka documentation for in-depth information on Kafka's inner workings.

Event-driven architecture refers to a communication pattern where information is exchanged indirectly and often asynchronously by interacting with a broker. Rather than making direct API calls from one service to another, one service, called the producer, publishes information to a broker in the form of an event (also called a message), and is unaware of all downstream processing of the event. It is essentially "fire and forget". Other services, called consumers, can consume the event once it has been published. Similarly, just as the producer is unaware of an event's fate, consumers are unaware of its origin. This has important implications for design. In Kafka, events have metadata headers, a timestamp, a key, and a payload. Hexkit adds a type field and an optional event ID field in the metadata headers to further distinguish events. Events are organized into configurable topics, and further organized within each topic by key. A topic can have any number of producers and consumers, and messages are not deleted after consumption (enabling rereads).

In the context of microservices, event-driven architecture provides some important benefits:

  1. Decoupling of Service Dependencies: Services communicate by emitting events, rather than direct requests, reducing service coupling. Keeping contractual definitions (i.e. payload schemas) in a dedicated library means services can evolve with minimal change propagation.
  2. Asynchronous Communication: Kafka facilitates asynchronous data processing, allowing services to operate independently without waiting for responses. This improves performance and fault tolerance, as services can continue functioning even if one component is slow or temporarily unavailable.
  3. Consistency and Order Guarantee: Kafka maintains the order of events within a partition for a given key, crucial for consistency in processing events that are related or dependent upon one another.

However, event-driven architecture also introduces challenges:

  1. Duplicate processing: Because event consumers have no way to know whether an event is a duplicate or a genuinely distinct event with the same payload, consumers must be designed to be idempotent.
  2. Data Duplication: If service A needs to access the data maintained by service B, the options are essentially an API call (which introduces security, performance and coupling concerns) or duplicating the data of service B for service A. Hexkit supports mitigating this issue via data duplication by way of the outbox pattern, described later in this document.
  3. Tracing: A single request flow can span many services with a multitude of messages being generated along the way. The agnostic nature of producers and consumers makes it harder to tag requests than with traditional API calls, and several consumers can process a given message in parallel, sometimes leading to complex request flows. Hexkit enables basic tracing by generating and propagating a correlation ID (also called a request ID) as an event header.

Abstraction Layers

Hexkit features protocols and providers to simplify Kafka usage in services. Rather than implementing this functionality from scratch in every new project, hexkit provides the KafkaEventPublisher and KafkaEventSubscriber provider classes, along with a KafkaConfig Pydantic configuration class.

Producers

When a service publishes an event, it uses the KafkaEventPublisher provider class along with a service-defined translator (a class implementing the EventPublisherProtocol). The purpose of the translator is to provide a domain-bound API to the core of the service for publishing an event without exposing any of the specific provider-related details. The core calls a method like publish_user_created() on the translator, and the translator provides the requisite payload, topic, type, and key to the KafkaEventPublisher. The translator's module usually features a config class for defining the topics and event types used. It's important to note that a translator is not strictly required in order to use the KafkaEventPublisher. As long as the required configuration is provided, there's nothing to stop the core from using it directly. However, it will make switching to a different infrastructure (that might be supported by hexkit in the future) more difficult.

Consumers

While the KafkaEventPublisher can be used directly, the KafkaEventSubscriber always requires a translator to deliver event payloads to the service's core components. The translator is defined outside of hexkit (in the service code, not in hexkit itself) and implements the EventSubscriberProtocol. The provider passes the event to the translator through the protocol's consume method, which then calls an abstract method implemented by the translator. The translator receives the event and examines its attributes (the topic, payload, key, headers, etc.) to determine processing. Naturally, the actual processing logic varies, but usually the payload is validated against the expected schema and used to call a method in the service's core. When control is returned to the KafkaEventSubscriber, the underlying consumer (from AIOKafka) saves its offsets, declaring that it has successfully handled the event and is ready for the next.

A diagram illustrating the process of event publishing and consuming, starting with the microservice: Kafka abstraction

Outbox Pattern

Sometimes services need to access data owned by another service. The challenge of enabling this kind of linkage involves ensuring data consistency and avoiding coupling. The outbox pattern provides a way to mitigate these risks by publishing persisted data as Kafka events for other services to consume as needed. Some advantages include:

  • Events can be republished from the database as needed (fault tolerance)
  • The coupling between services is minimized
  • The data producer remains the single source of truth
  • Kafka events no longer need to be backed up separately from the database

Topic Compaction & Idempotence

Topic compaction means retaining only the latest event per key in a topic. Why do this? In hexkit's implementation of the outbox pattern, published events are categorized as one of two event types: upserted or deleted. The entire state of the object, insofar as it is exposed by the producer, is represented in every event. Consumers interpret the event type and act accordingly, based on their needs (although this typically translates to at least upserting or deleting the event in their own database). When combined with topic compaction, the result is that consumers only have to get the latest event to be up to date with the originating service.

Implications

One requirement of using the outbox pattern as implemented in hexkit is that consumers must be idempotent. The event state is stored in the database and can be republished as needed, so services must be prepared to consume the same event multiple times. Event republishing can occur at any point, which means that sequence-dependent events might be re-consumed out of order. Depending on the complexity of the microservice landscape, it can be difficult to foresee all cases where out-of-order events could cause problems. This is why it is essential to perform robust testing and use features like topic compaction where necessary.

Dead Letter Queue

Event consumers are sometimes unable to process received events due to validation errors, data inconsistencies, or temporary service failures. Since service consumers are expected to be idempotent and long-lived, individual event processing failures cannot be allowed to crash the service. Hexkit addresses this challenge through a Dead Letter Queue (DLQ) mechanism, which automatically handles failed events.

When an event consumer encounters an exception while processing an event, the KafkaEventSubscriber can be configured to:

  1. Retry the event processing a configurable number of times
  2. If retries are exhausted, publish the failed event to a dedicated DLQ topic
  3. Continue processing other events instead of crashing

Events in the DLQ retain their original payload, type, and key but include additional metadata headers with failure information (service name, original topic, exception details, etc.). This enables manual review and corrective action. Failed events can then be republished to a service-specific retry topic, where they are reprocessed with the original topic name restored.

For a comprehensive understanding of the DLQ mechanism, configuration options, and recovery procedures, see the Dead Letter Queue documentation.

Tracing

Request flows are traced with a correlation ID, or request ID, which is generated at the request origin. The correlation ID is propagated through subsequent services by adding it to the metadata headers of Kafka events. When a service consumes an event, the provider automatically extracts the correlation ID and sets it as a context variable before calling the translator's consume() method. In the case of the outbox mechanism, the correlation ID is stored in the document metadata. When initial publishing occurs, the correlation ID is already set in the context. For republishing, however, the correlation ID is extracted and set before publishing the event, thereby maintaining traceability for disjointed request flows.