In a traditional request-response architecture, Service A calls Service B, waits for a response, and processes it. This works perfectly until: Service B is slow (Service A blocks), Service B is down (Service A fails), you need Service C and Service D to also react to the same event (you add more synchronous calls), or traffic spikes cause cascading timeouts across the entire service mesh.
Event-driven architecture (EDA) solves these problems by decoupling services through events. Instead of Service A calling Service B directly, Service A publishes an event ("OrderCreated") to a message broker, and any interested service (B, C, D) subscribes to that event and processes it independently. Service A doesn't wait, doesn't know who's listening, and doesn't fail if a downstream service is temporarily unavailable.
This guide covers practical EDA implementation: choosing between Kafka and NATS, designing events, handling failures, and the patterns that make event-driven systems work in production.
When Event-Driven Architecture Makes Sense
EDA is not always the right choice. Use it when:
Multiple services need to react to the same event. "OrderCreated" triggers: inventory reservation, payment processing, email confirmation, analytics tracking, and fraud detection. In request-response, the order service calls all five services sequentially or in parallel. In EDA, it publishes one event and each service processes it independently.
Services should be decoupled. If the email service is down for 5 minutes, should order creation fail? No. With EDA, the email service processes the event when it comes back up — the order service never knew or cared.
You need to handle traffic spikes. A flash sale generates 100x normal order volume. In request-response, every downstream service must scale simultaneously. In EDA, the message broker absorbs the spike, and downstream services process events at their own pace.
You need an audit trail. Events stored in Kafka are an immutable log of everything that happened in your system. You can replay events to rebuild state, debug issues, or populate new services with historical data.
Apache Kafka: The Enterprise Standard
Kafka is the dominant event streaming platform for enterprise workloads. It provides durable, ordered, partitioned event logs that can retain events for days, weeks, or forever. Kafka guarantees at-least-once delivery and exactly-once semantics (with transactions).
// Event design best practices
// Events should be self-contained (no need to call back to the source)
interface OrderCreatedEvent {
// Metadata
eventId: string; // Unique event identifier (UUID)
eventType: 'OrderCreated';
eventVersion: '1.0';
timestamp: string; // ISO 8601
source: 'order-service';
correlationId: string; // Links related events across services
// Payload — contains ALL data consumers need
data: {
orderId: string;
customerId: string;
customerEmail: string; // Include here so email service
// doesn't need to look it up
items: Array<{
productId: string;
productName: string;
quantity: number;
unitPrice: number;
currency: string;
}>;
totalAmount: number;
currency: string;
shippingAddress: {
street: string;
city: string;
country: string;
postalCode: string;
};
};
}
// Kafka producer (Node.js with kafkajs)
import { Kafka, Partitioners } from 'kafkajs';
const kafka = new Kafka({
clientId: 'order-service',
brokers: ['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
});
const producer = kafka.producer({
createPartitioner: Partitioners.DefaultPartitioner,
idempotent: true, // Prevent duplicate events on retries
});
async function publishOrderCreated(order: Order) {
const event: OrderCreatedEvent = {
eventId: crypto.randomUUID(),
eventType: 'OrderCreated',
eventVersion: '1.0',
timestamp: new Date().toISOString(),
source: 'order-service',
correlationId: order.correlationId,
data: {
orderId: order.id,
customerId: order.customerId,
customerEmail: order.customerEmail,
items: order.items,
totalAmount: order.totalAmount,
currency: order.currency,
shippingAddress: order.shippingAddress,
},
};
await producer.send({
topic: 'orders.created',
messages: [{
key: order.customerId, // Partition by customer for ordering
value: JSON.stringify(event),
headers: {
'event-type': 'OrderCreated',
'event-version': '1.0',
},
}],
});
}
NATS: The Lightweight Alternative
NATS is a simpler, faster messaging system that excels at real-time communication and request-reply patterns. With NATS JetStream (the persistence layer), it also handles durable event streaming. NATS is ideal when you need lower latency, simpler operations, and don't need Kafka's full feature set.
Choose Kafka when: You need long-term event retention (days/weeks/forever), exactly-once processing semantics, high-throughput event streaming (millions of events per second), or event replay capability.
Choose NATS when: You need low-latency real-time messaging (sub-millisecond), request-reply patterns alongside pub/sub, simpler operations (NATS is a single binary), or lightweight edge deployment.
Event Sourcing: Events as the Source of Truth
Event sourcing takes EDA further: instead of storing the current state of an entity (a row in a database), you store the sequence of events that led to the current state. The current state is derived by replaying events from the beginning.
For example, instead of a bank account table with a "balance" column, you store events: AccountOpened($0), MoneyDeposited($100), MoneyWithdrawn($30), MoneyDeposited($50). The current balance ($120) is calculated by replaying these events. This provides a complete audit trail, enables temporal queries ("What was the balance on March 15?"), and allows rebuilding state from scratch.
Event sourcing adds significant complexity. Use it when: you need a complete audit trail (financial systems, healthcare, legal), you need temporal queries, or different services need different views of the same data (each service builds its own read model from the events). Don't use it for simple CRUD applications where the current state is all you need.
Handling Failures in Event-Driven Systems
EDA introduces failure modes that don't exist in request-response systems:
Duplicate events: Network failures and retries cause duplicate events. Every consumer must be idempotent — processing the same event twice should produce the same result as processing it once. Use the event ID for deduplication: before processing, check if the event ID was already processed (store processed event IDs in a database or cache).
Out-of-order events: Events for the same entity may arrive out of order, especially in partitioned systems. Design consumers to handle this: include a sequence number or timestamp in events and reject events that are older than the last processed event for that entity.
Poison messages: Events that consistently cause processing failures. Without handling, they block the consumer forever. Implement a dead letter queue (DLQ): after N failed processing attempts, move the event to a DLQ for manual investigation and continue processing subsequent events.
Consumer lag: If a consumer falls behind (can't keep up with the event rate), the lag grows until events expire or the broker runs out of storage. Monitor consumer lag and alert when it exceeds a threshold. Scale consumers horizontally (add more instances processing from the same consumer group) to catch up.
Event Naming and Topic Design
Good event names describe what happened, not what should happen. OrderCreated (past tense, what happened) is better than CreateOrder (imperative, what should happen). Events are facts about the past — they describe state changes that already occurred.
Topic naming convention: domain.entity.action — e.g., orders.created, payments.completed, users.email-changed. Use dots as separators for hierarchical filtering. Keep topics granular — one topic per event type makes it easy for consumers to subscribe to exactly what they need.
ZeonEdge helps companies design and implement event-driven architectures with Kafka, NATS, and cloud-native messaging services. Contact our architecture team.
Marcus Rodriguez
Lead DevOps Engineer specializing in CI/CD pipelines, container orchestration, and infrastructure automation.