What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

Event-Driven Architecture with Kafka & Schema Registry

Why Events Change Everything

Request-response is the default mental model for most backend engineers. Service A calls Service B, waits for the response, and continues. It works — until it doesn't. As systems grow, synchronous coupling creates cascading failures, deployment bottlenecks, and invisible dependencies that make every change risky.

Event-driven architecture inverts the model. Instead of services calling each other, they publish facts about what happened — order placed, payment confirmed, inventory reserved— and other services react to those facts independently. The publisher doesn't know or care who's listening. The result is a system where services can be deployed, scaled, and evolved independently.

Apache Kafka is the de facto backbone for event-driven systems at scale. Paired with Confluent Schema Registry, it gives you durable event streaming with enforced contracts between producers and consumers. This article covers the practical decisions you'll face when building with both. For capturing database changes as Kafka events, see CDC with Debezium. For systems that require a full event history and the ability to rebuild read models, combine Kafka with event sourcing and CQRS — where the event store is the source of truth and Kafka projections are derived views.

Kafka Fundamentals That Matter

Kafka is a distributed commit log. Producers append records to topics, consumers read them in order, and records are retained for a configurable duration (or indefinitely). Three concepts determine how your system behaves:

Topics & Partitions

A topic is a named stream of events. Each topic is split into partitions — the unit of parallelism and ordering. Records within a partition are strictly ordered. Records across partitions have no ordering guarantee. Choose your partition key carefully: it determines which records land together and which consumers process them.

Consumer Groups

A consumer group is a set of consumer instances that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group. This gives you parallel processing with exactly-once-per-group semantics. Multiple groups can independently consume the same topic — the order service and the analytics pipeline each get their own view.

Offsets & Retention

Every record in a partition has an offset — a monotonically increasing integer. Consumers track their position by committing offsets. If a consumer crashes, it resumes from the last committed offset. Kafka retains records based on time or size, not consumption. A new consumer group can replay the entire topic history from offset zero.

Topic Design Patterns

Topic design is the schema design of event-driven systems. Get it wrong, and you'll fight ordering issues, hot partitions, and consumer bottlenecks for the life of the system.

One Topic Per Event Type

The cleanest pattern: each business event gets its own topic. orders.placed, orders.shipped, payments.confirmed. Consumers subscribe only to what they need. Schema evolution is per-topic, so upgrading one event type doesn't affect others.

Choosing Partition Keys

The partition key determines record placement. Use the entity ID that requires ordering — order_id for order events, customer_id for customer events. All events for the same entity land in the same partition, guaranteeing processing order.

// Producer configuration — Java
Properties props = new Properties();
props.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092,kafka-3:9092");
props.put("key.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("schema.registry.url", "http://schema-registry:8081");
props.put("acks", "all");
props.put("enable.idempotence", "true");

KafkaProducer<String, OrderPlaced> producer = new KafkaProducer<>(props);

// Partition key = orderId → all events for this order go to the same partition
ProducerRecord<String, OrderPlaced> record = new ProducerRecord<>(
    "orders.placed",
    orderPlaced.getOrderId(),  // partition key
    orderPlaced                // value
);

producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        log.error("Failed to produce event", exception);
    } else {
        log.info("Event sent to partition {} at offset {}",
            metadata.partition(), metadata.offset());
    }
});

Note

Avoid using high-cardinality random keys (like UUIDs for every event) if ordering matters. Conversely, avoid low-cardinality keys (like country codes) that create hot partitions. The ideal key has even distribution and meaningful ordering semantics.

Partition Count Decisions

Partition count sets your maximum consumer parallelism — you can't have more active consumers than partitions in a group. Start with the number of consumers you expect at peak load, then round up. You can increase partitions later, but you cannot decrease them, and increasing changes the key-to-partition mapping for new records.

Schema Registry — Contracts for Events

Without schema enforcement, Kafka is a pipe that moves bytes. Any producer can write any shape of data to any topic, and consumers discover incompatibilities at runtime — usually in production, usually at 2 AM. Schema Registry solves this by making schemas first-class citizens in your event pipeline.

How It Works

Schema Registry is a standalone service that stores and serves schemas. When a producer serializes a record, the serializer registers the schema (or validates it against the existing one) and embeds the schema ID in the record payload. When a consumer deserializes, it fetches the schema by ID and uses it to decode the bytes. Schemas are cached aggressively — the registry is not in the hot path after the first call.

Format	Binary Size	Schema Evolution	Ecosystem	Best For
Avro	Compact	Excellent	JVM, Python, Go	Kafka-native pipelines
Protobuf	Compact	Good	All languages	Polyglot / gRPC
JSON Schema	Large	Basic	All languages	Debug-friendly topics

Apache Avro is the default choice for Kafka-native systems. It has the tightest integration with Schema Registry, the best schema evolution rules, and compact binary encoding. Protocol Buffersare the better choice if you're already using gRPC or need broad language support beyond the JVM.

Schema Evolution Without Breaking Consumers

Events are immutable, and consumers are deployed independently. A producer might deploy a new schema version on Monday while a consumer team deploys their update on Thursday. Schema Registry enforces compatibility rules to prevent breaking changes from reaching the topic.

Compatibility Modes

BACKWARD (default)

New schemas can read data written by the previous schema. You can add fields with defaults and remove optional fields. This is the right mode when consumers upgrade before producers.

FORWARD

Old schemas can read data written by the new schema. You can remove fields with defaults and add optional fields. This is the right mode when producers upgrade before consumers.

FULL

Both backward and forward compatible. The safest option for independent deployments. The trade-off is that only additive changes with defaults are allowed.

// Avro schema — v1
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.datasops.events",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "total_cents", "type": "long"},
    {"name": "currency", "type": "string"},
    {"name": "placed_at", "type": "long", "logicalType": "timestamp-millis"}
  ]
}

// Avro schema — v2 (FULL compatible)
// Added: shipping_method with default
// Added: items array with default empty
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.datasops.events",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "total_cents", "type": "long"},
    {"name": "currency", "type": "string"},
    {"name": "placed_at", "type": "long", "logicalType": "timestamp-millis"},
    {"name": "shipping_method", "type": "string", "default": "standard"},
    {"name": "items", "type": {"type": "array", "items": "string"}, "default": []}
  ]
}

Note

Always set the compatibility mode at the subject level, not globally. Different topics evolve at different rates. Use FULL for core domain events and BACKWARD for internal analytics events where you control all consumers.

Consumer Patterns for Production

Idempotent Processing

Kafka guarantees at-least-once delivery. Consumer crashes, rebalances, and retry logic all mean your handler may see the same record more than once. Every consumer must be idempotent — processing the same event twice should produce the same result as processing it once.

// Idempotent consumer pattern — TypeScript
import { Kafka, EachMessagePayload } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'order-service',
  brokers: ['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
});

const consumer = kafka.consumer({ groupId: 'order-fulfillment' });

await consumer.subscribe({ topic: 'orders.placed', fromBeginning: false });

await consumer.run({
  eachMessage: async ({ topic, partition, message }: EachMessagePayload) => {
    const event = JSON.parse(message.value!.toString());
    const eventId = message.headers?.['event-id']?.toString();

    // Idempotency check — skip if already processed
    const alreadyProcessed = await db.query(
      'SELECT 1 FROM processed_events WHERE event_id = $1',
      [eventId]
    );
    if (alreadyProcessed.rowCount > 0) return;

    // Process within a transaction
    await db.transaction(async (tx) => {
      await tx.query(
        'INSERT INTO processed_events (event_id, processed_at) VALUES ($1, NOW())',
        [eventId]
      );
      await fulfillOrder(tx, event);
    });
  },
});

Dead Letter Queues

Some messages will fail processing — malformed data, downstream outages, business rule violations. A dead letter queue (DLQ) captures these failures so the consumer can continue processing healthy records instead of blocking or looping forever.

// DLQ pattern — route failed messages to a separate topic
async function processWithDLQ(message: KafkaMessage, handler: Handler) {
  const MAX_RETRIES = 3;
  let attempt = 0;

  while (attempt < MAX_RETRIES) {
    try {
      await handler(message);
      return; // success
    } catch (error) {
      attempt++;
      if (attempt < MAX_RETRIES) {
        await sleep(Math.pow(2, attempt) * 1000); // exponential backoff
      }
    }
  }

  // All retries exhausted — send to DLQ
  await producer.send({
    topic: 'orders.placed.dlq',
    messages: [{
      key: message.key,
      value: message.value,
      headers: {
        ...message.headers,
        'dlq-reason': 'max-retries-exceeded',
        'dlq-original-topic': 'orders.placed',
        'dlq-timestamp': Date.now().toString(),
      },
    }],
  });
}

Note

Monitor your DLQ topics. A growing DLQ backlog is an early warning signal. Set up alerts on DLQ message rate and review failed messages daily. Most DLQ entries indicate bugs, not transient failures.

Exactly-Once Semantics

Kafka supports exactly-once semantics (EOS) for produce-consume-produce workflows through transactions. When a consumer reads from one topic, processes the record, and writes to another topic, the entire operation can be wrapped in a Kafka transaction — either all writes commit (including the consumer offset) or none do.

// Transactional producer — Java
props.put("transactional.id", "order-processor-1");
KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);
producer.initTransactions();

try {
    producer.beginTransaction();

    // Consume + process + produce in one atomic operation
    ConsumerRecords<String, OrderPlaced> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, OrderPlaced> record : records) {
        OrderShipped shipped = processOrder(record.value());
        producer.send(new ProducerRecord<>("orders.shipped", shipped.getOrderId(), shipped));
    }

    // Commit consumer offsets as part of the transaction
    producer.sendOffsetsToTransaction(
        currentOffsets(records),
        consumer.groupMetadata()
    );
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
    throw e;
}

EOS has a throughput cost — roughly 10-20% lower than at-least-once. Use it for workflows where duplicates cause real business problems (payments, inventory decrements). For analytics and logging, at-least-once with idempotent consumers is sufficient and faster.

Testing Event-Driven Systems

Event-driven systems are harder to test than request-response because effects are asynchronous and distributed. A comprehensive testing strategy operates at three levels:

Schema Compatibility Tests

Run in CI on every PR that touches a schema file. The Schema Registry Maven plugin or the confluent CLI can validate compatibility against the live registry without deploying.

Consumer Contract Tests

Producers generate sample events, consumers verify they can deserialize and process them. Pact supports async message contracts for this exact use case. These tests catch logic errors that schema checks miss.

Integration Tests with Testcontainers

Spin up real Kafka and Schema Registry instances in Docker for end-to-end tests. Testcontainers makes this reproducible across CI environments. Test the full produce-consume cycle, including serialization, schema validation, and consumer offset commits.

// Integration test with Testcontainers — Java
@Testcontainers
class OrderEventIntegrationTest {

    @Container
    static KafkaContainer kafka = new KafkaContainer(
        DockerImageName.parse("confluentinc/cp-kafka:7.6.0")
    );

    @Container
    static GenericContainer<?> schemaRegistry = new GenericContainer<>(
        DockerImageName.parse("confluentinc/cp-schema-registry:7.6.0")
    )
    .withExposedPorts(8081)
    .withEnv("SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS", kafka.getBootstrapServers())
    .dependsOn(kafka);

    @Test
    void orderPlacedEvent_roundTrips_throughKafka() {
        // Produce an OrderPlaced event
        OrderPlaced event = OrderPlaced.newBuilder()
            .setOrderId("ord-123")
            .setCustomerId("cust-456")
            .setTotalCents(4999L)
            .setCurrency("USD")
            .setPlacedAt(Instant.now())
            .build();

        producer.send(new ProducerRecord<>("orders.placed", event.getOrderId(), event));

        // Consume and verify
        ConsumerRecord<String, OrderPlaced> consumed = pollOne("orders.placed");
        assertThat(consumed.value().getOrderId()).isEqualTo("ord-123");
        assertThat(consumed.value().getTotalCents()).isEqualTo(4999L);
    }
}

Production Hardening

Running Kafka in production is straightforward. Running it well in production requires attention to configuration, monitoring, and operational procedures.

Producer Configuration

Set acks=all to ensure records are replicated before acknowledgment. Enable enable.idempotence=true to prevent duplicate writes from retries. Set min.insync.replicas=2 on the broker to guarantee durability even if one replica fails.

Consumer Lag Monitoring

Consumer lag — the difference between the latest offset and the consumer's committed offset — is the single most important metric. Rising lag means consumers are falling behind. Use Burrowor your cloud provider's native monitoring to track lag per partition and alert when it exceeds thresholds.

Observability

Instrument producers and consumers with metrics: records produced/consumed per second, serialization errors, consumer rebalance frequency, and end-to-end latency (produce timestamp to consume timestamp). Export to Prometheus and visualize in Grafana. Kafka's JMX metrics cover broker health; your application metrics cover business health.

Note

Enable compression on producers (compression.type=zstd). Zstandard typically achieves 4-5x compression on JSON and Avro payloads, reducing network bandwidth and storage costs with minimal CPU overhead. This is free performance.

Anti-Patterns to Avoid

Using Kafka as a database. Kafka is a log, not a query engine. If you need to look up events by arbitrary fields, index them into a database or search engine downstream.
Mega-topics. Putting all event types in one topic sacrifices independent schema evolution, consumer filtering, and partition key semantics. One topic per event type.
Skipping Schema Registry.“We'll just use JSON” works for the first month. Then someone adds a field, misspells it, or changes a type, and three downstream services break silently.
Synchronous request-response over Kafka. If you need a response, use HTTP or gRPC. Kafka is for fire-and-forget events and async workflows, not for replacing synchronous APIs.
Ignoring consumer group rebalancing. Rebalances cause processing pauses. Use the cooperative sticky assignor (CooperativeStickyAssignor) to minimize disruption during deployments and scaling.

Building Confidence in Event-Driven Systems

Event-driven architecture with Kafka and Schema Registry gives you loose coupling, independent deployability, and a durable record of everything that happens in your system. But the benefits only materialize if you treat events as first-class contracts:

Design topics around business events, not implementation details
Enforce schemas with Schema Registry and run compatibility checks in CI
Build idempotent consumers and implement dead letter queues from day one
Monitor consumer lag as your primary health signal
Test the full produce-consume cycle with real Kafka instances, not mocks

The operational investment is real — Kafka is not a fire-and-forget dependency. But for systems that need to scale independently, evolve safely, and maintain a reliable audit trail of business events, it remains the strongest foundation available. Once your event pipeline is solid, the next step is often adding stateful processing — see Kafka Streams for Stateful Processing for aggregations, joins, and interactive queries built natively on Kafka. For complex event-time windowing and exactly-once guarantees across large topologies, see Streaming Analytics with Apache Flink. For a decision framework on when to choose streaming over batch, see Streaming vs. Batch Processing.

Building an Internal Developer Platform or improving your developer experience?

We help engineering teams design and implement IDPs with Backstage — from Software Catalog setup and golden-path templates to custom plugins and DORA metric tracking. Let's talk.

Send a Message

Building Internal Developer Platforms with Backstage

Why Events Change Everything

Kafka Fundamentals That Matter

Topics & Partitions

Consumer Groups

Offsets & Retention

Topic Design Patterns

One Topic Per Event Type

Choosing Partition Keys

Partition Count Decisions

Schema Registry — Contracts for Events

How It Works

Schema Evolution Without Breaking Consumers

Compatibility Modes

BACKWARD (default)

FORWARD

FULL

Consumer Patterns for Production

Idempotent Processing

Dead Letter Queues

Exactly-Once Semantics

Testing Event-Driven Systems

Schema Compatibility Tests

Consumer Contract Tests

Integration Tests with Testcontainers

Production Hardening

Producer Configuration

Consumer Lag Monitoring

Observability

Anti-Patterns to Avoid

Building Confidence in Event-Driven Systems

Building an Internal Developer Platform or improving your developer experience?

Need help implementing this in production?

Building Internal Developer Platforms with Backstage

Why Events Change Everything

Kafka Fundamentals That Matter

Topics & Partitions

Consumer Groups

Offsets & Retention

Topic Design Patterns

One Topic Per Event Type

Choosing Partition Keys

Partition Count Decisions

Schema Registry — Contracts for Events

How It Works

Schema Evolution Without Breaking Consumers

Compatibility Modes

BACKWARD (default)

FORWARD

FULL

Consumer Patterns for Production

Idempotent Processing

Dead Letter Queues

Exactly-Once Semantics

Testing Event-Driven Systems

Schema Compatibility Tests

Consumer Contract Tests

Integration Tests with Testcontainers

Production Hardening

Producer Configuration

Consumer Lag Monitoring

Observability

Anti-Patterns to Avoid

Building Confidence in Event-Driven Systems

Building an Internal Developer Platform or improving your developer experience?

Related Articles

Need help implementing this in production?