Kafka Streaming at Scale: Lessons from 100K Messages/Hour

Feb 20266 min read

The Challenge

Our team needed to build a microservice that could consume over 100,000 Kafka messages per hour, transform them, and load the results into Snowflake — all with 99.9% uptime. The messages represented healthcare claims data from multiple upstream systems, and any data loss or significant delay would directly impact downstream reporting and compliance dashboards.

Choosing the Right Consumer Pattern

We evaluated three consumer patterns: single-threaded sequential processing, multi-threaded consumer groups, and micro-batching with flush intervals. We settled on micro-batching because it gave us the best throughput-to-cost ratio. The consumer accumulates messages into an in-memory buffer and flushes to Snowflake every 5 seconds or when the buffer hits 1,000 records, whichever comes first. This approach reduced the number of Snowflake INSERT operations by 95% compared to row-by-row insertion.

Building with Java Spring Boot

The service is built on Java Spring Boot with the Spring Kafka library. We use a ConcurrentKafkaListenerContainerFactory with a concurrency of 3 (matching our topic's partition count) to parallelize consumption within a single instance. Each listener thread writes to a thread-safe ConcurrentLinkedQueue, and a scheduled task drains the queue and bulk-loads into Snowflake using the Snowflake JDBC driver's batch insert API.

Error Handling and Dead Letter Queues

Robust error handling was non-negotiable. We implemented a three-tier strategy: (1) transient errors trigger automatic retries with exponential backoff (up to 3 attempts), (2) deserialization errors route the raw message to a Dead Letter Queue (DLQ) topic for later inspection, and (3) Snowflake connection failures trigger a circuit breaker that pauses consumption and alerts the on-call engineer via PagerDuty. This layered approach means we never silently lose a message.

Monitoring and Observability

We instrument the service with Micrometer + Prometheus metrics exposed on a /actuator/prometheus endpoint. Key metrics include messages consumed per second, batch flush duration, Snowflake insert latency (p50/p99), consumer lag per partition, and DLQ message count. All metrics feed into Grafana dashboards with alerting rules. Consumer lag above 10,000 triggers a warning; above 50,000 triggers a page.

Results

After three months in production, the service achieved 99.96% uptime, processed an average of 2.4 million messages per day, and reduced our previous Snowflake ingestion costs by 30% thanks to the batching strategy. The DLQ has captured only 12 malformed messages out of over 200 million processed — a testament to the quality controls we added upstream.

💡Key Takeaways

  • 1.Micro-batching with flush intervals reduces write operations by 95% and dramatically lowers warehouse costs.
  • 2.Match consumer concurrency to your Kafka partition count for optimal parallelism.
  • 3.A three-tier error strategy (retry → DLQ → circuit breaker) ensures zero silent data loss.
  • 4.Monitor consumer lag, flush latency, and DLQ counts as your primary health signals.
  • 5.Invest in observability from day one — it pays for itself during the first production incident.