I'm working on a real-time analytics pipeline processing ~2M events/hour from Kafka and I keep running into issues with late-arriving data.
Current Setup
- Source: Kafka topic with customer events
- Processing: Spark Structured Streaming
- Sink: Delta Lake on S3
The Problem
About 3-5% of events arrive more than 10 minutes late due to mobile client buffering. Our current watermark of 5 minutes drops these events entirely.
df = spark.readStream \
.format("kafka") \
.option("subscribe", "customer_events") \
.load()
# Current watermark - too aggressive
windowed = df \
.withWatermark("event_time", "5 minutes") \
.groupBy(window("event_time", "1 hour")) \
.agg(count("*").alias("event_count"))What I've Tried
- Increasing the watermark to 30 minutes -- this works but adds too much latency
- A separate batch job to reconcile late data -- works but feels hacky
- Writing late events to a dead letter queue for reprocessing
Has anyone implemented a hybrid approach that balances latency with completeness? I've seen some posts about using Delta Lake's MERGE for late-arriving data but haven't found a clean pattern.
Any help is appreciated! Here's the relevant Spark docs: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Also found this useful article: https://databricks.com/blog/watermarking-in-structured-streaming