PINNEDDISCUSSIONETL Pipeline Optimization

Optimizing dbt models for incremental loads - share your patterns!

dbt_masterAnalytics Engineer Lead

Jan 28, 2026892 views

Let's collect best practices for dbt incremental models. I've been through a few iterations and want to share what works.

My Key Learnings

1. Always use a reliable `updated_at` column

Don't rely on id ordering. Use a proper timestamp:

sql

{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
  )
}}

SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

2. Use `merge` strategy over `delete+insert` for large tables

The merge strategy is significantly faster for tables > 100M rows:

delete+insert: ~45 min for 500M row table
merge: ~8 min for same table

3. Partition your target tables wisely

Partitioning by date works well for time-series data, but consider your query patterns first.

What patterns have worked for you? Drop your configs below!

More details on incremental strategies: https://docs.getdbt.com/docs/build/incremental-models

dbtincrementaloptimizationsql

YOUR REPLY

Supports **bold**, *italic*, `code`, ```code blocks```, lists, headings, and blockquotesCtrl+B bold | Ctrl+I italic | Tab indent

URLs are automatically converted to **bold clickable links**

3 Replies

ACCEPTED ANSWER

stream_kingStaff Engineer @ DataCo

Feb 3, 2026

Great question! We solved this with a two-layer approach that's been running in production for 8 months:

Layer 1: Streaming with generous watermark

We use a 15-minute watermark for the "hot" path. This captures ~97% of events with acceptable latency.

python

# Hot path - streaming
hot_df = df \
    .withWatermark("event_time", "15 minutes") \
    .groupBy(window("event_time", "1 hour")) \
    .agg(count("*").alias("event_count"))

hot_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .start("s3://data/hot_metrics/")

Layer 2: Batch reconciliation with Delta MERGE

A scheduled job runs every 4 hours and merges late-arriving data:

python

# Cold path - batch reconciliation
late_events = spark.read.format("kafka") \
    .option("startingOffsets", four_hours_ago) \
    .load()

delta_table = DeltaTable.forPath(spark, "s3://data/metrics/")
delta_table.alias("target").merge(
    late_events.alias("source"),
    "target.window_start = source.window_start"
).whenMatchedUpdate(set={
    "event_count": "target.event_count + source.late_count"
}).whenNotMatchedInsertAll().execute()

The key insight is that your hot path doesn't need to be perfect - it just needs to be fast. The batch reconciliation handles correctness.

More on this pattern: https://delta.io/blog/merge-late-arriving-data

dataeng_sarahSenior Data Engineer

Feb 3, 2026

This is exactly what I was looking for! The two-layer approach makes total sense. Quick question - how do you handle exactly-once semantics in the merge step? Do you use idempotent writes?

stream_kingStaff Engineer @ DataCo

Feb 3, 2026

Yes! We use the merge condition as our idempotency key. Since Delta Lake's MERGE is atomic, you get exactly-once semantics for free as long as your merge condition is deterministic.

The trick is using a composite key: window_start + source_partition

pipeline_ninjaData Platform Lead

Feb 3, 2026

Another approach worth considering: Flink's event-time processing handles this more natively than Spark.

In Flink you can configure allowed lateness separately from the watermark:

java

DataStream<Event> stream = env
    .addSource(kafkaSource)
    .assignTimestampsAndWatermarks(
        WatermarkStrategy
            .<Event>forBoundedOutOfOrderness(Duration.ofMinutes(5))
            .withIdleness(Duration.ofSeconds(30))
    )
    .keyBy(Event::getCustomerId)
    .window(TumblingEventTimeWindows.of(Time.hours(1)))
    .allowedLateness(Time.minutes(30))  // Accept late data up to 30 min
    .sideOutputLateData(lateOutputTag)   // Capture anything beyond that
    .aggregate(new EventAggregator());

The allowedLateness parameter keeps the window state around so late events can still update results, while the side output catches anything that's really late.

Not saying you should switch to Flink, but it's worth knowing the tradeoffs!

data_wizard

Feb 3, 2026

We use a simpler approach for our use case:

Write all events (including late ones) to a raw Delta table with no aggregation
Run materialized views on top for the aggregated metrics
Delta Lake's time travel lets us recompute any window

This works well if your query patterns are flexible. Not ideal if you need sub-second query latency though.

BACK TO DISCUSSIONS

PINNEDDISCUSSIONETL Pipeline Optimization

Optimizing dbt models for incremental loads - share your patterns!

dbt_masterAnalytics Engineer Lead

Jan 28, 2026892 views

Let's collect best practices for dbt incremental models. I've been through a few iterations and want to share what works.

My Key Learnings

1. Always use a reliable `updated_at` column

Don't rely on id ordering. Use a proper timestamp:

sql

{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
  )
}}

SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

2. Use `merge` strategy over `delete+insert` for large tables

The merge strategy is significantly faster for tables > 100M rows:

delete+insert: ~45 min for 500M row table
merge: ~8 min for same table

3. Partition your target tables wisely

Partitioning by date works well for time-series data, but consider your query patterns first.

What patterns have worked for you? Drop your configs below!

More details on incremental strategies: https://docs.getdbt.com/docs/build/incremental-models

dbtincrementaloptimizationsql

YOUR REPLY

Supports **bold**, *italic*, `code`, ```code blocks```, lists, headings, and blockquotesCtrl+B bold | Ctrl+I italic | Tab indent

URLs are automatically converted to **bold clickable links**

3 Replies

ACCEPTED ANSWER

stream_kingStaff Engineer @ DataCo

Feb 3, 2026

Great question! We solved this with a two-layer approach that's been running in production for 8 months:

Layer 1: Streaming with generous watermark

We use a 15-minute watermark for the "hot" path. This captures ~97% of events with acceptable latency.

python

# Hot path - streaming
hot_df = df \
    .withWatermark("event_time", "15 minutes") \
    .groupBy(window("event_time", "1 hour")) \
    .agg(count("*").alias("event_count"))

hot_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .start("s3://data/hot_metrics/")

Layer 2: Batch reconciliation with Delta MERGE

A scheduled job runs every 4 hours and merges late-arriving data:

python

# Cold path - batch reconciliation
late_events = spark.read.format("kafka") \
    .option("startingOffsets", four_hours_ago) \
    .load()

delta_table = DeltaTable.forPath(spark, "s3://data/metrics/")
delta_table.alias("target").merge(
    late_events.alias("source"),
    "target.window_start = source.window_start"
).whenMatchedUpdate(set={
    "event_count": "target.event_count + source.late_count"
}).whenNotMatchedInsertAll().execute()

The key insight is that your hot path doesn't need to be perfect - it just needs to be fast. The batch reconciliation handles correctness.

More on this pattern: https://delta.io/blog/merge-late-arriving-data

dataeng_sarahSenior Data Engineer

Feb 3, 2026

This is exactly what I was looking for! The two-layer approach makes total sense. Quick question - how do you handle exactly-once semantics in the merge step? Do you use idempotent writes?

stream_kingStaff Engineer @ DataCo

Feb 3, 2026

Yes! We use the merge condition as our idempotency key. Since Delta Lake's MERGE is atomic, you get exactly-once semantics for free as long as your merge condition is deterministic.

The trick is using a composite key: window_start + source_partition

pipeline_ninjaData Platform Lead

Feb 3, 2026

Another approach worth considering: Flink's event-time processing handles this more natively than Spark.

In Flink you can configure allowed lateness separately from the watermark:

java

DataStream<Event> stream = env
    .addSource(kafkaSource)
    .assignTimestampsAndWatermarks(
        WatermarkStrategy
            .<Event>forBoundedOutOfOrderness(Duration.ofMinutes(5))
            .withIdleness(Duration.ofSeconds(30))
    )
    .keyBy(Event::getCustomerId)
    .window(TumblingEventTimeWindows.of(Time.hours(1)))
    .allowedLateness(Time.minutes(30))  // Accept late data up to 30 min
    .sideOutputLateData(lateOutputTag)   // Capture anything beyond that
    .aggregate(new EventAggregator());

The allowedLateness parameter keeps the window state around so late events can still update results, while the side output catches anything that's really late.

Not saying you should switch to Flink, but it's worth knowing the tradeoffs!

data_wizard

Feb 3, 2026

We use a simpler approach for our use case:

Write all events (including late ones) to a raw Delta table with no aggregation
Run materialized views on top for the aggregated metrics
Delta Lake's time travel lets us recompute any window

This works well if your query patterns are flexible. Not ideal if you need sub-second query latency though.

Optimizing dbt models for incremental loads - share your patterns!

My Key Learnings

1. Always use a reliable updated_at column

2. Use merge strategy over delete+insert for large tables

3. Partition your target tables wisely

YOUR REPLY

3 Replies

Layer 1: Streaming with generous watermark

Layer 2: Batch reconciliation with Delta MERGE

Optimizing dbt models for incremental loads - share your patterns!

My Key Learnings

1. Always use a reliable updated_at column

2. Use merge strategy over delete+insert for large tables

3. Partition your target tables wisely

YOUR REPLY

3 Replies

Layer 1: Streaming with generous watermark

Layer 2: Batch reconciliation with Delta MERGE

1. Always use a reliable `updated_at` column

2. Use `merge` strategy over `delete+insert` for large tables

1. Always use a reliable `updated_at` column

2. Use `merge` strategy over `delete+insert` for large tables