BLOG

Building a Cost-Efficient ETL Pipeline at Scale

How we reduced our Spark cluster costs by 73% while improving throughput

Staff Data Engineer @ Stripe

February 10, 20268 min read2,456

When our data team was tasked with processing 10TB of e-commerce data daily, we knew the standard approach wouldn't cut it. After weeks of experimentation during the ETL Speed Race challenge on Databolt, we discovered a combination of techniques that dramatically changed our cost profile.

The Problem

Our initial pipeline was running on a fixed cluster of 20 r5.4xlarge instances, costing us roughly $4,200/day. The pipeline took 6 hours to complete and had poor resource utilization — most executors were idle during the ingestion phase.

You can see the original cost breakdown in our challenge notebook: https://databolt.dev/work/work-1

The Solution

We applied three key patterns from the Databolt community:

Dynamic Allocation with spot instances (inspired by the Salted Join Pattern)
Incremental Loading instead of full refreshes
Partition pruning using Delta Lake's Z-ordering

python

spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "40")
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Z-order the output table on the most queried columns
delta_table.optimize().executeZOrderBy("customer_id", "order_date")

Results

|--------|--------|-------|-------------|

| Daily Cost | $4,200 | $1,134 | -73% |

| Resource Utilization | 34% | 87% | +156% |

The results were significant — our daily cost dropped to $1,134 and the pipeline completes in 1.5 hours.

Key Takeaways

Don't over-provision. Dynamic allocation + spot instances is almost always cheaper.
Profile your pipeline before optimizing. We were surprised that I/O, not compute, was the bottleneck.
Use Delta Lake statistics — Z-ordering alone cut our scan time by 60%.

Read more about the patterns referenced in the sidebar, and check out the full notebook where I walk through each optimization step: https://databolt.dev/my-work/work-1

sparkcost-optimizationdelta-lakeetl

3 comments

COMMENTS (3)

Alex Rivera149d ago

Great write-up Sarah! We did something similar at Netflix but used Graviton instances instead of spot for more predictable availability. The cost savings were slightly lower (~60%) but we never had a pipeline fail due to spot reclamation.

One thing I'd add — if you're using Delta Lake, the VACUUM command is critical for keeping storage costs down after Z-ordering. We were surprised by how fast our S3 bill grew.

python

delta_table.vacuum(retentionHours=168)  # 7 days

Graviton Migration Pattern

Priya Sharma149d ago

This aligns perfectly with what I covered in my AQE post. The combination of AQE + dynamic allocation is really the sweet spot for most Spark workloads.

Have you tried enabling spark.sql.adaptive.skewJoin.enabled? For e-commerce data with heavy-hitter customers, it can make a huge difference on the join stages.

Understanding Spark's AQE

Jordan Lee148d ago

I'm curious about the incremental loading piece. Did you use watermark-based tracking or CDC from the source? We've been experimenting with both and finding CDC more reliable but harder to set up.

The Incremental Loading pattern in the library was a great starting point for us.

Incremental Loading

Supports **bold**, *italic*, `code`, ```code blocks```, lists, headings, and blockquotesCtrl+B bold | Ctrl+I italic | Tab indent

Tip: Reference items with @pattern, @work, @challenge, or @learn followed by the name

BLOG

Building a Cost-Efficient ETL Pipeline at Scale

How we reduced our Spark cluster costs by 73% while improving throughput

Sarah Chen

Staff Data Engineer @ Stripe

February 10, 20268 min read2,456

The Problem

You can see the original cost breakdown in our challenge notebook: https://databolt.dev/work/work-1

The Solution

We applied three key patterns from the Databolt community:

Dynamic Allocation with spot instances (inspired by the Salted Join Pattern)
Incremental Loading instead of full refreshes
Partition pruning using Delta Lake's Z-ordering

python

spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "40")
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Z-order the output table on the most queried columns
delta_table.optimize().executeZOrderBy("customer_id", "order_date")

Results

|--------|--------|-------|-------------|

| Daily Cost | $4,200 | $1,134 | -73% |

| Resource Utilization | 34% | 87% | +156% |

The results were significant — our daily cost dropped to $1,134 and the pipeline completes in 1.5 hours.

Key Takeaways

Don't over-provision. Dynamic allocation + spot instances is almost always cheaper.
Profile your pipeline before optimizing. We were surprised that I/O, not compute, was the bottleneck.
Use Delta Lake statistics — Z-ordering alone cut our scan time by 60%.

Read more about the patterns referenced in the sidebar, and check out the full notebook where I walk through each optimization step: https://databolt.dev/my-work/work-1

sparkcost-optimizationdelta-lakeetl

3 comments

COMMENTS (3)

Alex Rivera149d ago

One thing I'd add — if you're using Delta Lake, the VACUUM command is critical for keeping storage costs down after Z-ordering. We were surprised by how fast our S3 bill grew.

python

delta_table.vacuum(retentionHours=168)  # 7 days

Graviton Migration Pattern

Priya Sharma149d ago

This aligns perfectly with what I covered in my AQE post. The combination of AQE + dynamic allocation is really the sweet spot for most Spark workloads.

Have you tried enabling spark.sql.adaptive.skewJoin.enabled? For e-commerce data with heavy-hitter customers, it can make a huge difference on the join stages.

Understanding Spark's AQE

Jordan Lee148d ago

I'm curious about the incremental loading piece. Did you use watermark-based tracking or CDC from the source? We've been experimenting with both and finding CDC more reliable but harder to set up.

The Incremental Loading pattern in the library was a great starting point for us.

Incremental Loading

Supports **bold**, *italic*, `code`, ```code blocks```, lists, headings, and blockquotesCtrl+B bold | Ctrl+I italic | Tab indent

Tip: Reference items with @pattern, @work, @challenge, or @learn followed by the name

Building a Cost-Efficient ETL Pipeline at Scale

The Problem

The Solution

Results

Key Takeaways

COMMENTS (3)

LEAVE A COMMENT

Building a Cost-Efficient ETL Pipeline at Scale

The Problem

The Solution

Results

Key Takeaways

COMMENTS (3)

LEAVE A COMMENT