DataboltDATABOLT
CHALLENGESPATTERNSLEARNDISCUSSIONSWRITE-UPSMY WORK
DataboltDATABOLT
Home
WORK
My WorkNotebooksCode / Scripts
COMMUNITY
DiscussionsCompetitionsContributionsWrite-ups
LEARN
Learning PathsNotebook Playground
ACHIEVEMENTS
Badges
SETTINGS
Cloud (BYOC)
RECENT
ETL Pipeline - Customer Data
18m ago
ETL Speed Race
1h ago
Late-arriving data approach
2h ago
Kafka Consumer Script
4h ago
Spark Fundamentals
8h ago
LOGIN / SIGN UP
BACK TO WRITE-UPS
BLOG

Building a Cost-Efficient ETL Pipeline at Scale

How we reduced our Spark cluster costs by 73% while improving throughput

S
Sarah Chen

Staff Data Engineer @ Stripe

February 10, 20268 min read2,456

When our data team was tasked with processing 10TB of e-commerce data daily, we knew the standard approach wouldn't cut it. After weeks of experimentation during the ETL Speed Race challenge on Databolt, we discovered a combination of techniques that dramatically changed our cost profile.

The Problem

Our initial pipeline was running on a fixed cluster of 20 r5.4xlarge instances, costing us roughly $4,200/day. The pipeline took 6 hours to complete and had poor resource utilization — most executors were idle during the ingestion phase.

You can see the original cost breakdown in our challenge notebook: https://databolt.dev/work/work-1

The Solution

We applied three key patterns from the Databolt community:

  1. Dynamic Allocation with spot instances (inspired by the Salted Join Pattern)
  2. Incremental Loading instead of full refreshes
  3. Partition pruning using Delta Lake's Z-ordering
python
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "40")
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Z-order the output table on the most queried columns
delta_table.optimize().executeZOrderBy("customer_id", "order_date")

Results

| Metric | Before | After | Improvement |

|--------|--------|-------|-------------|

| Daily Cost | $4,200 | $1,134 | -73% |

| Runtime | 6 hours | 1.5 hours | -75% |

| Resource Utilization | 34% | 87% | +156% |

The results were significant — our daily cost dropped to $1,134 and the pipeline completes in 1.5 hours.

Key Takeaways

  • Don't over-provision. Dynamic allocation + spot instances is almost always cheaper.
  • Profile your pipeline before optimizing. We were surprised that I/O, not compute, was the bottleneck.
  • Use Delta Lake statistics — Z-ordering alone cut our scan time by 60%.

Read more about the patterns referenced in the sidebar, and check out the full notebook where I walk through each optimization step: https://databolt.dev/my-work/work-1

sparkcost-optimizationdelta-lakeetl
3 comments

COMMENTS (3)

24
P
Alex Rivera26d ago

Great write-up Sarah! We did something similar at Netflix but used Graviton instances instead of spot for more predictable availability. The cost savings were slightly lower (~60%) but we never had a pipeline fail due to spot reclamation.

One thing I'd add — if you're using Delta Lake, the VACUUM command is critical for keeping storage costs down after Z-ordering. We were surprised by how fast our S3 bill grew.

python
delta_table.vacuum(retentionHours=168)  # 7 days
Graviton Migration Pattern
18
D
Priya Sharma26d ago

This aligns perfectly with what I covered in my AQE post. The combination of AQE + dynamic allocation is really the sweet spot for most Spark workloads.

Have you tried enabling spark.sql.adaptive.skewJoin.enabled? For e-commerce data with heavy-hitter customers, it can make a huge difference on the join stages.

Understanding Spark's AQE
11
E
Jordan Lee25d ago

I'm curious about the incremental loading piece. Did you use watermark-based tracking or CDC from the source? We've been experimenting with both and finding CDC more reliable but harder to set up.

The Incremental Loading pattern in the library was a great starting point for us.

Incremental Loading

LEAVE A COMMENT

Supports **bold**, *italic*, `code`, ```code blocks```, lists, headings, and blockquotesCtrl+B bold | Ctrl+I italic | Tab indent

Tip: Reference items with @pattern, @work, @challenge, or @learn followed by the name

REFERENCED IN THIS POST

PATTERNSalted Join Pattern
PATTERNIncremental Loading
CHALLENGEETL Speed Race
WORKCustomer ETL Notebook

AUTHOR

S

Sarah Chen

Staff Data Engineer @ Stripe

TAGS

sparkcost-optimizationdelta-lakeetl