When our data team was tasked with processing 10TB of e-commerce data daily, we knew the standard approach wouldn't cut it. After weeks of experimentation during the ETL Speed Race challenge on Databolt, we discovered a combination of techniques that dramatically changed our cost profile.
The Problem
Our initial pipeline was running on a fixed cluster of 20 r5.4xlarge instances, costing us roughly $4,200/day. The pipeline took 6 hours to complete and had poor resource utilization — most executors were idle during the ingestion phase.
You can see the original cost breakdown in our challenge notebook: https://databolt.dev/work/work-1
The Solution
We applied three key patterns from the Databolt community:
- Dynamic Allocation with spot instances (inspired by the Salted Join Pattern)
- Incremental Loading instead of full refreshes
- Partition pruning using Delta Lake's Z-ordering
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "40")
spark.conf.set("spark.sql.adaptive.enabled", "true")
# Z-order the output table on the most queried columns
delta_table.optimize().executeZOrderBy("customer_id", "order_date")Results
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Daily Cost | $4,200 | $1,134 | -73% |
| Runtime | 6 hours | 1.5 hours | -75% |
| Resource Utilization | 34% | 87% | +156% |
The results were significant — our daily cost dropped to $1,134 and the pipeline completes in 1.5 hours.
Key Takeaways
- Don't over-provision. Dynamic allocation + spot instances is almost always cheaper.
- Profile your pipeline before optimizing. We were surprised that I/O, not compute, was the bottleneck.
- Use Delta Lake statistics — Z-ordering alone cut our scan time by 60%.
Read more about the patterns referenced in the sidebar, and check out the full notebook where I walk through each optimization step: https://databolt.dev/my-work/work-1