Performance tuning & optimization for Databricks → BigQuery
Spark tuning habits (shuffle optimization, partition overwrite, cluster sizing) don’t directly translate. We tune BigQuery layout, queries, and capacity so pruning works, bytes scanned stays stable, and refresh SLAs hold under real concurrency.
- Input
- Databricks Performance tuning & optimization logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Defeating pruning: wrapping partition columns in functions/casts in WHERE clauses.
- Partitioning by the wrong key: choosing partitions that don’t match common filters and refresh windows.
- Clustering by folklore: clustering keys chosen without evidence from predicates and join keys.
Why this breaks
Databricks performance is dominated by cluster runtime and Spark execution patterns (shuffles, partition sizing, file compaction). In BigQuery, cost and runtime are driven by bytes scanned, slot time, and how well queries prune partitions. After cutover, teams often keep Spark-era query shapes and lose pruning—leading to scan blowups, slow dashboards, and unpredictable spend.
Common post-cutover symptoms:
- Queries scan entire facts because partition filters aren’t pushed down
- Heavy joins reshuffle large datasets; BI refresh becomes slow and expensive
- Incremental MERGE/apply touches too much history each run
- Repeated parsing/extraction of semi-structured fields becomes expensive
- Concurrency spikes cause slot contention and tail latency
Optimization replaces “Spark tuning” with BigQuery-native pruning, layout, and governance—backed by baselines and regression gates.
How conversion works
- Baseline the top workloads: identify the most expensive and most business-critical queries/pipelines (dashboards, marts, MERGE/upserts).
- Diagnose root causes: scan bytes, join patterns, partition pruning, repeated transforms, and apply scope.
- Tune table layout: partitioning and clustering aligned to real access paths and refresh windows.
- Rewrite for pruning and reuse: pruning-first filters, pre-aggregation/materializations, and typed extraction boundaries for semi-structured fields.
- Capacity & cost governance: on-demand vs reservations posture, concurrency controls, and guardrails for expensive queries.
- Regression gates: store baselines and enforce thresholds so improvements persist.
Supported constructs
Representative tuning levers we apply for Databricks → BigQuery workloads.
| Source | Target | Notes |
|---|---|---|
| Spark shuffle-heavy joins | Pruning-first joins + layout alignment | Reduce scanned bytes and stabilize runtime under BI refresh. |
| Delta partition overwrite habits | Bounded apply windows + partition-scoped MERGE | Avoid touching more history than necessary each run. |
| Compaction/OPTIMIZE habits | Partitioning/clustering + materialization strategy | Keep pruning effective as data grows. |
| Semi-structured parsing in Spark | Typed extraction boundaries + reuse | Extract once, cast once, reuse everywhere. |
| Cluster sizing for performance | Slots/reservations + concurrency posture | Stabilize SLAs under peak usage and control spend. |
| Ad-hoc expensive queries | Governance guardrails + cost controls | Prevent scan blowups and surprise bills. |
How workload changes
| Topic | Databricks / Delta | BigQuery |
|---|---|---|
| Primary cost driver | Cluster runtime | Bytes scanned + slot time |
| Tuning focus | Shuffle patterns, partition sizing, file layout | Partitioning/clustering + pruning-first SQL |
| Incremental apply | Partition overwrite and reprocessing windows common | Bounded MERGE/apply with explicit late windows |
| Concurrency planning | Cluster autoscaling and job scheduling | Slots/reservations + concurrency policies |
Examples
Illustrative BigQuery optimization patterns after Databricks migration: enforce pruning, pre-aggregate for BI, scope MERGEs, and store baselines for regression gates.
-- Pruning-first query shape (fact table partitioned by DATE(event_ts))
SELECT
country,
SUM(revenue) AS rev
FROM `proj.mart.fact_orders`
WHERE DATE(event_ts) BETWEEN @start_date AND @end_date
GROUP BY 1;Common pitfalls
- Defeating pruning: wrapping partition columns in functions/casts in WHERE clauses.
- Partitioning by the wrong key: choosing partitions that don’t match common filters and refresh windows.
- Clustering by folklore: clustering keys chosen without evidence from predicates and join keys.
- Unbounded MERGE/apply: incremental jobs touch too much history, causing scan bytes spikes.
- Over-materialization: too many intermediates without controlling refresh cost.
- Ignoring concurrency: BI refresh peaks overwhelm slots/reservations and create tail latency.
- No regression gates: performance improves once, then regresses silently after the next release.
Validation approach
- Baseline capture: runtime, bytes scanned, slot time, and output row counts for top queries/pipelines.
- Pruning checks: confirm partition pruning and predicate pushdown on representative parameters and boundary windows.
- Before/after evidence: demonstrate improvements in runtime and scan bytes; document tradeoffs.
- Correctness guardrails: golden queries and KPI aggregates ensure tuning doesn’t change semantics.
- Regression thresholds: define alerts (e.g., +25% bytes scanned or +30% runtime) and enforce via CI or scheduled checks.
- Operational monitors: dashboards for scan bytes, slot utilization, failures, and refresh SLA adherence.
Migration steps
- 01
Identify top cost and SLA drivers
Rank queries and pipelines by bytes scanned, slot time, and business criticality (dashboards, batch windows). Select a tuning backlog with clear owners.
- 02
Create baselines and targets
Capture current BigQuery job metrics (runtime, scan bytes, slot time) and define improvement targets. Freeze golden outputs so correctness doesn’t regress.
- 03
Tune layout: partitioning and clustering
Align partition keys to common filters and refresh windows. Choose clustering keys based on observed predicates and join keys—not guesses.
- 04
Rewrite for pruning and reuse
Apply pruning-aware SQL rewrites, reduce reshuffles, scope MERGEs/applies to affected partitions, and pre-aggregate where BI patterns repeatedly scan large facts.
- 05
Capacity posture and governance
Set reservations/on-demand posture, tune concurrency for BI refresh peaks, and implement guardrails to prevent scan blowups from new queries.
- 06
Add regression gates
Codify performance thresholds and alerting so future changes don’t reintroduce high scan bytes or missed SLAs. Monitor post-cutover metrics continuously.
We identify your highest-cost migrated workloads, tune pruning and table layout, and deliver before/after evidence with regression thresholds—so performance improves and stays stable.
Get an optimization backlog, tuned partitioning/clustering, and performance gates (runtime/bytes/slot thresholds) so future releases don’t reintroduce slow dashboards or high spend.