Databricks ETL pipelines to BigQuery
Re-home Delta MERGE/upsert semantics—staging, dedupe tie-breakers, late data policy, and orchestration—into BigQuery with an explicit run contract and validation gates that prevent KPI drift.
- Input
- Databricks ETL / pipeline migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Dedupe instability: ROW_NUMBER-based dedupe without complete ORDER BY causes nondeterministic drift under retries.
- Late-arrival policy implicit: Delta relied on partition overwrite; BigQuery version becomes append-only.
- Unbounded applies: MERGE touches full targets because apply windows aren’t scoped to affected partitions.
Why this breaks
Delta pipelines encode correctness in operational behavior: MERGE logic, partition overwrite/reprocessing windows, and implicit retry/idempotency assumptions. BigQuery can produce the same business outcomes, but only if the run contract is made explicit—keys, ordering, dedupe rules, and late-arrival policy.
Common symptoms after cutover:
- Duplicate or missing updates because dedupe keys and tie-breakers were implicit
- Late events double-counted or silently ignored because reprocessing windows weren’t recreated
- SCD dimensions drift during backfills and late updates
- Costs spike because MERGE/apply touches too much history (unbounded scans)
- Orchestration changes retries/ordering, turning failures into silent data issues
How conversion works
- Inventory & classify pipelines: Delta tables, MERGE targets, orchestration jobs, schedules, and dependencies.
- Extract the run contract: business keys, deterministic ordering/tie-breakers, dedupe rule, late-arrival window policy, and restart semantics.
- Re-home transformations to BigQuery-native staging (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to load windows.
- Implement late-data behavior explicitly: reprocessing windows and staged re-apply so outcomes match without full-table rewrites.
- Re-home orchestration (Composer/Airflow, dbt, or your runner) with explicit DAG dependencies, retries, alerts, and concurrency posture.
- Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.
Supported constructs
Representative Databricks/Delta ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Delta MERGE INTO (upsert) | BigQuery MERGE (staging + apply) | Match keys, casts, and update predicates made explicit and testable. |
| Partition overwrite / reprocessing windows | Explicit late-window policy + staged re-apply | Backfills and late updates validated as first-class scenarios. |
| ROW_NUMBER-based dedupe patterns | Deterministic dedupe with explicit tie-breakers | Prevents nondeterministic drift under retries. |
| SCD Type-1 / Type-2 apply logic | MERGE + current-flag/end-date patterns | Validated during backfills and late corrections. |
| Schema evolution in Delta | Typed staging + drift policy (widen/quarantine/reject) | Auditability for changing upstream payloads. |
| Spark job orchestration | Composer/dbt orchestration with explicit DAG contracts | Retries, concurrency, and alerts modeled and monitored. |
How workload changes
| Topic | Databricks / Delta | BigQuery |
|---|---|---|
| Incremental correctness | Often relies on partition overwrite and implicit reprocessing | Explicit late-window policy + staged apply with integrity gates |
| Idempotency under retries | Emerges from job structure, not always tested | Proven via rerun simulations and applied-window markers |
| Cost predictability | Cluster runtime + shuffle patterns | Bytes scanned + slot time + pruning effectiveness |
Examples
Canonical BigQuery apply pattern for incremental ETL: stage → dedupe deterministically → MERGE with scoped partitions + applied-window tracking. Adjust keys, partitions, and casts to your model.
-- Applied-window tracking (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.applied_windows` (
job_name STRING NOT NULL,
window_start TIMESTAMP NOT NULL,
window_end TIMESTAMP NOT NULL,
applied_at TIMESTAMP NOT NULL
);Common pitfalls
- Dedupe instability: ROW_NUMBER-based dedupe without complete ORDER BY causes nondeterministic drift under retries.
- Late-arrival policy implicit: Delta relied on partition overwrite; BigQuery version becomes append-only.
- Unbounded applies: MERGE touches full targets because apply windows aren’t scoped to affected partitions.
- Type drift in staging: implicit casts in Spark become explicit failures or silent coercions in BigQuery.
- Schema evolution surprises: upstream types widen; downstream typed tables break without a drift policy.
- Orchestration mismatch: concurrency and dependency ordering changes create freshness and correctness drift.
Validation approach
- Execution checks: pipelines run reliably under representative volumes and schedules.
- Structural parity: partition-level row counts and column profiles (null/min/max/distinct) for key tables.
- KPI parity: aggregates by key dimensions for critical marts and dashboards.
- Incremental integrity (mandatory):
- Idempotency: rerun same micro-batch → no net change
- Late-arrival: inject late updates → only expected rows change
- Backfill safety: replay historical windows → stable SCD and dedupe
- Dedupe stability: duplicates eliminated consistently under retries
- Cost/performance gates: pruning verified; scan bytes/runtime thresholds set for top jobs.
- Operational readiness: retry/alerting tests and rollback criteria defined before cutover.
Migration steps
- 01
Inventory pipelines, schedules, and dependencies
Extract pipeline graph: MERGE targets, upstream feeds, orchestration DAGs, schedules, and SLAs. Identify business-critical marts and consumers.
- 02
Formalize the run contract
Define high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival window policy, and backfill boundaries. Make restartability explicit.
- 03
Rebuild transformations on BigQuery-native staging
Implement landing → typed staging → dedupe → apply with partitioning/clustering aligned to load windows. Define schema evolution policy (widen/quarantine/reject).
- 04
Re-home orchestration and operations
Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerts, and concurrency. Add applied-window tracking and failure handling.
- 05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.
We inventory your Databricks pipelines, formalize MERGE/late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without scan-cost surprises.
Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.