Databricks ETL pipelines to Snowflake
Re-home Delta MERGE/upsert semantics—staging, deterministic dedupe, late data policy, and restartability—into Snowflake with an explicit run contract and validation gates that prevent KPI drift.
- Input
- Databricks ETL / pipeline migration logic
- Output
- Snowflake equivalent (validated)
- Common pitfalls
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
- Late-arrival policy implicit: Delta relied on partition overwrite; Snowflake version becomes append-only.
- Full-target MERGE: missing apply boundaries causes large scans and credit spikes.
Why this breaks
Delta pipelines encode correctness in operational behavior: MERGE logic, partition overwrite/reprocessing windows, and implicit retry/idempotency assumptions. When migrated naïvely, teams recreate “mechanics” but lose the correctness contract—keys, tie-breakers, and late-arrival rules—so KPIs drift after cutover.
Common symptoms after migration:
- Duplicates or missing updates because dedupe keys and tie-breakers were implicit
- Late events double-counted or silently ignored because reprocessing windows weren’t recreated
- MERGE behavior changes because match keys/casts/predicates weren’t made explicit
- SCD dimensions drift during backfills and late updates
- Credit spikes because MERGE/apply touches too much history (full-target scans)
A successful migration extracts the run contract and implements a Snowflake-native staged apply with integrity gates.
How conversion works
- Inventory & classify pipelines: Delta tables, MERGE targets, sources, schedules, and orchestration DAGs.
- Extract the run contract: business keys, deterministic ordering/tie-breakers, dedupe rule, late-arrival window policy, and restart semantics.
- Re-home transformations into Snowflake staging (landing → canonicalize/cast → dedupe → apply) with explicit delete semantics where needed.
- Bound apply scope: stage apply windows and pruning-aware strategies so MERGEs don’t become full-target scans.
- Re-home orchestration (Airflow/DBT/Scheduler) with explicit dependencies, retries, alerts, and concurrency posture (warehouse isolation).
- Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, late injections, backfills) and rollback-ready criteria.
Supported constructs
Representative Databricks/Delta ETL constructs we commonly migrate to Snowflake (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Delta MERGE INTO (upsert) | Snowflake MERGE (staging + apply) | Match keys, casts, and update predicates made explicit and testable. |
| Partition overwrite / reprocessing windows | Explicit late-window policy + staged re-apply | Backfills and late updates validated as first-class scenarios. |
| ROW_NUMBER-based dedupe patterns | Deterministic dedupe with explicit tie-breakers | Prevents nondeterministic drift under retries. |
| Deletes and tombstones | Explicit delete semantics in MERGE (hard/soft) | Delete behavior validated with replay and edge cohorts. |
| SCD Type-1 / Type-2 apply logic | MERGE + current-flag/end-date patterns | Validated during backfills and late corrections. |
| Spark job orchestration | Airflow/dbt orchestration with explicit DAG contracts | Retries, warehouse isolation, and alerts modeled and monitored. |
How workload changes
| Topic | Databricks / Delta | Snowflake |
|---|---|---|
| Incremental correctness | Often relies on partition overwrite and implicit reprocessing | Explicit late-window policy + staged MERGE apply with integrity gates |
| Idempotency under retries | Emerges from job structure, not always tested | Proven via rerun simulations + applied-batch markers |
| Cost predictability | Cluster runtime + shuffle patterns | Warehouse credits + pruning effectiveness |
Examples
Canonical Snowflake incremental apply pattern: stage → dedupe deterministically → MERGE with bounded scope + applied-batch tracking. Adjust keys, offsets, and casts to your model.
-- Applied-batch tracking (restartability)
CREATE TABLE IF NOT EXISTS CONTROL.APPLIED_BATCHES (
job_name STRING NOT NULL,
batch_id STRING NOT NULL,
applied_at TIMESTAMP_NTZ NOT NULL,
PRIMARY KEY (job_name, batch_id)
);Common pitfalls
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
- Late-arrival policy implicit: Delta relied on partition overwrite; Snowflake version becomes append-only.
- Full-target MERGE: missing apply boundaries causes large scans and credit spikes.
- Delete semantics lost: tombstones/soft deletes not modeled; downstream facts diverge.
- Type drift in staging: implicit casts in Spark must become explicit in Snowflake to preserve intent.
- Schema evolution surprises: upstream fields widen; typed targets break without a drift policy.
- Warehouse contention: BI and batch share warehouses; concurrency creates tail latency and cost spikes.
Validation approach
- Execution checks: pipelines run reliably under representative volumes and schedules.
- Structural parity: window/partition-level row counts and column profiles (null/min/max/distinct) for key tables.
- KPI parity: aggregates by key dimensions for critical marts and dashboards.
- Incremental integrity (mandatory):
- Idempotency: rerun same micro-batch → no net change
- Late-arrival: inject late updates → only expected rows change
- Backfill safety: replay historical windows → stable SCD and dedupe
- Dedupe stability: duplicates eliminated consistently under retries
- Cost/performance gates: bounded MERGE scope verified; credit/runtime thresholds set for top jobs.
- Operational readiness: retry/alerting tests, canary gates, and rollback criteria defined before cutover.
Migration steps
- 01
Inventory pipelines, schedules, and dependencies
Extract pipeline graph: MERGE targets, upstream feeds, orchestration DAGs, schedules, and SLAs. Identify business-critical marts and consumers.
- 02
Formalize the run contract
Define business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival window policy, delete semantics, and backfill boundaries. Make restartability explicit.
- 03
Rebuild transformations on Snowflake-native staging
Implement landing → cast/normalize → dedupe → apply with bounded MERGE scope. Add schema evolution policy (widen/quarantine/reject) where feeds change.
- 04
Re-home orchestration and operations
Implement DAGs and operational controls: retries, alerts, concurrency posture, and warehouse isolation. Add applied-batch tracking and failure handling.
- 05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.
We inventory your Databricks pipelines, formalize MERGE/late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without credit spikes.
Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.