Workload

Databricks ETL pipelines to Snowflake

Re-home Delta MERGE/upsert semantics—staging, deterministic dedupe, late data policy, and restartability—into Snowflake with an explicit run contract and validation gates that prevent KPI drift.

At a glance
Input
Databricks ETL / pipeline migration logic
Output
Snowflake equivalent (validated)
Common pitfalls
  • Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
  • Late-arrival policy implicit: Delta relied on partition overwrite; Snowflake version becomes append-only.
  • Full-target MERGE: missing apply boundaries causes large scans and credit spikes.
Context

Why this breaks

Delta pipelines encode correctness in operational behavior: MERGE logic, partition overwrite/reprocessing windows, and implicit retry/idempotency assumptions. When migrated naïvely, teams recreate “mechanics” but lose the correctness contract—keys, tie-breakers, and late-arrival rules—so KPIs drift after cutover.

Common symptoms after migration:

  • Duplicates or missing updates because dedupe keys and tie-breakers were implicit
  • Late events double-counted or silently ignored because reprocessing windows weren’t recreated
  • MERGE behavior changes because match keys/casts/predicates weren’t made explicit
  • SCD dimensions drift during backfills and late updates
  • Credit spikes because MERGE/apply touches too much history (full-target scans)

A successful migration extracts the run contract and implements a Snowflake-native staged apply with integrity gates.

Approach

How conversion works

  1. Inventory & classify pipelines: Delta tables, MERGE targets, sources, schedules, and orchestration DAGs.
  2. Extract the run contract: business keys, deterministic ordering/tie-breakers, dedupe rule, late-arrival window policy, and restart semantics.
  3. Re-home transformations into Snowflake staging (landing → canonicalize/cast → dedupe → apply) with explicit delete semantics where needed.
  4. Bound apply scope: stage apply windows and pruning-aware strategies so MERGEs don’t become full-target scans.
  5. Re-home orchestration (Airflow/DBT/Scheduler) with explicit dependencies, retries, alerts, and concurrency posture (warehouse isolation).
  6. Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, late injections, backfills) and rollback-ready criteria.

Supported constructs

Representative Databricks/Delta ETL constructs we commonly migrate to Snowflake (exact coverage depends on your estate).

SourceTargetNotes
Delta MERGE INTO (upsert)Snowflake MERGE (staging + apply)Match keys, casts, and update predicates made explicit and testable.
Partition overwrite / reprocessing windowsExplicit late-window policy + staged re-applyBackfills and late updates validated as first-class scenarios.
ROW_NUMBER-based dedupe patternsDeterministic dedupe with explicit tie-breakersPrevents nondeterministic drift under retries.
Deletes and tombstonesExplicit delete semantics in MERGE (hard/soft)Delete behavior validated with replay and edge cohorts.
SCD Type-1 / Type-2 apply logicMERGE + current-flag/end-date patternsValidated during backfills and late corrections.
Spark job orchestrationAirflow/dbt orchestration with explicit DAG contractsRetries, warehouse isolation, and alerts modeled and monitored.

How workload changes

TopicDatabricks / DeltaSnowflake
Incremental correctnessOften relies on partition overwrite and implicit reprocessingExplicit late-window policy + staged MERGE apply with integrity gates
Idempotency under retriesEmerges from job structure, not always testedProven via rerun simulations + applied-batch markers
Cost predictabilityCluster runtime + shuffle patternsWarehouse credits + pruning effectiveness
Incremental correctness: Correctness becomes auditable and repeatable under retries/backfills.
Idempotency under retries: Idempotency is enforced as a cutover gate.
Cost predictability: Bounded MERGE scope and isolation keep credit burn stable.

Examples

Canonical Snowflake incremental apply pattern: stage → dedupe deterministically → MERGE with bounded scope + applied-batch tracking. Adjust keys, offsets, and casts to your model.

-- Applied-batch tracking (restartability)
CREATE TABLE IF NOT EXISTS CONTROL.APPLIED_BATCHES (
  job_name STRING NOT NULL,
  batch_id STRING NOT NULL,
  applied_at TIMESTAMP_NTZ NOT NULL,
  PRIMARY KEY (job_name, batch_id)
);
Avoid

Common pitfalls

  • Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
  • Late-arrival policy implicit: Delta relied on partition overwrite; Snowflake version becomes append-only.
  • Full-target MERGE: missing apply boundaries causes large scans and credit spikes.
  • Delete semantics lost: tombstones/soft deletes not modeled; downstream facts diverge.
  • Type drift in staging: implicit casts in Spark must become explicit in Snowflake to preserve intent.
  • Schema evolution surprises: upstream fields widen; typed targets break without a drift policy.
  • Warehouse contention: BI and batch share warehouses; concurrency creates tail latency and cost spikes.
Proof

Validation approach

  • Execution checks: pipelines run reliably under representative volumes and schedules.
  • Structural parity: window/partition-level row counts and column profiles (null/min/max/distinct) for key tables.
  • KPI parity: aggregates by key dimensions for critical marts and dashboards.
  • Incremental integrity (mandatory):
    • Idempotency: rerun same micro-batch → no net change
    • Late-arrival: inject late updates → only expected rows change
    • Backfill safety: replay historical windows → stable SCD and dedupe
    • Dedupe stability: duplicates eliminated consistently under retries
  • Cost/performance gates: bounded MERGE scope verified; credit/runtime thresholds set for top jobs.
  • Operational readiness: retry/alerting tests, canary gates, and rollback criteria defined before cutover.
Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.
  1. 01

    Inventory pipelines, schedules, and dependencies

    Extract pipeline graph: MERGE targets, upstream feeds, orchestration DAGs, schedules, and SLAs. Identify business-critical marts and consumers.

  2. 02

    Formalize the run contract

    Define business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival window policy, delete semantics, and backfill boundaries. Make restartability explicit.

  3. 03

    Rebuild transformations on Snowflake-native staging

    Implement landing → cast/normalize → dedupe → apply with bounded MERGE scope. Add schema evolution policy (widen/quarantine/reject) where feeds change.

  4. 04

    Re-home orchestration and operations

    Implement DAGs and operational controls: retries, alerts, concurrency posture, and warehouse isolation. Add applied-batch tracking and failure handling.

  5. 05

    Run parity and incremental integrity gates

    Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.

Workload Assessment
Migrate Delta pipelines with the run contract intact

We inventory your Databricks pipelines, formalize MERGE/late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without credit spikes.

Migration Acceleration
Cut over pipelines with proof-backed gates

Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.

FAQ

Frequently asked questions

Is Databricks ETL migration just rewriting Spark jobs?+
No. The critical work is preserving the run contract: keys, ordering/tie-breakers, dedupe rules, late-arrival policy, delete semantics, and restartability. Job translation is only one part.
How do you preserve Delta MERGE behavior in Snowflake?+
We formalize match keys and ordering, dedupe deterministically, and implement staged apply with bounded MERGE scope. Then we prove idempotency and late-data behavior with simulations as cutover gates.
What if our pipeline relies on partition overwrite semantics?+
We convert it into an explicit late-arrival and reprocessing policy and implement staged re-apply so outcomes match without full-table reprocessing.
How do you avoid Snowflake credit spikes for ETL?+
We design pruning-aware staging boundaries, bound MERGE scope to affected windows, and isolate batch warehouses from BI. Validation includes credit/runtime baselines and regression thresholds for top jobs.